大规模非结构化数据资源快速存储方法研究

闫丽飞, 褚宇宁, 赵维伟, 何壮壮, 刘晓强

集成电路与嵌入式系统 ›› 2024, Vol. 24 ›› Issue (4) : 77-81.

PDF(990 KB)
PDF(990 KB)
集成电路与嵌入式系统 ›› 2024, Vol. 24 ›› Issue (4) : 77-81. DOI: 10.20193/j.ices2097-4191.2024.04.014
研究论文

大规模非结构化数据资源快速存储方法研究

作者信息 +

Research on fast storage methods for large scale unstructured data resources

Author information +
文章历史 +

摘要

非结构化数据资源具有较高的研究价值,伴随着信息化技术、互联网技术应用范围的扩大,非结构化数据资源规模随之增大,对其存储技术提出了较大的挑战,因此提出了大规模非结构化数据资源快速存储方法,采用层次聚类算法分组处理非结构化数据资源。以某一组非结构化数据资源为对象,结合数据资源传输距离、节点能量、传输方向等因素,确定非结构化数据资源转发路径,描述非结构化数据资源存储过程,制定分层扩展存储机制,从而实现大规模非结构化数据资源的快速存储。实验数据表明,在不同实验工况背景下,应用本文方法后获得的非结构化数据资源存储速率最大值为1 920 MB/s,非结构化数据资源存储位置准确性最大值为98%。

Abstract

Unstructured data resources contain great research value.With the expansion of the application range of information technology and internet technology,the scale of unstructured data resources increases,which poses a great challenge to its storage technology.A rapid storage method for large-scale unstructured data resources is proposed.The hierarchical clustering algorithm is used to group unstructured data resources.Taking a group of unstructured data resources as the object,combining the transmission distance,node energy,transmission direction and other factors of data resources,the forwarding path of unstructured data resources is determined,the storage process of unstructured data resources is described,and the storage mechanism of hierarchical expansion is formulated,so as to realize the rapid storage of large-scale unstructured data resources.The experimental data shows that the maximum storage rate of unstructured data resource obtained by the proposed method is 1 920 MB/s under different experimental conditions,and the maximum storage location accuracy of unstructured data resource is 98%,which fully confirms the better application performance of the proposed method.

关键词

数据资源 / 非结构化 / 安全存储 / 存储机制 / 快速存储

Key words

data resources / unstructured / secure storage / storage mechanism / fast storage

引用本文

导出引用
闫丽飞, 褚宇宁, 赵维伟, . 大规模非结构化数据资源快速存储方法研究[J]. 集成电路与嵌入式系统. 2024, 24(4): 77-81 https://doi.org/10.20193/j.ices2097-4191.2024.04.014
YAN Lifei, CHU Yuning, ZHAO Weiwei, et al. Research on fast storage methods for large scale unstructured data resources[J]. Integrated Circuits and Embedded Systems. 2024, 24(4): 77-81 https://doi.org/10.20193/j.ices2097-4191.2024.04.014
中图分类号: TP315 (管理程序、管理系统)   

参考文献

[1]
高健, 魏峻, 许利杰, 等. 基于预分区策略的装备数据分布式存储方法[J]. 计算机科学与探索, 2021, 15(1):96-108.
摘要
随着传感器技术和计算机技术的发展,装备在研制生产过程中会产生大量的数据,这些数据是海量的、多源的和异构的,企业需要考虑如何将数据进行快速处理和存储管理,进而利用加工后的数据提升装备生产制造能力。对卫星、飞机等典型装备数据进行了研究,提出了一种基于预分区策略的分布式数据存储方法。该方法研究HBase的预分区机制和装备数据模型特点,研究装备数据快速存储的影响因子,并给出了数据快速存储算法,使海量装备数据可以负载均衡地、快速地存储在HBase数据库里。最后,对模型的数据存储性能、负载均衡性、各类装备的适用性进行了评估试验。试验结果表明,该方法可以覆盖多种类型的装备数据,并在数据存储效率上有良好的表现。
GAO J, WEI J, XU L J, et al. Distributed Storage Method for Equipment Data Based on Pre-partitioning Strategy[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(1):96-108. (in Chinese)
[2]
李朝奎, 王露瑶, 周新邵, 等. 基于HBase的矢量空间数据存储与查询方法及其应用[J]. 地理科学, 2022, 42(7):1146-1154.
摘要
研究了HBase存储机制,针对现有存储查询方法效率低等缺陷,设计了HBase矢量空间数据存储表模式,如行键、过滤列族、几何列族及非几何列族等,以MapReduce算法为基础改进了原有的区域查询方法,上述改进有效提高了HBase中矢量空间数据查询效率。以某地近100 a地质灾害数据进行实验,结果表明:设计的存储模型可行,查询算法与传统查询算法相比效率更高;由于MapReduce运行过程中的通信等原因,当数据量小于5万级时,算法优势并不明显;当数据量大于10万级时,算法查询时间低于原来的1/2,而数据量达到100万级时,算法查询时间仅为算法改进前查询时间的1/20。数据量越大,并行化处理优势越明显。
LI CH K, WANG L Y, ZHOU X SH, et al. Design and Application of Storage and Query Algorithm for Vector Spatial Data Based on HBase[J]. Scientia Geographica Sinica, 2022, 42(7):1146-1154. (in Chinese)

Based on the research of HBase storage mechanism, this article aims at the low efficiency of existing storage query methods, the HBase spatial data storage table patterns such as row key, filter column family, geometric column family and non-geometric column family are designed, and the original region query method is improved based on MapReduce algorithm, the above improvements effectively improve the query efficiency of vector spatial data in HBase. The experiment was carried out with the data of geological hazards in recent 100 years. The results show that the storage model designed in this article is feasible, and the query algorithm is more efficient than the traditional query algorithm. Due to the communication in the process of MapReduce, when the amount of data is less than 50 000 byte, the advantages of this algorithm are not obvious; When the amount of data is more than 100 000 byte, the query time of this algorithm is less than 1/2 of the original; The query time of the algorithm is only 1/20th of that before the improvement of the algorithm when the amount of data reaches 1 million byte. The greater the amount of data, the more obvious the advantages of parallel processing.

[3]
赵越, 余红英, 王一奇. 一种高速数据存储方法的设计与验证[J]. 数据采集与处理, 2021, 36(2):384-390.
ZHAO Y, YU H Y, WANG Y Q. Design and Verification of a High-Speed Data Storage Method[J]. Journal of Data Acquisition and Processing, 2021, 36(2):384-390. (in Chinese)
[4]
康海燕, 邓婕. 面向医疗数据安全存储的增强混合加密方法[J]. 北京理工大学学报, 2021, 41(10):1058-1068.
KANG H Y, DENG J. Hybrid Encryption Method for Secure Storage of Medical Data[J]. Transactions of Beijing Institute of Technology, 2021, 41(10):1058-1068. (in Chinese)
[5]
宋红娟. 中国旅游产业融合的趋势和模式变化—基于非结构化数据[J]. 管理评论, 2023, 35(1):97-107.
SONG H J. The Trends and Patterns of Tourism Industry Convergence in China:Based on Unstructured Data[J]. Management Review, 2023, 35(1):97-107. (in Chinese)
[6]
吴万青, 赵永新, 王巧, 等. 一种满足差分隐私的轨迹数据安全存储和发布方法[J]. 计算机研究与发展, 2021, 58(11):2430-2443.
WU W Q, ZHAO Y X, WANG Q, et al. A Safe Storage and Release Method of Trajectory Data Satisfying Differential Privacy[J]. Journal of Computer Research and Development, 2021, 58(11):2430-2443. (in Chinese)
[7]
曾梦, 邹北骥, 张文生, 等. 多模态医疗数据中海量小文件存储优化方法[J]. 软件学报, 2023, 34(3):1451-1469.
ZENG M, ZOU B J, ZHANG W SH, et al. Optimization Method for Storing Massive Small Files in Multi-modal Medical Data[J]. Journal of Software, 2023, 34(3):1451-1469. (in Chinese)
[8]
米启超, 赵红梅, 林丽萍. 基于多通道卷积神经网络的非结构化数据标注[J]. 计算机仿真, 2021, 38(6):400-404.
MI Q CH, ZHAO H M, LIN L P. Unstructured Data Annotation Based on Multichannel Convolutional Neural Network[J]. Computer Simulation, 2021, 38(6):400-404. (in Chinese)
[9]
黄安琪, 杨文晖, 苗放. 微服务下DRC非结构化数据注册引擎设计[J]. 计算机工程与设计, 2022, 43(12):3570-3579.
HUANG A Q, YANG W H, MIAO F. Design of DRC unstructured data registration engine under microservices[J]. Computer Engineering and Design, 2022, 43(12):3570-3579. (in Chinese)
[10]
喻波, 王志海, 孙亚东, 等. 非结构化文档敏感数据识别与异常行为分析[J]. 智能系统学报, 2021, 16(5):932-939.
YU B, WANG ZH H, SUN Y D, et al. Unstructured document sensitive data identification and abnormal behavior analysis[J]. CAAI Transactions on Intelligent Systems, 2021, 16(5):932-939. (in Chinese)

基金

国网总部科技项目(SGGSXT00XMXX2100312)

编辑: 薛士然
PDF(990 KB)

Accesses

Citation

Detail

段落导航
相关文章

/