Abstract:
With the rising growth trend of data size in the Internet era, storage, analysis, and processing of big data arebecomingamong the strongtopics in academia and industry. T...Show MoreMetadata
Abstract:
With the rising growth trend of data size in the Internet era, storage, analysis, and processing of big data arebecomingamong the strongtopics in academia and industry. Typical big data processing platforms adopt the MapReduce programming model to perform application processing. For example, the deployment and calculation method of Hadoop are as follows: Hadoop first collects data and stores them in distributed storage systems, which are storage nodes in clusters. Then, the compute nodes read data from the storage nodes and perform map operations. Lastly, the compute nodes communicate with each other and obtain computation results by performing reduction operations. In the process of collecting and storing data, the storage nodes mainly perform IO operations; hence, the computing resources of these nodes are not fully utilized. This paper proposes a big data preprocessing system based on Hadoop platforms. The main idea of this system is that the data collection and storage phase starts computation operations earlier by utilizing idle computing resources on the basis that IO performance is not affected. This idea can reduce the data size of disk transfer and network communication, and the runtime of applications. Experiments conducted with WordCount, a typical big data processing application, indicate that the system can improve the performance of Hadoop applications.
Date of Conference: 12-14 March 2016
Date Added to IEEE Xplore: 14 July 2016
ISBN Information: