Enhancing in-memory Efficiency for MapReduce-based Data Processing

Veiga, Jorge; Expósito, Roberto R.; Taboada, Guillermo L.; Touriño, Juan

Use this link to cite:

http://hdl.handle.net/2183/21765

Enhancing in-memory Efficiency for MapReduce-based Data Processing

Files

R.R.Expósito_2018_Enhancin_In-memory_Efficiency_for_MapReduce-based_Data_Processing.pdf (1.01 MB)

Identifiers

URI: http://hdl.handle.net/2183/21765

DOI: 10.1016/j.jpdc.2018.04.001

Publication date

2018-10

Authors

Veiga, Jorge

Expósito, Roberto R.

Taboada, Guillermo L.

Touriño, Juan

Bibliographic citation

Jorge Veiga, Roberto R. Expósito, Guillermo L. Taboada, Juan Touriño, Enhancing in-memory efficiency for MapReduce-based data processing, Journal of Parallel and Distributed Computing, Volume 120, 2018, Pages 323-338, ISSN 0743-7315, https://doi.org/10.1016/j.jpdc.2018.04.001.

Abstract

[Abstract] As the memory capacity of computational systems increases, the in-memory data management of Big Data processing frameworks becomes more crucial for performance. This paper analyzes and improves the memory efficiency of Flame-MR, a framework that accelerates Hadoop applications, providing valuable insight into the impact of memory management on performance. By optimizing memory allocation, the garbage collection overheads and execution times have been reduced by up to 85% and 44%, respectively, on a multi-core cluster. Moreover, different data buffer implementations are evaluated, showing that off-heap buffers achieve better results overall. Memory resources are also leveraged by caching intermediate results, improving iterative applications by up to 26%. The memory-enhanced version of Flame-MR has been compared with Hadoop and Spark on the Amazon EC2 cloud platform. The experimental results have shown significant performance benefits reducing Hadoop execution times by up to 65%, while providing very competitive results compared to Spark.

Description

This is a post-peer-review, pre-copyedit version of an article published in Journal of Parallel and Distributed Computing. The final authenticated version is available online at: https://doi.org/10.1016/j.jpdc.2018.04.001