Performance Evaluation of Big Data Frameworks for Large-Scale Data Analytics

Veiga, Jorge; Expósito, Roberto R.; Pardo, Xoán C.; Taboada, Guillermo L.; Touriño, Juan

Título

Autor(es)

Veiga, Jorge

Expósito, Roberto R.

Pardo, Xoán C.

Taboada, Guillermo L.

Touriño, Juan

Data

2017-02-06

Cita bibliográfica

J. Veiga, R. R. Expósito, X. C. Pardo, G. L. Taboada and J. Tourifio, "Performance evaluation of big data frameworks for large-scale data analytics," 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, 2016, pp. 424-431.

Resumo

[Abstract] The increasing adoption of Big Data analytics has led to a high demand for efficient technologies in order to manage and process large datasets. Popular MapReduce frameworks such as Hadoop are being replaced by emerging ones like Spark or Flink, which improve both the programming APIs and performance. However, few works have focused on comparing these frameworks. This paper addresses this issue by performing a comparative evaluation of Hadoop, Spark and Flink using representative Big Data workloads and considering factors like performance and scalability. Moreover, the behavior of these frameworks has been characterized by modifying some of the main parameters of the workloads such as HDFS block size, input data size, interconnect network or thread configuration. The analysis of the results has shown that replacing Hadoop with Spark or Flink can lead to a reduction in execution times by 77% and 70% on average, respectively, for non-sort benchmarks.

Palabras chave

Sparks
Benchmark testing
Big Data
Generators
Programming
Clustering algorithms
Computational modeling

Descrición

This is a post-peer-review, pre-copyedit version of an article published. The final authenticated version is available online at: http://dx.doi.org/10.1109/BigData.2016.7840633