RGen: Data Generator for Benchmarking Big Data Workloads

Pérez-Jove, R.; Expósito, R.R.; Touriño, J. RGen: Data Generator for Benchmarking Big Data Workloads. Eng. Proc. 2021, 7, 13. https://doi.org/10.3390/engproc2021007013

Abstract

[Abstract] This paper presents RGen, a parallel data generator for benchmarking Big Data workloads, which integrates existing features and new functionalities in a standalone tool. The main functionalities developed in this work were the generation of text and graphs that meet the characteristics defined by the 4 Vs of Big Data. On the one hand, the LDA model has been used for text generation, which extracts topics or themes covered in a series of documents. On the other hand, graph generation is based on the Kronecker model. The experimental evaluation carried out on a 16-node cluster has shown that RGen provides very good weak and strong scalability results. RGen is publicly available to download at https://github.com/rubenperez98/RGen, accessed on 30 September 2021.

Keywords

Data generator
MapReduce
HDFS
Apache Hadoop
Java
Big Data
Benchmarking

Description

Presented at the 4th XoveTIC Conference, A Coruña, Spain, 7–8 October 2021.

Editor version

https://doi.org/10.3390/engproc2021007013

Rights

Atribución 3.0 España