Use this link to cite:
http://hdl.handle.net/2183/26184 SparkEC: Reingeniería y optimización de una herramienta Big Data para la corrección de errores en conjuntos de datos genéticos
Loading...
Identifiers
Publication date
Authors
Martínez-Sánchez, Marco
Other responsabilities
Enxeñaría informática, Grao en
Journal Title
Bibliographic citation
Type of academic work
Academic degree
Abstract
[Resumen]
Con el presente Trabajo de Fin de Grado (TFG) se plantea el rediseño y reimplementación
de la herramienta paralela CloudEC con el objetivo último de obtener una mejora de su rendimiento
en entornos clúster. CloudEC permite realizar la corrección de errores en secuencias
de ADN para así incrementar la calidad de las bases leídas en el proceso de secuenciación.
Tanto la herramienta original (CloudEC) como la nueva desarrollada en este TFG (SparkEC)
están enfocadas al manejo de grandes volúmenes de datos haciendo uso de frameworks de
procesamiento Big Data. En el caso de CloudEC está implementada con Apache Hadoop y, en
el caso de la nueva herramienta fruto del resultado de este proceso de reingeniería, se ha seleccionado
Apache Spark. Ambos frameworks son gratuitos, de código abierto y ampliamente
utilizados tanto en investigación como en la industria.
La herramienta ha sido desarrollada utilizando prácticas bien asentadas en el ecosistema
de la Ingeniería del Software. Para realizar el diseño, si bien se ha conservado la arquitectura
subyacente del sistema original, esta ha sido refinada utilizando patrones de diseño y
arquitecturales, buscando con ello siempre una mejora en la mantenibilidad y extensibilidad
del software. Adicionalmente, SparkEC ofrece funcionalidades y configuraciones adicionales.
Con ellas, se espera que los usuarios puedan adecuar mejor cada ejecución a sus datos de
entrada y al hardware disponible para realizar la computación.
Respecto al rendimiento, se ha realizado una extensa evaluación experimental con distintos
conjuntos de datos, configuraciones y versiones del software, comparándolas entre sí para
analizar el impacto de las optimizaciones propuestas, y con respecto a la herramienta original
para conocer la aceleración obtenida en cada escenario.
La herramienta desarrollada en este TFG se encuentra disponible para su descarga en el
siguiente repositorio Git: https://github.com/mscrocker/SparkEC.
[Abstract] This BSc Thesis proposes the redesign and reimplementation of the parallel tool named CloudEC in order to improve its performance in cluster environments. CloudEC allows performing error correction over DNA reads to increase the quality of the base pairs obtained during the sequencing process. Both the original tool (CloudEC) and the one developed in this project (SparkEC) are focused on handling large datasets by relying on Big Data processing frameworks. In the case of CloudEC it is implemented upon Apache Hadoop, whereas Apache Spark has been selected for the new tool developed as the result of the reengineering process. Both Hadoop and Spark are free, open-source frameworks widely used in research and industry. The tool has been developed following practices well settled in the Software Engineering ecosystem. In order to perform the design, the underlying architecture of the original system has been maintained but refined using design and architectural patterns, always trying to improve the maintainability and extensibility of the software. Furthermore, SparkEC provides additional features and settings. With them, it is expected that users are able to better fit each execution to their input datasets and also to the available hardware for performing the computations. Regarding performance, an extensive experimental evaluation using different datasets, settings and versions of the software has been carried out, making comparisons among them to analyze the impact of the optimizations proposed, and comparing them to the original tool in order to measure the acceleration obtained in each scenario. The tool developed in this project is publicly available to download at the following Git repository: https://github.com/mscrocker/SparkEC.
[Abstract] This BSc Thesis proposes the redesign and reimplementation of the parallel tool named CloudEC in order to improve its performance in cluster environments. CloudEC allows performing error correction over DNA reads to increase the quality of the base pairs obtained during the sequencing process. Both the original tool (CloudEC) and the one developed in this project (SparkEC) are focused on handling large datasets by relying on Big Data processing frameworks. In the case of CloudEC it is implemented upon Apache Hadoop, whereas Apache Spark has been selected for the new tool developed as the result of the reengineering process. Both Hadoop and Spark are free, open-source frameworks widely used in research and industry. The tool has been developed following practices well settled in the Software Engineering ecosystem. In order to perform the design, the underlying architecture of the original system has been maintained but refined using design and architectural patterns, always trying to improve the maintainability and extensibility of the software. Furthermore, SparkEC provides additional features and settings. With them, it is expected that users are able to better fit each execution to their input datasets and also to the available hardware for performing the computations. Regarding performance, an extensive experimental evaluation using different datasets, settings and versions of the software has been carried out, making comparisons among them to analyze the impact of the optimizations proposed, and comparing them to the original tool in order to measure the acceleration obtained in each scenario. The tool developed in this project is publicly available to download at the following Git repository: https://github.com/mscrocker/SparkEC.







