SparkEC: speeding up alignment-based DNA error correction tools

UDC.coleccionInvestigaciónes_ES
UDC.departamentoEnxeñaría de Computadoreses_ES
UDC.grupoInvGrupo de Arquitectura de Computadores (GAC)es_ES
UDC.issue464es_ES
UDC.journalTitleBMC Bioinformaticses_ES
UDC.volume23es_ES
dc.contributor.authorExpósito, Roberto R.
dc.contributor.authorMartínez-Sánchez, Marco
dc.contributor.authorTouriño, Juan
dc.date.accessioned2022-12-14T19:26:08Z
dc.date.available2022-12-14T19:26:08Z
dc.date.issued2022
dc.description.abstract[Abstract]: In recent years, huge improvements have been made in the context of sequencing genomic data under what is called Next Generation Sequencing (NGS). However, the DNA reads generated by current NGS platforms are not free of errors, which can affect the quality of downstream analysis. Although error correction can be performed as a preprocessing step to overcome this issue, it usually requires long computational times to analyze those large datasets generated nowadays through NGS. Therefore, new software capable of scaling out on a cluster of nodes with high performance is of great importance. In this paper, we present SparkEC, a parallel tool capable of fixing those errors produced during the sequencing process. For this purpose, the algorithms proposed by the CloudEC tool, which is already proved to perform accurate corrections, have been analyzed and optimized to improve their performance by relying on the Apache Spark framework together with the introduction of other enhancements such as the usage of memory-efficient data structures and the avoidance of any input preprocessing. The experimental results have shown significant improvements in the computational times of SparkEC when compared to CloudEC for all the representative datasets and scenarios under evaluation, providing an average and maximum speedups of 4.9× and 11.9×, respectively, over its counterpart. As error correction can take excessive computational time, SparkEC provides a scalable solution for correcting large datasets. Due to its distributed implementation, SparkEC speed can increase with respect to the number of nodes in a cluster. Furthermore, the software is freely available under GPLv3 license and is compatible with different operating systems (Linux, Windows and macOS).es_ES
dc.description.sponsorshipMinisterio de Ciencia e Innovación; PID2019-104184RB-I00 / AEI / 10.13039 / 501100011033es_ES
dc.description.sponsorshipXunta de Galicia; ED431G 2019/01es_ES
dc.description.sponsorshipXunta de Galicia; ED431C 2021/30es_ES
dc.description.sponsorshipFEDER; ED431G 2019/01es_ES
dc.description.sponsorshipFEDER; ED431C 2021/30es_ES
dc.identifier.citationExpósito, R.R., Martínez-Sánchez, M. & Touriño, J. SparkEC: speeding up alignment-based DNA error correction tools. BMC Bioinformatics 23, 464 (2022). https://doi.org/10.1186/s12859-022-05013-1es_ES
dc.identifier.issn1471-2105
dc.identifier.urihttp://hdl.handle.net/2183/32190
dc.language.isoenges_ES
dc.publisherBioMed Central (Springer)es_ES
dc.relation.urihttps://doi.org/10.1186/s12859-022-05013-1es_ES
dc.rightsAtribución 3.0 Españaes_ES
dc.rights.accessRightsopen accesses_ES
dc.rights.urihttp://creativecommons.org/licenses/by/3.0/es/*
dc.subjectError correctiones_ES
dc.subjectBig dataes_ES
dc.subjectDistributed processinges_ES
dc.subjectApache Sparkes_ES
dc.titleSparkEC: speeding up alignment-based DNA error correction toolses_ES
dc.typejournal articlees_ES
dspace.entity.typePublication
relation.isAuthorOfPublication6a6967e9-a4f5-4006-afee-4fc9d5f3a658
relation.isAuthorOfPublication86e306a5-99a1-4c43-8faa-720f0a9f0a34
relation.isAuthorOfPublication.latestForDiscovery6a6967e9-a4f5-4006-afee-4fc9d5f3a658

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Exposito_Martinez_Tourino_2022_SparkEC_DNA_error_correction_tools.pdf
Size:
3.96 MB
Format:
Adobe Portable Document Format
Description: