SMusket: Spark-based DNA error correction on distributed-memory systems
Use este enlace para citar
http://hdl.handle.net/2183/34529
A non ser que se indique outra cousa, a licenza do ítem descríbese como Atribución-NoComercial-SinDerivadas 3.0 España
Coleccións
- GI-GAC - Artigos [192]
Metadatos
Mostrar o rexistro completo do ítemTítulo
SMusket: Spark-based DNA error correction on distributed-memory systemsData
2020Cita bibliográfica
R. R. Expósito, J. González-Domínguez, and J. Touriño, "SMusket: Spark-based DNA error correction on distributed-memory systems", Future Generation Computer Systems, vol. 111, pp. 698-713, 2020, https://doi.org/10.1016/j.future.2019.10.038
É version de
https://doi.org/10.1016/j.future.2019.10.038
Resumo
[Abstract]: Next-Generation Sequencing (NGS) technologies have revolutionized genomics research over the last decade, bringing new opportunities for scientists to perform groundbreaking biological studies. Error correction in NGS datasets is considered an important preprocessing step in many workflows as sequencing errors can severely affect the quality of downstream analysis. Although current error correction approaches provide reasonably high accuracies, their computational cost can be still unacceptable when processing large datasets. In this paper we propose SparkMusket (SMusket), a Big Data tool built upon the open-source Apache Spark cluster computing framework to boost the performance of Musket, one of the most widely adopted and top-performing multithreaded correctors. Our tool efficiently exploits Spark features to implement a scalable error correction algorithm intended for distributed-memory systems built using commodity hardware. The experimental evaluation on a 16-node cluster using four publicly available datasets has shown that SMusket is up to 15.3 times faster than previous state-of-the-art MPI-based tools, also providing a maximum speedup of 29.8 over its multithreaded counterpart. SMusket is publicly available under an open-source license at https://github.com/rreye/smusket
Palabras chave
Next-Generation Sequencing (NGS)
Sequence analysis
Big Data
Apache Spark
Error correction
Sequence analysis
Big Data
Apache Spark
Error correction
Descrición
©2020 Elsevier B.V. All rights reserved. This manuscript version is made available under
the CC-BY-NC-ND 4.0 license https://creativecommons.org/licenses/bync-nd/4.0/. This
version of the article has been accepted for publication in Future Generation Computer
Systems. The Version of Record is available online at
https://doi.org/10.1016/j.future.2019.10.038 This is the accepted version of: R. R. Expósito, J. González-Domínguez, and J. Touriño, "SMusket: Sparkbased DNA error correction on distributed-memory systems", Future
Generation Computer Systems, vol. 111, pp. 698-713, 2020, https://doi.org/10.1016/j.future.2019.10.038
Versión do editor
Dereitos
Atribución-NoComercial-SinDerivadas 3.0 España