MarDRe: efficient MapReduce-based removal of duplicate DNA reads in the cloud

Expósito, Roberto R.; Veiga, Jorge; González-Domínguez, Jorge; Touriño, Juan

Title

Author(s)

Expósito, Roberto R.

Veiga, Jorge

González-Domínguez, Jorge

Touriño, Juan

Date

2017

Citation

Roberto R. Expósito, Jorge Veiga, Jorge González-Domínguez, Juan Touriño; MarDRe: efficient MapReduce-based removal of duplicate DNA reads in the cloud, Bioinformatics, Volume 33, Issue 17, 1 September 2017, Pages 2762–2764, https://doi.org/10.1093/bioinformatics/btx307

Abstract

[Abstract] This article presents MarDRe, a de novo cloud-ready duplicate and near-duplicate removal tool that can process single- and paired-end reads from FASTQ/FASTA datasets. MarDRe takes advantage of the widely adopted MapReduce programming model to fully exploit Big Data technologies on cloud-based infrastructures. Written in Java to maximize cross-platform compatibility, MarDRe is built upon the open-source Apache Hadoop project, the most popular distributed computing framework for scalable Big Data processing. On a 16-node cluster deployed on the Amazon EC2 cloud platform, MarDRe is up to 8.52 times faster than a representative state-of-the-art tool.

Keywords

MarDRe
Apache Hadoop
Big Data
Cloud platform
MapReduce
Cloud-ready duplicate

Description

This is a pre-copyedited, author-produced version of an article accepted for publication in Bioinformatics following peer review. The version of record Roberto R. Expósito, Jorge Veiga, Jorge González-Domínguez, Juan Touriño; MarDRe: efficient MapReduce-based removal of duplicate DNA reads in the cloud, Bioinformatics, Volume 33, Issue 17, 1 September 2017, Pages 2762–2764 is available online at: https://doi.org/10.1093/bioinformatics/btx307