Description

Distmap (Pandey & Schlötterer 2013) is a wrapper around different mappers for distributed computation using Hadoop. The input for this tool is a modified FASTQ format which is writen in HDFS to save space and to distribute easily pair-end reads. It is a tab-delimited format with the following fields:

  1. Read name with ‘@’ symbol included.
  2. Read sequence.
  3. Read quality.
  4. Second read sequence (if pair-end).
  5. Second read quality (if pair-end).

Limitations and ReadTools solution

There are several limitations in the current implementation (24/04/2017) of this software:

  • Distmap is only able to upload FASTQ files, and only split files for pair-end data. ReadTools provides a framework to upload any kind of format, which it is used to allow SAM/BAM/CRAM files upload.
  • Distmap loses barcode information from SAM/BAM/CRAM files. As any other FASTQ-based file format, the information in the SAM tags is removed if it is not contained in the read name. ReadTools discourages the storage of information in the read name and proposes an standard SAM format for barcode information, and thus any kind of compatible read sources is handled accordingly when converting to the FASTQ-like Distmap format.
  • Distmap converts and writes the input into the local filesystem, and then the file is transferred to HDFS. This two-step behavior slows down the processing by parsing two files and uses twice the amount of disk space (in the local filesystem and HDFS). ReadTools changes this into a one-step processing, by reading the local file and convert it internally allowing to write down the output directly into HDFS.
  • Distmap downloads the part-files from the Map-Reduce tasks into the local filesystem and uses Picard Tools to merge/sort them into batches. As before, this two-step behavior slows down the process, and in this case the amount of disk space is huge (part files, batches, final output) and repetitive. ReadTools changes this into a “one-step” processing by reading the files from HDFS in batches (download) and sorting them on the fly, and then combining this pre-sorted batches into the final output.

In ReadTools, the following tools are related with the Distmap software:

  • ReadsToDistmap: converts any kind of read source into the Distmap format. The tool could output directly into HDFS to avoid local disk overhead.
  • DownloadDistmapResult: download, sort and merge the alignments generated by DistMap.