Towards a standard way of storing reads

Diverse formats were developed for storing both raw and mapped reads, but there is no consensus yet about which one is the best. When ReadTools started to be developed, the most common way of storing reads was [FASTQ] for raw datasets and [SAM] for already mapped/processed data.

The SAM format is well-defined and has a good support in the community (SAM specs), in contrast with the implementation dependent FASTQ format. In addition, several programs widely used in bioinformatics are moving towards unmapped reads in SAM formatting, such as Picard Tools or bwa (aln). ReadTools use a consistent format based on the [SAM specs] for output reads, and handle other sources to standardize them.

Supported formats

The following formats are accepted in ReadTools native tools:

Future support

The following formats are not accepted yet, but there are plans to support them in the future:

  • PacBio HDF5: the legacy PacBio format was stored in HDF5, but currently they use a new BAM format. We plan to support bas.h5 input files to allow the conversion between legacy and current pipelines.
  • SRA: the NCBI format to store public datasets. Supporting this source of reads will be important for downloading already standardized data. This support depends on a native library, and may require the download of libraries the first time is used.