Efficiently Importing Millions Of Sequences

Memory Mapped Files

The fastest way to import tons of sequences in Fasta/Fastq/GSeq/... file format is to avoid slow C++ I/O streams and instead map the file directly into memory. This can be done via the MMapString which uses memory mapping of the operating system or via the ExternalString which emulates memory mapping by doing the paging by-hand. Most commonly used file formats concatenate sequences separated by a delimiter, e.g. >, @, line-break, that marks the begin of each sequence. In SeqAn there is also a data structure that represents multiple sequences using one concatenation string and the begin positions of each sequence, the ConcatDirectStringSet. We therefore defined the type MultiSeqFile as an alias for a ConcatDirectStringSet using a single MMapString.

In the next example we are going to open a sequence file, recognize its format, split the file into sequence fractions and import each sequence, its quality values and id.

#define SEQAN_PROFILE // enable time measurements
#include <seqan/file.h>
#include <iostream>

using namespace seqan;

First we associate our sequence file with the memory mapped string underlying the ConcatDirectStringSet using open.

int main (int argc, char const * argv[])
{
	SEQAN_PROTIMESTART(loadTime);

	MultiSeqFile multiSeqFile;
	if (argc < 2 || !open(multiSeqFile.concat, argv[1], OPEN_RDONLY))
		return 1;

Next we guess the file format of the single concatenation string and store the result in a AutoSeqFormat object, which is used subsequently to select the right import function. split expects a ConcatDirectStringSet and divides the underlying string into sequence fragments separated by a file format specific delimiter.

	AutoSeqFormat format;
	guessFormat(multiSeqFile.concat, format);
	split(multiSeqFile, format);

After calling split the multiSeqFile StringSet represents the sequence fragments and can be used to reserve memory for the StringSets that store sequences and ids.

	unsigned seqCount = length(multiSeqFile);
	StringSet<String<Dna5Q> > seqs;
	StringSet<CharString> seqIDs;

	reserve(seqs, seqCount, Exact());
	reserve(seqIDs, seqCount, Exact());

The main loop iterates over each sequence fragment and uses the functions assignSeq, assignQual and assignSeqId to extract sequence data, qualities and id. The quality values are encoded in ASCII and have to be converted into integer values between 0 and 62 before assigning it to a Dna5Q character via assignQualityValue.

	String<Dna5Q> seq;
	CharString qual;
	CharString id;

	for (unsigned i = 0; i < seqCount; ++i)
	{
		assignSeq(seq, multiSeqFile[i], format);    // read sequence
		assignQual(qual, multiSeqFile[i], format);  // read ascii quality values
		assignSeqId(id, multiSeqFile[i], format);   // read sequence id

		// convert ascii to values from 0..62
		// store dna and quality together in Dna5Q
		for (unsigned j = 0; j < length(qual) && j < length(seq); ++j)
			assignQualityValue(seq[j], (int)(ordValue(qual[j]) - 33));

		// we use reserve and append, as assign is not supported
		// by StringSet<..., Owner<ConcatDirect<> > >
		appendValue(seqs, seq, Generous());
		appendValue(seqIDs, id, Generous());
	}

Finally we output the number of imported sequences, the overall runtime and the first 10 sequences in Fasta format.

	std::cout << "Loading " << seqCount << " sequences took " << SEQAN_PROTIMEDIFF(loadTime);
	std::cout << " seconds." << std::endl << std::endl;
	for (unsigned i = 0; i < seqCount && i < 10; ++i)
	{
		std::cout << '>' << seqIDs[i] << std::endl;
		std::cout << seqs[i] << std::endl;
	}

	return 0;
}

Program Output

$ cd build/Release
$ make efficiently_import_sequences
[...]
$ ./core/demos/howto/efficiently_import_sequences reads.fq
Loading 1000000 sequences took 4.82109 seconds

>HWI-EAS299_3_30MAPAAXX:6:1:1561:1481/1
GTTTATTTCACCTCCTTTACTTGTAGTCCAGGCGGTA
>HWI-EAS299_3_30MAPAAXX:6:1:1561:1481/2
AAAGAATTTAAATATTTCCTTAATAAGGCACGCCGTT
>HWI-EAS299_3_30MAPAAXX:6:1:1703:1976/1
GTTTTGATGTACAACGCCGTTACAGGTATAGTGAGAG
>HWI-EAS299_3_30MAPAAXX:6:1:1703:1976/2
TTCTAAATTAAAACCTCCAGAATAAGGAACATAAGAG
>HWI-EAS299_3_30MAPAAXX:6:1:1638:1932/1
GAAATTTTTGAGGTTATTCGCTCTTGCAACACTTTTC
>HWI-EAS299_3_30MAPAAXX:6:1:1638:1932/2
CACCCATACTATTAAAGCAAGCATCGGGAAAAGTAAT
>HWI-EAS299_3_30MAPAAXX:6:1:1726:1928/1
GCATAATGCAAAGGGTTAGTATATGATTTTTAGTATG
>HWI-EAS299_3_30MAPAAXX:6:1:1726:1928/2
GAGACGACAACTCCCTCCGGGAACTAAACGTGCGTAT
>HWI-EAS299_3_30MAPAAXX:6:1:720:1208/1
GCATATTCTATAAATGCTAAGCATAAAAATAATTTTC
>HWI-EAS299_3_30MAPAAXX:6:1:720:1208/2
TGCCTGTTTACCATTTAGACAGGGTTCACAAATTTCA

Remarks

  • We intentionally use appendValue to fill the StringSets as for some applications it is more memory efficient to use a ConcatDirectStringSet to store imported sequences and ids. The ConcatDirectStringSet consists of only one String concatenating all sequences and a String containing the begin positions which induce less overhead compared to storing millions of single Strings separately on heap with their own begin, end and capacity information.
  • Although not visible in the example, the import functions can of course also import large sequences spanning multiple lines in various formats.

Fragment Store

The whole program above is condensed into the function loadReads working on a FragmentStore. An example for this function is given in Filtering Similar Sequences.

comments powered by Disqus