Simple Read Mapping

Learning Objective
You will be able to write read mappers using SeqAn.
Difficulty
Hard
Duration
2 h
Prerequisites
Indices, Fragment Store

In this tutorial, we will walk you through the code of a simple read mapper minimapper that uses the SWIFT filter and uses approximate string search for verification. There are severe limitations to its capabilities but it’s a read mapper in 12 effective lines of code (ignoring includes, comments, typedefs, I/O and lines with closing braces).

Try It

You can find the source code in the directory core/demos/tutorial/read_mapping/core/demos/tutorial/read_mapping. Copy over the FASTA files into your build directory and test it:

$ cp .../core/demos/tutorial/read_mapping/*.fasta .
$ make demo_tutorial_minimapper
...
$ ./core/demos/tutorial/read_mapping/demo_tutorial_minimapper
Invalid number of arguments.
USAGE: minimapper GENOME.fasta READS.fasta OUT.sam
$ ./core/demos/tutorial/read_mapping/tutorial_minimapper nc_001454.fasta reads_hamming.fasta out.sam
$ cat out.sam
@HD VN:1.0
@SQ SN:gi|9626553|ref|NC_001454.1|  LN:34214
@PG ID:SeqAn
nc_001454.fasta.fasta.000000005 8   gi|9626553|ref|NC_001454.1| 1396    255 36M *   0   0   TAGTGTTAGTTTATTCTGATGGAGTTGTGGAGTGAG    ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
nc_001454.fasta.fasta.000000003 8   gi|9626553|ref|NC_001454.1| 20574   255 36M *   0   0   CCGGCGGCGTACACTGGCTGGCCCTNGCCTGGAACC    ]]]]]]]]]]]]]]]]]]]]]]]]]!]]]]]]]]]]
nc_001454.fasta.fasta.000000007 8   gi|9626553|ref|NC_001454.1| 23191   255 36M *   0   0   GTGACAACGCGCGTTTGGCCGTACTCAAACGCACCA    ]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]

Code Walkthrough

First, include the headers of the SeqAn modules we will use.

#include <cstdio>
#include <iostream>

#include <seqan/basic.h>
#include <seqan/file.h>
#include <seqan/index.h>
#include <seqan/store.h>

using namespace seqan;

We will now use some typedefs for the FragmentStore and SWIFT filter to get shortcuts to types used below.

// Some typedefs.
typedef FragmentStore<>::TReadSeqStore TReadSeqStore;
typedef Value<TReadSeqStore>::Type TReadSeq;
typedef FragmentStore<>::TContigStore TContigStore;
typedef Value<TContigStore>::Type TContigStoreElement;
typedef TContigStoreElement::TContigSeq TContigSeq;
typedef Index<TReadSeqStore, IndexQGram<Shape<Dna, UngappedShape<11> >, OpenAddressing> > TIndex;
typedef Pattern<TIndex, Swift<SwiftSemiGlobal> > TPattern;
typedef Finder<TContigSeq, Swift<SwiftSemiGlobal> > TFinder;
typedef FragmentStore<>::TAlignedReadStore TAlignedReadStore;
typedef Value<TAlignedReadStore>::Type TAlignedRead;

We define the global constant EPSILON (\(\vareps\)), the allowed error rate.

const double EPSILON = 0.08;

Evaluate the arguments from the command line. Use the functions loadContigs and loadReads to load the reference sequence (possibly more than one if the FASTA file contains more than one sequence) and reads into the FragmentStore. Note that these functions will automatically guess the file type for you.

int main(int argc, char *argv[]) {
    // 0) Handle command line arguments.
    if (argc < 3) {
        std::cerr << "Invalid number of arguments." << std::endl
                  << "USAGE: minimapper GENOME.fasta READS.fasta OUT.sam" << std::endl;
        return 1;
    }

    // 1) Load contigs and reads.
    FragmentStore<> fragStore;
    if (!loadContigs(fragStore, argv[1])) return 1;
    if (!loadReads(fragStore, argv[2])) return 1;

Initialize Finder and Pattern for the q-gram index used by the swift filter.

    // 2) Build an index over all reads and a SWIFT pattern over this index.
    TIndex index(fragStore.readSeqStore);
    TPattern pattern(index);

Now, iterate over all input sequence contigs and enumerate all SWIFT hits. These hits will contain all possible matches of the reads in the FragmentStore with up to \(\varepsilon \cdot \ell\) (with \(\ell =\) length(read)) errors. Mismatches and indels are taken into consideration.

    // 3) Enumerate all epsilon matches.
    for (unsigned i = 0; i < length(fragStore.contigStore); ++i) {
        TFinder finder(fragStore.contigStore[i].seq);
        while (find(finder, pattern, EPSILON)) {

Now, verify each possible match using a HammingSimplePattern. The verified matches will have Hamming distance \(< \lfloor \varepsilon \cdot \ell \rfloor\), edit distance is not considered.

            // Verify match.
            Finder<TContigSeq> verifyFinder(fragStore.contigStore[i].seq);
            setPosition(verifyFinder, beginPosition(finder));
            Pattern<TReadSeq, HammingSimple> verifyPattern(fragStore.readSeqStore[position(pattern).i1]);
            unsigned readLength = length(fragStore.readSeqStore[position(pattern).i1]);
            int minScore = -static_cast<int>(EPSILON * readLength);
            while (find(verifyFinder, verifyPattern, minScore) && position(verifyFinder) < endPosition(infix(finder))) {
                TAlignedRead match(length(fragStore.alignedReadStore), position(pattern).i1, i,
                                   beginPosition(verifyFinder), endPosition(verifyFinder));
                appendValue(fragStore.alignedReadStore, match);
            }
        }
    }

Finally, write out the resulting multiple read alignment to the SAM file with the file name on the command line.

    // 4) Write out Sam file.
    std::ofstream samFile(argv[3], std::ios_base::out);
    write(samFile, fragStore, Sam());

    return 0;
}

Hands On!

Programming can only be learned by programming, so let’s get started. We create a new sandbox and a new app for the minimapper. If you already have a sandbox, then you can skip the first step

$ ./util/bin/skel.py repository sandbox/my_sandbox
$ ./util/bin/skel.py app minimapper sandbox/my_sandbox

Now, we copy over the code from the original location into our new app and build it.

$ cp core/demos/tutorial/read_mapping/minimapper.cpp sandbox/my_sandbox/apps/minimapper/minimapper.cpp
$ cd build/Debug
$ cmake .
$ make minimapper
$ ./sandbox/my_sandbox/apps/minimapper/minimapper
Invalid number of arguments.
USAGE: minimapper GENOME.fasta READS.fasta OUT.sam

Now, play around with the source code. Here are some examples for things to try out. There are no solutions, and they are merely thought to get you started playing...

Task 1: Use the ArgumentParser

Global constants are kind of inflexible. Instead of the global constant EPSILON, create an Options struct with a member variable epsilon, initialize it to 0.8 in the constructor and use an Option struct in the main program. Make the value for configurable using the class ArgumentParser described in the Parsing Command Line Arguments Tutorial.

Task 2: Allow Edit Distance for Verification

Currently, the read mapper can only find reads with mismatches but not with indels. The SWIFT filter will already create hits for positions with indels so you only have to adjust the verification step.

Hint
Use the Myers Pattern for the approximate search. Don’t forget to call findBegin using the score (getScore) of the last hit as the find begin score. You can use one Myers Pattern object per read sequence to only perform the precomputation once. If you reuse your finder object, don’t forget to call clear.

Task 3: Find Matches On Reverse-Complement

Another limitation is that only reads from the forward strand will be found. Either reverse-complement all reads or the contigs to find reads from the reverse strand.

Maybe add options to limit searching to the forward or reverse strand.

Hint
Reverse-complementing the contigs will be faster in practice: First, an index is built over the reads which would have to be built twice if the reads were complemented. Second, there will usually be more reads data than genome data if the coverage is greater than 1.

Task 4: Allow Other Output Formats

Read the documentation on the function write of the class FragmentStore.

comments powered by Disqus