Filtering Similar Sequences

Using Swift

In the next example we are going to use the Swift filter to efficiently find pairs of similar reads. The Swift algorithms searches for so-called epsilon matches, local alignments, of two sequences with an error rate below a certain epsilon threshold.

The Swift implementation in SeqAn provides a find interface and requires the Finder and Pattern to be specialized with Swift<..>. Millions of sequences can be searched simultaneously with one Swift Pattern in a Swift Finder of a single haystack sequence. The error rate of a local alignment is the number of errors divided by the length of the needle sequence part of the match. There are currently two version of the Swift algorithm implemented in SeqAn, SwiftSemiGlobal and SwiftLocal. Both can be used to search epsilon-matches of a certain minimum length.

Hint

SwiftSemiGlobal should only be used for short needles (sequenced reads) as it always returns potential epsilon matches spanning a whole needle sequence. SwiftLocal should be preferred for large needles as it returns needle sequences potentially having an intersection with an epsilon match.

The following program searches for semi-global alignments between pairs of reads with a maximal error rate of 10%.

#include <seqan/stream.h>
#include <seqan/index.h>
#include <seqan/store.h>
#include <iostream>

using namespace seqan;

First we loads reads from a file into a FragmentStore with loadReads.

int main(int argc, char const * argv[])
{
    FragmentStore<> fragStore;
    if (argc < 2 || !loadReads(fragStore, argv[1]))
        return 1;

Swift uses a q-gram index of the needle sequences. Thus, we have to specialize the Swift Semi Global Pattern with a IndexQGram index of the needle StringSet in the first template argument, create the index over the readSeqStore and pass the index to the Pattern constructor. Swift Semi Global Finder and Swift Semi Global Pattern classes have to be specialized with SwiftSemiGlobal in the second template argument.

Note

Note, to use the local swift filter you simply switch the specialization tag to SwiftLocal: Swift Local Finder and Swift Local Pattern.

The main loop iterates over all potential matches which can be further processed, e.g. by a semi-global or overlap aligner.

    typedef FragmentStore<>::TReadSeqStore TReadSeqStore;
    typedef GetValue<TReadSeqStore>::Type TReadSeq;
    typedef Index<TReadSeqStore, IndexQGram<Shape<Dna, UngappedShape<11> >, OpenAddressing> > TIndex;
    typedef Pattern<TIndex, Swift<SwiftSemiGlobal> > TPattern;
    typedef Finder<TReadSeq, Swift<SwiftSemiGlobal> > TFinder;

    TIndex index(fragStore.readSeqStore);
    TPattern pattern(index);
    for (unsigned i = 0; i < length(fragStore.readSeqStore); ++i)
    {
        if ((i % 1000) == 0)
            std::cout << "." << std::flush;
        TFinder finder(fragStore.readSeqStore[i]);
        while (find(finder, pattern, 0.1))
        {
            if (i == position(pattern).i1)
                continue;
            // do further alignment here
/*			std::cout << "Found possible overlap of " << std::endl;
            std::cout << "\t" << infix(finder) << std::endl;
            std::cout << "\t" << seqs[position(pattern).i1] << std::endl;
*/      }
    }

    return 0;
}