Optimal Search Schemes

Learning Objective

In this tutorial you will learn how to search a known pattern in a string or StringSet using a bidirectional index. The implemented algorithm in SeqAn is based on Optimal Search Schemes from the FAMOUS paper which allows for searching for up to 4 errors based on Hamming distance or Edit distance.

Difficulty

Average

Duration

15 min

Prerequisites

Sequences, Indices

Overview

Searching in an index with more than two errors is computationally very expensive and thus trivial approaches such as simple backtracking are not recommended. Optimal Search Schemes make use of a bidirectional index, split the pattern to be searched into multiple pieces and enumerate it in a more efficient fashion. To use this algorithm, you can simply call the find() method.

    find<0, 2>(delegate, index, pattern, HammingDistance());

The first argument is the delegate function that will be called if a match of the pattern is found. It gets an iterator of the bidirectional index and you can choose how to proceed with the hits. Additionally it passes a reference to the original pattern searched as well as the number of errors for the current hit in the index.

    auto delegate = [](auto & iter, DnaString const &, uint8_t /* errors */)
    {
        for (auto occ : getOccurrences(iter))
            std::cout << occ << std::endl;
    };

The second argument has to be a bidirectional index, currently SeqAn offers only bidirectional FM indices. (Please check the corresponding tutorial on FMIndex for more information how FM indices can be configured and constructed.)

The third argument is a String that you want to search.

The number of allowed errors (lower and upper bounds) are passed as template arguments. Please note that these parameters have to be constexpr. The distance metric is passed as fourth argument. You can choose between HammingDistance() and `EditDistance() (also known as Levenshtein distance).

You also have to possibility to pass a StringSet of patterns to search instead of a single String. The search can then be parallized by passing a fifth parameter tag Parallel() instead of Serial().

    find<1, 2>(delegate, index, patterns, HammingDistance(), Serial());

Warning

Please be aware that the same hits might be reported multiple times and you might have to filter these duplicated depending on your application, especially with larger errors numbers.

Note

When searching a StringSet in parallel mode the delegate is likely being executed by different threads at the same time. You have to take care by yourself that the threads do not interfere, e.g. write not into a variable simultaneously. One way of achieving this is by using lock guards. The following example buffers all hits in a set (to filter duplicates) and prints them afterwards.

    std::mutex mtx;
    std::set<Pair<DnaString, unsigned> > hits;
    auto delegateParallel = [&hits, &mtx](auto & iter, DnaString const & needle, uint8_t /* errors */)
    {
        std::lock_guard<std::mutex> lck(mtx); // critical section below this line
        for (auto occ : getOccurrences(iter))
            hits.insert(Pair<DnaString, unsigned>(needle, occ));
    };
    find<2, 3>(delegateParallel, index, patterns, HammingDistance(), Parallel());
    std::cout << "Hits with 2-3 errors (HammingDistance, no duplicates):" << std::endl;
    for (auto hit : hits)
        std::cout << hit << std::endl;

Here is a complete example for searching a string or a stringset (in serial mode):

#include <set>
#include <mutex>

#include <seqan/index.h>

using namespace seqan2;

int main()
{

    DnaString genome(
        "GAGAGGCCACTCGCAGGATTAAGTCAATAAGTTAATGGCGTGGGGTTATGGTATGGGGGTTCTCGCCCACAGTGACCTCATCGGT"
        "GCATTTCCTCATCGTAGGCGGAACGGTAGACACAAGGCATGATGTCAAATCGCGACTCCAATCCCAAGGTCGCAAGCCTATATAG"
        "GAACCCGCTTATGCCCTCTAATCCCGGACAGACCCCAAATATGGCATAGCTGGTTGGGGGTACCTACTAGGCACAGCCGGAAGCA");
    Index<DnaString, BidirectionalIndex<FMIndex<> > > index(genome);

    //![Delegate]
    auto delegate = [](auto & iter, DnaString const &, uint8_t /* errors */)
    {
        for (auto occ : getOccurrences(iter))
            std::cout << occ << std::endl;
    };
    //![Delegate]

    DnaString pattern("GGGGTTAT");
    std::cout << "Hits with up to 2 errors (HammingDistance):" << std::endl;
    //![SinglePattern]
    find<0, 2>(delegate, index, pattern, HammingDistance());
    //![SinglePattern]

    StringSet<DnaString> patterns;
    appendValue(patterns, "GGGGTTAT");
    appendValue(patterns, "CTAGCTAA");
    std::cout << "Hits with 1-2 errors (HammingDistance):" << std::endl;
    //![MultiplePatterns]
    find<1, 2>(delegate, index, patterns, HammingDistance(), Serial());
    //![MultiplePatterns]

    //![ParallelMode]
    std::mutex mtx;
    std::set<Pair<DnaString, unsigned> > hits;
    auto delegateParallel = [&hits, &mtx](auto & iter, DnaString const & needle, uint8_t /* errors */)
    {
        std::lock_guard<std::mutex> lck(mtx); // critical section below this line
        for (auto occ : getOccurrences(iter))
            hits.insert(Pair<DnaString, unsigned>(needle, occ));
    };
    find<2, 3>(delegateParallel, index, patterns, HammingDistance(), Parallel());
    std::cout << "Hits with 2-3 errors (HammingDistance, no duplicates):" << std::endl;
    for (auto hit : hits)
        std::cout << hit << std::endl;
    //![ParallelMode]

    return 0;
}