Indexed Pattern Matching

Learning Objective

In this tutorial you will learn how to use the SeqAn classes finder and pattern to search a known pattern in a string or StringSet.

Difficulty

Average

Duration

30 min

Prerequisites

Sequences, Indices, Online Pattern Matching

Overview

The Finder is an object that stores all necessary information for searching for a pattern. We have learned how to use it in Online Pattern Matching tutorial.

The following line of code shows how the Finder is initialized with an index. In this example, we search for the pattern ACGT.

    String<Dna5> genome = "ACGTACGTACGTN";
    Index<String<Dna5>, IndexEsa<> > esaIndex(genome);
//![esa]
    Finder<Index<String<Dna5>, IndexEsa<> > > esaFinder(esaIndex);

Calling the function find invokes the localization of all occurrences of a given pattern. It works by modifying pointers of the Finder to tables of the index. For example, the Finder of esaIndex stores two pointers, pointing to the first and last suffix array entry that stores an occurrence of the pattern. The return value of the find function tells us whether or not a given pattern occurs in the text. Furthermore, if there are several instances of a pattern, consecutive calls of find will modify the Finder such that it points to the next occurrence after each call:

    find(esaFinder, "ACGT");  // first occurrence of "ACGT"
    find(esaFinder, "ACGT");  // second occurrence of "ACGT"
    find(esaFinder, "ACGT");  // third occurrence of "ACGT"

The above code is not very useful, since we do not know the locations of the first, second or third pattern occurrence. The function position will help here. position called on a finder returns the location of the xth pattern, where x can be the first, second, or any other occurrence of the pattern.

    find(esaFinder, "ACGT"); // first occurrence of "ACGT"
    position(esaFinder); // -> 0
    find(esaFinder, "ACGT"); // second occurrence of "ACGT"
    position(esaFinder); // -> 4
    find(esaFinder, "ACGT"); // third occurrence of "ACGT"
    position(esaFinder); // -> 8

Tip

Indices in SeqAn are built on demand. That means that the index tables are not build when the constructor is called, but when we search for a pattern for the first time.

Approximate Filtration

Currently there are no indices directly supporting an approximate search. But nevertheless, there are approximate search filters available that can be used to filter out regions of the haystack that do not contain an approximate match, see SwiftFinder and SwiftPattern. The regions found by these filters potentially contain a match and must be verified afterwards. beginPosition, endPosition and infix can be used to return the boundaries or sequence of such a potential match. For more details on using filters, see the article Filtering Similar Sequences.