Indexed Pattern Matching¶
- Learning Objective
In this tutorial you will learn how to use the SeqAn classes finder and pattern to search a known pattern in a string or StringSet.
- Difficulty
Average
- Duration
30 min
- Prerequisites
Overview¶
The Finder is an object that stores all necessary information for searching for a pattern. We have learned how to use it in Online Pattern Matching tutorial.
The following line of code shows how the Finder is initialized with an index. In this example, we search for the pattern ACGT
.
String<Dna5> genome = "ACGTACGTACGTN";
Index<String<Dna5>, IndexEsa<> > esaIndex(genome);
//![esa]
Finder<Index<String<Dna5>, IndexEsa<> > > esaFinder(esaIndex);
Calling the function find invokes the localization of all occurrences of a given pattern.
It works by modifying pointers of the Finder
to tables of the index.
For example, the Finder of esaIndex
stores two pointers, pointing to the first and last suffix array entry that stores an occurrence of the pattern. The return value of the find function tells us whether or not a given pattern occurs in the text. Furthermore, if there are several instances of a pattern, consecutive calls of find will modify the Finder such that it points to the next occurrence after each call:
find(esaFinder, "ACGT"); // first occurrence of "ACGT"
find(esaFinder, "ACGT"); // second occurrence of "ACGT"
find(esaFinder, "ACGT"); // third occurrence of "ACGT"
The above code is not very useful, since we do not know the locations of the first, second or third pattern occurrence.
The function position will help here.
position called on a finder returns the location of the x
th pattern, where x
can be the first, second, or any other occurrence of the pattern.
find(esaFinder, "ACGT"); // first occurrence of "ACGT"
position(esaFinder); // -> 0
find(esaFinder, "ACGT"); // second occurrence of "ACGT"
position(esaFinder); // -> 4
find(esaFinder, "ACGT"); // third occurrence of "ACGT"
position(esaFinder); // -> 8
Tip
Indices in SeqAn are built on demand. That means that the index tables are not build when the constructor is called, but when we search for a pattern for the first time.
Exact Search¶
For the index based search the Finder needs to be specialized with an Index of the haystack
in the first template argument.
The index itself requires two template arguments, the haystack
type and a index specialization.
In contrast, since the needle
is not preprocessed the second template argument of the Pattern has to be omitted.
The following source illustrates the usage of an index based search in SeqAn using the example of the IndexEsa index (an enhanced suffix array index).
This is the default index specialization if no second template argument for the index is given.
We begin to create an index object of our haystack
"tobeornottobe"
and a needle
"be"
.
int main()
{
Index<CharString> index("tobeornottobe");
CharString needle = "be";
Finder<Index<CharString> > finder(index);
We proceed to create a Pattern of the needle and conduct the search in the usual way.
Pattern<CharString> pattern(needle);
while (find(finder, pattern))
std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder) << ")\t" << infix(finder) << std::endl;
Instead of creating and using a pattern solely storing the needle
we can pass the needle directly to find.
Please note that an Index based Finder has to be reset with clear before conducting another search.
clear(finder);
while (find(finder, "be"))
std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder) << ")\t" << infix(finder) << std::endl;
return 0;
}
Program output:
[11,13) be
[2,4) be
[11,13) be
[2,4) be
All indices also support StringSet texts and can therefore be used to search multiple haystacks
as the following example shows.
We simply exchange the CharString of the haystack with a StringSet of CharString and append some strings to it.
int main()
{
typedef StringSet<CharString> THaystacks;
THaystacks haystacks;
appendValue(haystacks, "tobeornottobe");
appendValue(haystacks, "thebeeonthecomb");
appendValue(haystacks, "beingjohnmalkovich");
Index<THaystacks> index(haystacks);
Finder<Index<THaystacks> > finder(haystacks);
The rest of the program remains unchanged.
clear(finder);
while (find(finder, "be"))
std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder) << ")\t" << infix(finder) << std::endl;
return 0;
}
[< 0 , 11 >,< 0 , 13 >) be
[< 1 , 3 >,< 1 , 5 >) be
[< 2 , 0 >,< 2 , 2 >) be
[< 0 , 2 >,< 0 , 4 >) be
The following index specializations support the Finder interface as described above.
- Specialization IndexEsa
Enhanced suffix array based index. Supports arbitrary needles.
- Specialization IndexQGram
Q-gram index. Needle mustn’t exceed the size of the q-gram.
- Specialization Open Adressing QGram Index
Q-gram index with open addressing. Supports larger q-grams. Needle and q-gram must have the same size.
Besides the find interface there is another interface for indices using suffix tree iterators to search exact needle
occurrences described in the tutorial Indices.
Assignment 1¶
- Type
Application
- Objective
Modify the example above to search with a Open Adressing QGram Index q-gram index for matches of “tobe” in “tobeornottobe”.
- Solution
Click more… to see the solution.
#include <iostream> #include <seqan/index.h> using namespace seqan2; int main() { typedef Index<CharString, IndexQGram<UngappedShape<4>, OpenAddressing> > TIndex; TIndex index("tobeornottobe"); Finder<TIndex> finder(index); while (find(finder, "tobe")) std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder) << ")\t" << infix(finder) << std::endl; return 0; }
We simply add a second template argument to the definition of the Index as described in the documentation of the Open Adressing QGram Index. As shape we can use an UngappedShape of length 4.
Program output:
[0,4) tobe [9,13) tobe
Approximate Filtration¶
Currently there are no indices directly supporting an approximate search.
But nevertheless, there are approximate search filters available that can be used to filter out regions of the haystack
that do not contain an approximate match, see SwiftFinder and SwiftPattern.
The regions found by these filters potentially contain a match and must be verified afterwards.
beginPosition, endPosition and infix can be used to return the boundaries or sequence of such a potential match.
For more details on using filters, see the article Filtering Similar Sequences.