Pattern Matching¶
- Learning Objective
- In this tutorial you will learn how to use the SeqAn classes finder and pattern to search a known pattern in a string or StringSet.
- Difficulty
- Average
- Duration
- 40 min
- Prerequisites
- Sequences, Indices
Pattern matching is about searching a known string or StringSet (needle
) in another string or StringSet (haystack
).
This tutorial will introduce you into the SeqAn classes finder and pattern.
It will demonstrate how to use the spezializations of the class finder to perform either an online search or an index based seach.
And you will learn how to specify the search algorithm, which can be exact or approximate.
Overview¶
In the case of approximate searching errors are allowed, which are either only mismatches or also indels.
Additionally there are filtration algorithms which return potential matches, i.e. haystack
segments possibly containing a pattern match.
All searching is done by calling the function find, which takes at least two arguments:
- A Finder that stores all necessary information about the
haystack
and the last found position of theneedle
within the haystack. - A Pattern that stores all information about the
needle
. Some variants of find support further arguments. The Finder and Pattern classes expect the underlyinghaystack
andneedle
types as first template arguments. In addition, a second template argument specifies the search algorithm.
Each call of find finds only one match (or potential match) of the needle
within the haystack.
The Finder can be asked for the begin and end position of the last found match.
The Pattern can be asked for the number of the found sequence if the needle
is a StringSet.
Subsequent calls of find can be used to find more occurrences of the needle
, until no more occurrences can be found and find returns false
.
In general, search algorithms can be divided into algorithms that preprocess the needle
(online search) or preprocess the haystack
(index search).
Online Search¶
For all online search algorithms, the Finder is an iterator that scans over the haystack
.
The Pattern is a search algorithm dependent data structure preprocessed from the needle
.
The second template argument of the Pattern selects the search algorithm.
Exact Search¶
The following code snippet illustrates the usage of online search algorithms in SeqAn using the example of the Hoorspool algorithm [Hor80].
We begin by creating two strings of type char
containing the haystack
and the needle
.
#include <iostream>
#include <seqan/find.h>
using namespace seqan;
int main()
{
CharString haystack = "Simon, send more money!";
CharString needle = "mo";
We then create Finder and Pattern objects of these strings and choose Horspool as the specialization in the second template argument of Pattern.
Finder<CharString> finder(haystack);
Pattern<CharString, Horspool> pattern(needle);
while (find(finder, pattern))
std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder) << ")\t" << infix(finder) << std::endl;
return 0;
}
Program output:
[2,4) mo
[12,14) mo
[17,19) mo
Currently the following exact online algorithms for searching a single sequence are implemented in SeqAn:
- Simple
- Brute force algorithm
- Horspool
- [Hor80]
- Bfam
- Backward Factor Automaton Matching
- BndmAlgo
- Backward Nondeterministic DAWG Matching
- ShiftAnd
- Exact string matching using bit parallelism
- ShiftOr
- Exact string matching using bit parallelism
... and for multiple sequences:
- WuManber
- Extension of Horspool.
- MultiBfam
- Multiple version of Bfam, uses an automaton of reversed needles.
- SetHorspool
- Another extension of Horspool using a trie of reversed needles.
- AhoCorasick
- [AC75]
- MultipleShiftAnd
- Extension of ShiftAnd, should only be used if the sum of needle lengths doesn’t exceed the machine word size.
Assignment 1¶
- Type
- Review
- Objective
- Use the given code example from below.
Extend the code to search the given
haystack
simultaneously for “mo”, “send” and “more”. For every match output the begin and end position in thehaystack
and whichneedle
has been found. - Hint
Online search algorithms for multiple sequences simply expect needles of type
String<String<...> >
.#include <iostream> #include <seqan/find.h> using namespace seqan; int main() { CharString haystack = "Simon, send more money!"; String<CharString> needles; appendValue(needles, "mo"); appendValue(needles, "send"); appendValue(needles, "more"); return 0; }
You can use the specialization WuManber.
- Solution
Click more... to see the solution.
#include <iostream> #include <seqan/find.h> using namespace seqan; int main() { CharString haystack = "Simon, send more money!"; String<CharString> needles; appendValue(needles, "mo"); appendValue(needles, "send"); appendValue(needles, "more"); Finder<CharString> finder(haystack); Pattern<String<CharString>, WuManber> pattern(needles); while (find(finder, pattern)) { std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder) << ")\t"; std::cout << position(pattern) << '\t' << infix(finder) << std::endl; } return 0; }
We use a Pattern specialized with the WuManber algorithm for the search and initialize it with our
needles
string. For every match found by find we output the begin and end position and the match region in thehaystack
as well as the index of the foundneedle
which is returned byposition(pattern)
.[2,4) 0 mo [7,11) 1 send [12,14) 0 mo [12,16) 2 more [17,19) 0 mo
Approximate Search¶
The approximate search can be used to find segments in the haystack
that are similar to a needle
allowing errors, such as mismatches or indels.
Note that if only mismatches are allowed, the difference of the end and begin position of a match is the length of the found needle
.
However, in the case of indels this difference may vary and is only a rough estimate for the length.
Therefore, to find a begin position for a certain end position the findBegin interface should be used.
The usage is similar to find and is shown in the next example.
We want to find all semi-global alignments of a needle
“more” with a SimpleScore of at least -2 using the scoring scheme (0,-2,-1) (match,mismatch,gap).
Again, we create haystack
and needle
strings first:
#include <iostream>
#include <seqan/find.h>
using namespace seqan;
int main()
{
CharString haystack = "Simon, send more money!";
CharString needle = "more";
We then create Finder and Pattern objects of these strings and choose DPSearch as the specialization in the second template argument of Pattern.
DPSearch expects the scoring function as the first template argument which is SimpleScore in our example.
The pattern is constructed using the needle
as a template and our scoring object is initialized with the appropriate scores for match, mismatch and gap.
As in the previous example, the main iteration uses find to iterate over all end positions with a minimum best score of -2.
If such a semi-global alignment end position is found the begin position is searched via findBegin.
Please note that we have to set the minimum score to the score of the match found (getScore) in order to find the begin of a best match.
We then output all begin and end positions and the corresponding haystack
segment for each match found.
Finder<CharString> finder(haystack);
Pattern<CharString, DPSearch<SimpleScore> > pattern(needle, SimpleScore(0, -2, -1));
while (find(finder, pattern, -2))
while (findBegin(finder, pattern, getScore(pattern)))
std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder) << ")\t" << infix(finder) << std::endl;
return 0;
}
Program output:
[2,4) mo
[12,14) mo
[12,15) mor
[12,16) more
[12,17) more
[12,18) more m
[17,19) mo
[17,21) mone
The following specializations are available:
- Specialization DPSearch
- Dynamic programming algorithm for many kinds of scoring scheme
- Specialization Myers
- [Mye99], [Ukk85]
- Specialization Pex
- [BYN99]
- Specialization AbndmAlgo
- Approximate Backward Nondeterministic DAWG Matching, adaption of AbndmAlgo
Assignment 2¶
- Type
- Application
- Objective
- Use the example from above.
Modify the code to search with the Myers algorithm for matches of
"more"
with an edit distance of at most 2. - Solution
Click more... to see the solution.
#include <iostream> #include <seqan/find.h> using namespace seqan; int main() { CharString haystack = "Simon, send more money!"; CharString needle = "more"; Finder<CharString> finder(haystack); Pattern<CharString, Myers<> > pattern(needle); while (find(finder, pattern, -2)) while (findBegin(finder, pattern, getScore(pattern))) std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder) << ")\t" << infix(finder) << std::endl; return 0; }
We again set the
needle
to"more"
. We then change the specialization tag of the Pattern to Myers with default arguments. As Myers algorithm is only applicable to edit distance searches it cannot be specialized or initialized with a scoring scheme. In SeqAn, edit distance corresponds to the scoring scheme (0,-1,-1) (match, mismatch, gap) and an edit distance of 2 corresponds to a minimum score of -2 given to the find function.The program’s output is as follows.
[2,4) mo [2,5) mon [2,6) mon, [12,14) mo [12,15) mor [12,16) more [12,17) more [12,18) more m [17,19) mo [17,20) mon [17,21) mone [17,22) money
Index Search¶
Exact Search¶
For the index based search the Finder needs to be specialized with an Index of the haystack
in the first template argument.
The index itself requires two template arguments, the haystack
type and a index specialization.
In contrast, since the needle
is not preprocessed the second template argument of the Pattern has to be omitted.
The following source illustrates the usage of an index based search in SeqAn using the example of the IndexEsa index (an enhanced suffix array index).
This is the default index specialization if no second template argument for the index is given.
We begin to create an index object of our haystack
"tobeornottobe"
and a needle
"be"
.
int main()
{
Index<CharString> index("tobeornottobe");
CharString needle = "be";
Finder<Index<CharString> > finder(index);
We proceed to create a Pattern of the needle and conduct the search in the usual way.
Pattern<CharString> pattern(needle);
while (find(finder, pattern))
std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder) << ")\t" << infix(finder) << std::endl;
Instead of creating and using a pattern solely storing the needle
we can pass the needle directly to find.
Please note that an Index based Finder has to be reset with clear before conducting another search.
clear(finder);
while (find(finder, "be"))
std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder) << ")\t" << infix(finder) << std::endl;
return 0;
}
Program output:
[11,13) be
[2,4) be
[11,13) be
[2,4) be
All indices also support StringSet texts and can therefore be used to search multiple haystacks
as the following example shows.
We simply exchange the CharString of the haystack with a StringSet of CharString and append some strings to it.
int main()
{
typedef StringSet<CharString> THaystacks;
THaystacks haystacks;
appendValue(haystacks, "tobeornottobe");
appendValue(haystacks, "thebeeonthecomb");
appendValue(haystacks, "beingjohnmalkovich");
Index<THaystacks> index(haystacks);
Finder<Index<THaystacks> > finder(haystacks);
The rest of the program remains unchanged.
clear(finder);
while (find(finder, "be"))
std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder) << ")\t" << infix(finder) << std::endl;
return 0;
}
[< 0 , 11 >,< 0 , 13 >) be
[< 1 , 3 >,< 1 , 5 >) be
[< 2 , 0 >,< 2 , 2 >) be
[< 0 , 2 >,< 0 , 4 >) be
The following index specializations support the Finder interface as described above.
- Specialization IndexEsa
- Enhanced suffix array based index. Supports arbitrary needles.
- Specialization IndexQGram
- Q-gram index. Needle mustn’t exceed the size of the q-gram.
- Specialization Open Adressing QGram Index
- Q-gram index with open addressing. Supports larger q-grams. Needle and q-gram must have the same size.
Besides the find interface there is another interface for indices using suffix tree iterators to search exact needle
occurrences described in the tutorial Indices.
Assignment 3¶
- Type
- Application
- Objective
- Modify the example above to search with a Open Adressing QGram Index q-gram index for matches of “tobe” in “tobeornottobe”.
- Solution
Click more... to see the solution.
#include <iostream> #include <seqan/index.h> using namespace seqan; int main() { typedef Index<CharString, IndexQGram<UngappedShape<4>, OpenAddressing> > TIndex; TIndex index("tobeornottobe"); Finder<TIndex> finder(index); while (find(finder, "tobe")) std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder) << ")\t" << infix(finder) << std::endl; return 0; }
We simply add a second template argument to the definition of the Index as described in the documentation of the Open Adressing QGram Index. As shape we can use an UngappedShape of length 4.
Program output:
[0,4) tobe [9,13) tobe
Approximate Filtration¶
Currently there are no indices directly supporting an approximate search.
But nevertheless, there are approximate search filters available that can be used to filter out regions of the haystack
that do not contain an approximate match, see SwiftFinder and SwiftPattern.
The regions found by these filters potentially contain a match and must be verified afterwards.
beginPosition, endPosition and infix can be used to return the boundaries or sequence of such a potential match.
For more details on using filters, see the article Filtering Similar Sequences.