Online Pattern Matching
- Learning Objective
In this tutorial you will learn how to use the SeqAn classes finder and pattern to search a known pattern in a string or StringSet.
- Difficulty
Average
- Duration
40 min
- Prerequisites
For all online search algorithms, the Finder is an iterator that scans over the haystack
.
The Pattern is a search algorithm dependent data structure preprocessed from the needle
.
The second template argument of the Pattern selects the search algorithm.
Exact Search
The following code snippet illustrates the usage of online search algorithms in SeqAn using the example of the Hoorspool algorithm [Hor80].
We begin by creating two strings of type char
containing the haystack
and the needle
.
#include <iostream>
#include <seqan/find.h>
using namespace seqan2;
int main()
{
CharString haystack = "Simon, send more money!";
CharString needle = "mo";
We then create Finder and Pattern objects of these strings and choose Horspool as the specialization in the second template argument of Pattern.
Finder<CharString> finder(haystack);
Pattern<CharString, Horspool> pattern(needle);
while (find(finder, pattern))
std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder) << ")\t" << infix(finder) << std::endl;
return 0;
}
Program output:
[2,4) mo
[12,14) mo
[17,19) mo
Currently the following exact online algorithms for searching a single sequence are implemented in SeqAn:
- Simple
Brute force algorithm
- Horspool
[Hor80]
- Bfam
Backward Factor Automaton Matching
- BndmAlgo
Backward Nondeterministic DAWG Matching
- ShiftAnd
Exact string matching using bit parallelism
- ShiftOr
Exact string matching using bit parallelism
… and for multiple sequences:
- WuManber
Extension of Horspool.
- MultiBfam
Multiple version of Bfam, uses an automaton of reversed needles.
- SetHorspool
Another extension of Horspool using a trie of reversed needles.
- AhoCorasick
[AC75]
- MultipleShiftAnd
Extension of ShiftAnd, should only be used if the sum of needle lengths doesn’t exceed the machine word size.
Assignment 1
- Type
Review
- Objective
Use the given code example from below. Extend the code to search the given
haystack
simultaneously for “mo”, “send” and “more”. For every match output the begin and end position in thehaystack
and whichneedle
has been found.- Hint
Online search algorithms for multiple sequences simply expect needles of type
String<String<...> >
.#include <iostream> #include <seqan/find.h> using namespace seqan2; int main() { CharString haystack = "Simon, send more money!"; String<CharString> needles; appendValue(needles, "mo"); appendValue(needles, "send"); appendValue(needles, "more"); return 0; }
You can use the specialization WuManber.
- Solution
Click more… to see the solution.
#include <iostream> #include <seqan/find.h> using namespace seqan2; int main() { CharString haystack = "Simon, send more money!"; String<CharString> needles; appendValue(needles, "mo"); appendValue(needles, "send"); appendValue(needles, "more"); Finder<CharString> finder(haystack); Pattern<String<CharString>, WuManber> pattern(needles); while (find(finder, pattern)) { std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder) << ")\t"; std::cout << position(pattern) << '\t' << infix(finder) << std::endl; } return 0; }
We use a Pattern specialized with the WuManber algorithm for the search and initialize it with our
needles
string. For every match found by find we output the begin and end position and the match region in thehaystack
as well as the index of the foundneedle
which is returned byposition(pattern)
.[2,4) 0 mo [7,11) 1 send [12,14) 0 mo [12,16) 2 more [17,19) 0 mo
Approximate Search
The approximate search can be used to find segments in the haystack
that are similar to a needle
allowing errors, such as mismatches or indels.
Note that if only mismatches are allowed, the difference of the end and begin position of a match is the length of the found needle
.
However, in the case of indels this difference may vary and is only a rough estimate for the length.
Therefore, to find a begin position for a certain end position the findBegin interface should be used.
The usage is similar to find and is shown in the next example.
We want to find all semi-global alignments of a needle
“more” with a SimpleScore of at least -2 using the scoring scheme (0,-2,-1) (match,mismatch,gap).
Again, we create haystack
and needle
strings first:
#include <iostream>
#include <seqan/find.h>
using namespace seqan2;
int main()
{
CharString haystack = "Simon, send more money!";
CharString needle = "more";
We then create Finder and Pattern objects of these strings and choose DPSearch as the specialization in the second template argument of Pattern.
DPSearch expects the scoring function as the first template argument which is SimpleScore in our example.
The pattern is constructed using the needle
as a template and our scoring object is initialized with the appropriate scores for match, mismatch and gap.
As in the previous example, the main iteration uses find to iterate over all end positions with a minimum best score of -2.
If such a semi-global alignment end position is found the begin position is searched via findBegin.
Please note that we have to set the minimum score to the score of the match found (getScore) in order to find the begin of a best match.
We then output all begin and end positions and the corresponding haystack
segment for each match found.
Finder<CharString> finder(haystack);
Pattern<CharString, DPSearch<SimpleScore> > pattern(needle, SimpleScore(0, -2, -1));
while (find(finder, pattern, -2))
while (findBegin(finder, pattern, getScore(pattern)))
std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder) << ")\t" << infix(finder) << std::endl;
return 0;
}
Program output:
[2,4) mo
[12,14) mo
[12,15) mor
[12,16) more
[12,17) more
[12,18) more m
[17,19) mo
[17,21) mone
The following specializations are available:
- Specialization DPSearch
Dynamic programming algorithm for many kinds of scoring scheme
- Specialization Myers
- Specialization Pex
[BYN99]
- Specialization AbndmAlgo
Approximate Backward Nondeterministic DAWG Matching, adaption of AbndmAlgo
Assignment 2
- Type
Application
- Objective
Use the example from above. Modify the code to search with the Myers algorithm for matches of
"more"
with an edit distance of at most 2.- Solution
Click more… to see the solution.
#include <iostream> #include <seqan/find.h> using namespace seqan2; int main() { CharString haystack = "Simon, send more money!"; CharString needle = "more"; Finder<CharString> finder(haystack); Pattern<CharString, Myers<> > pattern(needle); while (find(finder, pattern, -2)) while (findBegin(finder, pattern, getScore(pattern))) std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder) << ")\t" << infix(finder) << std::endl; return 0; }
We again set the
needle
to"more"
. We then change the specialization tag of the Pattern to Myers with default arguments. As Myers algorithm is only applicable to edit distance searches it cannot be specialized or initialized with a scoring scheme. In SeqAn, edit distance corresponds to the scoring scheme (0,-1,-1) (match, mismatch, gap) and an edit distance of 2 corresponds to a minimum score of -2 given to the find function.The program’s output is as follows.
[2,4) mo [2,5) mon [2,6) mon, [12,14) mo [12,15) mor [12,16) more [12,17) more [12,18) more m [17,19) mo [17,20) mon [17,21) mone [17,22) money