Journaled Set¶
- Learning Objective
- This tutorial introduces you to the new data structures Journaled Set and Journaled String. You will learn how to use them and how to exploit these data structures for an efficient analysis, while implementing a native online search.
- Difficulty
- Advanced
- Duration
- 2 h
- Prerequisites
- Sequences, String Sets, Iterators
A typical task in bioinformatics is to find patterns in biological sequences e.g. transcription factors, or to examine different biological traits and the effects of modifications on such traits. With the current advances in sequencing technologies, sequences of whole populations have been made available. But the time for searching in all these sequences is proportional to the number of sequences searched. That’s why it is important to find novel strategies to cope with the deluge of sequencing data. Since, many biological problems often involve the analysis of sequences of the same species, one effective strategy would be to exploit the similarities of such sequences.
For this special purpose we provide two data structures that are designed to improve the algorithmic tasks. The first one is the JournaledString and the second is the JournaledSet.
In this tutorial, we will introduce you to both data structures and implement a simple online search step by step.
Journaled String¶
The JournaledString data structure behaves like a normal String in SeqAn, except that it is composed of two data structures.
- The first data structure is a Holder which stores a sequence.
- The second data structure stores modifications that are made to this particular sequence using a journal (see Journaling Filesystems for more information). This journal contains a list of deletions and insertions. The inserted characters are stored in an additional insertion buffer.
The advantage of this data structure lies in representing a String as a “patch” to another String. The journaled data structure can be modified without loosing the original context. We want to show you how to work with these data structures so you can build your own algorithms based on this. For this reason we keep the applicational background simple and implement an native online-search algorithm by which we examine different properties of the data structures.
Before we start implementing our online search algorithm, we show you how to work with the Journaled String to learn the basic principles. To get access to the Journaled String implementation you have to include the <seqan/sequence_journaled.h> header file. Note that you will need the <seqan/file.h> too in order to print the sequences.
#include <iostream>
#include <seqan/file.h>
#include <seqan/sequence_journaled.h>
using namespace seqan;
int main()
{
In the next step we define the Journaled String type. A Journaled String is a specialization of the String class and is defined as String<TValue, Journaled<THostSpec, TJournalSpec, TBufferSpec> >. The specialization takes two parameters: (1) TValue defines the alphabet type used for the Journaled String and (2) Journaled<> selects the Journaled String specialization of the String class.
Journaled<> is further specialized with
- THostSpec selects the specialization of the underlying host sequence (Alloc<> for [dox:AllocString Alloc String),
- TJournaleSpec selects the used implementation to manage the journaled differences (here: SortedArray), and
- TBufferSpec selects the used specialization for the internally managed insertion buffer (here: Alloc<> as well).
In our scenario we use a char alphabet and [dox:AllocString Alloc String for the host string and the insertion buffer. Additionally, we use a Sorted Array as the model to manage the recorded differences.
We use the metafunction Host to get the type of the underlying host string used for the Journaled String.
typedef String<char, Journaled<Alloc<>, SortedArray, Alloc<> > > TJournaledString;
typedef Host<TJournaledString>::Type THost;
Now we can define the variables holding data structures. First, we construct our host sequence and after that we construct the Journaled String. Then, we set the host sequence using the function setHost. Afterwards, we examine the data structure in more detail and print the host sequence the constructed journaled sequence and the nodes of it.
String<char> hostStr = "thisisahostsequence";
TJournaledString journalStr;
setHost(journalStr, hostStr);
std::cout << "After creating the Journaled String:" << std::endl;
std::cout << "Host: " << host(journalStr) << std::endl;
std::cout << "Journal: "<< journalStr << std::endl;
std::cout << "Nodes: " << journalStr._journalEntries << std::endl;
std::cout << std::endl;
Tip
The Journal
A node in the Journaled String represents either a part of the host sequence or a part of the insertion buffer. The type of a node is distinguished by the member variable segmentSource and can be of value SOURCE_ORIGINAL to refere to a part in the host or SOURCE_PATCH to refere to a part in the insertion buffer. A node further consists of three variables which specify the virtual position, the physical position and the length of this part. The virtual position gives the relative position of the Journaled String after all modifications before this position have been “virtually” applied. The physical position gives the absolute position where this part of the journal maps to either the host sequence or the insertion buffer.
This is followed by modifying our Journaled String. We insert the string "modified" at position 7 and delete the suffix "sequence" at position 19. Note that position 19 refers to the string after the insertion of "modified" at position 7. Again we print the host, the journaled sequence and the nodes that represent the modifications to see how our changes affect the host and the journaled sequence.
insert(journalStr, 7, "modified");
erase(journalStr, 19,27);
std::cout << "After modifying the Journaled String:" << std::endl;
std::cout << "Host: " << host(journalStr) << std::endl;
std::cout << "Journal: " << journalStr << std::endl;
std::cout << "Nodes: " << journalStr._journalEntries << std::endl;
std::cout << std::endl;
All of this is followed by calling flatten on our journeld string. This call applies all journaled changes to the host sequence. Again we print the sequences to see the effects.
flatten(journalStr);
std::cout << "After flatten the Journaled String:" << std::endl;
std::cout << "Host: " << host(journalStr) << std::endl;
std::cout << "Journal: " << journalStr << std::endl;
std::cout << "Nodes: " << journalStr._journalEntries << std::endl;
return 0;
}
Here is the output of our small program.
After creating the Journaled String:
Host: thisisahostsequence
Journal: thisisahostsequence
Nodes: JournalEntries({segmentSource=1, virtualPosition=0, physicalPosition=0, length=19})
After modifying the Journaled String:
Host: thisisahostsequence
Journal: thisisamodifiedhost
Nodes: JournalEntries({segmentSource=1, virtualPosition=0, physicalPosition=0, length=7}, {segmentSource=2, virtualPosition=7, physicalPosition=0, length=8}, {segmentSource=1, virtualPosition=15, physicalPosition=7, length=4})
After flatten the Journaled String:
Host: thisisamodifiedhost
Journal: thisisamodifiedhost
Nodes: JournalEntries({segmentSource=1, virtualPosition=0, physicalPosition=0, length=19})
Important
Be careful when using the flatten function as it modifies the underlying host sequence. This might affect other journaled sequences that share the same host sequence. This becomes important especially when working with Journal Sets where a whole set of sequences is journaled to the same reference.
Journaled Set¶
The JournaledSet is a specialization of the StringSet which can be used exactly as such but also provides some additional functions optimized to work with JournaledStrings. The general interface is equal to the interface of the StringSet. But it also provides some interfaces specialized for the use of Journaled Strings. One of these interfaces is the join function which journales a contained Journaled String to the previously set global reference. The following code snippet demonstrates the usage of the Journal Set and how to join a sequence to the previously set reference sequence.
As usual we include the necessary headers. We need the header <seqan/journal_set.h> to get access to the Journal Set. Then we define a type for journaled sequences. After that we define our Journal Set. The Journal Set is a specialization of the Owner concept of StringSets and is defined as StringSet<TJournalString, Owner<JournaledSet> >.
#include <iostream>
#include <seqan/file.h>
#include <seqan/journaled_set.h>
using namespace seqan;
int main()
{
typedef String<char, Journaled<Alloc<>, SortedArray, Alloc<> > > TJournalString;
typedef Host<TJournalString>::Type THost;
typedef StringSet<TJournalString, Owner<JournaledSet> > TJournaledSet;
TJournaledSet journaledSet;
THost reference = "DPKKPRGKMSSYAFFVQTSREEHKKKHPDASVNFSEFSKKCSERWKTMSAKEKGKFEDMAKADKARYEREMKTYIPPKGE";
THost seq0 = "DPKKPRGKMVNSPPAFFVQTSREEHKKKHPDASVFSKKCSERWKTMSAKEKGKFEDMAKARYEREMKTTYIPKGETYIPPKGE";
THost seq1 = "DPHHPPKPRGKMVNSPPAFFVQTSREEHKPDASVFSKKCSERRMPNHHTMSAKEKGKFEDMAKARYEREMKTTYIPKGETYIPPKGE";
THost seq2 = "DPKKPRGKMSSYAFFVQTSREEHKKKHPKKCDEFSKKCSERWKTMSAKEKGKFEDARYEREMKTYIPPKGE";
In the subsequent steps we want to set a reference sequence to the Journal Set and add some sequences to it. We can set the reference sequence by using the function setHost. This function stores only a pointer to the given sequence. In some cases it might be necessary to copy the reference sequence instead. For this purpose you can use the function createHost.
setGlobalReference(journaledSet, reference);
appendValue(journaledSet, TJournalString(seq0));
appendValue(journaledSet, TJournalString(seq1));
appendValue(journaledSet, TJournalString(seq2));
Just adding sequences to the Journal Set does not automatically journal them to the global reference sequence of the set. One can explicitly trigger this process using the function join. This function takes as parameters the Journal Set and the position of the contained Journaled String which is to be journaled to the reference sequence. Thus, the programmer is free in the decision which sequence has to be journaled and which not. Furthermore, we can use the JoinConfig object to specify the method that shall be used for the journaling process.
join(journaledSet, 0, JoinConfig<GlobalAlign<JournaledManhatten> >()); // Simply inserts the
join(journaledSet, 1, JoinConfig<GlobalAlign<JournaledCompact> >()); // Uses default scoring scheme to compute compact journal.
JoinConfig<GlobalAlign<JournaledCompact> > joinConfig;
setScoringScheme(joinConfig, Score<int, BiAffine>(0,-1,-1)); // Note the mismatch score is forbidden internally when used in the context of journaling.
join(journaledSet, 2, joinConfig); // Compute journal using Levenshtein distance.
std::cout << "Reference: " << globalReference(journaledSet) << std::endl;
for(unsigned i = 0; i < length(journaledSet); ++i)
std::cout << "Journaled Sequence " << i << ": " << value(journaledSet,i) << std::endl;
return 0;
}
Tip
Configuration of the Join Methods
The JoinConfig object differentiates two methods in general and each method further differs in the used strategy. The two methods are the GlobalAlign and the GlobalChain method. They differ in the approach of computing the alignment which is necessary to construct the journal. The first method uses a global alignment algorithm while the second one uses an anchor approach in which first exact seeds are found using a q-gram index and after that the optimal chain between the identified anchors is computed. For each method the user can specify a different strategy. The first strategy is triggered by using JournaledManhatten. This means for the the GlobalAlign method, that the complete sequence is inserted and the complete reference is deleted, while for the GlobalChain methods this means that the gaps between the anchors are connected through the Manhatten distance. The second strategy is specified using the JournaledCompact tag. It computes the most compact form of a journal by menas of memory requirements.
Here is the output of the program.
Reference: DPKKPRGKMSSYAFFVQTSREEHKKKHPDASVNFSEFSKKCSERWKTMSAKEKGKFEDMAKADKARYEREMKTYIPPKGE
Journaled Sequence 0: DPKKPRGKMVNSPPAFFVQTSREEHKKKHPDASVFSKKCSERWKTMSAKEKGKFEDMAKARYEREMKTTYIPKGETYIPPKGE
Journaled Sequence 1: DPHHPPKPRGKMVNSPPAFFVQTSREEHKPDASVFSKKCSERRMPNHHTMSAKEKGKFEDMAKARYEREMKTTYIPKGETYIPPKGE
Journaled Sequence 2: DPKKPRGKMSSYAFFVQTSREEHKKKHPKKCDEFSKKCSERWKTMSAKEKGKFEDARYEREMKTYIPPKGE
Implementing an Online-Search¶
Now we have all foundations laid down to implement the online-search algorithm. Let us begin with the first assignment where we read in some sequences and use the currently learned things about the Journal Set.
Assignment 1¶
- Type
- Review, Application
- Objective
- Download the fasta file sequences.fasta which contains some DNA sequences. Write a method called loadAndJoin that gets a Journal Set and a stream file pointing to the downloaded fasta file. The method reads in the sequences one after another using SeqAn’s RecordReader. The first read sequences is set to the reference sequence. All following sequences are first appended to the StringSet and afterwards joined to the StringSet using a global alignment strategy and the most compact form.
- Hints
- You can start using the following code snippet. Replace the path of the iostream such that it points to your path and fill in the missing parts A, B and C in the function loadAndJoin (Altogether, you will need 4 lines of code.).
#include <iostream> #include <seqan/seq_io.h> #include <seqan/journaled_set.h> using namespace seqan; template <typename TString, typename TStream, typename TSpec> inline int loadAndJoin(StringSet<TString, Owner<JournaledSet> > & /*journalSet*/, TStream & stream, JoinConfig<TSpec> const & /*joinConfig*/) { typedef typename Host<TString>::Type THost; // Define the RecordReader. RecordReader<std::ifstream, SinglePass<> > reader(stream); // [A] Ensure the Journal Set is not occupied by other sequences. // Construct the temporary buffers for the read id and sequence. String<char> tempSeqId; THost tempSeq; // No sequences in the fasta file! if (atEnd(reader)) { std::cerr << "Empty FASTA file." << std::endl; return -1; } // First read sequence for reference sequence. if (readRecord(tempSeqId, tempSeq, reader, Fasta()) != 0) { std::cerr << "ERROR reading FASTA." << std::endl; return 1; } // [B] Set the reference sequence to the Journal Set // Read remaining sequences. while (!atEnd(reader)) { if (readRecord(tempSeqId, tempSeq, reader, Fasta()) != 0) { std::cerr << "ERROR reading FASTA." << std::endl; return 1; } // [C] Append and join the current read sequence. } return 0; } int main() { // Definition of the used types. typedef String<Dna,Alloc<> > TSequence; typedef String<Dna,Journaled<Alloc<>,SortedArray,Alloc<> > > TJournal; typedef StringSet< TJournal, Owner<JournaledSet> > TJournaledSet; // Open the stream to the file containing the sequences. String<char> seqDatabasePath = "/path/to/your/fasta/file/sequences.fasta"; std::ifstream databaseFile(toCString(seqDatabasePath), std::ios_base::in); if(!databaseFile.good()) { std::cerr << "Cannot open file <" << seqDatabasePath << ">!" << std::endl; } // Reading each sequence and journal them. // [D] Construct Journaled Set and call loadAndJoin databaseFile.close(); return 0; }
- Solution
#include <iostream> #include <seqan/seq_io.h> #include <seqan/journaled_set.h> using namespace seqan; template <typename TString, typename TStream, typename TSpec> inline int loadAndJoin(StringSet<TString, Owner<JournaledSet> > & journalSet, TStream & stream, JoinConfig<TSpec> const & joinConfig) { typedef typename Host<TString>::Type THost; // Define the RecordReader. RecordReader<std::ifstream, SinglePass<> > reader(stream); // [A] clear(journalSet); // Construct the temporary buffers for the read id and sequence. String<char> tempSeqId; THost sequence; // No sequences in the fasta file! if (atEnd(reader)) { std::cerr << "Empty FASTA file." << std::endl; return -1; } // First read sequence for reference sequence. if (readRecord(tempSeqId, sequence, reader, Fasta()) != 0) { std::cerr << "ERROR reading FASTA." << std::endl; return 1; } // [B] createGlobalReference(journalSet, sequence); // When using create we copy the reference instead of storing a pointer. // Read remaining sequences. while (!atEnd(reader)) { if (readRecord(tempSeqId, sequence, reader, Fasta()) != 0) { std::cerr << "ERROR reading FASTA." << std::endl; return 1; } // [C] appendValue(journalSet, TString(sequence)); // First we append the sequence to the set. join(journalSet, length(journalSet) - 1, joinConfig); // Second we join it to the set. } return 0; } int main() { // Definition of the used types. typedef String<Dna,Alloc<> > TSequence; typedef String<Dna,Journaled<Alloc<>,SortedArray,Alloc<> > > TJournal; typedef StringSet< TJournal, Owner<JournaledSet> > TJournaledSet; // Open the stream to the file containing the sequences. String<char> seqDatabasePath = "/Users/rahn_r/Downloads/sequences.fasta"; std::ifstream databaseFile(toCString(seqDatabasePath), std::ios_base::in); if(!databaseFile.good()) { std::cerr << "Cannot open file <" << seqDatabasePath << ">!" << std::endl; } // Reading each sequence and journal them. TJournaledSet journalSet; JoinConfig<GlobalAlign<JournaledCompact> > joinConfig; loadAndJoin(journalSet, databaseFile, joinConfig); databaseFile.close(); return 0; }
Now we have loaded and journaled our sequences and we use the minimal possible memory requirements for our sequences. Let’s continue and implement the exact online-search on the Journal Set. For this purpose we write a function called searchPattern which takes a StringSet of String<int> which we use to store each hit for each sequence in the Journal Set including the reference sequence. First we have to check whether the reference sequence is set. If not we abort the search since we cannot guarantee a correct search when the reference is not set. We also have to clear our hitSet to ensure there remain no phantom hits from previous searches. Then we resize it to the number of contained Journaled Strings plus an additional space for the global reference sequence.
template <typename TString, typename TPattern>
void searchPattern(StringSet<String<int> > & hitSet,
StringSet<TString, Owner<JournaledSet> > const & journalSet,
TPattern const & pattern)
{
typedef typename Host<TString>::Type THost;
// Check for valid initial state.
if (empty(globalReference(journalSet)))
{
std::cout << "No reference set. Aborted search!" << std::endl;
return;
}
// Reset the hitSet to avoid phantom hits coming from a previous search.
clear(hitSet);
resize(hitSet, length(journalSet) + 1);
Before we can search for the pattern in the Journaled Strings we have to find all occurrences in the reference sequence. Therefore we call the function findPatternInReference which takes a String<int> which we use to store the hits, the global reference and the pattern.
// Access the reference sequence.
THost & globalRef = globalReference(journalSet);
// Search for pattern in the reference sequence.
findPatternInReference(hitSet[0], globalRef, pattern);
After that we implement the body to search for occurrences in the Journaled Strings. Therefore we use for-loop to iterate over all contained sequences and call for each sequence the function findPatternInJournalString. The function gets as parameters the corresponding String<int> from the hitSet, the journaled sequence the pattern and the identified hits in the reference sequence.
// Search for pattern in the journaled sequences.
for (unsigned i = 0; i < length(journalSet); ++i)
findPatternInJournalString(hitSet[i+1], journalSet[i], pattern, hitSet[0]);
}
So far our program won’t compile. We have to first implement the both functions findPatternInReference and findPatternInJournalString.
Assignment 2¶
- Type
- Application
- Objective
- Implement the function findPatternInReference using a brute force pattern search algorithm. Store all found hits in the passed hits variable. Print all occurrences in the end of the main function.
- Hints
template <typename TString, typename TPattern> void findPatternInReference(String<int> & hits, TString const & reference, TPattern const & pattern) { // [A] Check whether pattern fits into the sequence. // [B] Iterate over all positions at which the pattern might occur. // [C] Evaluate all positions of the pattern until you find a mismatch or you have found a hit. // [D] Report begin position at which pattern matches the sequence. }
- Solution
Here is the solution for this function. Click more... below, to see a complete solution of everything we have done so far.
template <typename TString, typename TPattern> void findPatternInReference(String<int> & hits, TString const & reference, TPattern const & pattern) { // [A] Check whether pattern fits into the sequence. if (length(pattern) > length(reference)) return; // [B] Iterate over all positions at which the pattern might occur. for (unsigned pos = 0; pos < length(reference) - length(pattern) + 1; ++pos) { bool isHit = true; // [C] Evaluate all positions of the pattern until you find a mismatch or you have found a hit. for (unsigned posPattern = 0; posPattern < length(pattern); ++posPattern) { if(pattern[posPattern] != reference[posPattern + pos]) { isHit = false; break; } } // [D] Report begin position at which pattern matches the sequence. if(isHit) appendValue(hits, pos); } }
Include the necessary headers.
#include <iostream> #include <seqan/seq_io.h> #include <seqan/journaled_set.h> using namespace seqan;
Implementation of the findPatternInReference function.
template <typename TString, typename TPattern> void findPatternInReference(String<int> & hits, TString const & reference, TPattern const & pattern) { // [A] Check whether pattern fits into the sequence. if (length(pattern) > length(reference)) return; // [B] Iterate over all positions at which the pattern might occur. for (unsigned pos = 0; pos < length(reference) - length(pattern) + 1; ++pos) { bool isHit = true; // [C] Evaluate all positions of the pattern until you find a mismatch or you have found a hit. for (unsigned posPattern = 0; posPattern < length(pattern); ++posPattern) { if(pattern[posPattern] != reference[posPattern + pos]) { isHit = false; break; } } // [D] Report begin position at which pattern matches the sequence. if(isHit) appendValue(hits, pos); } }
Implementation of the searchPattern function. Note that we haven’t implemented the function findPatternInJournalString yet.
template <typename TString, typename TPattern> void searchPattern(StringSet<String<int> > & hitSet, StringSet<TString, Owner<JournaledSet> > const & journalSet, TPattern const & pattern) { typedef typename Host<TString>::Type THost; // Check for valid initial state. if (empty(globalReference(journalSet))) { std::cout << "No reference set. Aborted search!" << std::endl; return; } // Reset the hitSet to avoid phantom hits coming from a previous search. clear(hitSet); resize(hitSet, length(journalSet) + 1); // Access the reference sequence. THost & globalRef = globalReference(journalSet); // Search for pattern in the reference sequence. findPatternInReference(hitSet[0], globalRef, pattern); // Search for pattern in the journaled sequences. for (unsigned i = 0; i < length(journalSet); ++i) { // findPatternInJournalString(hitSet[i+1], journalSet[i], pattern, hitSet[0]); } }
Implementation of the loadAndJoin function.
template <typename TString, typename TStream, typename TSpec> inline int loadAndJoin(StringSet<TString, Owner<JournaledSet> > & journalSet, TStream & stream, JoinConfig<TSpec> const & joinConfig) { typedef typename Host<TString>::Type THost; RecordReader<std::ifstream, SinglePass<> > reader(stream); clear(journalSet); String<char> tempSeqId; THost sequence; // No sequences in the fasta file! if (atEnd(reader)) { std::cerr << "Empty FASTA file." << std::endl; return -1; } // First read sequence for reference sequence. if (readRecord(tempSeqId, sequence, reader, Fasta()) != 0) { std::cerr << "ERROR reading FASTA." << std::endl; return 1; } // We have to create the global reference sequence otherwise we loose the information after this function terminates. createGlobalReference(journalSet, sequence); // If there are more while (!atEnd(reader)) { if (readRecord(tempSeqId, sequence, reader, Fasta()) != 0) { std::cerr << "ERROR reading FASTA." << std::endl; return 1; } appendValue(journalSet, TString(sequence)); join(journalSet, length(journalSet) - 1, joinConfig); } return 0; }
Implementation of the main function.
int main() { // Definition of the used types. typedef String<Dna,Alloc<> > TSequence; typedef String<Dna,Journaled<Alloc<>,SortedArray,Alloc<> > > TJournal; typedef StringSet< TJournal, Owner<JournaledSet> > TJournaledSet; // Open the stream to the file containing the sequences. String<char> seqDatabasePath = "/Users/rahn_r/Downloads/sequences.fasta"; std::ifstream databaseFile(toCString(seqDatabasePath), std::ios_base::in); if(!databaseFile.good()) { std::cerr << "Cannot open file <" << seqDatabasePath << ">!" << std::endl; } // Reading each sequence and journal them. TJournaledSet journalSet; JoinConfig<GlobalAlign<JournaledCompact> > joinConfig; loadAndJoin(journalSet, databaseFile, joinConfig); databaseFile.close(); // Define a pattern and start search. StringSet<String<int> > hitSet; TSequence pattern = "GTGGT"; std::cout << "Search for: " << pattern << ":\n"; searchPattern(hitSet, journalSet, pattern);
Printing the hits of the reference sequence.
if (empty(hitSet[0])) { std::cout << "No hit in reference " << std::endl; } else { std::cout << "Hit in reference " << " at "; for (unsigned j = 0; j < length(hitSet[0]); ++j) std::cout << hitSet[0][j] << ": " << infix(globalReference(journalSet), hitSet[0][j],hitSet[0][j] + length(pattern)) << "\t"; } std::cout << std::endl; return 0; }
And here is the result.
Search for: GTGGT: Hit in reference at 311: GTGGT 644: GTGGT
We know can search for all occurrences of a pattern in the reference sequence. Now we can try to find all occurrences in the journaled sequences. Therefore we implement the function findPatternInJournalString. Our function gets the variable hitsTarget which stores the hits found in the JournaledString. It gets the search text and the pattern and finally the hits detected in the reference sequence. Instead of searching each position in the Journaled String, we only search in areas that are new to the search. This involves all inserted parts and all parts where the pattern crosses a border to another node. So instead of iterating over each position we iterate over the nodes of the Journaled String. To do so we have to determine the type of the data structure that is used by the Journaled String to manage the nodes. We can use the metafunction JournalType for this task. Afterwards we define an Iterator over the so called TJournalEntries data structure.
Again we check first whether the pattern fits into our sequence.
template <typename TValue, typename THostSpec, typename TJournalSpec, typename TBufferSpec, typename TPattern>
void findPatternInJournalString(String<int> & hitTarget,
String<TValue, Journaled<THostSpec, TJournalSpec, TBufferSpec> > const & journal,
TPattern const & pattern,
String<int> const & refHits)
{
typedef String<TValue, Journaled<THostSpec, TJournalSpec, TBufferSpec> > const TJournal;
typedef typename JournalType<TJournal>::Type TJournalEntries;
typedef typename Iterator<TJournalEntries>::Type TJournalEntriesIterator;
if (length(pattern) > length(journal))
return;
We then iterate over all nodes beginning from the first until we have reached the node in which the pattern reaches the end of the Journaled Sequence. The function findInJournalEntries helps us to find the corresponding node. We increment the position of the iterator by one such that it points behind the last element which is included by the search.
TJournalEntriesIterator it = begin(journal._journalEntries);
TJournalEntriesIterator itEnd = findInJournalEntries(journal._journalEntries, length(journal) - length(pattern) + 1) + 1;
Now we search in each node until we have reached the end. For each node we first check the type of the journaled operation. If we are in an “original” node we call the function _findInOriginalNode. If we are in a “patch” node we call the function _findInPatchNode and in the end we call the function _searchAtBorder which is called for each node type and scans all possible hits at the border of a node.
while(it != itEnd)
{
if (it->segmentSource == SOURCE_ORIGINAL)
{ // Find a possible hit in the current source vertex.
_findInOriginalNode(hitTarget, it, pattern, refHits);
}
if (it->segmentSource == SOURCE_PATCH)
{ // Search for pattern within the patch node.
_findInPatchNode(hitTarget, it, journal, pattern);
}
// Scan the border for a possible match.
_searchAtBorder(hitTarget, it, journal, pattern);
++it;
}
}
Let us begin with the implementation of the function _findInOriginalNode. In this function we exploit the journaling concept such that we can speed up the search algorithm from \(\mathcal{O}(p \cdot n)\) to \(\mathcal{O}(\log_2(k))\), where \(p\) is the length of the pattern, \(n\) is the length of the search text, and k is the number of hits identified in the reference sequence. We need at most \(\log_2(k)\) comparisons to find the first hit which occurred in the reference sequence that also occurs in the current original node.
Assignment 3¶
- Type
- Transfer
- Objective
- Implement the function _findInOriginalNode, which identifies all shared hits between the current original node and the corresponding part in the reference sequence. Note you do not need to scan all positions again. In the end print all occurrences to the console.
- Hints
Use the STL function std::upper_bound to conduct a binary search to find the first possible hit from the reference that is also represented by the current node.
template <typename TJournalEntriesIterator, typename TPattern> void _findInOriginalNode(String<int> & hitTarget, TJournalEntriesIterator & entriesIt, TPattern const & pattern, String<int> const & refHits) { // [A] Check if hits exist in the reference. // [B] Find upper bound to current physical position in sorted refHits using std::upper_bound. // [C] Make sure we do not miss hits that begin at physical position of current node. // [D] Store all hits that are found in the region of the reference which is covered by this node. // [E] Store the correct virtual position and check next hit. }
- Solution
Here is the solution to this function. Click more... below, to see a complete solution of everything we have done so far.
template <typename TJournalEntriesIterator, typename TPattern> void _findInOriginalNode(String<int> & hitTarget, TJournalEntriesIterator & entriesIt, TPattern const & pattern, String<int> const & refHits) { // Define an Iterator which iterates over the reference hit set. typedef typename Iterator<String<int> const, Standard>::Type THitIterator; // [A] Check if hits exist in the reference. if(!empty(refHits)) { // [B] Find upper bound to current physical position in sorted refHits using std::upper_bound. THitIterator itHit = std::upper_bound(begin(refHits),end(refHits),(int)entriesIt->physicalPosition); // [C] Make sure we do not miss hits that begin at physical position of current node. if(itHit != begin(refHits) && *(itHit - 1) >= (int)entriesIt->physicalPosition) --itHit; // [D] Store all hits that are found in the region of the reference which is covered by this node. while((int)*itHit < ((int)entriesIt->physicalPosition + (int)entriesIt->length - (int)length(pattern) + 1) && itHit != end(refHits)) { // [E] Store the correct virtual position and check next hit. appendValue(hitTarget, entriesIt->virtualPosition + (*itHit - (int)entriesIt->physicalPosition)); ++itHit; } } }
Include the necessary headers.
#include <iostream> #include <seqan/seq_io.h> #include <seqan/journaled_set.h> using namespace seqan;
We know implement the method to search for hits in an original node. We only need to check if the current node covers a region of the reference in which we’ve found a hit. We use the function std::upper_bound to find the first element that is greater than the current physical position. Since, we’ve found an upper bound we have to check additionally if there exists a previous hit that lies directly on the physical begin position of our current node. We then include all hits that fit into this current node until we have found the first position where the pattern would cross the border of this node or we have reached the end of the reference hit set.
template <typename TJournalEntriesIterator, typename TPattern> void _findInOriginalNode(String<int> & hitTarget, TJournalEntriesIterator & entriesIt, TPattern const & pattern, String<int> const & refHits) { // Define an Iterator which iterates over the reference hit set. typedef typename Iterator<String<int> const, Standard>::Type THitIterator; // [A] Check if hits exist in the reference. if(!empty(refHits)) { // [B] Find upper bound to current physical position in sorted refHits using std::upper_bound. THitIterator itHit = std::upper_bound(begin(refHits),end(refHits),(int)entriesIt->physicalPosition); // [C] Make sure we do not miss hits that begin at physical position of current node. if(itHit != begin(refHits) && *(itHit - 1) >= (int)entriesIt->physicalPosition) --itHit; // [D] Store all hits that are found in the region of the reference which is covered by this node. while((int)*itHit < ((int)entriesIt->physicalPosition + (int)entriesIt->length - (int)length(pattern) + 1) && itHit != end(refHits)) { // [E] Store the correct virtual position and check next hit. appendValue(hitTarget, entriesIt->virtualPosition + (*itHit - (int)entriesIt->physicalPosition)); ++itHit; } } }
Implementing the backbone to search for a pattern in the reference string.
template <typename TValue, typename THostSpec, typename TJournalSpec, typename TBufferSpec, typename TPattern> void findPatternInJournalString(String<int> & hitTarget, String<TValue, Journaled<THostSpec, TJournalSpec, TBufferSpec> > const & journal, TPattern const & pattern, String<int> const & refHits) { typedef String<TValue, Journaled<THostSpec, TJournalSpec, TBufferSpec> > const TJournal; typedef typename JournalType<TJournal>::Type TJournalEntries; typedef typename Iterator<TJournalEntries>::Type TJournalEntriesIterator; if (length(pattern) > length(journal)) return; TJournalEntriesIterator it = begin(journal._journalEntries); TJournalEntriesIterator itEnd = findInJournalEntries(journal._journalEntries, length(journal) - length(pattern) + 1) + 1; while(it != itEnd) { if (it->segmentSource == SOURCE_ORIGINAL) { // Find a possible hit in the current source vertex. _findInOriginalNode(hitTarget, it, pattern, refHits); } if (it->segmentSource == SOURCE_PATCH) { // Search for pattern within the patch node. // _findInPatchNode(hitTarget, it, journal, pattern); } // Scan the border for a possible match. // _searchAtBorder(hitTarget, it, journal, pattern); ++it; } }
Implementing the search within the reference sequence.
template <typename TString, typename TPattern> void findPatternInReference(String<int> & hits, TString const & reference, TPattern const & pattern) { // Check whether the pattern fits into the sequence. if (length(pattern) > length(reference)) return; // for (unsigned pos = 0; pos < length(reference) - length(pattern) + 1; ++pos) { bool isHit = true; for (unsigned posPattern = 0; posPattern < length(pattern); ++posPattern) { if(pattern[posPattern] != reference[posPattern + pos]) { isHit = false; break; } } // Report the position if found a hit. if(isHit) appendValue(hits, pos); } }
Implementing the backbone of the search.
template <typename TString, typename TPattern> void searchPattern(StringSet<String<int> > & hitSet, StringSet<TString, Owner<JournaledSet> > const & journalSet, TPattern const & pattern) { typedef typename Host<TString>::Type THost; // Check for valid initial state. if (empty(globalReference(journalSet))) { std::cout << "No reference set. Aborted search!" << std::endl; return; } // Reset the hitSet to avoid phantom hits coming from a previous search. clear(hitSet); resize(hitSet, length(journalSet) + 1); // Access the reference sequence. THost & globalRef = globalReference(journalSet); // Search for pattern in the reference sequence. findPatternInReference(hitSet[0], globalRef, pattern); // Search for pattern in the journaled sequences. for (unsigned i = 0; i < length(journalSet); ++i) findPatternInJournalString(hitSet[i+1], journalSet[i], pattern, hitSet[0]); }
Implement the laodAndJoin method.
Implementing the main method.
int main() { // Definition of the used types. typedef String<Dna,Alloc<> > TSequence; typedef String<Dna,Journaled<Alloc<>,SortedArray,Alloc<> > > TJournal; typedef StringSet< TJournal, Owner<JournaledSet> > TJournaledSet; // Open the stream to the file containing the sequences. String<char> seqDatabasePath = "/Users/rahn_r/Downloads/sequences.fasta"; std::ifstream databaseFile(toCString(seqDatabasePath), std::ios_base::in); if(!databaseFile.good()) { std::cerr << "Cannot open file <" << seqDatabasePath << ">!" << std::endl; } // Reading each sequence and journal them. TJournaledSet journalSet; JoinConfig<GlobalAlign<JournaledCompact> > joinConfig; loadAndJoin(journalSet, databaseFile, joinConfig); databaseFile.close(); // Define a pattern and start search. StringSet<String<int> > hitSet; TSequence pattern = "GTGGT"; std::cout << "Search for: " << pattern << ":\n"; searchPattern(hitSet, journalSet, pattern);
Printing the hits of the reference sequence.
if (empty(hitSet[0])) { std::cout << "No hit in reference " << std::endl; } else { std::cout << "Hit in reference " << " at "; for (unsigned j = 0; j < length(hitSet[0]); ++j) std::cout << hitSet[0][j] << ": " << infix(globalReference(journalSet), hitSet[0][j],hitSet[0][j] + length(pattern)) << "\t"; } std::cout << std::endl;
Printing the hits of the journaled sequences.
for (unsigned i = 1; i < length(hitSet); ++i) { if (empty(hitSet[i])) { std::cout << "No hit in sequence " << i - 1 << std::endl; } else { std::cout << "Hit in sequence " << i - 1 << " at "; for (unsigned j = 0; j < length(hitSet[i]); ++j) std::cout << hitSet[i][j] << ": " << infix(value(journalSet,i-1), hitSet[i][j],hitSet[i][j] + length(pattern)) << "\t"; } std::cout << std::endl; } return 0; }
And here is the result.
Search for: GTGGT: Hit in reference at 311: GTGGT 644: GTGGT Hit in sequence 0 at 312: GTGGT Hit in sequence 1 at 308: GTGGT Hit in sequence 2 at 311: GTGGT Hit in sequence 3 at 327: GTGGT Hit in sequence 4 at 317: GTGGT Hit in sequence 5 at 320: GTGGT
We are almost at the end of our online-search algorithm. Let’s now implement the method _findInPatchNode. We basically had this already implemented when we wrote the search function for the reference. Let’s recall this part together.
First we write the body of our function and define now an Iterator over the Journaled String.
template <typename TJournalEntriesIterator, typename TJournal, typename TPattern>
void _findInPatchNode(String<int> & hitTarget,
TJournalEntriesIterator & entriesIt,
TJournal const & journal,
TPattern const & pattern)
{
typedef typename Iterator<TJournal const, Standard>::Type TJournalIterator;
We know specify the range in which we are searching for the pattern. This range starts at the current physical position of the insertion buffer and ends at the last position of this node where the pattern completely fits.
// Search for pattern in the insertion node.
TJournalIterator patchIter = iter(journal, entriesIt->virtualPosition);
TJournalIterator patchEnd = patchIter + _max(0, (int)entriesIt->length - (int)length(pattern) + 1);
// Move step by step over search region.
for (; patchIter != patchEnd; ++patchIter)
{
We need to use a second temporary iterator which is used to compare the current value with the pattern. If all positions matches then we report a match at the current virtual position.
TJournalIterator verifyIter = patchIter;
bool isHit = true;
// Search for pattern in the insertion node.
for (unsigned posPattern = 0; posPattern < length(pattern); ++posPattern, ++verifyIter)
{
// Comparing the pattern value with the current value of the iterator.
if(pattern[posPattern] != getValue(verifyIter))
{
isHit = false;
break;
}
}
if (isHit)
appendValue(hitTarget, position(patchIter));
}
}
To ensure that we are not missing any hits we also have to scan the regions where the pattern is leaving the current node. You can solve this problem in the next assignment.
Assignment 4¶
- Type
- Review
- Objective
- Implement the last function _searchAtBorder, which identifies all hits that cross the border of the current node to the next node.
- Hints
The following code snippet provides you with the backbone for this function. Fill in the missing parts [A], [B], [C] and [D].
template <typename TJournalEntriesIterator, typename TJournal, typename TPattern> void _searchAtBorder(String<int> & hitTarget, TJournalEntriesIterator & entriesIt, TJournal const & journal, TPattern const & pattern) { // [A] Determine first position of the at which pattern crosses the border of current node. // [B] Determine last position before pattern exits the current node or reaches the end of the sequence. // [C] Move step by step over search region. // [D] Scan pattern in current window and report possible hits. }
- Solution
- Here is the solution to this function. Click more... below, to see a complete solution of everything we have done so far.
template <typename TJournalEntriesIterator, typename TJournal, typename TPattern> void _searchAtBorder(String<int> & hitTarget, TJournalEntriesIterator & entriesIt, TJournal const & journal, TPattern const & pattern) { typedef typename Iterator<TJournal const, Standard>::Type TJournalIterator; // [A] Determine first position of the at which pattern crosses the border of current node. TJournalIterator nodeIter = iter(journal, entriesIt->virtualPosition + _max(0,(int)entriesIt->length - (int)length(pattern) + 1)); // [B] Determine last position before pattern exits the current node or reaches the end of the sequence. TJournalIterator nodeEnd = iter(journal, _min(entriesIt->virtualPosition + entriesIt->length, length(journal) - length(pattern) + 1)); if (nodeEnd == end(journal)) return; // [C] Move step by step over search region. for (; nodeIter != nodeEnd; ++nodeIter) { // [D] Scan pattern in current window and report possible hits. TJournalIterator verifyIter = nodeIter; bool isHit = true; // Compare pattern with current search window. for (unsigned posPattern = 0; posPattern < length(pattern); ++posPattern, ++verifyIter) { // Comparing the pattern value with the current value of the iterator. if(pattern[posPattern] != getValue(verifyIter)) { isHit = false; break; } } // Report hit if found. if (isHit) appendValue(hitTarget, position(nodeIter)); } }
Include the necessary headers.
#include <iostream> #include <seqan/seq_io.h> #include <seqan/journaled_set.h> using namespace seqan;
Search at the border the current node for the pattern.
template <typename TJournalEntriesIterator, typename TJournal, typename TPattern> void _searchAtBorder(String<int> & hitTarget, TJournalEntriesIterator & entriesIt, TJournal const & journal, TPattern const & pattern) { typedef typename Iterator<TJournal const, Standard>::Type TJournalIterator; // [A] Determine first position of the at which pattern crosses the border of current node. TJournalIterator nodeIter = iter(journal, entriesIt->virtualPosition + _max(0,(int)entriesIt->length - (int)length(pattern) + 1)); // [B] Determine last position before pattern exits the current node or reaches the end of the sequence. TJournalIterator nodeEnd = iter(journal, _min(entriesIt->virtualPosition + entriesIt->length, length(journal) - length(pattern) + 1)); if (nodeEnd == end(journal)) return; // [C] Move step by step over search region. for (; nodeIter != nodeEnd; ++nodeIter) { // [D] Scan pattern in current window and report possible hits. TJournalIterator verifyIter = nodeIter; bool isHit = true; // Compare pattern with current search window. for (unsigned posPattern = 0; posPattern < length(pattern); ++posPattern, ++verifyIter) { // Comparing the pattern value with the current value of the iterator. if(pattern[posPattern] != getValue(verifyIter)) { isHit = false; break; } } // Report hit if found. if (isHit) appendValue(hitTarget, position(nodeIter)); } }
Search for the pattern in the insertion region covered by the current node.
template <typename TJournalEntriesIterator, typename TJournal, typename TPattern> void _findInPatchNode(String<int> & hitTarget, TJournalEntriesIterator & entriesIt, TJournal const & journal, TPattern const & pattern) { typedef typename Iterator<TJournal const, Standard>::Type TJournalIterator; // Search for pattern in the insertion node. TJournalIterator patchIter = iter(journal, entriesIt->virtualPosition); TJournalIterator patchEnd = patchIter + _max(0, (int)entriesIt->length - (int)length(pattern) + 1); // Move step by step over search region. for (; patchIter != patchEnd; ++patchIter) { TJournalIterator verifyIter = patchIter; bool isHit = true; // Search for pattern in the insertion node. for (unsigned posPattern = 0; posPattern < length(pattern); ++posPattern, ++verifyIter) { // Comparing the pattern value with the current value of the iterator. if(pattern[posPattern] != getValue(verifyIter)) { isHit = false; break; } } if (isHit) appendValue(hitTarget, position(patchIter)); } }
Check if hit was reported for this region in the reference sequence.
template <typename TJournalEntriesIterator, typename TPattern> void _findInOriginalNode(String<int> & hitTarget, TJournalEntriesIterator & entriesIt, TPattern const & pattern, String<int> const & refHits) { // Define an Iterator which iterates over the reference hit set. typedef typename Iterator<String<int> const, Standard>::Type THitIterator; // Check if hits exist in the reference. if(!empty(refHits)) { // Find upper bound to physical position in sorted refHits. THitIterator itHit = std::upper_bound(begin(refHits),end(refHits),(int)entriesIt->physicalPosition); // Make sure we do not miss hits that begin at physical position of current node. if(itHit != begin(refHits) && *(itHit - 1) >= (int)entriesIt->physicalPosition) --itHit; // Store all hits that are found in the region of the reference which is covered by this node. while((int)*itHit < ((int)entriesIt->physicalPosition + (int)entriesIt->length - (int)length(pattern) + 1) && itHit != end(refHits)) { appendValue(hitTarget, entriesIt->virtualPosition + (*itHit - (int)entriesIt->physicalPosition)); ++itHit; } } }
Implementing the backbone of the search for the Journaled String.
template <typename TValue, typename THostSpec, typename TJournalSpec, typename TBufferSpec, typename TPattern> void findPatternInJournalString(String<int> & hitTarget, String<TValue, Journaled<THostSpec, TJournalSpec, TBufferSpec> > const & journal, TPattern const & pattern, String<int> const & refHits) { typedef String<TValue, Journaled<THostSpec, TJournalSpec, TBufferSpec> > const TJournal; typedef typename JournalType<TJournal>::Type TJournalEntries; typedef typename Iterator<TJournalEntries>::Type TJournalEntriesIterator; if (length(pattern) > length(journal)) return; TJournalEntriesIterator it = begin(journal._journalEntries); TJournalEntriesIterator itEnd = findInJournalEntries(journal._journalEntries, length(journal) - length(pattern) + 1) + 1; while(it != itEnd) { if (it->segmentSource == SOURCE_ORIGINAL) { // Find a possible hit in the current source vertex. _findInOriginalNode(hitTarget, it, pattern, refHits); } if (it->segmentSource == SOURCE_PATCH) { // Search for pattern within the patch node. _findInPatchNode(hitTarget, it, journal, pattern); } // Scan the border for a possible match. _searchAtBorder(hitTarget, it, journal, pattern); ++it; } }
Implementing the search for the reference sequence.
template <typename TString, typename TPattern> void findPatternInReference(String<int> & hits, TString const & reference, TPattern const & pattern) { // Check whether the pattern fits into the sequence. if (length(pattern) > length(reference)) return; // for (unsigned pos = 0; pos < length(reference) - length(pattern) + 1; ++pos) { bool isHit = true; for (unsigned posPattern = 0; posPattern < length(pattern); ++posPattern) { if(pattern[posPattern] != reference[posPattern + pos]) { isHit = false; break; } } // Report the position if found a hit. if(isHit) appendValue(hits, pos); } }
The backbone of the search method.
template <typename TString, typename TPattern> void searchPattern(StringSet<String<int> > & hitSet, StringSet<TString, Owner<JournaledSet> > const & journalSet, TPattern const & pattern) { typedef typename Host<TString>::Type THost; // Check for valid initial state. if (empty(globalReference(journalSet))) { std::cout << "No reference set. Aborted search!" << std::endl; return; } // Reset the hitSet to avoid phantom hits coming from a previous search. clear(hitSet); resize(hitSet, length(journalSet) + 1); // Access the reference sequence. THost & globalRef = globalReference(journalSet); // Search for pattern in the reference sequence. findPatternInReference(hitSet[0], globalRef, pattern); // Search for pattern in the journaled sequences. for (unsigned i = 0; i < length(journalSet); ++i) findPatternInJournalString(hitSet[i+1], journalSet[i], pattern, hitSet[0]); }
Loading and joining the sequences.
Implementing the main function.
int main() { // Definition of the used types. typedef String<Dna,Alloc<> > TSequence; typedef String<Dna,Journaled<Alloc<>,SortedArray,Alloc<> > > TJournal; typedef StringSet< TJournal, Owner<JournaledSet> > TJournaledSet; // Open the stream to the file containing the sequences. String<char> seqDatabasePath = "/Users/rahn_r/Downloads/sequences.fasta"; std::ifstream databaseFile(toCString(seqDatabasePath), std::ios_base::in); if(!databaseFile.good()) { std::cerr << "Cannot open file <" << seqDatabasePath << ">!" << std::endl; } // Reading each sequence and journal them. TJournaledSet journalSet; JoinConfig<GlobalAlign<JournaledCompact> > joinConfig; loadAndJoin(journalSet, databaseFile, joinConfig); databaseFile.close(); // Define a pattern and start search. StringSet<String<int> > hitSet; TSequence pattern = "GTGGT"; std::cout << "Search for: " << pattern << ":\n"; searchPattern(hitSet, journalSet, pattern);
Reporting the identified hits.
if (empty(hitSet[0])) { std::cout << "No hit in reference " << std::endl; } else { std::cout << "Hit in reference " << " at "; for (unsigned j = 0; j < length(hitSet[0]); ++j) std::cout << hitSet[0][j] << ": " << infix(globalReference(journalSet), hitSet[0][j],hitSet[0][j] + length(pattern)) << "\t"; } std::cout << std::endl; for (unsigned i = 1; i < length(hitSet); ++i) { if (empty(hitSet[i])) { std::cout << "No hit in sequence " << i - 1 << std::endl; } else { std::cout << "Hit in sequence " << i - 1 << " at "; for (unsigned j = 0; j < length(hitSet[i]); ++j) std::cout << hitSet[i][j] << ": " << infix(value(journalSet,i-1), hitSet[i][j],hitSet[i][j] + length(pattern)) << "\t"; } std::cout << std::endl; } return 0; }
And here is the result.
Search for: GTGGT: Hit in reference at 311: GTGGT 644: GTGGT Hit in sequence 0 at 151: GTGGT 312: GTGGT Hit in sequence 1 at 308: GTGGT Hit in sequence 2 at 311: GTGGT 507: GTGGT Hit in sequence 3 at 327: GTGGT Hit in sequence 4 at 307: GTGGT 312: GTGGT 317: GTGGT Hit in sequence 5 at 0: GTGGT 320: GTGGT 986: GTGGT
Congratulations! You have just implemented a cool online-search which is speed up by exploiting the parallelism given by the data set. And here is the final result.
Search for: GTGGT:
Hit in reference at 311: GTGGT 644: GTGGT
Hit in sequence 0 at 151: GTGGT 312: GTGGT
Hit in sequence 1 at 308: GTGGT
Hit in sequence 2 at 311: GTGGT 507: GTGGT
Hit in sequence 3 at 327: GTGGT
Hit in sequence 4 at 307: GTGGT 312: GTGGT 317: GTGGT
Hit in sequence 5 at 0: GTGGT 320: GTGGT 986: GTGGT
Assignment 5¶
- Type
- Transfer
- Objective
- Try to replace the brute force versions using using SeqAn’s Finder and Pattern concept. You can find additional material to this topic in the Pattern Matching Tutorial.
- Solution
Now we want to replace the brute force methods with some cool pattern matching algorithms. Therefore we include the header <seqan/finder.h>.
#include <iostream> #include <seqan/find.h> #include <seqan/seq_io.h> #include <seqan/journaled_set.h> using namespace seqan;
Now we can use the Finder interface of SeqAn. One cool thing of the usage of the Finder class is that we don’t have to check for the borders anymore. This will do the Finder for us. We only have to specify the correct infix over which the Finder should iterate to find the pattern. We first compute the positions that enclose the search region. Afterwards, we get an infix for this region and pass it to the Finder’s constructor. We also have to define the Pattern object which gets the pattern we are searching for. Then we can simply call the function find until we there is no more match. Be careful when storing the position that the Finder is returning. We have to recompute the correct virtual position since we used an infix of the original search text.
template <typename TJournalEntriesIterator, typename TJournal, typename TPattern> void _searchAtBorder(String<int> & hitTarget, TJournalEntriesIterator & entriesIt, TJournal const & journal, TPattern const & needle) { typedef typename Position<TJournal>::Type TPosition; // Define region before the border to the next node to search for the pattern. TPosition infixBegin = entriesIt->virtualPosition + _max(0,(int)entriesIt->length - (int)length(needle) + 1); TPosition infixEnd = _min(length(journal),entriesIt->virtualPosition + entriesIt->length + length(needle) - 1); TPattern tmpInsBuffer = infix(journal, infixBegin, infixEnd); Finder<TPattern const> finder(tmpInsBuffer); Pattern<TPattern, Horspool> pattern(needle); while (find(finder, pattern)) { appendValue(hitTarget, infixBegin + beginPosition(finder)); } }
So the biggest change is done. We simply repeat the changes from above and watch to get the correct virtual position.
template <typename TJournalEntriesIterator, typename TJournal, typename TPattern> void _findInPatchNode(String<int> & hitTarget, TJournalEntriesIterator & entriesIt, TJournal const & journal, TPattern const & needle) { typedef typename Position<TJournal>::Type TPosition; TPosition infixBegin = entriesIt->virtualPosition; TPosition infixEnd = entriesIt->virtualPosition + entriesIt->length; TPattern tmpInsBuffer = infix(journal, infixBegin, infixEnd); Finder<TPattern const> finder(tmpInsBuffer); Pattern<TPattern, Horspool> pattern(needle); while (find(finder, pattern)) appendValue(hitTarget, entriesIt->virtualPosition + beginPosition(finder)); }
Of course we don’t need to change anything for the original nodes.
template <typename TJournalEntriesIterator, typename TPattern> void _findInOriginalNode(String<int> & hitTarget, TJournalEntriesIterator & entriesIt, TPattern const & pattern, String<int> const & refHits) { // Define an Iterator which iterates over the reference hit set. typedef typename Iterator<String<int> const, Standard>::Type THitIterator; // Check if hits exist in the reference. if(!empty(refHits)) { // Find upper bound to physical position in sorted refHits. THitIterator itHit = std::upper_bound(begin(refHits),end(refHits),(int)entriesIt->physicalPosition); // Make sure we do not miss hits that begin at physical position of current node. if(itHit != begin(refHits) && *(itHit - 1) >= (int)entriesIt->physicalPosition) --itHit; // Store all hits that are found in the region of the reference which is covered by this node. while((int)*itHit < ((int)entriesIt->physicalPosition + (int)entriesIt->length - (int)length(pattern) + 1) && itHit != end(refHits)) { appendValue(hitTarget, entriesIt->virtualPosition + (*itHit - (int)entriesIt->physicalPosition)); ++itHit; } } }
Also the function findPatternInJournalString remains the same.
template <typename TValue, typename THostSpec, typename TJournalSpec, typename TBufferSpec, typename TPattern> void findPatternInJournalString(String<int> & hitTarget, String<TValue, Journaled<THostSpec, TJournalSpec, TBufferSpec> > const & journal, TPattern const & pattern, String<int> const & refHits) { typedef String<TValue, Journaled<THostSpec, TJournalSpec, TBufferSpec> > const TJournal; typedef typename JournalType<TJournal>::Type TJournalEntries; typedef typename Iterator<TJournalEntries>::Type TJournalEntriesIterator; if (length(pattern) > length(journal)) return; TJournalEntriesIterator it = begin(journal._journalEntries); TJournalEntriesIterator itEnd = findInJournalEntries(journal._journalEntries, length(journal) - length(pattern) + 1) + 1; while(it != itEnd) { if (it->segmentSource == SOURCE_ORIGINAL) { // Find a possible hit in the current source vertex. _findInOriginalNode(hitTarget, it, pattern, refHits); } if (it->segmentSource == SOURCE_PATCH) { // Search for pattern within the patch node. _findInPatchNode(hitTarget, it, journal, pattern); } // Scan the border for a possible match. _searchAtBorder(hitTarget, it, journal, pattern); ++it; } }
We will switch to the Finder concept for the function findPatternInReference too. This is done quickly, since we have the basis already laid down in the previous functions.
template <typename TString> void findPatternInReference(String<int> & hits, TString const & reference, TString const & needle) { // Check whether the pattern fits into the sequence. if (length(needle) > length(reference)) return; Finder<TString const> finder(reference); Pattern<TString, Horspool> pattern(needle); while (find(finder, pattern)) appendValue(hits, beginPosition(finder)); }
From here on, we don’t have to change anything.
template <typename TString, typename TPattern> void searchPattern(StringSet<String<int> > & hitSet, StringSet<TString, Owner<JournaledSet> > const & journalSet, TPattern const & pattern) { typedef typename Host<TString>::Type THost; // Check for valid initial state. if (empty(globalReference(journalSet))) { std::cout << "No reference set. Aborted search!" << std::endl; return; } // Reset the hitSet to avoid phantom hits coming from a previous search. clear(hitSet); resize(hitSet, length(journalSet) + 1); // Access the reference sequence. THost & globalRef = globalReference(journalSet); // Search for pattern in the reference sequence. findPatternInReference(hitSet[0], globalRef, pattern); // Search for pattern in the journaled sequences. for (unsigned i = 0; i < length(journalSet); ++i) findPatternInJournalString(hitSet[i+1], journalSet[i], pattern, hitSet[0]); }
We write the same main body ...
template <typename TString, typename TStream, typename TSpec> inline int loadAndJoin(StringSet<TString, Owner<JournaledSet> > & journalSet, TStream & stream, JoinConfig<TSpec> const & joinConfig) { typedef typename Host<TString>::Type THost; RecordReader<std::ifstream, SinglePass<> > reader(stream); clear(journalSet); String<char> seqId; THost sequence; // No sequences in the fasta file! if (atEnd(reader)) { std::cerr << "Empty FASTA file." << std::endl; return -1; } // First read sequence for reference sequence. if (readRecord(seqId, sequence, reader, Fasta()) != 0) { std::cerr << "ERROR reading FASTA." << std::endl; return 1; } // We have to create the global reference sequence otherwise we loose the information after this function terminates. createGlobalReference(journalSet, sequence); // If there are more while (!atEnd(reader)) { if (readRecord(seqId, sequence, reader, Fasta()) != 0) { std::cerr << "ERROR reading FASTA." << std::endl; return 1; } appendValue(journalSet, TString(sequence)); join(journalSet, length(journalSet) - 1, joinConfig); } return 0; }
and finally print the results.
int main() { // Definition of the used types. typedef String<Dna,Alloc<> > TSequence; typedef String<Dna,Journaled<Alloc<>,SortedArray,Alloc<> > > TJournal; typedef StringSet< TJournal, Owner<JournaledSet> > TJournaledSet; // Open the stream to the file containing the sequences. String<char> seqDatabasePath = "/Users/rahn_r/Downloads/sequences.fasta"; // String<char> seqDatabasePath = "/path/to/your/fasta/file/sequences.fasta"; std::ifstream databaseFile(toCString(seqDatabasePath), std::ios_base::in); if(!databaseFile.good()) { std::cerr << "Cannot open file <" << seqDatabasePath << ">!" << std::endl; } // Reading each sequence and journal them. TJournaledSet journalSet; JoinConfig<GlobalAlign<JournaledCompact> > joinConfig; loadAndJoin(journalSet, databaseFile, joinConfig); databaseFile.close(); // Define a pattern and start search. StringSet<String<int> > hitSet; TSequence pattern = "GTGGT"; std::cout << "Search for: " << pattern << "\n"; searchPattern(hitSet, journalSet, pattern); if (empty(hitSet[0])) { std::cout << "No hit in reference " << std::endl; } else { std::cout << "Hit in reference " << " at "; for (unsigned j = 0; j < length(hitSet[0]); ++j) std::cout << hitSet[0][j] << ": " << infix(globalReference(journalSet), hitSet[0][j],hitSet[0][j] + length(pattern)) << "\t"; } std::cout << std::endl; for (unsigned i = 1; i < length(hitSet); ++i) { if (empty(hitSet[i])) { std::cout << "No hit in sequence " << i - 1 << std::endl; } else { std::cout << "Hit in sequence " << i - 1 << " at "; for (unsigned j = 0; j < length(hitSet[i]); ++j) std::cout << hitSet[i][j] << ": " << infix(value(journalSet,i-1), hitSet[i][j],hitSet[i][j] + length(pattern)) << "\t"; } std::cout << std::endl; } return 0; }
And here is the result using the Finder and Pattern concept of SeqAn.
Search for: GTGGT: Hit in reference at 311: GTGGT 644: GTGGT Hit in sequence 0 at 151: GTGGT 312: GTGGT Hit in sequence 1 at 308: GTGGT Hit in sequence 2 at 311: GTGGT 507: GTGGT Hit in sequence 3 at 327: GTGGT Hit in sequence 4 at 307: GTGGT 312: GTGGT 317: GTGGT Hit in sequence 5 at 0: GTGGT 320: GTGGT 986: GTGGT