Journaled String Tree

Learning Objective
This tutorial introduces you to the new data structure Journaled String Tree. You will learn how to use this data structure and how to exploit it for an efficient analysis while searching multiple sequences simultaneously.
Difficulty
Average
Duration
45 min
Prerequisites
Sequences, String Sets

A typical task in bioinformatics is to find patterns in biological sequences e.g. transcription factors, or to examine different biological traits and the effects of modifications on such traits. With the current advances in sequencing technologies, sequences of whole populations have been made available. But the time for searching in all these sequences is proportional to the number of sequences searched. That’s why it is important to find novel strategies to cope with the deluge of sequencing data. Since, many biological problems often involve the analysis of sequences of the same species, one effective strategy would be to exploit the similarities of such sequences.

For this special purpose we will introduce you to two data structures that are designed to improve these algorithmic tasks. The first one is the JournaledString and the second is the JournaledStringTree.

Journaled String

The JournaledString data structure behaves like a normal String in SeqAn, except that it is composed of two data structures.

  1. The first data structure is a Holder which stores a sequence.
  2. The second data structure stores modifications that are made to this particular sequence using a journal (see Journaling Filesystems for more information). This journal contains a list of deletions and insertions. The inserted characters are stored in an additional insertion buffer.

The advantage of this data structure lies in representing a String as a “patch” to another String. The journaled data structure can be modified without loosing the original context. We want to show you how to work with these data structures so you can build your own algorithms based on this.

First, we show you how to work with the Journaled String so you can learn the basic principles. To get access to the Journaled String implementation you have to include the <seqan/sequence_journaled.h> header file. Note that you will need the <seqan/stream.h> too in order to print the sequences.

#include <seqan/stream.h>
#include <seqan/sequence_journaled.h>

using namespace seqan;

int main()
{

In the next step we define the Journaled String type. A Journaled String is a specialization of the String class and is defined as String<TValue, Journaled<THostSpec, TJournalSpec, TBufferSpec> >. The specialization takes two parameters: (1) TValue defines the alphabet type used for the Journaled String and (2) Journaled<> selects the Journaled String specialization of the String class.

Journaled<> is further specialized with

  • THostSpec selects the specialization of the underlying host sequence (Alloc<> for [dox:AllocString Alloc String),
  • TJournaleSpec selects the used implementation to manage the journaled differences (here: SortedArray), and
  • TBufferSpec selects the used specialization for the internally managed insertion buffer (here: Alloc<> as well).

In our scenario we use a char alphabet and [dox:AllocString Alloc String for the host string and the insertion buffer. Additionally, we use a Sorted Array as the model to manage the recorded differences.

We use the metafunction Host to get the type of the underlying host string used for the Journaled String.

    typedef String<char, Journaled<Alloc<>, SortedArray, Alloc<> > > TJournaledString;
    typedef Host<TJournaledString>::Type THost;

Now we can define the variables holding data structures. First, we construct our host sequence and after that we construct the Journaled String. Then, we set the host sequence using the function setHost. Afterwards, we examine the data structure in more detail and print the host sequence the constructed journaled sequence and the nodes of it.

    THost hostStr = "thisisahostsequence";
    TJournaledString journalStr;
    setHost(journalStr, hostStr);

    std::cout << "After creating the Journaled String:" << std::endl;
    std::cout << "Host: " << host(journalStr) << std::endl;
    std::cout << "Journal: " << journalStr << std::endl;
    std::cout << "Nodes: " << journalStr._journalEntries << std::endl;
    std::cout << std::endl;

Tip

The Journal

A node in the Journaled String represents either a part of the host sequence or a part of the insertion buffer. The type of a node is distinguished by the member variable segmentSource and can be of value SOURCE_ORIGINAL to refere to a part in the host or SOURCE_PATCH to refere to a part in the insertion buffer. A node further consists of three variables which specify the virtual position, the physical position and the length of this part. The virtual position gives the relative position of the Journaled String after all modifications before this position have been “virtually” applied. The physical position gives the absolute position where this part of the journal maps to either the host sequence or the insertion buffer.

This is followed by modifying our Journaled String. We insert the string "modified" at position 7 and delete the suffix "sequence" at position 19. Note that position 19 refers to the string after the insertion of "modified" at position 7. Again we print the host, the journaled sequence and the nodes that represent the modifications to see how our changes affect the host and the journaled sequence.

    insert(journalStr, 7, "modified");
    erase(journalStr, 19, 27);

    std::cout << "After modifying the Journaled String:" << std::endl;
    std::cout << "Host: " << host(journalStr) << std::endl;
    std::cout << "Journal: " << journalStr << std::endl;
    std::cout << "Nodes: " << journalStr._journalEntries << std::endl;
    std::cout << std::endl;

All of this is followed by calling flatten on our journeld string. This call applies all journaled changes to the host sequence. Again we print the sequences to see the effects.

    flatten(journalStr);
    std::cout << "After flatten the Journaled String:" << std::endl;
    std::cout << "Host: " << host(journalStr) << std::endl;
    std::cout << "Journal: " << journalStr << std::endl;
    std::cout << "Nodes: " << journalStr._journalEntries << std::endl;

    return 0;
}

Here is the output of our small program.

After creating the Journaled String:
Host: thisisahostsequence
Journal: thisisahostsequence
Nodes: JournalEntries({segmentSource=1, virtualPosition=0, physicalPosition=0, physicalOriginPosition=0, length=19})

After modifying the Journaled String:
Host: thisisahostsequence
Journal: thisisamodifiedhost
Nodes: JournalEntries({segmentSource=1, virtualPosition=0, physicalPosition=0, physicalOriginPosition=0, length=7}, {segmentSource=2, virtualPosition=7, physicalPosition=0, physicalOriginPosition=0, length=8}, {segmentSource=1, virtualPosition=15, physicalPosition=7, physicalOriginPosition=7, length=4})

After flatten the Journaled String:
Host: thisisamodifiedhost
Journal: thisisamodifiedhost
Nodes: JournalEntries({segmentSource=1, virtualPosition=0, physicalPosition=0, physicalOriginPosition=0, length=19})

Important

Be careful when using the flatten function as it modifies the underlying host sequence. This might affect other journaled sequences that share the same host sequence. This becomes important especially when working with Journaled Sets where a whole set of sequences is journaled to the same reference.

Simultaneously Searching Multiple Sequences

Now, we come to a simple example to demonstrate the use of the Journaled String Tree (JST). As you could imagine, the JST internally uses a set of Journaled Strings to buffer the sequences, while requiring only a low memory footprint.

In this article, we are going to create a small JST, which we will use to search for a pattern using the Horspool algorithm.

Let’s just start with the include headers. In order to make the JST implementation visible to our source code we need to include the header include <seqan/journaled_string_tree.h>.

#include <iostream>
#include <seqan/stream.h>
#include <seqan/journaled_string_tree.h>

In the next step, we are going to define the type of the JST.

using namespace seqan;

int main()
{
    typedef JournaledStringTree<DnaString> TJst;
    typedef Pattern<DnaString, Horspool>   TPattern;
    typedef Traverser<TJst>::Type          TTraverser;

The only information that is required is the type of the reference sequence used for the underlying sequence. We also defined the pattern type and a traverser type, which we will explain soon.

Now, we are ready to initialize the JST. To construct a JST, we need to know the reference sequence and how many sequences should be represented by the JST. In our case we assume 10 sequences. The JST supports insertion or deletion of delta events. A delta event is a tuple consisting of four parameters: The reference position, the value, the coverage and the delta type. The reference position determines the position within the reference sequence, where this event occurs. The value represents the actual modification applied to the sequences, that are determined by the coverage. The type of the value depends on the delta type.

Tip

The internal types, e.g. the types of the different delta events, of the JST can be overloaded with a second optional traits object. If no trait object is given DefaultJstConfig is taken as default. See the API documentation for more information.

The following listing creates a JST and inserts some delta events into the object:

    DnaString seq = "AGATCGAGCGAGCTAGCGACTCAG";
    TJst jst(seq, 10);

    insert(jst, 1, 3, std::vector<unsigned>{1, 3, 5, 6, 7}, DeltaTypeDel());
    insert(jst, 8, "CGTA", std::vector<unsigned>{1, 2}, DeltaTypeIns());
    insert(jst, 10, 'C', std::vector<unsigned>{4, 9}, DeltaTypeSnp());
    insert(jst, 15, 2, std::vector<unsigned>{0, 4, 7}, DeltaTypeDel());
    insert(jst, 20, 'A', std::vector<unsigned>{0, 9}, DeltaTypeSnp());
    insert(jst, 20, Pair<unsigned, DnaString>(1, "CTC"), std::vector<unsigned>{1, 2, 3, 7}, DeltaTypeSV());

After creating the JST, we can now prepare the search. To do so, we first define a needle that we want to search. Second, we need to instantiate a traverser object. A traverser represents the current state of the traversal over the JST. It is comparable to an iterator, but it is not lightweight, as it uses a state stack to implement the traversal over the JST. The traverser is initialized with two arguments: The instance of the JST and the context length, which is in our case the length of the needle.

Here is the listing:

    DnaString ndl = "AGCGT";
    TTraverser trav(jst, length(ndl));

    TPattern pat(ndl);
    JstExtension<TPattern> ext(pat);

In line 4 and 5 in the listing above we initialize the pattern with the needle and then create an JstExtension object. This JstExtension is needed to extend the Pattern class of SeqAn with some auxiliary functions necessary for the JST based search. The only thing required, is that pat is fully initialized when passing it to ext.

The last preparation step we need before invoking the search algorithm is to create a functor that is called, whenever the search algorithm finds a match. In our scenario we simply want to print the sequences and the positions where the hit occurs. Therefor we create a simple MatchPrinter functor:

template <typename TTraverser>
struct MatchPrinter
{
    TTraverser & trav;

    MatchPrinter(TTraverser & _trav) : trav(_trav)
    {}

    void
    operator()()
    {
        auto pos = position(trav);
        for (auto p : pos)
        {
            std::cout  << "Seq: " << p.i1 << " Pos: " << p.i2 << std::endl;
        }
    }
};

This match printer, holds a reference to the actual traverser. So we can call the position function on the traverser, when the function-call-operator is invoked by the search algorithm.

Now we can invoke the search using the find interface:

    MatchPrinter<TTraverser> delegate(trav);
    find(trav, ext, delegate);

    return 0;
}

And finally the output:

Seq: 1 Pos: 7
Seq: 2 Pos: 10

The following list gives an overview of the available search algorithms:

Horspool
Exact online search using HorspoolPattern as base pattern class.
ShiftAnd
Exact online search using ShiftAndPattern as base pattern class.
ShiftOr
Exact online search using ShiftOrPattern as base pattern class.
MyersUkkonen
Approximate online search using MyersUkkonen as base pattern class.

Assignment 1

Type
Review

Objective

Use the code from above and find all patterns of the needle CCTCCA with up to 2 errors.
Hints
When searching with errors, the context size needs to be updated accordingly.
Solution

Since we are trying to find the needle approximatively, we need to use the Myers' bitvector algorithm. Here is the entire solution:

#include <iostream>
#include <seqan/stream.h>
#include <seqan/journaled_string_tree.h>

template <typename TTraverser>
struct MatchPrinter
{
    TTraverser & trav;

    MatchPrinter(TTraverser & _trav) : trav(_trav)
    {}

    void
    operator()()
    {
        auto pos = position(trav);
        for (auto p : pos)
        {
            std::cout  << "Seq: " << p.i1 << " Pos: " << p.i2 << std::endl;
        }
    }
};

using namespace seqan;

int main()
{
    typedef JournaledStringTree<DnaString>   TJst;
    typedef Pattern<DnaString, MyersUkkonen> TPattern;
    typedef Traverser<TJst>::Type            TTraverser;

    DnaString seq = "AGATCGAGCGAGCTAGCGACTCAG";
    TJst jst(seq, 10);

    insert(jst, 1, 3, std::vector<unsigned>{1, 3, 5, 6, 7}, DeltaTypeDel());
    insert(jst, 8, "CGTA", std::vector<unsigned>{1, 2}, DeltaTypeIns());
    insert(jst, 10, 'C', std::vector<unsigned>{4, 9}, DeltaTypeSnp());
    insert(jst, 15, 2, std::vector<unsigned>{0, 4, 7}, DeltaTypeDel());
    insert(jst, 20, 'A', std::vector<unsigned>{0, 9}, DeltaTypeSnp());
    insert(jst, 20, Pair<unsigned, DnaString>(1, "CTC"), std::vector<unsigned>{1, 2, 3, 7}, DeltaTypeSV());

    DnaString ndl = "CCTCCA";
    TTraverser trav(jst, length(ndl) + 2);

    TPattern pat(ndl, -2);
    JstExtension<TPattern> ext(pat);

    MatchPrinter<TTraverser> delegate(trav);
    find(trav, ext, delegate);

    return 0;
}

And here is the output:

Seq: 7 Pos: 17
Seq: 7 Pos: 18
Seq: 4 Pos: 20
Seq: 1 Pos: 23
Seq: 2 Pos: 26
Seq: 3 Pos: 19
Seq: 1 Pos: 24
Seq: 2 Pos: 27
Seq: 3 Pos: 20
Seq: 1 Pos: 25
Seq: 2 Pos: 28
Seq: 3 Pos: 21
Seq: 7 Pos: 19
Seq: 1 Pos: 26
Seq: 2 Pos: 29
Seq: 3 Pos: 22
Seq: 7 Pos: 20
Seq: 5 Pos: 19
Seq: 6 Pos: 19
Seq: 8 Pos: 22