Journaled Set

Learning Objective
This tutorial introduces you to the new data structures Journaled Set and Journaled String. You will learn how to use them and how to exploit these data structures for an efficient analysis, while implementing a native online search.
Difficulty
Advanced
Duration
2 h
Prerequisites
Sequences, String Sets, Iterators

A typical task in bioinformatics is to find patterns in biological sequences e.g. transcription factors, or to examine different biological traits and the effects of modifications on such traits. With the current advances in sequencing technologies, sequences of whole populations have been made available. But the time for searching in all these sequences is proportional to the number of sequences searched. That’s why it is important to find novel strategies to cope with the deluge of sequencing data. Since, many biological problems often involve the analysis of sequences of the same species, one effective strategy would be to exploit the similarities of such sequences.

For this special purpose we provide two data structures that are designed to improve the algorithmic tasks. The first one is the JournaledString and the second is the JournaledSet.

In this tutorial, we will introduce you to both data structures and implement a simple online search step by step.

Journaled String

The JournaledString data structure behaves like a normal String in SeqAn, except that it is composed of two data structures.

  1. The first data structure is a Holder which stores a sequence.
  2. The second data structure stores modifications that are made to this particular sequence using a journal (see Journaling Filesystems for more information). This journal contains a list of deletions and insertions. The inserted characters are stored in an additional insertion buffer.

The advantage of this data structure lies in representing a String as a “patch” to another String. The journaled data structure can be modified without loosing the original context. We want to show you how to work with these data structures so you can build your own algorithms based on this. For this reason we keep the applicational background simple and implement an native online-search algorithm by which we examine different properties of the data structures.

Before we start implementing our online search algorithm, we show you how to work with the Journaled String to learn the basic principles. To get access to the Journaled String implementation you have to include the <seqan/sequence_journaled.h> header file. Note that you will need the <seqan/file.h> too in order to print the sequences.

#include <iostream>
#include <seqan/file.h>
#include <seqan/sequence_journaled.h>

using namespace seqan;

int main()
{

In the next step we define the Journaled String type. A Journaled String is a specialization of the String class and is defined as String<TValue, Journaled<THostSpec, TJournalSpec, TBufferSpec> >. The specialization takes two parameters: (1) TValue defines the alphabet type used for the Journaled String and (2) Journaled<> selects the Journaled String specialization of the String class.

Journaled<> is further specialized with

  • THostSpec selects the specialization of the underlying host sequence (Alloc<> for [dox:AllocString Alloc String),
  • TJournaleSpec selects the used implementation to manage the journaled differences (here: SortedArray), and
  • TBufferSpec selects the used specialization for the internally managed insertion buffer (here: Alloc<> as well).

In our scenario we use a char alphabet and [dox:AllocString Alloc String for the host string and the insertion buffer. Additionally, we use a Sorted Array as the model to manage the recorded differences.

We use the metafunction Host to get the type of the underlying host string used for the Journaled String.

    typedef String<char, Journaled<Alloc<>, SortedArray, Alloc<> > > TJournaledString;
    typedef Host<TJournaledString>::Type THost;

Now we can define the variables holding data structures. First, we construct our host sequence and after that we construct the Journaled String. Then, we set the host sequence using the function setHost. Afterwards, we examine the data structure in more detail and print the host sequence the constructed journaled sequence and the nodes of it.

    String<char> hostStr = "thisisahostsequence";
    TJournaledString journalStr;
    setHost(journalStr, hostStr);

    std::cout << "After creating the Journaled String:" << std::endl;
    std::cout << "Host: " << host(journalStr) << std::endl;
    std::cout << "Journal: "<< journalStr << std::endl;
    std::cout << "Nodes: " << journalStr._journalEntries << std::endl;
    std::cout << std::endl;

Tip

The Journal

A node in the Journaled String represents either a part of the host sequence or a part of the insertion buffer. The type of a node is distinguished by the member variable segmentSource and can be of value SOURCE_ORIGINAL to refere to a part in the host or SOURCE_PATCH to refere to a part in the insertion buffer. A node further consists of three variables which specify the virtual position, the physical position and the length of this part. The virtual position gives the relative position of the Journaled String after all modifications before this position have been “virtually” applied. The physical position gives the absolute position where this part of the journal maps to either the host sequence or the insertion buffer.

This is followed by modifying our Journaled String. We insert the string "modified" at position 7 and delete the suffix "sequence" at position 19. Note that position 19 refers to the string after the insertion of "modified" at position 7. Again we print the host, the journaled sequence and the nodes that represent the modifications to see how our changes affect the host and the journaled sequence.

    insert(journalStr, 7, "modified");
    erase(journalStr, 19,27);

    std::cout << "After modifying the Journaled String:" << std::endl;
    std::cout << "Host: " << host(journalStr) << std::endl;
    std::cout << "Journal: " << journalStr << std::endl;
    std::cout << "Nodes: " << journalStr._journalEntries << std::endl;
    std::cout << std::endl;

All of this is followed by calling flatten on our journeld string. This call applies all journaled changes to the host sequence. Again we print the sequences to see the effects.

    flatten(journalStr);
    std::cout << "After flatten the Journaled String:" << std::endl;
    std::cout << "Host: " << host(journalStr) << std::endl;
    std::cout << "Journal: " << journalStr << std::endl;
    std::cout << "Nodes: " << journalStr._journalEntries << std::endl;

    return 0;
}

Here is the output of our small program.

After creating the Journaled String:
Host: thisisahostsequence
Journal: thisisahostsequence
Nodes: JournalEntries({segmentSource=1, virtualPosition=0, physicalPosition=0, length=19})

After modifying the Journaled String:
Host: thisisahostsequence
Journal: thisisamodifiedhost
Nodes: JournalEntries({segmentSource=1, virtualPosition=0, physicalPosition=0, length=7}, {segmentSource=2, virtualPosition=7, physicalPosition=0, length=8}, {segmentSource=1, virtualPosition=15, physicalPosition=7, length=4})

After flatten the Journaled String:
Host: thisisamodifiedhost
Journal: thisisamodifiedhost
Nodes: JournalEntries({segmentSource=1, virtualPosition=0, physicalPosition=0, length=19})

Important

Be careful when using the flatten function as it modifies the underlying host sequence. This might affect other journaled sequences that share the same host sequence. This becomes important especially when working with Journal Sets where a whole set of sequences is journaled to the same reference.

Journaled Set

The JournaledSet is a specialization of the StringSet which can be used exactly as such but also provides some additional functions optimized to work with JournaledStrings. The general interface is equal to the interface of the StringSet. But it also provides some interfaces specialized for the use of Journaled Strings. One of these interfaces is the join function which journales a contained Journaled String to the previously set global reference. The following code snippet demonstrates the usage of the Journal Set and how to join a sequence to the previously set reference sequence.

As usual we include the necessary headers. We need the header <seqan/journal_set.h> to get access to the Journal Set. Then we define a type for journaled sequences. After that we define our Journal Set. The Journal Set is a specialization of the Owner concept of StringSets and is defined as StringSet<TJournalString, Owner<JournaledSet> >.

#include <iostream>
#include <seqan/file.h>
#include <seqan/journaled_set.h>

using namespace seqan;

int main()
{
    typedef String<char, Journaled<Alloc<>, SortedArray, Alloc<> > > TJournalString;
    typedef Host<TJournalString>::Type THost;
    typedef StringSet<TJournalString, Owner<JournaledSet> > TJournaledSet;

    TJournaledSet journaledSet;

    THost reference = "DPKKPRGKMSSYAFFVQTSREEHKKKHPDASVNFSEFSKKCSERWKTMSAKEKGKFEDMAKADKARYEREMKTYIPPKGE";
    THost seq0 = "DPKKPRGKMVNSPPAFFVQTSREEHKKKHPDASVFSKKCSERWKTMSAKEKGKFEDMAKARYEREMKTTYIPKGETYIPPKGE";
    THost seq1 = "DPHHPPKPRGKMVNSPPAFFVQTSREEHKPDASVFSKKCSERRMPNHHTMSAKEKGKFEDMAKARYEREMKTTYIPKGETYIPPKGE";
    THost seq2 = "DPKKPRGKMSSYAFFVQTSREEHKKKHPKKCDEFSKKCSERWKTMSAKEKGKFEDARYEREMKTYIPPKGE";

In the subsequent steps we want to set a reference sequence to the Journal Set and add some sequences to it. We can set the reference sequence by using the function setHost. This function stores only a pointer to the given sequence. In some cases it might be necessary to copy the reference sequence instead. For this purpose you can use the function createHost.

    setGlobalReference(journaledSet, reference);
    appendValue(journaledSet, TJournalString(seq0));
    appendValue(journaledSet, TJournalString(seq1));
    appendValue(journaledSet, TJournalString(seq2));

Just adding sequences to the Journal Set does not automatically journal them to the global reference sequence of the set. One can explicitly trigger this process using the function join. This function takes as parameters the Journal Set and the position of the contained Journaled String which is to be journaled to the reference sequence. Thus, the programmer is free in the decision which sequence has to be journaled and which not. Furthermore, we can use the JoinConfig object to specify the method that shall be used for the journaling process.

    join(journaledSet, 0, JoinConfig<GlobalAlign<JournaledManhatten> >());  // Simply inserts the
    join(journaledSet, 1, JoinConfig<GlobalAlign<JournaledCompact> >());     // Uses default scoring scheme to compute compact journal.
    JoinConfig<GlobalAlign<JournaledCompact> > joinConfig;
    setScoringScheme(joinConfig, Score<int, BiAffine>(0,-1,-1));    // Note the mismatch score is forbidden internally when used in the context of journaling.
    join(journaledSet, 2, joinConfig);  // Compute journal using Levenshtein distance.

    std::cout << "Reference: " << globalReference(journaledSet) << std::endl;
    for(unsigned i = 0; i < length(journaledSet); ++i)
        std::cout << "Journaled Sequence " << i << ": " << value(journaledSet,i) << std::endl;

    return 0;
}

Tip

Configuration of the Join Methods

The JoinConfig object differentiates two methods in general and each method further differs in the used strategy. The two methods are the GlobalAlign and the GlobalChain method. They differ in the approach of computing the alignment which is necessary to construct the journal. The first method uses a global alignment algorithm while the second one uses an anchor approach in which first exact seeds are found using a q-gram index and after that the optimal chain between the identified anchors is computed. For each method the user can specify a different strategy. The first strategy is triggered by using JournaledManhatten. This means for the the GlobalAlign method, that the complete sequence is inserted and the complete reference is deleted, while for the GlobalChain methods this means that the gaps between the anchors are connected through the Manhatten distance. The second strategy is specified using the JournaledCompact tag. It computes the most compact form of a journal by menas of memory requirements.

Here is the output of the program.

Reference: DPKKPRGKMSSYAFFVQTSREEHKKKHPDASVNFSEFSKKCSERWKTMSAKEKGKFEDMAKADKARYEREMKTYIPPKGE
Journaled Sequence 0: DPKKPRGKMVNSPPAFFVQTSREEHKKKHPDASVFSKKCSERWKTMSAKEKGKFEDMAKARYEREMKTTYIPKGETYIPPKGE
Journaled Sequence 1: DPHHPPKPRGKMVNSPPAFFVQTSREEHKPDASVFSKKCSERRMPNHHTMSAKEKGKFEDMAKARYEREMKTTYIPKGETYIPPKGE
Journaled Sequence 2: DPKKPRGKMSSYAFFVQTSREEHKKKHPKKCDEFSKKCSERWKTMSAKEKGKFEDARYEREMKTYIPPKGE
comments powered by Disqus