String Sets

Learning Objective
You will learn the advantages of StringSets and how to work with them.
Difficulty
Basic
Duration
15 min
Prerequisites
Sequences

A set of sequences can either be stored in a sequence of sequences, for example in a String<String<char> >, or in a StringSet. This tutorial will introduce you to the SeqAn class StringSet, its background and how to use it.

Background

One advantage of using StringSet is that it supports the function concat that returns a concatenator of all sequences in the string set. A concatenator is an object that represents the concatenation of a set of strings. This way, it is possible to build up index data structures for multiple sequences by using the same construction methods as for single sequences.

There are two kinds of StringSet specializations in SeqAn: Owner StringSet, the default specialisation, and Dependent StringSet; see the list below for details. Owner StringSets actually store the sequences, whereas Dependent StringSets just refer to sequences that are stored outside of the string set.

    StringSet<DnaString>               ownerSet;
    StringSet<DnaString, Owner<> >     ownerSet2;      // same as above
    StringSet<DnaString, Dependent<> > dependentSet;

The specialization ConcatDirect StringSet already stores the sequences in a concatenation. The concatenators for all other specializations of StringSet are virtual sequences, that means their interface simulates a concatenation of the sequences, but they do not literally concatenate the sequences into a single sequence. Hence, the sequences do not need to be copied when a concatenator is created.

One string can be an element of several Dependent StringSets. Typical tasks are, e.g., to find a specific string in a string set, or to test whether the strings in two string sets are the same. Therefore a mechanism to identify the strings in the string set is needed, and, for performance reasons, this identification should not involve string comparisons. SeqAn solves this problem by introducing ids, which are by default unsigned int values.

The following list lists the different StringSet specializations:

Specialization Owner<ConcatDirect>
The sequences are stored as parts of a long string. Since the sequences are already concatenated, concat just needs to return this string. The string set also stores lengths and starting positions of the strings. Inserting new strings into the set or removing strings from the set is more expensive than for the default OwnerStringSet specialization, since this involves moving all subsequent sequences in memory.
Specialization Owner<JournaledSet>
The sequences are stored as Journaled Strings to a common reference sequence, that is also stored within the container. When adding a new String to the set, it needs to be joined to this set of sequences which are all based on the common reference sequence. This way one can hold a large collection of similar sequences efficiently in memory.
Specialization Dependent<Tight>

This specialization stores sequence pointers consecutively in an array. Another array stores an id value for each sequence. That means that accessing given an id needs a search through the id array.

Warning

The Dependent-Tight StringSet is deprecated and will likely be removed within the SeqAn-2.x lifecycle.

Specialization Dependent<Generous>
The sequence pointers are stored in an array at the position of their ids. If a specific id is not present, the array stores a zero at this position. The advantage of this specialization is that accessing the sequence given its id is very fast. On the other hand, accessing a sequence given its position i can be expensive, since this means we have to find the i-th non-zero value in the array of sequence pointers. The space requirements of a string set object depends on the largest id rather than the number of sequences stored in the set. This could be inefficient for string sets that store a small subset out of a large number of sequences.

Building String Sets

Use the function appendValue to append strings to string sets.

#include <seqan/sequence.h>
#include <seqan/stream.h>

using namespace seqan;

int main()
{
    StringSet<DnaString> stringSet;
    DnaString str0 = "TATA";
    DnaString str1 = "CGCG";
    appendValue(stringSet, str0);
    appendValue(stringSet, str1);

Working with StringSets

This section will give you a short overview of the functionality of the class StringSet.

There are two ways for accessing the sequences in a string set: (1) the function operator[] returns a reference to the sequence at a specific position within the sequence of sequences, and (2) valueById accesses a sequence given its id. We can retrieve the id of a sequence in a StringSet with the function positionToId.

    // (1) Access by position
    std::cout << "Owner: " << '\n';
    std::cout << "Position 0: " << value(stringSet, 0) << '\n';

    // Get the corresponding ids
    unsigned id0 = positionToId(stringSet, 0);
    unsigned id1 = positionToId(stringSet, 1);

    // (2) Access by id
    std::cout << "Id 0:  " << valueById(stringSet, id0) << '\n';
    std::cout << "Id 1:  " << valueById(stringSet, id1) << '\n';

    return 0;
}
Owner: 
Position 0: TATA
Id 0:  TATA
Id 1:  CGCG

In the case of Owner StringSets, id and position of a string are always the same, but for Dependent StringSets, the ids can differ from the positions. For example, if a Dependent StringSet is used to represent subsets of strings that are stored in Owner StringSets, one can use the position of the string within the Owner StringSet as id of the strings. With the function assignValueById, we can add the string with a given id from the source string set to the target string set.

    // Let's create a string set of type dependent to represent strings,
    // which are stored in the StringSet of type Owner
    StringSet<DnaString, Dependent<Tight> > depSet;
    // We assign the first two strings of the owner string set to the dependent StringSet,
    // but in a reverse order
    assignValueById(depSet, stringSet, id1);
    assignValueById(depSet, stringSet, id0);

    std::cout << "Dependent: " << '\n';
    // (1) Access by position
    std::cout << "Pos 0: " << value(depSet, 0) << '\n';
    // (2) Access by id
    std::cout << "Id 0:  " << valueById(depSet, id0) << '\n';
Dependent:
Pos 0: CGCG
Id 0:  TATA

With the function positionToId we can show that, in this case, the position and the id of a string are different.

    std::cout << "Position 0: Id " << positionToId(depSet, 0) << '\n';
    std::cout << "Position 1: Id " << positionToId(depSet, 1) << '\n';
Position 0: Id 1
Position 1: Id 0

Iterating over String Sets

As well as for other containers, SeqAn has implemented iterators for StringSets. The following example illustrates, how to iterate over the StringSet.

    typedef Iterator<StringSet<DnaString> >::Type TStringSetIterator;
    for (TStringSetIterator it = begin(stringSet); it != end(stringSet); ++it)
    {
        std::cout << *it << '\n';
    }
TATA
CGCG

If we want to iterate over the contained Strings as well, as if the StringSet would be one sequence, we can use the function concat to get the concatenation of all sequences. Therefore we first use the metafunction Concatenator to receive the type of the concatenation. Then, we can simply build an iterator for this type and iterate over the concatenation of all strings.

    typedef Concatenator<StringSet<DnaString> >::Type TConcat;
    TConcat concatSet = concat(stringSet);

    Iterator<TConcat>::Type it = begin(concatSet);
    Iterator<TConcat>::Type itEnd = end(concatSet);
    for (; it != itEnd; goNext(it))
    {
        std::cout << getValue(it) << " ";
    }
    std::cout << '\n';
T A T A C G C G

Assignment 1

Type
Review
Objective
Build a string set with default specialization and which contains the strings "AAA", "CCC", "GGG" and "TTT". After that print the length of the string set and use a simple for-loop to print all elements of the strings set.
Solution

Click more... to see the solution.

#include <iostream>
#include <seqan/sequence.h>
#include <seqan/stream.h>

using namespace seqan;

int main()
{
    // Build strings
    DnaString str0 = "AAA";
    DnaString str1 = "CCC";
    DnaString str2 = "GGG";
    DnaString str3 = "TTT";
    // Build string set and append strings
    StringSet<DnaString> stringSet;
    appendValue(stringSet, str0);
    appendValue(stringSet, str1);
    appendValue(stringSet, str2);
    appendValue(stringSet, str3);
    // Print the length of the string set
    std::cout << length(stringSet) << std::endl;
    // Print all elements
    for (unsigned i = 0; i < length(stringSet); ++i)
    {
        std::cout << stringSet[i] << std::endl;
    }
    return 0;
}

Assignment 2

Type
Application
Objective

In this task you will test, whether a Dependent StringSet contains a string without comparing the actual sequences. Use the given code frame below and adjust it in the following way:

  1. Build a Owner StringSet to store the given strings.
  2. Get the corresponding ids for each position and store them.
  3. Build a DependentStringSet and assign the strings of the owner string set from position 0,1 and 3 by their id to it.
  4. Write a function isElement which takes a StringSet<Dependent<> > and a Id as arguments and checks whether a string set contains a string with a given id.
  5. Check if the string set contains the string of position 3 and 2 and print the result.
#include <iostream>
#include <seqan/sequence.h>
#include <seqan/file.h>

using namespace seqan;

int main()
{
    // Build strings
    DnaString str0 = "TATA";
    DnaString str1 = "CGCG";
    DnaString str2 = "TTAAGGCC";
    DnaString str3 = "ATGC";
    DnaString str4 = "AGTGTCA";

    // Your code

    return 0;
}
Hints
You can use the SeqAn functions positionToId and assignValueById.
Solution

Click more... to see the solution.

#include <iostream>
#include <seqan/sequence.h>
#include <seqan/stream.h>

using namespace seqan;

// Check whether the string set contains the string with the given id,
// without comparing the actual sequences
template <typename TStringSet, typename TId>
bool isElement(TStringSet & stringSet1, TId & id)
{

    for (unsigned i = 0; i < length(stringSet1); ++i)
    {
        // Get the id of the element at position i
        if (positionToId(stringSet1, i) == id)
            return true;
    }
    return false;
}

int main()
{
    // Build strings
    DnaString str0 = "TATA";
    DnaString str1 = "CGCG";
    DnaString str2 = "TTAAGGCC";
    DnaString str3 = "ATGC";
    DnaString str4 = "AGTGTCA";
    // Build owner string set and append strings
    StringSet<DnaString> stringSetOw;
    appendValue(stringSetOw, str0);
    appendValue(stringSetOw, str1);
    appendValue(stringSetOw, str2);
    appendValue(stringSetOw, str3);
    appendValue(stringSetOw, str4);
    // Get corresponding ids for positions
    unsigned id0 = positionToId(stringSetOw, 0);
    unsigned id1 = positionToId(stringSetOw, 1);
    unsigned id2 = positionToId(stringSetOw, 2);
    unsigned id3 = positionToId(stringSetOw, 3);
    // Build dependent string set and assigns strings by id
    StringSet<DnaString, Dependent<Generous> > stringSetDep;
    assignValueById(stringSetDep, stringSetOw, id0);
    assignValueById(stringSetDep, stringSetOw, id1);
    assignValueById(stringSetDep, stringSetOw, id3);
    // Call function to check if a string is contained and output result
    std::cout << "Does the string set contain the string with the id 'id3'? " <<  isElement(stringSetDep, id3) << std::endl;
    std::cout << "Does the string set contain the string with the id 'id2'? " <<  isElement(stringSetDep, id2) << std::endl;

    return 0;
}