Graph Algorithms

Learning Objective

This tutorial shows how to use some graph algorithms in SeqAn. In particular we will use the dijkstra algorithm to find shortest path and viterbi Algorithm to compute Viterbi path of a sequence.

Difficulty

Average

Duration

1 h

Prerequisites

Graphs

Overview

The following graph algorithms are currently available in SeqAn:

Elementary Graph Algorithms
Minimum Spanning Tree
Single-Source Shortest Path
All-Pairs Shortest Path
Maximum Flow
Transitive Closure
Bioinformatics Algorithms

The biological algorithms use heavily the alignment graph. Most of them are covered in the tutorial Alignment. All others use the appropriate standard graph. All algorithms require some kind of additional input, e.g., the Dijkstra algorithm requires a distance property map, alignment algorithms sequences and a score type and the network flow algorithm capacities on the edges.

Generally, only a single function call is sufficient to carry out all the calculations of a graph algorithm. In most cases you will have to define containers that store the algorithms results prior to the function call.

In our example, we apply the shortest-path algorithm of Dijkstra. It is implemented in the function dijkstra.

Let’s have a look at the input parameters. The first parameter is of course the graph, g. Second, you will have to specify a vertex descriptor. The function will compute the distance from this vertex to all vertices in the graph. The last input parameter is an edge map containing the distances between the vertices. One may think that the distance map is already contained in the graph. Indeed this is the case for our graph type but it is not in general. The cargo of a graph might as well be a string of characters or any other type. So, we first have to find out how to access our internal edge map. We do not need to copy the information to a new map. Instead we can define an object of the type InternalPropertyMap of our type TCargo. It will automatically find the edge labels in the graph when the function property or getProperty is called on it with the corresponding edge descriptor.

The output containers of the shortest-path algorithm are two property maps, predMap and distMap. The predMap is a vertex map that determines a shortest-paths-tree by mapping the predecessor to each vertex. Even though we are not interested in this information, we have to define it and pass it to the function. The distMap indicates the length of the shortest path to each vertex.

    typedef Size<TGraph>::Type TSize;
    InternalPropertyMap<TCargo> cargoMap;
    String<TVertexDescriptor> predMap;
    String<TSize> distMap;

Having defined all these property maps, we can then call the function dijkstra:

    dijkstra(predMap, distMap, g, vertHannover, cargoMap);

Finally, we have to output the result. Therefore, we define a second vertex iterator itV2 and access the distances just like the city names with the function property on the corresponding property map.

    TVertexIterator itV2(g);
    while (!atEnd(itV2))
    {
        std::cout << "Shortest path from " << property(cityNames, vertHannover) << " to " << property(cityNames, value(itV2)) << ": ";
        std::cout << property(distMap, value(itV2)) << std::endl;
        goNext(itV2);
    }

    return 0;
}

Assignment 1

Type

Application

Objective

Write a program which calculates the connected components of the graph defined in Assignment 2 of the Graphs tutorial and Output the connected component for each vertex.

Solution

SeqAn provides the function stronglyConnectedComponents to compute the connected components of a directed graph. The first parameter of this function is of course the graph. The second parameter is an output parameter. It is a vertex map that will map a component id to each vertex. Vertices that share the same id are in the same component.

    String<unsigned int> component;
    stronglyConnectedComponents(component, g);

Now, the only thing left to do is to walk through our graph and ouput each vertex and the corresponding component using the function getProperty. One way of doing so is to define a VertexIterator.

    std::cout << "Strongly Connected Components: " << std::endl;
    typedef Iterator<TGraph, VertexIterator>::Type TVertexIterator;
    TVertexIterator it(g);
    while (!atEnd(it))
    {
        std::cout << "Vertex " << getProperty(nameMap, getValue(it)) << ": ";
        std::cout << "Component = " << getProperty(component, getValue(it)) << std::endl;
        goNext(it);
    }
    return 0;
}

The output for the graph defined in the Assignment 1 looks as follows:

Strongly Connected Components:
Vertex a: Component = 3
Vertex b: Component = 3
Vertex c: Component = 2
Vertex d: Component = 2
Vertex e: Component = 3
Vertex f: Component = 1
Vertex g: Component = 1
Vertex h: Component = 0

The graph consists of four components. The first contains vertex a, b, and e, the second contains vertex c and d, the third contains vertex f and g and the last contains only vertex h.

Assignment 2

Type

Application

Objective

Extend the program from the Assignment 3 of the Graphs tutorial. Given the sequence s = "CTTCATGTGAAAGCAGACGTAAGTCA".

  1. calculate the Viterbi path of s and output the path as well as the probability of the path and

  2. calculate the probability that the HMM generated s with the forward and backward algorithm.

Solution

In Assignment 3 of the Graphs tutorial we defined an HMM with three states: exon, splice, and intron.

The Viterbi path is the sequence of states that is most likely to produce a given output. In SeqAn, it can be calculated with the function viterbiAlgorithm. The produced output for this assignment is the DNA sequence s.

The first parameter of the function viterbiAlgorithm is of course the HMM, and the second parameter is the sequence s. The third parameter is an output parameter that will be filled by the function. Since we want to compute a sequence of states, this third parameter is a String of VertexDescriptors which assigns a state to each character of the sequence s.

The return value of the function viterbiAlgorithm is the overall probability of this sequence of states, the Viterbi path.

The only thing left is to output the path. The path is usually longer than the given sequence. This is because the HMM may have silent states, e.g. the begin and end state. To check if a state is silent SeqAn provides the function isSilent.

    String<Dna> sequence = "CTTCATGTGAAAGCAGACGTAAGTCA";
    String<TVertexDescriptor> path;
    TProbability p = viterbiAlgorithm(path, hmm, sequence);
    std::cout << "Viterbi algorithm" << std::endl;
    std::cout << "Probability of best path: " << p << std::endl;
    std::cout << "Sequence: " << std::endl;
    for (TSize i = 0; i < length(sequence); ++i)
        std::cout << sequence[i] << ',';
    std::cout << std::endl;
    std::cout << "State path: " << std::endl;
    for (TSize i = 0; i < length(path); ++i)
    {
        std::cout << path[i];
        if (isSilent(hmm, path[i]))
            std::cout << " (Silent)";
        if (i < length(path) - 1)
            std::cout << ',';
    }
    std::cout << std::endl;

The output of the above piece of code is:

Viterbi algorithm
Probability of best path: 1.25465e-18
Sequence:
C,T,T,C,A,T,G,T,G,A,A,A,G,C,A,G,A,C,G,T,A,A,G,T,C,A,
State path:
0 (Silent),1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,3,3,3,3,3,3,3,4 (Silent)

It is even simpler to use the forward algorithm in SeqAn since it needs only the HMM and the sequence as parameters and returns a single probability. This is the probability of the HMM to generate the given sequence. The corresponding function is named forwardAlgorithm.

    std::cout << "Forward algorithm" << std::endl;
    p = forwardAlgorithm(hmm, sequence);
    std::cout << "Probability that the HMM generated the sequence: " << p << std::endl;

Analogously, the function backwardAlgorithm implements the backward algorithm in SeqAn.

    std::cout << "Backward algorithm" << std::endl;
    p = backwardAlgorithm(hmm, sequence);
    std::cout << "Probability that the HMM generated the sequence: " << p << std::endl;
    return 0;
}

The output of these two code fragments is:

Forward algorithm
Probability that the HMM generated the sequence: 2.71585e-18
Backward algorithm
Probability that the HMM generated the sequence: 2.71585e-18