Working With Custom Score Matrices

This How To describes how to create new scoring matrices for Amino Acids and DNA alphabets and how to load score matrices from files.

Creating A New Built-In Score Matrix

The following program demonstrates how to implement a new built-in score matrix.

#include <iostream>

#include <seqan/basic.h>
#include <seqan/stream.h>   // For printing strings.
#include <seqan/score.h>    // The module score.

using namespace seqan2;

We need to perform the necessary definitions for the matrix. This consists of two steps:

  1. Defining a tag struct.

  2. Specializing the class ScoringMatrixData_ with your tag.

Note how we use enum values to compute the matrix size which itself is retrieved from the ValueSize metafunction.

// Extend SeqAn by a user-define scoring matrix.
namespace seqan2 {

// We have to create a new specialization of the ScoringMatrix_ class
// for the DNA alphabet.  For this, we first create a new tag.
struct UserDefinedMatrix {};

// Then, we specialize the class ScoringMatrix_ for the Dna5 alphabet.
template <>
struct ScoringMatrixData_<int, Dna5, UserDefinedMatrix>
{
    enum
    {
        VALUE_SIZE = ValueSize<Dna5>::VALUE,
        TAB_SIZE = VALUE_SIZE * VALUE_SIZE
    };

    static inline int const * getData()
    {
        // The user defined data table.  In this case, we use the data from BLOSUM-30.
        static int const _data[TAB_SIZE] =
        {
            1, 0, 0, 0, 0,
            0, 1, 0, 0, 0,
            0, 0, 1, 0, 0,
            0, 0, 0, 1, 0,
            0, 0, 0, 0, 0
        };
        return _data;
    }

};
}  // namespace seqan2

Now we define a function showScoringMatrix for displaying a matrix.

// Print a scoring scheme matrix to stdout.
template <typename TScoreValue, typename TSequenceValue, typename TSpec>
void showScoringMatrix(Score<TScoreValue, ScoreMatrix<TSequenceValue, TSpec> > const & scoringScheme)
{
    // Print top row.
    for (unsigned i = 0; i < ValueSize<TSequenceValue>::VALUE; ++i)
        std::cout << "\t" << TSequenceValue(i);
    std::cout << std::endl;
    // Print each row.
    for (unsigned i = 0; i < ValueSize<TSequenceValue>::VALUE; ++i)
    {
        std::cout << TSequenceValue(i);
        for (unsigned j = 0; j < ValueSize<TSequenceValue>::VALUE; ++j)
        {
            std::cout << "\t" << score(scoringScheme, TSequenceValue(i), TSequenceValue(j));
        }
        std::cout << std::endl;
    }
}

Finally, the main function demostrates some of the things you can do with scores:

  • Construct an empty score matrix object (2.)

  • Fill the score matrix in a loop (3.1)

  • Fill the matrix with the user-defined matrix values (3.2)

  • Directly create a score matrix with the user-defined matrix values (4)

int main()
{
    // 1. Define type and constants.
    //
    // Define types for the score value and the scoring scheme.
    typedef int TValue;
    typedef Score<TValue, ScoreMatrix<Dna5, Default> > TScoringScheme;
    // Define our gap scores in some constants.
    int const gapOpenScore = -1;
    int const gapExtendScore = -1;

    // 2. Construct scoring scheme with default/empty matrix.
    //
    // Construct new scoring scheme, alternatively only give one score
    // that is used for both opening and extension.
    TScoringScheme scoringScheme(gapExtendScore, gapOpenScore);

    // 3. Fill the now-existing ScoreMatrix
    //
    // The scoring scheme now already has a matrix of the size
    // ValueSize<Dna5>::VALUE x ValueSize<Dna5>::VALUE which
    // we can now fill.

    // 3.1 We fill the scoring scheme with the product of the coordinates.
    std::cout << std::endl << "Coordinate Products" << std::endl;
    for (unsigned i = 0; i < ValueSize<Dna5>::VALUE; ++i)
    {
        for (unsigned j = 0; j < ValueSize<Dna5>::VALUE; ++j)
        {
            setScore(scoringScheme, Dna5(i), Dna5(j), i * j);
        }
    }
    showScoringMatrix(scoringScheme);

    // 3.2 Now, we fill it with the user defined matrix above.
    std::cout << "User defined matrix (also Dna5 scoring matrix)..." << std::endl;
    setDefaultScoreMatrix(scoringScheme, UserDefinedMatrix());
    showScoringMatrix(scoringScheme);

    // 4. Show our user-defined Dna5 scoring matrix.
    std::cout << "User DNA scoring scheme..." << std::endl;
    Score<TValue, ScoreMatrix<Dna5, UserDefinedMatrix> > userScoringSchemeDna;
    showScoringMatrix(userScoringSchemeDna);

    return 0;
}

Here is the output of the program:

Coordinate Products
	A	C	G	T	N
A	0	0	0	0	0
C	0	1	2	3	4
G	0	2	4	6	8
T	0	3	6	9	12
N	0	4	8	12	16
User defined matrix (also Dna5 scoring matrix)...
	A	C	G	T	N
A	1	0	0	0	0
C	0	1	0	0	0
G	0	0	1	0	0
T	0	0	0	1	0
N	0	0	0	0	0
User DNA scoring scheme...
	A	C	G	T	N
A	1	0	0	0	0
C	0	1	0	0	0
G	0	0	1	0	0
T	0	0	0	1	0
N	0	0	0	0	0

Loading Score Matrices From File

This small demo program shows how to load a score matrix from a file. Examples for a score file are demos/howto/scores/dna_example.txt for DNA alphabets and tests/sPAM250 for amino acids.

Include the necessary headers.

#include <iostream>

#include <seqan/basic.h>
#include <seqan/stream.h>   // For printing strings.
#include <seqan/score.h>  // The module score.

using namespace seqan2;

We define a function that can show a scoring matrix.

// Print a scoring scheme matrix to stdout.
template <typename TScoreValue, typename TSequenceValue, typename TSpec>
void showScoringMatrix(Score<TScoreValue, ScoreMatrix<TSequenceValue, TSpec> > const & scoringScheme)
{
    // Print top row.
    for (unsigned i = 0; i < ValueSize<TSequenceValue>::VALUE; ++i)
        std::cout << "\t" << TSequenceValue(i);
    std::cout << std::endl;
    // Print each row.
    for (unsigned i = 0; i < ValueSize<TSequenceValue>::VALUE; ++i)
    {
        std::cout << TSequenceValue(i);
        for (unsigned j = 0; j < ValueSize<TSequenceValue>::VALUE; ++j)
        {
            std::cout << "\t" << score(scoringScheme, TSequenceValue(i), TSequenceValue(j));
        }
        std::cout << std::endl;
    }
}

Finally, the main program loads the scoring matrix and then shows it.

int main(int, char **)
{
    typedef int TScoreValue;

    Score<TScoreValue, ScoreMatrix<Dna5, Default> > scoreMatrix;
    loadScoreMatrix(scoreMatrix, toCString(getAbsolutePath("demos/howto/scores/dna_example.txt")));
    showScoringMatrix(scoreMatrix);

    return 0;
}

Here’s the program output.

	A	C	G	T	N
A	1	-1	-1	-1	-1
C	-1	1	-1	-1	-1
G	-1	-1	1	-1	-1
T	-1	-1	-1	1	-1
N	-1	-1	-1	-1	1