Q-gram Index

Learning Objective
You will know the features of the q-gram Index, how it can be used for searching and how to access the different fibres.
Difficulty
Average
Duration
1 h
Prerequisites
Sequences, Iterators

The Q-gram Index

A q-gram index can be used to efficiently retrieve all occurrences of a certain q-gram in the text. It consists of various tables, called fibres (see Accessing Index Fibres), to retrieve q-gram positions, q-gram counts, etc. However, it has no support for suffix tree iterators. A q-gram index must be specialized with a Shape type. A Shape defines q, the number of characters in a q-gram and possibly gaps between these characters. There are different specializations of Shape available:

Specialization Modifiable Number of Gaps
UngappedShape - 0
SimpleShape + 0
OneGappedShape + 0/1
GappedShape - any
GenericShape + any
  • - fixed at compile time, + can be changed at runtime

Each shape evaluates a gapped or ungapped sequence of q characters to a hash value by the Functions hash, hashNext, etc. For example, the shape 1101 represents a 3-gram with one gap of length 1. This shape overlayed with the Dna text "GATTACA" at the third position corresponds to "TT-C". The function hash converts this 3-gram into \(61 = (\mathbf{3} \cdot 4 + \mathbf{3}) \cdot 4 + 1\). 4 is the alphabet size in this example (see ValueSize).

With hash and hashNext, we can compute the hash values of arbitrary / adjacent q-grams and a loop that outputs the hash values of all overlapping ungapped 3-grams could look as follows:

    DnaString text = "AAAACACAGTTTGA";
    Shape<Dna, UngappedShape<3> > myShape;

    std::cout << hash(myShape, begin(text)) << '\t';
    for (unsigned i = 1; i < length(text) - length(myShape) + 1; ++i)
        std::cout << hashNext(myShape, begin(text) + i) << '\t';

Note that the shape not only stores the length and gaps of a q-gram shape but also stores the hash value returned by the last hash/hashNext call. This hash value can be retrieved by calling value on the shape. However, one drawback of the example loop above is that the first hash value must be computed with hash while the hash values of the following overlapping q-grams can more efficiently be computed by hashNext. This complicates the structure of algorithms that need to iterate all hash values, as they have to handle this first hash differently. As a remedy, the hashInit function can be used first and then hashNext on the first and all following text positions in the same way:

    hashInit(myShape, begin(text));
    for (unsigned i = 0; i < length(text) - length(myShape) + 1; ++i)
        std::cout << hashNext(myShape, begin(text) + i) << '\t';

The q-gram index offers different functions to search or count occurrences of q-grams in an indexed text, see getOccurrences, countOccurrences. A q-gram index over a StringSet stores occurrence positions in the same way as the ESA index and in the same fibre (FibreSA). If only the number of q-grams per sequence are needed the QGramCounts and QGramCountsDir fibres can be used. They store pairs (seqNo, count), count>0, for each q-gram that occurs counts times in sequence number seqNo.

To efficiently retrieve all occurrence positions or all pairs (seqNo, count) for a given q-gram, these positions or pairs are stored in contiguous blocks (in QGramSA, QGramCounts fibres), called buckets. The begin position of bucket i is stored in directory fibres (QGramDir, QGramCountsDir) at position i, the end position is the begin positions of the bucket i+1. The default implementation of the IndexQGram index maps q-gram hash values 1-to-1 to bucket numbers. For large q or large alphabets the Open Adressing QGram Index can be more appropriate as its directories are additionally bound by the text length. This is realized by a non-trivial mapping from q-gram hashes to bucket numbers that requires an additional fibre (QGramBucketMap).

For more details on q-gram index fibres see Accessing Index Fibres or QGram Index Fibres.

Example

We want to construct the q-gram index of the string "CATGATTACATA" and output the occurrences of the ungapped 3-gram "CAT". As 3 is fixed at compile-time and the shape has no gaps we can use a UngappedShape which is the first template argument of IndexQGram, the second template argument of Index. Next we create the string "CATGATTACATA" and specialize the first index template argument with the type of this string. The string can be given to the index constructor.

int main ()
{
	typedef Index<DnaString, IndexQGram< UngappedShape<3> > > TIndex;
	TIndex index("CATGATTACATA");

To get all occurrences of a q-gram, we first have to hash it with a shape of the same type as the index shape (we can even use the index shape returned by indexShape). The hash value returned by hash or hashNext is also stored in the shape and is used by the function getOccurrences to retrieve all occurrences of our 3-gram.

	hash(indexShape(index), "CAT");
	for (unsigned i = 0; i < length(getOccurrences(index, indexShape(index))); ++i)
		std::cout << getOccurrences(index, indexShape(index))[i] << std::endl;

	return 0;
}

Program output:

0
8

Assignment 1

Type
Review
Objective
Write a program that outputs all occurrences of the gapped q-gram “AT-A” in “CATGATTACATA”.
Solution

Before we can create a DnaString index of “CATGATTACATA”, we have to choose an appropriate Shape. Because our shape 1101 is known at compile-time and contains only one gap we could choose OneGappedShape, GappedShape, or GenericShape (see the commented-out code). Although the GenericShape could be used for every possible shape, it is a good idea to choose a Shape with restrictions as its hash functions are more efficient in general.

int main ()
{
	Index<DnaString, IndexQGram<OneGappedShape> > index("CATGATTACATA");
	stringToShape(indexShape(index), "1101");

Please note that the Shape object that corresponds to the IndexQGram index is empty initially and has to be set by stringToShape or resize. This initialization is not necessary for Shape that are defined at compile-time, i.e. UngappedShape and GappedShape. To search for “AT-A” we first have to hash it with the index shape or any other Shape with the same bitmap. The we can use getOccurrences to output all matches.

	hash(indexShape(index), "ATCA");
	for (unsigned i = 0; i < length(getOccurrences(index, indexShape(index))); ++i)
		std::cout << getOccurrences(index, indexShape(index))[i] << std::endl;

	return 0;
}

Tip

Instead of length(getOccurrences(...)) we could have used countOccurrences. But beware that countOccurrences requires only the QGram_Dir fibre, whereas getOccurrences requires both QGram_Dir and QGram_SA, see Accessing Index Fibres. Because QGram_SA can be much more efficiently constructed during the construction of QGram_Dir, QGram_Dir would be constructed twice.

Program output:

1
4

Assignment 2

Type
Review
Objective
Create and output a matrix M where M(i,j) is the number of common ungapped 5-grams between sequence i and sequence j for 3 random Dna sequences, each not longer than 200 characters. Optional: Run the matrix calculation twice, once for an IndexQGram and once for an Open Adressing QGram Index and output the directory sizes (QGram_Dir, QGram_CountsDir fibre).
Hint
A common q-gram that occurs \(a\) times in one and \(b\) times in the other sequence counts for \(\min(a,b)\).
Solution

For generating random numbers we use the MersenneTwisterRng which is a specialization of the random number generator class Rng. The random numbers returned by pickRandomNumber are arbitrary unsigned int values which we downscale to values between 0 and 3 and convert into Dna characters. The 3 generated strings are of random length and appended to a StringSet. The main algorithmus is encapsulated in a template function qgramCounting to easily switch between the two IndexQGram specializations.

int main ()
{
	//  for the sake of reproducibility
	Rng<MersenneTwister> rng;

	// create StringSet of 3 random sequences
	StringSet<DnaString> stringSet;
	reserve(stringSet, 3);
	for (int seqNo = 0; seqNo < 3; ++seqNo)
	{
		DnaString tmp;
		int len = pickRandomNumber(rng) % 100 + 10;
		for (int i = 0; i < len; ++i)
			appendValue(tmp, Dna(pickRandomNumber(rng) % 4));
		appendValue(stringSet, tmp);
		std::cout << ">Seq" << seqNo << std::endl << tmp << std::endl;
	}

	qgramCounting(stringSet, IndexQGram<UngappedShape<5> >());
	qgramCounting(stringSet, IndexQGram<UngappedShape<5>, OpenAddressing>());
	return 0;
}

The main function expects the StringSet and the Index specialization as a tag. First, we define lots of types we need to iterate and access the fibres directly. We then notify the index about the fibres we require. For storing the common q-grams we use a 2-dimensional Matrix object whose lengths have to be set with setLength for each dimension. The matrix is initialized with zeros by resize.

template <typename TStringSet, typename TIndexSpec>
void qgramCounting(TStringSet &set, TIndexSpec)
{
	typedef Index<TStringSet, TIndexSpec> TIndex;
	typedef typename Fibre<TIndex, QGramCounts>::Type TCounts;
	typedef typename Fibre<TIndex, QGramCountsDir>::Type TCountsDir;
	typedef typename Value<TCountsDir>::Type TDirValue;
	typedef typename Iterator<TCounts, Standard>::Type TIterCounts;
	typedef typename Iterator<TCountsDir, Standard>::Type TIterCountsDir;

	TIndex index(set);
	indexRequire(index, QGramCounts());

	// initialize distance matrix
	int seqNum = countSequences(index);
	Matrix<int, 2> distMat;
	setLength(distMat, 0, seqNum);
	setLength(distMat, 1, seqNum);
	resize(distMat, 0);

	std::cout << std::endl << "Length of the CountsDir fibre: " << length(indexCountsDir(index)) << std::endl;
	TIterCountsDir itCountsDir = begin(indexCountsDir(index), Standard());
	TIterCountsDir itCountsDirEnd = end(indexCountsDir(index), Standard());
	TIterCounts itCountsBegin = begin(indexCounts(index), Standard());

The main part of the function iterates over the CountsDir fibre. Each entry in this directory represents a q-gram bucket, a contiguous interval in the Counts fibre storing for every sequence the q-gram occurs in the number of occurrences in pairs (seqNo,count). The interval begin of each bucket is stored in the directory and the interval end is the begin of the next bucket. So the inner loops iterate over all non-empty buckets and two pairs (seqNo1,count1) and (seqNo2,count2) indicate that seqNo1 and seqNo2 have a common q-gram. At the end the matrix can simply be output by shifting it to the cout stream.

	// for each bucket count common q-grams for each sequence pair
	TDirValue bucketBegin = *itCountsDir;
	for(++itCountsDir; itCountsDir != itCountsDirEnd; ++itCountsDir)
	{
		TDirValue bucketEnd = *itCountsDir;

		// q-gram must occur in at least 2 different sequences
		if (bucketBegin != bucketEnd)
		{
			TIterCounts itA = itCountsBegin + bucketBegin;
			TIterCounts itEnd = itCountsBegin + bucketEnd;
			for(; itA != itEnd; ++itA)
				for(TIterCounts itB = itA; itB != itEnd; ++itB)
					distMat((*itA).i1, (*itB).i1) += _min((*itA).i2, (*itB).i2);
		}
		bucketBegin = bucketEnd;
	}

	std::cout << std::endl << "Common 5-mers for Seq_i, Seq_j" << std::endl;
	std::cout << distMat;
}

Please note that the open addressing q-gram index directories are smaller than the IndexQGram index directories.

Program output:

>Seq0
TCATTTTCTCGATGAAAGCGTTGACCCCACATATCGTTAGTACTCTTGTACCCT
>Seq1
TGATTGTGTAGAAACCGAACTACGGTACCTCCTGTTGGTAGTCACGATAGATTATAAAAGTATGTTCCCACCCTATCGACGAGACTGGCA
>Seq2
CCTAGGTGTTTGCGGTGTTGGTACGTGCG

Length of the CountsDir fibre: 1025

Common 5-mers for Seq_i, Seq_j
50   4       0
0    86      5
0    0       25

Length of the CountsDir fibre: 259

Common 5-mers for Seq_i, Seq_j
50   4       0
0    86      5
0    0       25
comments powered by Disqus