Qgram Index¶
 Learning Objective
You will know the features of the qgram Index, how it can be used for searching and how to access the different fibres.
 Difficulty
Average
 Duration
1 h
 Prerequisites
The Qgram Index¶
A qgram index can be used to efficiently retrieve all occurrences of a certain qgram in the text. It consists of various tables, called fibres (see Accessing Index Fibres), to retrieve qgram positions, qgram counts, etc. However, it has no support for suffix tree iterators. A qgram index must be specialized with a Shape type. A Shape defines q, the number of characters in a qgram and possibly gaps between these characters. There are different specializations of Shape available:
Specialization 
Modifiable 
Number of Gaps 

 
0 

+ 
0 

+ 
0/1 

 
any 

+ 
any 
 fixed at compile time, + can be changed at runtime
Each shape evaluates a gapped or ungapped sequence of q characters to a hash value by the Functions hash, hashNext, etc.
For example, the shape 1101
represents a 3gram with one gap of length 1.
This shape overlayed with the Dna text "GATTACA"
at the third position corresponds to "TTC"
.
The function hash converts this 3gram into \(61 = (\mathbf{3} \cdot 4 + \mathbf{3}) \cdot 4 + 1\).
4 is the alphabet size in this example (see ValueSize).
With hash and hashNext, we can compute the hash values of arbitrary / adjacent qgrams and a loop that outputs the hash values of all overlapping ungapped 3grams could look as follows:
DnaString text = "AAAACACAGTTTGA";
Shape<Dna, UngappedShape<3> > myShape;
std::cout << hash(myShape, begin(text)) << '\t';
for (unsigned i = 1; i < length(text)  length(myShape) + 1; ++i)
std::cout << hashNext(myShape, begin(text) + i) << '\t';
Note that the shape not only stores the length and gaps of a qgram shape but also stores the hash value returned by the last hash/hashNext call. This hash value can be retrieved by calling value on the shape. However, one drawback of the example loop above is that the first hash value must be computed with hash while the hash values of the following overlapping qgrams can more efficiently be computed by hashNext. This complicates the structure of algorithms that need to iterate all hash values, as they have to handle this first hash differently. As a remedy, the hashInit function can be used first and then hashNext on the first and all following text positions in the same way:
hashInit(myShape, begin(text));
for (unsigned i = 0; i < length(text)  length(myShape) + 1; ++i)
std::cout << hashNext(myShape, begin(text) + i) << '\t';
The qgram index offers different functions to search or count occurrences of qgrams in an indexed text, see getOccurrences, countOccurrences.
A qgram index over a StringSet stores occurrence positions in the same way and in the same fibre (FibreSA) as the ESA index.
If only the number of qgrams per sequence are needed the QGramCounts and QGramCountsDir fibres can be used.
They store pairs (seqNo, count)
, count
>0, for each qgram that occurs counts
times in sequence number seqNo
.
To efficiently retrieve all occurrence positions or all pairs (seqNo, count)
for a given qgram, these positions or pairs are stored in contiguous blocks (in QGramSA, QGramCounts fibres), called buckets.
The begin position of bucket i is stored in directory fibres (QGramDir, QGramCountsDir) at position i, the end position is the begin positions of the bucket i+1.
The default implementation of the IndexQGram index maps qgram hash values 1to1 to bucket numbers.
For large q or large alphabets the Open Addressing QGram Index can be more appropriate as its directories are additionally bound by the text length.
This is realized by a nontrivial mapping from qgram hashes to bucket numbers that requires an additional fibre (QGramBucketMap).
For more details on qgram index fibres see Accessing Index Fibres or QGram Index Fibres.
Example¶
We want to construct the qgram index of the string "CATGATTACATA"
and output the occurrences of the ungapped 3gram "CAT"
.
As 3 is fixed at compiletime and the shape has no gaps we can use an UngappedShape which is the first template argument of IndexQGram, the second template argument of Index.
Next we create the string "CATGATTACATA"
and specialize the first index template argument with the type of this string.
The string can be given to the index constructor.
int main()
{
typedef Index<DnaString, IndexQGram<UngappedShape<3> > > TIndex;
TIndex index("CATGATTACATA");
To get all occurrences of a qgram, we first have to hash it with a shape of the same type as the index shape (we can even use the index shape returned by indexShape). The hash value returned by hash or hashNext is also stored in the shape and is used by the function getOccurrences to retrieve all occurrences of our 3gram.
hash(indexShape(index), "CAT");
for (unsigned i = 0; i < length(getOccurrences(index, indexShape(index))); ++i)
std::cout << getOccurrences(index, indexShape(index))[i] << std::endl;
return 0;
}
Program output:
0
8
Assignment 1¶
 Type
Review
 Objective
Write a program that outputs all occurrences of the gapped qgram “ATA” in “CATGATTACATA”.
 Solution
Before we can create a DnaString index of “CATGATTACATA”, we have to choose an appropriate Shape. Because our shape
1101
is known at compiletime and contains only one gap we could choose OneGappedShape, GappedShape, or GenericShape (see the commentedout code). Although the GenericShape could be used for every possible shape, it is a good idea to choose a Shape with restrictions as its hash functions are more efficient in general.int main() { Index<DnaString, IndexQGram<OneGappedShape> > index("CATGATTACATA"); stringToShape(indexShape(index), "1101");
Please note that the Shape object that corresponds to the IndexQGram index is empty initially and has to be set by stringToShape or resize. This initialization is not necessary for Shape that are defined at compiletime, i.e. UngappedShape and GappedShape. To search for “ATA” we first have to hash it with the index shape or any other Shape with the same bitmap. The we can use getOccurrences to output all matches.
hash(indexShape(index), "ATCA"); for (unsigned i = 0; i < length(getOccurrences(index, indexShape(index))); ++i) std::cout << getOccurrences(index, indexShape(index))[i] << std::endl; return 0; }
Tip
Instead of
length(getOccurrences(...))
we could have used countOccurrences. But beware that countOccurrences requires only theQGram_Dir
fibre, whereas getOccurrences requires bothQGram_Dir
andQGram_SA
, see Accessing Index Fibres. BecauseQGram_SA
can be much more efficiently constructed during the construction ofQGram_Dir
,QGram_Dir
would be constructed twice.Program output:
1 4
Assignment 2¶
 Type
Review
 Objective
Create and output a matrix M where M(i,j) is the number of common ungapped 5grams between sequence i and sequence j for 3 random Dna sequences, each not longer than 200 characters. Optional: Run the matrix calculation twice, once for an IndexQGram and once for an Open Addressing QGram Index and output the directory sizes (QGram_Dir, QGram_CountsDir fibre).
 Hint
A common qgram that occurs \(a\) times in one and \(b\) times in the other sequence counts for \(\min(a,b)\).
 Solution
For generating random numbers we use the std::mt19937. The random numbers returned by the random number engine are arbitrary
unsigned int
values which we downscale to values between 0 and 3 and convert into Dna characters. The 3 generated strings are of random length and appended to a StringSet. The main algorithm is encapsulated in a template functionqgramCounting
to easily switch between the two IndexQGram specializations.int main() { // for the sake of reproducibility std::mt19937 rng; // create StringSet of 3 random sequences StringSet<DnaString> stringSet; reserve(stringSet, 3); for (int seqNo = 0; seqNo < 3; ++seqNo) { DnaString tmp; int len = rng() % 100 + 10; for (int i = 0; i < len; ++i) appendValue(tmp, Dna(rng() % 4)); appendValue(stringSet, tmp); std::cout << ">Seq" << seqNo << std::endl << tmp << std::endl; } qgramCounting(stringSet, IndexQGram<UngappedShape<5> >()); qgramCounting(stringSet, IndexQGram<UngappedShape<5>, OpenAddressing>()); return 0; }
The main function expects the StringSet and the Index specialization as a tag. First, we define lots of types we need to iterate and access the fibres directly. We then notify the index about the fibres we require. For storing the common qgrams we use a 2dimensional Matrix object whose lengths have to be set with
setLength
for each dimension. The matrix is initialized with zeros by resize.template <typename TStringSet, typename TIndexSpec> void qgramCounting(TStringSet & set, TIndexSpec) { typedef Index<TStringSet, TIndexSpec> TIndex; typedef typename Fibre<TIndex, QGramCounts>::Type TCounts; typedef typename Fibre<TIndex, QGramCountsDir>::Type TCountsDir; typedef typename Value<TCountsDir>::Type TDirValue; typedef typename Iterator<TCounts, Standard>::Type TIterCounts; typedef typename Iterator<TCountsDir, Standard>::Type TIterCountsDir; TIndex index(set); indexRequire(index, QGramCounts()); // initialize distance matrix int seqNum = countSequences(index); Matrix<int, 2> distMat; setLength(distMat, 0, seqNum); setLength(distMat, 1, seqNum); resize(distMat, 0); std::cout << std::endl << "Length of the CountsDir fibre: " << length(indexCountsDir(index)) << std::endl; TIterCountsDir itCountsDir = begin(indexCountsDir(index), Standard()); TIterCountsDir itCountsDirEnd = end(indexCountsDir(index), Standard()); TIterCounts itCountsBegin = begin(indexCounts(index), Standard());
The main part of the function iterates over the CountsDir fibre. Each entry in this directory represents a qgram bucket, a contiguous interval in the Counts fibre storing for every sequence the qgram occurs in the number of occurrences in pairs (seqNo,count). The interval begin of each bucket is stored in the directory and the interval end is the begin of the next bucket. So the inner loops iterate over all nonempty buckets and two pairs (seqNo1,count1) and (seqNo2,count2) indicate that seqNo1 and seqNo2 have a common qgram. At the end the matrix can simply be output by shifting it to the
cout
stream.// for each bucket count common qgrams for each sequence pair TDirValue bucketBegin = *itCountsDir; for (++itCountsDir; itCountsDir != itCountsDirEnd; ++itCountsDir) { TDirValue bucketEnd = *itCountsDir; // qgram must occur in at least 2 different sequences if (bucketBegin != bucketEnd) { TIterCounts itA = itCountsBegin + bucketBegin; TIterCounts itEnd = itCountsBegin + bucketEnd; for (; itA != itEnd; ++itA) for (TIterCounts itB = itA; itB != itEnd; ++itB) distMat((*itA).i1, (*itB).i1) += _min((*itA).i2, (*itB).i2); } bucketBegin = bucketEnd; } std::cout << std::endl << "Common 5mers for Seq_i, Seq_j" << std::endl; std::cout << distMat; }
Please note that the open addressing qgram index directories are smaller than the IndexQGram index directories.
Program output:
>Seq0 GGCATCCGTTCAGTATACGCCT >Seq1 GGACATACCGGTCCTAAGTACACGTGGCAGGGATGGTCGAAGAACCCGC >Seq2 TGCAAAGTTAGCGTACTAGTTAGTAACCGTGATACTAGCAAAATGAGTCTTCTCGTCGAAGTGGACGGGATGTGCTCACGGCCTTTTTT Length of the CountsDir fibre: 1025 Common 5mers for Seq_i, Seq_j 18 0 0 0 45 5 0 0 85 Length of the CountsDir fibre: 257 Common 5mers for Seq_i, Seq_j 18 0 0 0 45 5 0 0 85