GFF and GTF I/O

Learning Objective
In this tutorial, you will learn how to read and write GFF and GTF files.
Difficulty
Average
Duration
45 min
Prerequisites
Sequences, File I/O Overview, GFF Format Specification

This tutorial shows how to read and write GFF and GTF files using the GffFileIn and GffFileOut classes. It starts out with a quick reminder on the structure of GFF and GTF files and will then continue with how to read and write GFF and GTF files.

The GFF and GTF formats are used for annotating genomic intervals (an interval with begin/end position on a contig/chromosome). GFF exists in versions 2 and 3 and GTF is sometimes called “GFF 2.5”. There are specifications for GFF 2, GFF 3, and GTF available elsewhere. GFF and GTF are TSV-based formats and in general have the same structure. The main difference is the underlying system/ontology for the annotation but also smaller differences in the format.

In this tutorial, we will focus on the format GFF 3 since it is the most current one with most complete tool support. The information of this tutorial can easily be translated to the other two formats.

The SeqAn module gff_io allows the reading and writing of the GFF and GTF formats.

Tip

Format Version Support in SeqAn

GffFileIn allows to read GFF files in version 2 and 3 and GTF files. For writing, GffFileOut supports only GFF 3 and GTF.

GFF Format

The following is an example of a GFF 3 file:

ctg123	.	gene	1000	9000	.	+	.	ID=gene00001;Name=EDEN
ctg123	.	TF_binding_site	1000	1012	.	+	.	Parent=gene00001
ctg123	.	mRNA	1050	9000	.	+	.	ID=mRNA00001;Parent=gene00001
ctg123	.	mRNA	1050	9000	.	+	.	ID=mRNA00002;Parent=gene00001
ctg123	.	mRNA	1300	9000	.	+	.	ID=mRNA00003;Parent=gene00001
ctg123	.	exon	1300	1500	.	+	.	Parent=mRNA00003
ctg123	.	exon	1050	1500	.	+	.	Parent=mRNA00001,mRNA00002
ctg123	.	exon	3000	3902	.	+	.	Parent=mRNA00001,mRNA00003
ctg123	.	exon	5000	5500	.	+	.	Parent=mRNA00001,mRNA00002,mRNA00003
ctg123	.	exon	7000	9000	.	+	.	Parent=mRNA00001,mRNA00002,mRNA00003
ctg123	.	CDS	1201	1500	.	+	0	ID=cds00001;Parent=mRNA00001
ctg123	.	CDS	3000	3902	.	+	0	ID=cds00001;Parent=mRNA00001
ctg123	.	CDS	5000	5500	.	+	0	ID=cds00001;Parent=mRNA00001
ctg123	.	CDS	7000	7600	.	+	0	ID=cds00001;Parent=mRNA00001
ctg123	.	CDS	1201	1500	.	+	0	ID=cds00002;Parent=mRNA00002
ctg123	.	CDS	5000	5500	.	+	0	ID=cds00002;Parent=mRNA00002
ctg123	.	CDS	7000	7600	.	+	0	ID=cds00002;Parent=mRNA00002
ctg123	.	CDS	3301	3902	.	+	0	ID=cds00003;Parent=mRNA00003
ctg123	.	CDS	5000	5500	.	+	1	ID=cds00003;Parent=mRNA00003
ctg123	.	CDS	7000	7600	.	+	1	ID=cds00003;Parent=mRNA00003
ctg123	.	CDS	3391	3902	.	+	0	ID=cds00004;Parent=mRNA00003
ctg123	.	CDS	5000	5500	.	+	1	ID=cds00004;Parent=mRNA00003
ctg123	.	CDS	7000	7600	.	+	1	ID=cds00004;Parent=mRNA00003

The meaning of the columns are as follows:

seq id (1)
Name of the reference sequence.
source (2)
Free text field describing the source of the annotation, such as a software (e.g. “Genescan”) or a a database (e.g. “Genebank”), “.” for none.
type (3)
The type of the annotation.
start (4)
The 1-based begin position of the annotation.
end (5)
The 1-based end position of the annotation.
score (6)
The score of the annotation, “.” for none.
strand (7)
The strand of the annotation, “+” and “-” for forward and reverse strand, “.” for features that are not stranded.
phase (8)
Shift of the feature regarding to the reading frame, one of “0”, “1”, “2”, and “.” for missing/dont-care.
attributes (9)
A list of key/value attributes. For GFF 3, this is a list of key=value pairs, separated by semicolons (e.g. ID=cds00003;Parent=mRNA00003). For GTF and GFF 2, this is a list of tuples, separated by semicolon. The first entry gives the key, the following entries are values. Strings are generally enclosed in quotes (e.g. Target "HBA_HUMAN" 11 55 ; E_value 0.0003)

Tip

1-based and 0-based positions.

There are two common ways of specifying intervals.

  1. Start counting positions at 1 and give intervals by the first and last position that are part of the interval (closed intervals). For example, the interval [1000; 2000] starts at character 1,000 and ends at character 2000 and includes it. This way is natural to non-programmers and used when giving coordinates in GFF files or genome browsers such as UCSC Genome Browser and IGV.
  2. Start counting positions at 0 and give intervals by the first position that is part of the interval and giving the position behind the last position that is part of the interval. The interval from above would be [999; 2000) in this case.

In text representations, such as GFF and GTF, 1-based closed intervals are used whereas in the internal binary data structures, SeqAn uses 0-based half-open intervals.

A First Working Example

The following example shows an example of a program that reads the file example.gff and prints its contents back to the user on standard output.

#include <seqan/basic.h>
#include <seqan/gff_io.h>

using namespace seqan;

int main()
{
    // Get path to example file.
    CharString file = getAbsolutePath("demos/tutorial/gff_and_gtf_io/example.gff");

    // Open input file.
    GffFileIn gffIn(toCString(file));

    // Attach to standard output.
    GffFileOut gffOut(std::cout, Gff());

    // Copy the file record by record.
    GffRecord record;
    while (!atEnd(gffIn))
    {
        readRecord(record, gffIn);
        writeRecord(gffOut, record);
    }

    return 0;
}

The program first opens a GffFileIn for reading and a GffFileOut for writing. The GFF records are read into GffRecord objects which we will focus on below.

Assignment 1

Type
Reproduction
Objective
Create a file with the sample GFF content from above and adjust the path "example.gff" to the path to your GFF file (e.g. "/path/to/my_example.gff").
Solution
#include <seqan/gff_io.h>

using namespace seqan;

int main()
{
    // Get path to example file.
    CharString file = getAbsolutePath("demos/tutorial/gff_and_gtf_io/example.gff");

    // Open input file.
    GffFileIn gffIn;
    if (!open(gffIn, toCString(file)))
    {
        std::cerr << "ERROR: Could not open example.gff" << std::endl;
        return 1;
    }

    // Attach to standard output.
    GffFileOut gffOut(std::cout, Gff());

    // Copy the file record by record.
    GffRecord record;

    try
    {
        while (!atEnd(gffIn))
        {
            readRecord(record, gffIn);
            writeRecord(gffOut, record);
        }
    }
    catch (Exception const & e)
    {
        std::cout << "ERROR: " << e.what() << std::endl;
        return 1;
    }

    return 0;
}
ctg123	.	gene	1000	9000	.	+	.	ID=gene00001;Name=EDEN
ctg123	.	TF_binding_site	1000	1012	.	+	.	Parent=gene00001
ctg123	.	mRNA	1050	9000	.	+	.	ID=mRNA00001;Parent=gene00001
ctg123	.	mRNA	1050	9000	.	+	.	ID=mRNA00002;Parent=gene00001
ctg123	.	mRNA	1300	9000	.	+	.	ID=mRNA00003;Parent=gene00001
ctg123	.	exon	1300	1500	.	+	.	Parent=mRNA00003
ctg123	.	exon	1050	1500	.	+	.	Parent=mRNA00001,mRNA00002
ctg123	.	exon	3000	3902	.	+	.	Parent=mRNA00001,mRNA00003
ctg123	.	exon	5000	5500	.	+	.	Parent=mRNA00001,mRNA00002,mRNA00003
ctg123	.	exon	7000	9000	.	+	.	Parent=mRNA00001,mRNA00002,mRNA00003
ctg123	.	CDS	1201	1500	.	+	0	ID=cds00001;Parent=mRNA00001
ctg123	.	CDS	3000	3902	.	+	0	ID=cds00001;Parent=mRNA00001
ctg123	.	CDS	5000	5500	.	+	0	ID=cds00001;Parent=mRNA00001
ctg123	.	CDS	7000	7600	.	+	0	ID=cds00001;Parent=mRNA00001
ctg123	.	CDS	1201	1500	.	+	0	ID=cds00002;Parent=mRNA00002
ctg123	.	CDS	5000	5500	.	+	0	ID=cds00002;Parent=mRNA00002
ctg123	.	CDS	7000	7600	.	+	0	ID=cds00002;Parent=mRNA00002
ctg123	.	CDS	3301	3902	.	+	0	ID=cds00003;Parent=mRNA00003
ctg123	.	CDS	5000	5500	.	+	1	ID=cds00003;Parent=mRNA00003
ctg123	.	CDS	7000	7600	.	+	1	ID=cds00003;Parent=mRNA00003
ctg123	.	CDS	3391	3902	.	+	0	ID=cds00004;Parent=mRNA00003
ctg123	.	CDS	5000	5500	.	+	1	ID=cds00004;Parent=mRNA00003
ctg123	.	CDS	7000	7600	.	+	1	ID=cds00004;Parent=mRNA00003

Accessing the Records

The class GffRecord stores one record in a Gff file.

using namespace seqan;

class GffRecord
{
public:
    CharString ref;      // reference name
    int32_t rID;         // index in sequenceNames of GffFile
    CharString source;   // source free text descriptor
    CharString type;     // type of the feature
    int32_t beginPos;    // begin position of the interval
    int32_t endPos;      // end position of the interval
    float score;         // score of the annotation
    char strand;         // the strand
    char phase;          // one of '0', '1', '2', and '.'

    // The key/value list, split into a list of keys and values.
    StringSet<CharString> tagNames;
    StringSet<CharString> tagValues;

    // Returns float value for an invalid score.
    static float INVALID_SCORE();

    // Constants for marking reference id and position as invalid.
    static const int32_t INVALID_IDX = -1;
    static const int32_t INVALID_POS = -1;
};

The static members INVALID_POS and INVALID_REFID store sentinel values for marking positions and reference sequence ids as invalid. The static funtion INVALID_SCORE() returns the IEEE float “NaN” value.

Assignment 2

Counting Records

Type
Review
Objective
Change the result of Assignment 1 by counting the number of variants for each chromosome/contig instead of writing out the records.
Solution
#include <seqan/gff_io.h>
#include <seqan/misc/name_store_cache.h>

using namespace seqan;

int main()
{
    // Get path to example file.
    CharString file = getAbsolutePath("demos/tutorial/gff_and_gtf_io/example.gff");

    // Open input file.
    GffFileIn gffIn;
    if (!open(gffIn, toCString(file)))
    {
        std::cerr << "ERROR: Could not open example.gff" << std::endl;
        return 1;
    }

    // Array of counters and sequence names.
    String<unsigned> counters;
    StringSet<CharString> seqNames;
    NameStoreCache<StringSet<CharString> > cache(seqNames);

    // Read the file record by record.
    GffRecord record;

    try
    {
        while (!atEnd(gffIn))
        {
            readRecord(record, gffIn);
            unsigned rID = nameToId(cache, record.ref);

            // Resize counters if necessary and increment counter.
            assignValueById(counters, rID, getValueById(counters, rID) + 1);
        }
    }
    catch (Exception const & e)
    {
        std::cout << "ERROR: " << e.what() << std::endl;
        return 1;
    }

    // Print result.
    std::cout << "RECORDS ON CONTIGS\n";
    for (unsigned i = 0; i < length(seqNames); ++i)
        if (counters[i] != 0u)
            std::cout << seqNames[i] << '\t' << counters[i] << '\n';

    return 0;
}

The output is

RECORDS ON CONTIGS
ctg123	23

Creating a New File

Assignment 3

Generating GFF From Scratch

Type
Application
Objective
Write a program that prints the following GFF file. Create GffRecord objects and write them to a GffFileOut using writeRecord().
Solution
#include <seqan/gff_io.h>

using namespace seqan;

int main()
{
    GffFileOut out(std::cout, Gff());

    GffRecord record;

    // Fill and write out the first record.
    record.ref = "ctg123";
    record.source = "";
    record.type = "gene";
    record.beginPos = 999;
    record.endPos = 9000;
    record.strand = '+';
    record.score = GffRecord::INVALID_SCORE();
    appendValue(record.tagNames, "ID");
    appendValue(record.tagValues, "gene0001");
    appendValue(record.tagNames, "Name");
    appendValue(record.tagValues, "EDEN");
    writeRecord(out, record);

    // Clear the record.
    clear(record.tagNames);
    clear(record.tagValues);

    // Fill and write out the second record.
    record.ref = "ctg123";
    record.source = "";
    record.type = "TF_binding_site";
    record.beginPos = 999;
    record.endPos = 1012;
    record.strand = '+';
    record.score = GffRecord::INVALID_SCORE();
    appendValue(record.tagNames, "Parent");
    appendValue(record.tagValues, "gene0001");
    writeRecord(out, record);

    return 0;
}

Next Steps