GFF and GTF I/O¶
- Learning Objective
- In this tutorial, you will learn how to use the high-level interface GffStream class to read and write GFF and GTF files.
- Difficulty
- Average
- Duration
- 45 min
- Prerequisites
- Exposure to the GFF and GTF formats is useful.
This tutorial deals with how to easily read and write GFF and GTF files using the GffStream class. It starts out with a quick reminder on the structure of GFF and GTF files and will then continue with how to read and write GFF and GTF files.
The GFF and GTF formats are used for annotating genomic intervals (an interval with begin/end position on a contig/chromosome). GFF exist in versions 2 and 3 and GTF is sometimes called “GFF 2.5”. There are specifications for GFF 2, GFF 3, and GTF available elsewhere. GFF and GTF are TSV-based formats and in general have the same structure. The main difference is the underlying system/ontology for the annotation but also smaller differences in the format.
In this tutorial, we will focus on the format GFF 3 since it is the most current one with most complete tool support. The information of this tutorial can easily be translated to the other two formats.
The SeqAn module gff_io allows the reading and writing of the GFF and GTF formats.
Tip
Format Version Support in SeqAn
GffStream allows to read GFF files in version 2 and 3 and GTF files. For writing, only GFF 3 and GTF are supported at the moment.
GFF File Structure¶
The following is an example of a GFF 3 file:
ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN
ctg123 . TF_binding_site 1000 1012 . + . Parent=gene00001
ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001
ctg123 . mRNA 1050 9000 . + . ID=mRNA00002;Parent=gene00001
ctg123 . mRNA 1300 9000 . + . ID=mRNA00003;Parent=gene00001
ctg123 . exon 1300 1500 . + . Parent=mRNA00003
ctg123 . exon 1050 1500 . + . Parent=mRNA00001,mRNA00002
ctg123 . exon 3000 3902 . + . Parent=mRNA00001,mRNA00003
ctg123 . exon 5000 5500 . + . Parent=mRNA00001,mRNA00002,mRNA00003
ctg123 . exon 7000 9000 . + . Parent=mRNA00001,mRNA00002,mRNA00003
ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 3000 3902 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 5000 5500 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 7000 7600 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 1201 1500 . + 0 ID=cds00002;Parent=mRNA00002
ctg123 . CDS 5000 5500 . + 0 ID=cds00002;Parent=mRNA00002
ctg123 . CDS 7000 7600 . + 0 ID=cds00002;Parent=mRNA00002
ctg123 . CDS 3301 3902 . + 0 ID=cds00003;Parent=mRNA00003
ctg123 . CDS 5000 5500 . + 1 ID=cds00003;Parent=mRNA00003
ctg123 . CDS 7000 7600 . + 1 ID=cds00003;Parent=mRNA00003
ctg123 . CDS 3391 3902 . + 0 ID=cds00004;Parent=mRNA00003
ctg123 . CDS 5000 5500 . + 1 ID=cds00004;Parent=mRNA00003
ctg123 . CDS 7000 7600 . + 1 ID=cds00004;Parent=mRNA00003
The meaning of the columns are as follows:
- seq id (1)
- Name of the reference sequence.
- source (2)
- Free text field describing the source of the annotation, such as a software (e.g. “Genescan”) or a a database (e.g. “Genebank”), “.” for none.
- type (3)
- The type of the annotation.
- start (4)
- The 1-based begin position of the annotation.
- end (5)
- The 1-based end position of the annotation.
- score (6)
- The score of the annotation, “.” for none.
- strand (7)
- The strand of the annotation, “+” and “-” for forward and reverse strand, “.” for features that are not stranded.
- phase (8)
- Shift of the feature regarding to the reading frame, one of “0”, “1”, “2”, and “.” for missing/dont-care.
- attributes (9)
- A list of key/value attributes. For GFF 3, this is a list of key=value pairs, separated by semicolons (e.g. ID=cds00003;Parent=mRNA00003). For GTF and GFF 2, this is a list of tuples, separated by semicolon. The first entry gives the key, the following entries are values. Strings are generally enclosed in quotes (e.g. Target "HBA_HUMAN" 11 55 ; E_value 0.0003)
Tip
1-based and 0-based positions.
There are two common ways of specifying intervals.
- Start counting positions at 1 and give intervals by the first and last position that are part of the interval (closed intervals). For example, the interval [1,000; 2,000] starts at character 1,000 and ends at character 2,000 and includes it. This way is natural to non-programmers and used when giving coordinates in GFF files or genome browsers such as UCSC Genome Browser and IGV.
- Start counting positions at 0 and give intervals by the first position that is part of the interval and giving the position behind the last position that is part of the interval. The interval from above would be [999; 2,000) in this case.
In text representations, such as GFF and GTF, 1-based closed intervals are used whereas in the internal binary data structures, SeqAn uses 0-based half-open intervals.
A First Working Example¶
The following example shows an example of a program that reads the file with the path example.gff and prints its contents back to the user on stdout. If you want to try out this program then create a file with the sample GFF content from above and adjust the path "example.gff" in the program below to the path to your GFF file (e.g. "path/to/my_example.gff").
#include <seqan/basic.h>
#include <seqan/gff_io.h>
int main()
{
// Open input stream.
seqan::GffStream gffIn("example.gff");
// Open output stream, filename "-" means stdout.
seqan::GffStream gffOut("-", seqan::GffStream::WRITE);
// Read the file record by record.
seqan::GffRecord record;
while (!atEnd(gffIn))
{
readRecord(record, gffIn);
// If record is on a sequence that is not known to gffOut yet then we
// have to make it known there.
if (record.rID >= (int)length(gffOut.sequenceNames))
addSequenceName(gffOut, record.ref);
writeRecord(gffOut, record);
}
return 0;
}
The program first opens a GffStream for reading, then one for writing. You can read from stdin and write to stdout using "-" as the file name.
The member sequenceNames of your GffStream object gffIn contains the names of the reference sequences that have been seen in records so far. This StringSet of CharString thus gets new elements as you read the Gff file. For the translation between reference names and numeric ids, a cache is used. The function [dox:GffStream#addSequenceName addSequenceName can be used to register the sequence name with the gffOut stream. This will also update the cache.
Note that the example above is missing error handling. This means that if the input format is ill-formed, error return codes are not handled appropriately and the program might do something unexpected in the case of an error. We will fix this in Assignment 1.
You can see the output of the program below when called with the input file from above.
ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN
ctg123 . TF_binding_site 1000 1012 . + . Parent=gene00001
ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001
ctg123 . mRNA 1050 9000 . + . ID=mRNA00002;Parent=gene00001
ctg123 . mRNA 1300 9000 . + . ID=mRNA00003;Parent=gene00001
ctg123 . exon 1300 1500 . + . Parent=mRNA00003
ctg123 . exon 1050 1500 . + . Parent=mRNA00001,mRNA00002
ctg123 . exon 3000 3902 . + . Parent=mRNA00001,mRNA00003
ctg123 . exon 5000 5500 . + . Parent=mRNA00001,mRNA00002,mRNA00003
ctg123 . exon 7000 9000 . + . Parent=mRNA00001,mRNA00002,mRNA00003
ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 3000 3902 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 5000 5500 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 7000 7600 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 1201 1500 . + 0 ID=cds00002;Parent=mRNA00002
ctg123 . CDS 5000 5500 . + 0 ID=cds00002;Parent=mRNA00002
ctg123 . CDS 7000 7600 . + 0 ID=cds00002;Parent=mRNA00002
ctg123 . CDS 3301 3902 . + 0 ID=cds00003;Parent=mRNA00003
ctg123 . CDS 5000 5500 . + 1 ID=cds00003;Parent=mRNA00003
ctg123 . CDS 7000 7600 . + 1 ID=cds00003;Parent=mRNA00003
ctg123 . CDS 3391 3902 . + 0 ID=cds00004;Parent=mRNA00003
ctg123 . CDS 5000 5500 . + 1 ID=cds00004;Parent=mRNA00003
ctg123 . CDS 7000 7600 . + 1 ID=cds00004;Parent=mRNA00003
To add error handling, we have to check return values. The readRecord call returns a status code different from 0, indicating an error.
In Assignment 1, we will add error handling to the program.
Assignment 1¶
Adding Error Handling
- Type
- Review
- Objective
- Add error handling using the hints below.
- Hints
- The functions readRecord and writeRecord return a status code int, 0 on success, 1 on errors. The function isGood checks whether the state of a GffStream is errorneous.
- Solution
#include <seqan/basic.h> #include <seqan/gff_io.h> int main() { // Open input stream seqan::GffStream gffIn("example.gff"); if (!isGood(gffIn)) { std::cerr << "ERROR: Could not open example.gff\n"; return 1; } // Open output stream, filename "-" means stdout. seqan::GffStream gffOut("-", seqan::GffStream::WRITE); // Read the file record by record. seqan::GffRecord record; while (!atEnd(gffIn)) { if (readRecord(record, gffIn) != 0) { std::cerr << "ERROR: Problem reading from example.gff\n"; return 1; } // If record is on a sequence that is not known to gffOut yet then we // have to make it known there. if (record.rID >= (int)length(gffOut.sequenceNames)) addSequenceName(gffOut, record.ref); if (writeRecord(gffOut, record) != 0) { std::cerr << "ERROR: Problem writing to stdout.\n"; return 1; } } return 0; }
The Class GffRecord¶
The class GffRecord stores one record in a Gff file.
namespace seqan {
class GffRecord
{
public:
CharString ref; // reference name
__int32 rID; // index in sequenceNames of GffStream
CharString source; // source free text descriptor
CharString type; // type of the feature
__int32 beginPos; // begin position of the interval
__int32 endPos; // end position of the interval
float score; // score of the annotation
char strand; // the strand
char phase; // one of '0', '1', '2', and '.'
// The key/value list, split into a list of keys and values.
StringSet<CharString> tagName;
StringSet<CharString> tagValue;
// Returns float value for an invalid score.
static float INVALID_SCORE();
// Constants for marking reference id and position as invalid.
static const __int32 INVALID_IDX = -1;
static const __int32 INVALID_POS = -1;
};
} // namespace seqan
The static members INVALID_POS, INVALID_REFID store sentinel values for marking positions and reference sequence ids as invalid. The static funtion INVALID_SCORE() returns the IEEE float “NaN” value. In C++11, there will be a std::nan() function but for now, we need this here.
The member ref stores the contig/reference name of the genomic interval. This information is somewhat redundant with the rID member that is filled automatically when reading from a GffStream such that the GffStream’s sequenceNames[record.rID] == record.ref. Translating reference names to integers is useful in many applications.
When writing and record.rID == INVALID_REFID then record.ref is written out as the reference name and sequenceNames[record.rID] is written out otherwise. The user has to take care that record.rID is a valid reference id in this case.
Assignment 2¶
Counting Records
- Type
- Review
- Objective
- Change the result of Assignment 1 by counting the number of variants for each chromosome/contig instead of writing out the records.
- Solution
#include <seqan/basic.h> #include <seqan/gff_io.h> int main() { // Open input stream seqan::GffStream gffIn("example.gff"); if (!isGood(gffIn)) { std::cerr << "ERROR: Could not open example.gff\n"; return 1; } // Array of counters and sequence names. seqan::String<unsigned> counters; seqan::StringSet<seqan::CharString> seqNames; // Read the file record by record. seqan::GffRecord record; while (!atEnd(gffIn)) { if (readRecord(record, gffIn) != 0) { std::cerr << "ERROR: Problem reading from example.gff\n"; return 1; } // Resize counters and write seqNames if necessary. if ((int)length(counters) <= record.rID) { resize(counters, record.rID + 1, 0); resize(seqNames, record.rID + 1); } if (counters[record.rID] == 0) seqNames[record.rID] = record.ref; // Register record with counters. counters[record.rID] += 1; } // Print result. std::cout << "RECORDS ON CONTIGS\n"; for (unsigned i = 0; i < length(seqNames); ++i) if (counters[i] > 0u) std::cout << seqNames[i] << '\t' << counters[i] << '\n'; return 0; }
The output is
RECORDS ON CONTIGS ctg123 23
Assignment 3¶
Generating GFF From Scratch
- Type
- Application
- Objective
Write a program that prints the following GFF file. Create GffRecord objects and write them to a GffStream using writeRecord().
ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN ctg123 . TF_binding_site 1000 1012 . + . Parent=gene00001
- Solution
#include <seqan/basic.h> #include <seqan/gff_io.h> #include <sstream> int main() { seqan::GffStream out("-", seqan::GffStream::WRITE); // Add sequence names. addSequenceName(out, "ctg123"); // Write out the records. seqan::GffRecord record; record.rID = 0; record.source = ""; record.type = "gene"; record.beginPos = 999; record.endPos = 9000; record.strand = '+'; record.score = seqan::GffRecord::INVALID_SCORE(); appendValue(record.tagName, "ID"); appendValue(record.tagValue, "gene0001"); appendValue(record.tagName, "Name"); appendValue(record.tagValue, "EDEN"); if (writeRecord(out, record) != 0) std::cerr << "ERROR: Problem writing output file."; clear(record.tagName); clear(record.tagValue); record.rID = 0; record.source = ""; record.type = "TF_binding_site"; record.beginPos = 999; record.endPos = 1012; record.strand = '+'; record.score = seqan::GffRecord::INVALID_SCORE(); appendValue(record.tagName, "Parent"); appendValue(record.tagValue, "gene0001"); if (writeRecord(out, record) != 0) std::cerr << "ERROR: Problem writing output file."; return 0; }
GFF and GTF¶
The class GffStream transparently reads files in both GFF and GTF format. When writing, you can select the output format with the third parameter to the constructor GffStream or the function open. When using GffStream::GFF, GFF 3 is used, when using GffStream::GTF, the GTF format. The default is to use GffStream::GFF.
The following program converts a GFF file into a GTF file.
#include <seqan/basic.h>
#include <seqan/gff_io.h>
int main()
{
// Open input stream.
seqan::GffStream gffIn("example.gff");
// Open output stream, filename "-" means stdout.
seqan::GffStream gffOut("-", seqan::GffStream::WRITE, seqan::GffStream::GTF);
// Read the file record by record.
seqan::GffRecord record;
while (!atEnd(gffIn))
{
readRecord(record, gffIn);
// If record is on a sequence that is not known to gffOut yet then we
// have to make it known there.
if (record.rID >= (int)length(gffOut.sequenceNames))
addSequenceName(gffOut, record.ref);
writeRecord(gffOut, record);
}
return 0;
}
Given the GFF file at the top, the output of the example above will look as follows:
ctg123 . gene 1000 9000 . + . ID "gene00001"; Name "EDEN";
ctg123 . TF_binding_site 1000 1012 . + . Parent "gene00001";
ctg123 . mRNA 1050 9000 . + . ID "mRNA00001"; Parent "gene00001";
ctg123 . mRNA 1050 9000 . + . ID "mRNA00002"; Parent "gene00001";
ctg123 . mRNA 1300 9000 . + . ID "mRNA00003"; Parent "gene00001";
ctg123 . exon 1300 1500 . + . Parent "mRNA00003";
ctg123 . exon 1050 1500 . + . Parent "mRNA00001,mRNA00002";
ctg123 . exon 3000 3902 . + . Parent "mRNA00001,mRNA00003";
ctg123 . exon 5000 5500 . + . Parent "mRNA00001,mRNA00002,mRNA00003";
ctg123 . exon 7000 9000 . + . Parent "mRNA00001,mRNA00002,mRNA00003";
ctg123 . CDS 1201 1500 . + 0 ID "cds00001"; Parent "mRNA00001";
ctg123 . CDS 3000 3902 . + 0 ID "cds00001"; Parent "mRNA00001";
ctg123 . CDS 5000 5500 . + 0 ID "cds00001"; Parent "mRNA00001";
ctg123 . CDS 7000 7600 . + 0 ID "cds00001"; Parent "mRNA00001";
ctg123 . CDS 1201 1500 . + 0 ID "cds00002"; Parent "mRNA00002";
ctg123 . CDS 5000 5500 . + 0 ID "cds00002"; Parent "mRNA00002";
ctg123 . CDS 7000 7600 . + 0 ID "cds00002"; Parent "mRNA00002";
ctg123 . CDS 3301 3902 . + 0 ID "cds00003"; Parent "mRNA00003";
ctg123 . CDS 5000 5500 . + 1 ID "cds00003"; Parent "mRNA00003";
ctg123 . CDS 7000 7600 . + 1 ID "cds00003"; Parent "mRNA00003";
ctg123 . CDS 3391 3902 . + 0 ID "cds00004"; Parent "mRNA00003";
ctg123 . CDS 5000 5500 . + 1 ID "cds00004"; Parent "mRNA00003";
ctg123 . CDS 7000 7600 . + 1 ID "cds00004"; Parent "mRNA00003";
Next Steps¶
- Continue with the Tutorial.