VCF I/O

Learning Objective
In this tutorial, you will learn how to use the high-level interface VcfStream class to read and write VCF files.
Difficulty
Average
Duration
1 h (45 min if you know the VCF format)
Prerequisites
Sequences, exposure to the VCf format

This tutorial deals with how to easily read and write VCF files using the VcfStream class. It starts out with a quick reminder on the structure of VCF files and will then continue with how to read and write VCF files and access the tags of a record.

Important

Note that this tutorial is targeted at readers that already know about the VCF format. If you do not know about the VCF format yet then this tutorial will be harder for your to understand. The 1000 Genomes project hosts the VCF format specification (v.4.1).

The VCF format allows storing genomic variants of individuals with respect to a reference. The general file structure starts with (1) meta-information lines starting with ##, one (2) header line giving the names of the individuals, and (3) an arbitrary number of records.

The information of (1) and (2) will be read and written together as the “header” of the file. For simple variants such as SNPs and small indels, each record corresponds to a variant. More complex variants can be stored in multiple records (see the VCF standard on “breakends” for more information).

The vcf_io module of SeqAn allows the record-wise reading and writing to VCF files. Since the structure of the fields in the VCF format often is very complex and the format undergoes changes in this respect, SeqAn only offers basic parsing functionality: The position is stored as a 0-based integer, reference names are stored in a reference name store (similar as in the Basic SAM and BAM I/O Tutorial), and the quality is stored as a float value.

The remaining fields have to be parsed from and composed as strings in the user’s application.

VCF File Structure

This section gives a very brief overview of the VCF file structure. For more details, see the VCF format specification (v4.1).

The following is an example of a VCF file:

##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS     ID        REF    ALT     QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3   0/0:41:3
20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2   2/2:35:4
20     1230237 .         T      .       47   PASS   NS=3;DP=13;AA=T                   GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20     1234567 microsat1 GTC    G,GTCT  50   PASS   NS=3;DP=9;AA=G                    GT:GQ:DP    0/1:35:4       0/2:17:2       1/1:40:3

The file starts with meta information lines (starting with ##) with a key/value structure. The most important lines have the keys contig, INFO, FILTER, and FORMAT.

contig
Lines with this key list the contigs of the reference genome.``
INFO
These lines give valid keys (and the format of the values) for the INFO column.
FILTER
Valid values of the FILTER column.
FORMAT
Valid entries for the INFO column.

The meta information lines are followed by the header line which gives the names of the first 9 columns which are always the same (CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT) and a non-empty list of sample names. The columns are separated by spaces.

The header line is followed by the records which contains a value for each column in the header.

CHROM
Name of the chromosome/reference sequence that the variant lies on.
POS
The 1-based position of the variant.
ID
A name of the variant. . is used if no name is available.
REF
The value of the reference allele.
ALT
The alternate allele values (multiple values are comma-separated).
QUAL
Quality value of the call (float).
FILTER
A value for the filter result (given in a FILTER meta information line).
INFO
Information about a variant.
FORMAT
Colon-separated list of entries that are found for each variant.

The 9 mandatory columns are followed by as many columns as there are individual. For each individual, there is a colon-separated list of values in the order given in the FORMAT cell.

Tip

1-based and 0-based positions.

There are two common ways of specifying intervals.

  1. Start counting positions at 1 and give intervals by the first and last position that are part of the interval (closed intervals). For example, the interval [1,000; 2,000] starts at character 1,000 and ends at character 2,000 and includes it. This way is natural to non-programmers and used when giving coordinates in GFF files or genome browsers such as UCSC Genome Browser and IGV.
  2. Start counting positions at 0 and give intervals by the first position that is part of the interval and giving the position behind the last position that is part of the interval. The interval from above would be [999; 2,000) in this case.

In text representations, such as VCF, 1-based closed intervals are used whereas in the internal binary data structures, SeqAn uses 0-based half-open intervals. When fields are reads as text, numbers are not translated, of course.

A First Working Example

The following example shows an example of a program that reads the file with the path example.vcf and prints its contents back to the user on stdout. If you want to try out this program then create a file with the sample VCF content from above and adjust the path "example.vcf" in the program below to the path to your VCF file (e.g. "path/to/my_example.vcf").

#include <seqan/basic.h>
#include <seqan/vcf_io.h>

int main()
{
    // Open input stream.
    seqan::VcfStream vcfIn("example.vcf");
    // Open output stream, filename "-" means stdout.
    seqan::VcfStream vcfOut("-", seqan::VcfStream::WRITE);

    // Copy over header.
    vcfOut.header = vcfIn.header;

    // Read the file record by record.
    seqan::VcfRecord record;
    while (!atEnd(vcfIn))
    {
        readRecord(record, vcfIn);
        writeRecord(vcfOut, record);
    }
    
    return 0;
}

The program first opens a VcfStream for reading, then one for writing. You can read from stdin and write to stdout using "-" as the file name.

The header is automatically read when a VcfStream is opened. After the header has been read, it is copied over into the output stream. Then, the input stream is read record by record and written out to the output stream. Note that the header is written out automatically before the first variant record is written.

The alignment records are read into VcfRecord objects which we will focus on below.

Note that the example above is missing error handling. This means that if the input format is ill-formed, error return codes are not handled appropriately and the program might do something unexpected in the case of an error. We will fix this in Assignment 1.

You can see the output of the program below when called with the input file from above.

##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS     ID        REF    ALT     QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3   0/0:41:3
20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2   2/2:35:4
20     1230237 .         T      .       47   PASS   NS=3;DP=13;AA=T                   GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20     1234567 microsat1 GTC    G,GTCT  50   PASS   NS=3;DP=9;AA=G                    GT:GQ:DP    0/1:35:4       0/2:17:2       1/1:40:3

To add error handling, we have to check return values. The readRecor call returns a status code different from 0, indicating an error.

In Assignment 1, we will add error handling to the program.

Assignment 1

Adding Error Handling

Type
Review
Objective
Add error handling using the hints below.
Hints
The functions readRecord and writeRecord return a status code int, 0 on success, 1 on errors. The function isGood checks whether the state of a VcfStream is errorneous.
Solution
#include <seqan/basic.h>
#include <seqan/vcf_io.h>

int main()
{
    // Open input stream
    seqan::VcfStream vcfIn("example.vcf");
    if (!isGood(vcfIn))
    {
        std::cerr << "ERROR: Could not open example.vcf\n";
        return 1;
    }
    // Open output stream, filename "-" means stdout.
    seqan::VcfStream vcfOut("-", seqan::VcfStream::WRITE);

    // Copy over header.
    vcfOut.header = vcfIn.header;

    // Read the file record by record.
    seqan::VcfRecord record;
    while (!atEnd(vcfIn))
    {
        if (readRecord(record, vcfIn) != 0)
        {
            std::cerr << "ERROR: Problem reading from example.vcf\n";
            return 1;
        }
        if (writeRecord(vcfOut, record) != 0)
        {
            std::cerr << "ERROR: Problem writing to stdout.\n";
            return 1;
        }
    }
    
    return 0;
}

The Class VcfRecord

The class VcfRecord stores one record in a VCF file. It is best explained by its definition. Note how most fields are represented by strings:

namespace seqan {

class VcfRecord
{
public:
    __int32 rID;                          // CHROM
    __int32 beginPos;                     // POS
    CharString id;                        // ID
    CharString ref;                       // REF
    CharString alt;                       // ALT
    float qual;                           // QUAL
    CharString filter;                    // FILTER
    CharString info;                      // INFO
    CharString format;                    // FORMAT
    StringSet<CharString> genotypeInfos;  // <individual1> <individual2> ..

    // Constants for marking reference id and position as invalid.
    static const __int32 INVALID_REFID = -1;
    static const __int32 INVALID_POS = -1;
    // This function returns the float value for "invalid quality".
    static float MISSING_QUAL();
};

}  // namespace seqan

The static members INVALID_POS, INVALID_REFID store sentinel values for marking positions and reference sequence ids as invalid. The static funtion MISSING_QUAL() returns the IEEE float “NaN” value. In C++11, there will be a std::nan() function but for now, we need this here.

Assignment 2

Counting Records

Type
Review
Objective
Change the result of Assignment 1 by counting the number of variants for each chromosome/contig instead of writing out the records.
Hints
The header contains the sequence names in vcfIn.header.sequenceNames. You can use the length of this StringSet of CharString to get the number of contigs.
Solution
#include <seqan/basic.h>
#include <seqan/vcf_io.h>

int main()
{
    // Open input stream
    seqan::VcfStream vcfIn("example.vcf");
    if (!isGood(vcfIn))
    {
        std::cerr << "ERROR: Could not open example.vcf\n";
        return 1;
    }
    // Open output stream, filename "-" means stdout.
    seqan::VcfStream vcfOut("-", seqan::VcfStream::WRITE);

    // Copy over header.
    vcfOut.header = vcfIn.header;

    // Get array of counters.
    seqan::String<unsigned> counters;
    resize(counters, length(vcfIn.header.sequenceNames), 0);

    // Read the file record by record.
    seqan::VcfRecord record;
    while (!atEnd(vcfIn))
    {
        if (readRecord(record, vcfIn) != 0)
        {
            std::cerr << "ERROR: Problem reading from example.vcf\n";
            return 1;
        }

        // Register record with counters.
        counters[record.rID] += 1;
    }

    // Print result.
    std::cout << "VARIANTS ON CONTIGS\n";
    for (unsigned i = 0; i < length(vcfIn.header.sequenceNames); ++i)
        std::cout << vcfIn.header.sequenceNames[i] << '\t'
                  << counters[i] << '\n';
    
    return 0;
}

The output is

VARIANTS ON CONTIGS
20  5

The Classes VcfHeader and VcfHeaderRecord

The header information is stored in the class VcfHeader. Objects of this class store the information present in the VCF meta information and header lines.

The class has three members: sequenceNames, sampleNames, and headerRecords. sequenceNames and sampleNames are StringSets of CharStrings. The member rID of VcfRecord points into sequenceNames and gives the reference sequence. The genotypeInfos member of VcfRecord has the same number of entires as sampleNames and record.genotypeInfos[i] contains the variant information for sampleNames[i].

When writing VCF files, you have to fill these three members of VcfHeader before writing any record.

Assignment 3

Generating VCF From Scratch

Type
Application
Objective
Write a program that prints the VCF file from above.
Hints

You can convert integers into strings using the <sstream> STL header.

#include <sstream>
// ...
std::stringstream ss;
ss << 10;
seqan::CharString str = ss.str();  // => == "10"
// To reset ss, we need two calls:
ss.str("");  // Remove contents.
ss.clear();  // Reset any error bits.
Solution
#include <seqan/basic.h>
#include <seqan/vcf_io.h>

#include <sstream>

int main()
{
    seqan::VcfStream out("-", seqan::VcfStream::WRITE);

    // Fill sequence names.
    appendValue(out.header.sequenceNames, "20");

    // Fill sample names.
    appendValue(out.header.sampleNames, "NA00001");
    appendValue(out.header.sampleNames, "NA00002");
    appendValue(out.header.sampleNames, "NA00002");

    // Write out headers.
    //
    // This is somewhat tedious.
    appendValue(out.header.headerRecords, seqan::VcfHeaderRecord("fileformat", "VCFv4.1"));
    appendValue(out.header.headerRecords, seqan::VcfHeaderRecord("fileDate", "20090805"));
    appendValue(out.header.headerRecords, seqan::VcfHeaderRecord("source", "myImputationProgramV3.1"));
    appendValue(out.header.headerRecords, seqan::VcfHeaderRecord("reference", "file:///seq/references/1000GenomesPilot-NCBI36.fasta"));
    appendValue(out.header.headerRecords, seqan::VcfHeaderRecord("contig", "<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species=\"Homo sapiens\",taxonomy=x>"));
    appendValue(out.header.headerRecords, seqan::VcfHeaderRecord("phasing", "partial"));
    appendValue(out.header.headerRecords, seqan::VcfHeaderRecord("INFO", "<ID=NS,Number=1,Type=Integer,Description=\"Number of Samples With Data\">"));
    appendValue(out.header.headerRecords, seqan::VcfHeaderRecord("INFO", "<ID=DP,Number=1,Type=Integer,Description=\"Total Depth\">"));
    appendValue(out.header.headerRecords, seqan::VcfHeaderRecord("INFO", "<ID=AF,Number=A,Type=Float,Description=\"Allele Frequency\">"));
    appendValue(out.header.headerRecords, seqan::VcfHeaderRecord("INFO", "<ID=AA,Number=1,Type=String,Description=\"Ancestral Allele\">"));
    appendValue(out.header.headerRecords, seqan::VcfHeaderRecord("INFO", "<ID=DB,Number=0,Type=Flag,Description=\"dbSNP membership, build 129\">"));
    appendValue(out.header.headerRecords, seqan::VcfHeaderRecord("INFO", "<ID=H2,Number=0,Type=Flag,Description=\"HapMap2 membership\">"));
    appendValue(out.header.headerRecords, seqan::VcfHeaderRecord("FILTER", "<ID=q10,Description=\"Quality below 10\">"));
    appendValue(out.header.headerRecords, seqan::VcfHeaderRecord("FILTER", "<ID=s50,Description=\"Less than 50% of samples have data\">"));
    appendValue(out.header.headerRecords, seqan::VcfHeaderRecord("ID", "<ID=GT,Number=1,Type=String,Description=\"Genotype\">"));
    appendValue(out.header.headerRecords, seqan::VcfHeaderRecord("ID", "<ID=GQ,Number=1,Type=Integer,Description=\"Genotype Quality\">"));
    appendValue(out.header.headerRecords, seqan::VcfHeaderRecord("ID", "<ID=DP,Number=1,Type=Integer,Description=\"Read Depth\">"));
    appendValue(out.header.headerRecords, seqan::VcfHeaderRecord("ID", "<ID=HQ,Number=2,Type=Integer,Description=\"Haplotype Quality\">"));

    // Write out the records.
    seqan::VcfRecord record;

    record.rID = 0;
    record.beginPos = 14369;
    record.id = "rs6054257";
    record.ref = "G";
    record.alt = "A";
    record.qual = 29;
    record.filter = "PASS";
    record.info = "NS=3;DP=14;AF=0.5;DB;H2";
    record.format = "GT:GQ:DP:HQ";
    appendValue(record.genotypeInfos, "0|0:48:1:51,51");
    appendValue(record.genotypeInfos, "1|0:48:8:51,51");
    appendValue(record.genotypeInfos, "1/1:43:5:.,.");
    if (writeRecord(out, record) != 0)
        std::cerr << "ERROR: Problem writing output file.";
    
    return 0;
}

Next Steps

comments powered by Disqus