VCF I/O

Learning Objective

In this tutorial, you will learn how to read and write VCF files.

Difficulty

Average

Duration

1 h (45 min if you know the VCF format)

Prerequisites

Sequences, File I/O Overview, VCF Format Specification (v4.2)

Overview

This tutorial shows how to read and write VCF files using the VcfFileIn and VcfFileOut classes. It starts out with a quick reminder on the structure of VCF files and will then continue with how to read and write VCF files and access the tags of a record.

Important

Note that this tutorial is targeted at readers that already know about the VCF format. If you do not know about the VCF format yet, then this tutorial will be harder for your to understand.

The VCF format allows storing genomic variants of individuals with respect to a reference. The general file structure starts with

  1. several meta-information lines starting with ##,

  2. one header line giving the names of the individuals, and

  3. an arbitrary number of records.

The information of (1) and (2) will be read and written together as the “header” of the file. For simple variants such as SNPs and small indels, each record corresponds to a variant. More complex variants can be stored in multiple records (see the VCF standard on “breakends” for more information).

The vcf_io module of SeqAn allows to read and write VCF files record-wise. Since the structure of the fields in the VCF format often is very complex and the format undergoes changes in this respect, SeqAn only offers basic parsing functionality: The position is stored as a 0-based integer, reference names are stored in a reference name store (similar as in the SAM and BAM I/O Tutorial), and the quality is stored as a float value.

The remaining fields have to be parsed from and composed as strings in the user’s application.

VCF Format

This section gives a very brief overview of the VCF file structure. For more details, see the VCF Format Specification (v4.2).

The following is an example of a VCF file:

##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	NA00002	NA00003
20	14370	rs6054257	G	A	29	PASS	NS=3;DP=14;AF=0.5;DB;H2	GT:GQ:DP:HQ	0|0:48:1:51,51	1|0:48:8:51,51	1/1:43:5:.,.
20	17330	.	T	A	3	q10	NS=3;DP=11;AF=0.017	GT:GQ:DP:HQ	0|0:49:3:58,50	0|1:3:5:65,3	0/0:41:3
20	1110696	rs6040355	A	G,T	67	PASS	NS=2;DP=10;AF=0.333,0.667;AA=T;DB	GT:GQ:DP:HQ	1|2:21:6:23,27	2|1:2:0:18,2	2/2:35:4
20	1230237	.	T	.	47	PASS	NS=3;DP=13;AA=T	GT:GQ:DP:HQ	0|0:54:7:56,60	0|0:48:4:51,51	0/0:61:2
20	1234567	microsat1	GTC	G,GTCT	50	PASS	NS=3;DP=9;AA=G	GT:GQ:DP	0/1:35:4	0/2:17:2	1/1:40:3

The file starts with meta information lines (starting with ##) with a key/value structure. The most important lines have the keys contig, INFO, FILTER, and FORMAT.

contig

Lines with this key list the contigs of the reference genome.``

INFO

These lines give valid keys (and the format of the values) for the INFO column.

FILTER

Valid values of the FILTER column.

FORMAT

Valid entries for the INFO column.

The meta information lines are followed by the header line which gives the names of the first 9 columns which are always the same (CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT) and a non-empty list of sample names. The columns are separated by spaces.

The header line is followed by the records which contains a value for each column in the header.

CHROM

Name of the chromosome/reference sequence that the variant lies on.

POS

The 1-based position of the variant.

ID

A name of the variant. . is used if no name is available.

REF

The value of the reference allele.

ALT

The alternate allele values (multiple values are comma-separated).

QUAL

Quality value of the call (float).

FILTER

A value for the filter result (given in a FILTER meta information line).

INFO

Information about a variant.

FORMAT

Colon-separated list of entries that are found for each variant.

The 9 mandatory columns are followed by as many columns as there are individuals. For each individual, there is a colon-separated list of values in the order given in the FORMAT cell.

Tip

1-based and 0-based positions.

There are two common ways of specifying intervals.

  1. Start counting positions at 1 and give intervals by the first and last position that are part of the interval (closed intervals). For example, the interval [1000; 2000] starts at character 1000 and ends at character 2000 and includes it. This way is natural to non-programmers and used when giving coordinates in GFF files or genome browsers such as UCSC Genome Browser and IGV.

  2. Start counting positions at 0 and give intervals by the first position that is part of the interval and giving the position behind the last position that is part of the interval. The interval from above would be [999; 2000) in this case.

In text representations, such as VCF, 1-based closed intervals are used whereas in the internal binary data structures, SeqAn uses 0-based half-open intervals. When fields are read as text, numbers are not translated, of course.

A First Working Example

The following example shows a program that reads the file example.vcf and prints its contents back to the user on standard output.

#include <seqan/vcf_io.h>

using namespace seqan2;

int main()
{
    // Open input file.
    VcfFileIn vcfIn(toCString(getAbsolutePath("demos/tutorial/vcf_io/example.vcf")));

    // Attach to standard output.
    VcfFileOut vcfOut(vcfIn);
    open(vcfOut, std::cout, Vcf());

    // Copy over header.
    VcfHeader header;
    readHeader(header, vcfIn);
    writeHeader(vcfOut, header);

    // Copy the file record by record.
    VcfRecord record;
    while (!atEnd(vcfIn))
    {
        readRecord(record, vcfIn);
        writeRecord(vcfOut, record);
    }

    return 0;
}

The program first opens a VcfFileIn for reading the file, then a VcfFileOut for writing it. First, the header is copied by means of a VcfHeader object that we will see below. Then, the input file is read record by record and written out to standard output. The alignment records are read into VcfRecord objects which we will focus on below.

The output of the example program looks as follows:

##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	NA00002	NA00003
20	14370	rs6054257	G	A	29	PASS	NS=3;DP=14;AF=0.5;DB;H2	GT:GQ:DP:HQ	0|0:48:1:51,51	1|0:48:8:51,51	1/1:43:5:.,.
20	17330	.	T	A	3	q10	NS=3;DP=11;AF=0.017	GT:GQ:DP:HQ	0|0:49:3:58,50	0|1:3:5:65,3	0/0:41:3
20	1110696	rs6040355	A	G,T	67	PASS	NS=2;DP=10;AF=0.333,0.667;AA=T;DB	GT:GQ:DP:HQ	1|2:21:6:23,27	2|1:2:0:18,2	2/2:35:4
20	1230237	.	T	.	47	PASS	NS=3;DP=13;AA=T	GT:GQ:DP:HQ	0|0:54:7:56,60	0|0:48:4:51,51	0/0:61:2
20	1234567	microsat1	GTC	G,GTCT	50	PASS	NS=3;DP=9;AA=G	GT:GQ:DP	0/1:35:4	0/2:17:2	1/1:40:3

Assignment 1

Type

Reproduction

Objective

Create a file with the sample VCF content from above and adjust the path "example.vcf" to the path to your SAM file (e.g. "/path/to/my_example.sam").

Solution
#include <seqan/vcf_io.h>

using namespace seqan2;

int main()
{
    // Open input file.
    VcfFileIn vcfIn(toCString(getAbsolutePath("demos/tutorial/vcf_io/example.vcf")));

    // Attach to standard output.
    VcfFileOut vcfOut(vcfIn);
    open(vcfOut, std::cout, Vcf());

    // Copy over header.
    VcfHeader header;
    readHeader(header, vcfIn);
    writeHeader(vcfOut, header);

    // Copy the file record by record.
    VcfRecord record;
    while (!atEnd(vcfIn))
    {
        readRecord(record, vcfIn);
        writeRecord(vcfOut, record);
    }

    return 0;
}

Accessing the Header

Sequence information from the VCF header is stored in the VcfIOContext. The VcfIOContext of a VcfFileIn can be accessed through the function context. The VCF sequence informations can be in turn accessed through functions contigNames and sampleNames. All remaining VCF header information is stored in the class VcfHeader.

Accessing the Records

The class VcfRecord stores one record in a VCF file. It is best explained by its definition. Note how most fields are represented by Strings:

using namespace seqan2;

class VcfRecord
{
public:
    int32_t rID;                          // CHROM
    int32_t beginPos;                     // POS
    CharString id;                        // ID
    CharString ref;                       // REF
    CharString alt;                       // ALT
    float qual;                           // QUAL
    CharString filter;                    // FILTER
    CharString info;                      // INFO
    CharString format;                    // FORMAT
    StringSet<CharString> genotypeInfos;  // <individual1> <individual2> ..

    // Constants for marking reference id and position as invalid.
    static const int32_t INVALID_REFID = -1;
    static const int32_t INVALID_POS = -1;
    // This function returns the float value for "invalid quality".
    static float MISSING_QUAL();
};

The static members INVALID_POS and INVALID_REFID store sentinel values for marking positions and reference sequence ids as invalid. The static funtion MISSING_QUAL() returns the IEEE float “NaN” value.

Tip

A VcfRecord is linked to a reference sequence by the field rID and to samples by genotypeInfos. The sequence information is stored in the VCF header and kept in the VcfIOContext.

Assignment 2

Counting Records

Type

Review

Objective

Change the result of Assignment 1 by counting the number of variants for each chromosome/contig instead of writing out the records.

Hints

The reference sequence information from the VCF header is stored inside the VcfIOContext of its VcfFileIn. You can obtain the number of contigs from the length of the contigNames.

Solution
#include <seqan/vcf_io.h>

using namespace seqan2;

int main()
{
    try
    {
        // Open input file.
        VcfFileIn vcfIn(toCString(getAbsolutePath("demos/tutorial/vcf_io/example.vcf")));

        // Copy over header.
        VcfHeader header;
        readHeader(header, vcfIn);

        // Get array of counters.
        String<unsigned> counters;
        unsigned contigsCount = length(contigNames(context(vcfIn)));
        resize(counters, contigsCount, 0);

        // Read the file record by record.
        VcfRecord record;
        while (!atEnd(vcfIn))
        {
            readRecord(record, vcfIn);

            // Register record with counters.
            counters[record.rID] += 1;
        }

        // Print result.
        std::cout << "VARIANTS ON CONTIGS\n";
        for (unsigned i = 0; i < contigsCount; ++i)
            std::cout << contigNames(context(vcfIn))[i] << '\t'
                      << counters[i] << '\n';
    }
    catch (seqan2::Exception const & e)
    {
        std::cerr << "ERROR:" << e.what() << std::endl;
        return 1;
    }

    return 0;
}

The output is

VARIANTS ON CONTIGS
20	5

Creating a New File

Assignment 3

Generating VCF From Scratch

Type

Application

Objective

Write a program that manually creates the VCF file from above and than prints it back on standard output.

Hint
Solution
#include <seqan/vcf_io.h>

using namespace seqan2;

int main()
{
    // Open output file.
    VcfFileOut out(std::cout, Vcf());

    // Fill sequence names.
    appendValue(contigNames(context(out)), "20");

    // Fill sample names.
    appendValue(sampleNames(context(out)), "NA00001");
    appendValue(sampleNames(context(out)), "NA00002");
    appendValue(sampleNames(context(out)), "NA00002");

    // Fill and write out headers - This is somewhat tedious.
    VcfHeader header;
    appendValue(header, VcfHeaderRecord("fileformat", "VCFv4.1"));
    appendValue(header, VcfHeaderRecord("fileDate", "20090805"));
    appendValue(header, VcfHeaderRecord("source", "myImputationProgramV3.1"));
    appendValue(header, VcfHeaderRecord("reference", "file:///seq/references/1000GenomesPilot-NCBI36.fasta"));
    appendValue(header, VcfHeaderRecord("contig", "<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species=\"Homo sapiens\",taxonomy=x>"));
    appendValue(header, VcfHeaderRecord("phasing", "partial"));
    appendValue(header, VcfHeaderRecord("INFO", "<ID=NS,Number=1,Type=Integer,Description=\"Number of Samples With Data\">"));
    appendValue(header, VcfHeaderRecord("INFO", "<ID=DP,Number=1,Type=Integer,Description=\"Total Depth\">"));
    appendValue(header, VcfHeaderRecord("INFO", "<ID=AF,Number=A,Type=Float,Description=\"Allele Frequency\">"));
    appendValue(header, VcfHeaderRecord("INFO", "<ID=AA,Number=1,Type=String,Description=\"Ancestral Allele\">"));
    appendValue(header, VcfHeaderRecord("INFO", "<ID=DB,Number=0,Type=Flag,Description=\"dbSNP membership, build 129\">"));
    appendValue(header, VcfHeaderRecord("INFO", "<ID=H2,Number=0,Type=Flag,Description=\"HapMap2 membership\">"));
    appendValue(header, VcfHeaderRecord("FILTER", "<ID=q10,Description=\"Quality below 10\">"));
    appendValue(header, VcfHeaderRecord("FILTER", "<ID=s50,Description=\"Less than 50% of samples have data\">"));
    appendValue(header, VcfHeaderRecord("FORMAT", "<ID=GT,Number=1,Type=String,Description=\"Genotype\">"));
    appendValue(header, VcfHeaderRecord("FORMAT", "<ID=GQ,Number=1,Type=Integer,Description=\"Genotype Quality\">"));
    appendValue(header, VcfHeaderRecord("FORMAT", "<ID=DP,Number=1,Type=Integer,Description=\"Read Depth\">"));
    appendValue(header, VcfHeaderRecord("FORMAT", "<ID=HQ,Number=2,Type=Integer,Description=\"Haplotype Quality\">"));
    writeHeader(out, header);

    // Fill and write out the record.
    VcfRecord record;
    record.rID = 0;
    record.beginPos = 14369;
    record.id = "rs6054257";
    record.ref = "G";
    record.alt = "A";
    record.qual = 29;
    record.filter = "PASS";
    record.info = "NS=3;DP=14;AF=0.5;DB;H2";
    record.format = "GT:GQ:DP:HQ";
    appendValue(record.genotypeInfos, "0|0:48:1:51,51");
    appendValue(record.genotypeInfos, "1|0:48:8:51,51");
    appendValue(record.genotypeInfos, "1/1:43:5:.,.");
    writeRecord(out, record);

    return 0;
}

Next Steps