BED I/O

Learning Objective

In this tutorial, you will learn how to read and write BED files.

Difficulty

Average

Duration

45min

Prerequisites

Sequences, File I/O Overview, BED Format Specification

This tutorial shows how to read and write BED files using the BedFileIn and BedFileOut classes. It starts out with a quick reminder on the structure of BED files and will then continue with how to read and write BED files.

Originally, the BED format was designed for storing annotation tracks in the UCSC genome browser. Such an annotation track consists of multiple annotation records. Each annotation adds some meta information to a genomic interval (an interval with begin/end position on a contig/chromosome) The original specification of the format can be found in the UCSC Genome Browser FAQ.

The BED format is a TSV format and contains 12 columns. The first three columns specify a genomic region (contig/chromsome name, begin, and end position) and the remaining columns contain additional information. The full format will be described below.

Since genomic intervals are very useful and because there were many tools for manipulating BED files (sorting, intersecting intervals etc.), many other authors and projects created variants of the BED format. Usually, three or more columns have the same meaning as in BED and are followed by other, arbitrary tab-separated columns with additional annotation information. The “full” BED format is then called BED12. BED3, BED4, BED5, and BED6 use the first 3-6 columns and keep the remaining information as data.

BED files can be manipuluated using standard Unix tools such as sed, awk, and sort. There also is the bedtools suite with additional functionality.

The SeqAn module bed_io allows the reading and writing of BED files.

BED Format

The following is an example of a BED file:

chr1	66999824	67210768	NM_032291	0	+	6700004167208778	0	25	227,64,25,72,57,55,176,12,12,25,52,86,93,75,501,128,127,60,112,156,133,203,65,165,2013,	0,91705,98928,101802,105635,108668,109402,126371,133388,136853,137802,139139,142862,145536,147727,155006,156048,161292,185152,195122,199606,205193,206516,207130,208931,
chr1	48998526	50489626	NM_032785	0	-	4899984450489468	0	14	1439,27,97,163,153,112,115,90,40,217,95,125,123,192,	0,2035,6787,54149,57978,101638,120482,130297,334336,512729,712915,1164458,1318541,1490908,
chr1	16767166	16786584	NM_018090	0	+	1676725616785385	0	8	182,101,105,82,109,178,76,1248,	0,2960,7198,7388,8421,11166,15146,18170,
chr1	33546713	33585995	NM_052998	0	+	3354785033585783	0	12	182,121,212,177,174,173,135,166,163,113,215,351,0,275,488,1065,2841,10937,12169,13435,15594,16954,36789,38931,
chr1	16767166	16786584	NM_001145278	0	+	1676725616785385	0	8	104,101,105,82,109,178,76,1248,	0,2960,7198,7388,8421,11166,15146,18170,

The meaning of the columns are as follows:

ref (1)

Name of the reference sequence.

beginPos (2)

Begin position of the interval.

endPos (3)

End position of the interval.

name (4)

Name of the interval.

score (5)

A score, could also be in scientific notation or several values in a comma/colon-separated list.

strand (6)

The strand of the feature, + for forward, - for reverse, . for unknown/dont-care.

thickBegin (7)

Begin position where the feature is drawn thick in the UCSC browser.

thickEnd (8)

End position where the feature is drawn thick in the UCSC browser.

itemRgb (9)

Comma-separated triple with RGB values (0..255 each)

blockCount (10)

The number of blocks (exons) in the BED line (for the UCSC browser).

blockStarts (11)

Comma-separated list with begin positions of exons (for the UCSC browser, should be consistent with blockCount).

blockSizes (12)

Comma-separated list with exon lists (for the UCSC browser, should be consistent with blockCount).

Tip

1-based and 0-based positions.

There are two common ways of specifying intervals.

  1. Start counting positions at 1 and give intervals by the first and last position that are part of the interval (closed intervals). For example, the interval [1000; 2000] starts at character 1000 and ends at character 2000 and includes it. This way is natural to non-programmers and used when giving coordinates in GFF files or genome browsers such as UCSC Genome Browser and IGV.

  2. Start counting positions at 0 and give intervals by the first position that is part of the interval and giving the position behind the last position that is part of the interval. The interval from above would be [999; 2000) in this case.

In text representations, such as GFF and GTF, 1-based closed intervals are used whereas in the internal binary data structures, SeqAn uses 0-based half-open intervals. BED is a text format using 0-based positions.

A First Working Example

The following example shows an example of a program that reads the file example.bed and prints its contents back to the user on standard output.

#include <seqan/bed_io.h>
using namespace seqan2;

int main()
{
    // Open input bed file.
    BedFileIn bedIn(toCString(getAbsolutePath("demos/tutorial/bed_io/example.bed")));

    // Attach to standard output.
    BedFileOut bedOut(std::cout, Bed());

    // Copy the file record by record.
    BedRecord<Bed3> record;

    while (!atEnd(bedIn))
    {
        readRecord(record, bedIn);
        writeRecord(bedOut, record);
    }

    return 0;
}

The program first opens a BedFileIn for reading and a BedFileOut for writing. The BED records are read into BedRecord objects which we will focus on below. In this case, we use the Bed3Record specialization of the BedRecord class.

The output of the example program looks as follows:

chr1	66999824	67210768	NM_032291	0	+	6700004167208778	0	25	227,64,25,72,57,55,176,12,12,25,52,86,93,75,501,128,127,60,112,156,133,203,65,165,2013,	0,91705,98928,101802,105635,108668,109402,126371,133388,136853,137802,139139,142862,145536,147727,155006,156048,161292,185152,195122,199606,205193,206516,207130,208931,
chr1	48998526	50489626	NM_032785	0	-	4899984450489468	0	14	1439,27,97,163,153,112,115,90,40,217,95,125,123,192,	0,2035,6787,54149,57978,101638,120482,130297,334336,512729,712915,1164458,1318541,1490908,
chr1	16767166	16786584	NM_018090	0	+	1676725616785385	0	8	182,101,105,82,109,178,76,1248,	0,2960,7198,7388,8421,11166,15146,18170,
chr1	33546713	33585995	NM_052998	0	+	3354785033585783	0	12	182,121,212,177,174,173,135,166,163,113,215,351,0,275,488,1065,2841,10937,12169,13435,15594,16954,36789,38931,
chr1	16767166	16786584	NM_001145278	0	+	1676725616785385	0	8	104,101,105,82,109,178,76,1248,	0,2960,7198,7388,8421,11166,15146,18170,

Assignment 1

Type

Reproduction

Objective

Create a file with the sample BED content from above and adjust the path "example.bed" to the path to your BED file (e.g. "/path/to/my_example.bed").

Solution
#include <seqan/bed_io.h>

using namespace seqan2;

int main()
{
    // Open input bed file.
    BedFileIn bedIn;
    if (!open(bedIn, toCString(getAbsolutePath("demos/tutorial/bed_io/example.bed"))))
    {
        std::cerr << "ERROR: Could not open example.bed" << std::endl;
        return 1;
    }
    // Attach to standard output.
    BedFileOut bedOut(std::cout, Bed());

    // Read the file record by record.
    BedRecord<Bed3> record;

    try
    {
        while (!atEnd(bedIn))
        {
            readRecord(record, bedIn);
            writeRecord(bedOut, record);
        }
    }
    catch (Exception const & e)
    {
        std::cout << "ERROR: " << e.what() << std::endl;
        return 1;
    }

    return 0;
}

Accessing the Records

The class BedRecord stores one record in a BED file. Note that there are various specializations, each storing a different number of fields. We show the quasi-definition of BedRecord below. The other specializations have less fields.

#include <seqan/basic.h>
#include <seqan/bed_io.h>

using namespace seqan2;

class BedRecord
{
public:
   CharString ref;      // reference name
   int32_t rID;         // index in sequenceNames of BedFile
   int32_t beginPos;    // begin position of the interval
   int32_t endPos;      // end position of the interval
   CharString name;     // name of the interval
   CharString score;    // score of the interval
   char strand;         // strand of the interval

   int32_t thickBegin;  // begin position for drawing thickly
   int32_t thickEnd;    // end position for drawing thickly
   BedRgb itemRgb;      // color for the item
   int32_t blockCount;  // number of blocks/exons
   String<int32_t> blockSizes;   // block sizes
   String<int32_t> blockBegins;  // block begin positions

   CharString data;    // any data not fitting into other members

   // Constants for marking reference id and position as invalid.
   static const int32_t INVALID_REFID = -1;
   static const int32_t INVALID_POS = -1;
};

The static members INVALID_POS and INVALID_REFID store sentinel values for marking positions and reference sequence ids as invalid.

Assignment 2

Counting Records

Type

Review

Objective

Change the result of Assignment 1 by counting the number of variants for each chromosome/contig instead of writing out the records.

Solution
#include <seqan/bed_io.h>
#include <seqan/misc/name_store_cache.h>

using namespace seqan2;

int main()
{
    // Open input bed file.
    BedFileIn bedIn;
    if (!open(bedIn, toCString(getAbsolutePath("demos/tutorial/bed_io/example.bed"))))
    {
        std::cerr << "ERROR: Could not open example.bed\n";
        return 1;
    }

    // Array of counters and sequence names.
    String<unsigned> counters;
    StringSet<CharString> seqNames;
    NameStoreCache<StringSet<CharString> > cache(seqNames);

    // Read the file record by record.
    BedRecord<Bed3> record;

    try
    {
        while (!atEnd(bedIn))
        {
            readRecord(record, bedIn);
            unsigned rID = nameToId(cache, record.ref);

            // Resize counters if necessary and increment counter.
            assignValueById(counters, rID, getValueById(counters, rID) + 1);
        }
    }
    catch (Exception const & e)
    {
        std::cout << "ERROR: " << e.what() << std::endl;
        return 1;
    }

    // Print result.
    std::cout << "RECORDS ON CONTIGS\n";
    for (unsigned i = 0; i < length(seqNames); ++i)
        if (counters[i] != 0u)
            std::cout << seqNames[i] << '\t' << counters[i] << '\n';

    return 0;
}

The output is

RECORDS ON CONTIGS
chr1	5

Creating a New File

Assignment 3

Generating BED From Scratch

Type

Application

Objective

Write a program that prints the following BED file. Create BedRecord<Bed6> objects and write them to a BedFileOut using writeRecord().

chr7	127471195	127472363	Pos1	0	+
chr7	127472362	127473530	Pos2	0	+
Solution
#include <seqan/bed_io.h>

#include <sstream>

using namespace seqan2;

int main()
{
    BedFileOut out(std::cout, Bed());

    BedRecord<Bed6> record;

    // Fill and write out the first record.
    record.ref = "chr7";
    record.beginPos = 127471195;
    record.endPos = 127472363;
    record.name = "Pos1";
    record.score = "0";
    record.strand = '+';
    writeRecord(out, record);

    // Fill and write out the second record.
    record.ref = "chr7";
    record.beginPos = 127472362;
    record.endPos = 127473530;
    record.name = "Pos2";
    record.score = "0";
    record.strand = '+';
    writeRecord(out, record);

    return 0;
}

Next Steps