Playing with matches and CIGARs

Aligned reads in a SAM or BAM file typically have a Compact Idiosyncratic Gapped Alignment Report (CIGAR) string that expresses how the read is mapped to the reference genome.

Table of Cigar Operators

When I first read the CIGAR operator table (above), I was confused by two things:

  1. the match, M, operator description, “alignment match (can be a sequence match or mismatch)“, struck me as odd.
  2. the relationship between the M, = and X operators isn’t explained in the spec.

I hope this blog post helps others with the same questions.

Given the following reference and read,

Reference: ...CTTCTATTATCCTT...
     Read:    CTTCTATTATCCTT

the CIGAR string would be 14M meaning that the aligner was able to match the read to a region on the reference 14 bases long. Simple enough.

Let’s use the same reference but alter a single base in the read at position 3, marked with the caret, e.g.

Reference: ...CTTCTATTATCCTT...
     Read:    CTTATATTATCCTT
                 ^

In this example, the CIGAR string would be exactly the same as the first example: 14M.

Why?

Both examples are alignment matches so they have the same CIGAR string. The first example is a sequence match while the second example is a sequence mismatch. Bare in mind that aligners try to match reads to a segment of the reference with the minimum edit distance. An alignment match is not the same as a sequence match. Since the reference genome and the read come from different individuals, there will be nature idiosyncracies.

These examples show the limitation of the match, M, operator. {“The match operator doesn’t allow you to calculate the edit distance or recreate the portion of the reference the read covers.“} In order to address these limitations, the = and X operators were later added to the CIGAR syntax (although they are rarely used). Using these new operators, the second example would have a CIGAR string of 3=1X10= meaning “three matching bases followed by one mismatched base and 10 matching bases”. These new operators allows you to easily calculate the edit distance (1 in this example for the mismatch) but you still can’t recreate the reference.

The MD tag, listed in the optional fields section of the SAM/BAM specification, is a superior to the =/X operator workaround. The MD tag can be used to calculate both the edit distance and recreate the reference the read covers.

The MD tag for the second example would be 3C10 meaning 3 matching bases followed by a single mismatch, denoted C the reference base at the mismatched position, followed by 10 matching bases.

To calculate the edit distance from an MD tag, just count the number of missing and mismatched bases. To recreate the reference, just merge the read sequence with the reference bases.

Here are some more example reads, with old and new CIGARS and MD tags…

Reference: ...CTTCTATTATCCTT...  M       =/X         MD
     Read:    CTTCTATTATCCTT     14M     14=         14       // example 1
     Read:    CTTATATTATCCTT     14M     3=1X10=     3C10     // example 2
     Read:    CTTATATTGGCCTT     14M     3=1X4=2X4=  3C4AT4
     Read:    CTTCTATTGGCCTT     14M     8=2X4=      8AT4
     Read:    TTTCTATTATCCTG     14M     1X12=1X     0C12T0

Not only does the MD tag provide information about the reference, it’s actually more compact than the ‘=’/‘X’ CIGAR syntax which probably explains why = and X are rarely used.

When I find more time, I’ll write a post on hard and soft clipping and the bitwise flags in BAM files.

Feel free to contact me if you have any question about parsing CIGAR or MD tags. I hope you found this post useful.

comments powered by Disqus