Your DNA holds over 60 zettabytes of data

Your DNA holds over 60 zettabytes of data. That’s about 5,000 times the estimated information content of all human knowledge. There are four nucleobases in DNA, adenine [A], cytosine [C], thymine [T] and guanine [G], which require 2 bits each to store Each haploid cell (sperm or egg) in your body is made of 3,234.83 million base pairs Your somatic cells have twice as many base pairs with one set coming from your dad and the other coming from dear old mom There are an estimated 37. [Read More]

Introduction to Base Quality Score Recalibration (BQSR)

Thanks to Chris Hartl for writing the initial implementation of BQSR for ADAM and for taking the time to share his knowledge of BQSR with me over cappuccino at People’s Cafe. Hopefully this post will help others who are trying to understand how BQSR works. Drop a comment if you have any questions. DNA sequencing machines provide an estimate of the quality of each base (e.g. A, C, T or G) that they read. [Read More]

Playing with matches and CIGARs

Aligned reads in a SAM or BAM file typically have a Compact Idiosyncratic Gapped Alignment Report (CIGAR) string that expresses how the read is mapped to the reference genome.

Table of Cigar Operators

When I first read the CIGAR operator table (above), I was confused by two things:

  1. the match, M, operator description, “alignment match (can be a sequence match or mismatch)“, struck me as odd.
  2. the relationship between the M, = and X operators isn’t explained in the spec.

I hope this blog post helps others with the same questions.

[Read More]