Your DNA holds over 60 zettabytes of data. That’s about 5,000 times the estimated information content of all human knowledge.

  • There are four nucleobases in DNA, adenine [A], cytosine [C], thymine [T] and guanine [G], which require 2 bits each to store
  • Each haploid cell (sperm or egg) in your body is made of 3,234.83 million base pairs
  • Your somatic cells have twice as many base pairs with one set coming from your dad and the other coming from dear old mom
  • There are an estimated 37.2 trillion cells in your body

Thanks to Chris Hartl for writing the initial implementation of BQSR for ADAM and for taking the time to share his knowledge of BQSR with me over cappuccino at People’s Cafe. Hopefully this post will help others who are trying to understand how BQSR works. Drop a comment if you have any questions.

DNA sequencing machines provide an estimate of the quality of each base (e.g. A, C, T or G) that they read. This estimated quality score is called a Phred score and represents an error probability. A Phred score of 10 represents 90% accuracy, a Phred score of 20 equals 99%, a Phred score of 30 equals 99.9%, etc. Of course, it’s easy to go from probabilities back to Phred scores. Phred scores are also called quality scores or Q scores.

Base Quality Score Recalibration (BQSR) is a method of adjusting your Phred quality scores to be more accurate by looking at more than just a single base in isolation but, rather, every base in your file.

To run BQSR, you need a input file that contains all known Single Nucleic Polymorphisms (SNPs). A SNP is a genetic variation that occurs at a single base position, e.g. the reference genome has an ‘A’ and your sample has a ’T’ at the same position. The National Institutes of Health (NIH) provides a database of single SNPs called dbSNP. Why is this input file necessary? The BQSR algorithm assumes that any SNP in your file that is not in dbSNP is an error. This is a statistically sound assumption given that there are over 12 million SNPs in dbSNP (as of build 128) and, on average, a single individual will have no more than about 1,000 SNPs that are not in dbSNP.

Note: A cleaner, more efficient way to handle Avro objects in Spark can be seen in this gist

I love open-source projects that play nicely with others; no one likes to be locked into a single data processing framework or programming language. Mature open-source projects build software with integration and openness in mind to allow engineers to attack Big Data problems from a number of different angles using the most appropriate tool for the job. This post explains how to combine Spark, Parquet and Avro to create a fast, flexible and scalable data analysis system.

Aligned reads in a SAM or BAM file typically have a Compact Idiosyncratic Gapped Alignment Report (CIGAR) string that expresses how the read is mapped to the reference genome.

When I first read the CIGAR operator table (above), I was confused by two things:

  1. the match, M, operator description, “alignment match (can be a sequence match or mismatch)”, struck me as odd.
  2. the relationship between the M, = and X operators isn’t explained in the spec.

I hope this blog post helps others with the same questions.

I ran my first 50K today — the Chabot 50K Trail Run. The volunteers at Inside Trail Racing impressed me with their professionalism, friendliness and genuine concern for my well-being. Inside Trails Racing put on one of the best trail runs I’ve been a part of. As an example, a volunteer at the Two Rocks aid station (~mile 23) gave me one of her personal water bottles when she saw I wasn’t carrying one (since I forgot it at home).

The weather couldn’t have been better too. At race start (8:30am PST), the temperature was 45F and was soon in the mid-50s, sunny with mild, cool breezes. Perfect.

The harddrive in my iMac (Late 2009 27”) died last weekend and I decided to replace it myself. Here’s some quick tips if you find yourself in the same situation. Not sure if disk errors are your problem? Boot you Mac and press “Command-V” during startup for verbose boot output. You’ll see messages about “Disk I/O Error” during boot.

There’s a great tutorial on iFixit that explains step-by-step how to replace the drive. I found my iMac had a 3.5” Hitachi Model HDE721010SLA330 SATA 3.0 Gb/s drive once I cracked it open. You can replace the drive with any 3.5” SATA drive you like. I chose to replace it with a comparable Western Digital drive that had more cache.

Open-source has been a part of Berkeley culture since the 1970′s when Bill Joy assembled the original Berkeley Software Distribution (BSD). As a reader of this blog, you probably know first-hand the time and effort it takes to create quality open-source software.

Over the last year, the AMPLab has seen exciting growth in the number of users and contributors. In order to keep code quality high, I’ve been hired to build a team of full-time engineers. I need to fill two software engineering positions immediately. Both positions require strong Linux skills and familiarity with EC2 and git. One position requires experience with one or more of Scala, Java, C++, Hadoop, Hive and NoSQL databases; while the other position will focus on automation where knowledge of scripting, Maven, Jenkins, and rpm/deb packaging is important.

The picture to the right is from an old race t-shirt that I found in my closet. Luckily, it had the date of the run on it so it made it easier to search the web for information about the race.

The race was a staggered start based on age. Older racers got to start first and younger racers had to try and catch them. The race started outside of Busch stadium and ended inside the stadium at center field.