Thanks to Chris Hartl for writing the initial implementation of BQSR for ADAM and for taking the time to share his knowledge of BQSR with me over cappuccino at People’s Cafe. Hopefully this post will help others who are trying to understand how BQSR works. Drop a comment if you have any questions.
DNA sequencing machines provide an estimate of the quality of each base (e.g. A, C, T or G) that they read. This estimated quality score is called a Phred score and represents an error probability. A Phred score of 10 represents 90% accuracy, a Phred score of 20 equals 99%, a Phred score of 30 equals 99.9%, etc. Of course, it’s easy to go from probabilities back to Phred scores. Phred scores are also called quality scores or Q scores.
Base Quality Score Recalibration (BQSR) is a method of adjusting your Phred quality scores to be more accurate by looking at more than just a single base in isolation but, rather, every base in your file.
To run BQSR, you need a input file that contains all known Single Nucleic Polymorphisms (SNPs). A SNP is a genetic variation that occurs at a single base position, e.g. the reference genome has an ‘A’ and your sample has a ’T’ at the same position. The National Institutes of Health (NIH) provides a database of single SNPs called dbSNP. Why is this input file necessary? The BQSR algorithm assumes that any SNP in your file that is not in dbSNP is an error. This is a statistically sound assumption given that there are over 12 million SNPs in dbSNP (as of build 128) and, on average, a single individual will have no more than about 1,000 SNPs that are not in dbSNP.