Introduction to Base Quality Score Recalibration (BQSR)

Thanks to Chris Hartl for writing the initial implementation of BQSR for ADAM and for taking the time to share his knowledge of BQSR with me over cappuccino at People’s Cafe. Hopefully this post will help others who are trying to understand how BQSR works. Drop a comment if you have any questions.

DNA sequencing machines provide an estimate of the quality of each base (e.g. A, C, T or G) that they read. This estimated quality score is called a Phred score and represents an error probability. A Phred score of 10 represents 90% accuracy, a Phred score of 20 equals 99%, a Phred score of 30 equals 99.9%, etc. Of course, it’s easy to go from probabilities back to Phred scores. Phred scores are also called quality scores or Q scores.

Base Quality Score Recalibration (BQSR) is a method of adjusting your Phred quality scores to be more accurate by looking at more than just a single base in isolation but, rather, every base in your file.

To run BQSR, you need a input file that contains all known Single Nucleic Polymorphisms (SNPs). A SNP is a genetic variation that occurs at a single base position, e.g. the reference genome has an ‘A’ and your sample has a ’T’ at the same position. The National Institutes of Health (NIH) provides a database of single SNPs called dbSNP. Why is this input file necessary? The BQSR algorithm assumes that any SNP in your file that is not in dbSNP is an error. This is a statistically sound assumption given that there are over 12 million SNPs in dbSNP (as of build 128) and, on average, a single individual will have no more than about 1,000 SNPs that are not in dbSNP.

Finding errors

Here’s an example to help understand the logic of BQSR at the individual base level. Let’s assume we have a read that is 10 bases long and starts at position 0 on a chromosome, e.g.

  base position:    0   1   2   3   4   5   6   7   8   9
reference bases:    A   T   C   A   G   G   A   A   T   C  ...
read bases:    A   G   C   G   G   T   A   C   T   C
phred score:   10  11  11  20  22  22  30  20  20  10
In dbSNP?         y   y           y
Mark as error?                 y               y


We mark the bases at positions 3 and 7 as errors because we observed SNPs that are not in the dbSNP database. The SNPs at position 1 and 5 are not errors because they are in dbSNP.

Simply as an exercise, let’s apply BQSR to this single read.

Aggregating the reported phred score.

For this example, we’re going to use the average phred score across the read for the reported phred score. Note, this sort of aggregation is never done in practice. A more realistic example follows but, for now, this will help you understand aggregating reported phred scores. Since the phred scores are logarithmic, we convert them to probabilities to average them, e.g. (0.1 + 0.079 + 0.079 + 0.01 + 0.006 + 0.006 + 0.001 + 0.01 + 0.01 + 0.1)/10 = 0.0401. We then convert this probability back to our log scale phred score, e.g. -10 * log10(0.0401) ~= 14. So, the average reported phred quality score across this entire read was 14.

Calculating the empirical phred score.

We have 10 observations and 2 errors (at position 3 and 7) so the error probability is 15 = 0.2 which is equal to a phred score of -10 log10(.2) ~= 7.

So for this single read, the sequencer estimated an average quality score of 14 but the real empirical quality score is 7. To recalibrate the score, we just shift all the estimated scores down by 7, e.g.

             base position:    0   1   2   3   4   5   6   7   8   9
observed phred score:   10  11  11  20  22  22  30  20  20  10
recalibrated phred score:    3   4   4  13  15  15  23  13  13   3


Running BQSR on a single read is useless and inaccurate but running BQSR across millions of reads in a file allows for very accurate recalibrations. Let’s look at more realistic example.

Not all bases are created equal

A base does not occur in a vacuum. It is part of a short strand of DNA (called a “read”). The read is part of a “read group” which contains all the reads for a single run of a sequencing machine. Since different sequencing machines may be calibrated differently, we group each read base into separate read group bins. In addition, we split up the data in a read group by the quality scores since the goal of BQSR is to compare the scores reported by the sequencing machine to the empirical scores we derive from the empirical error counts.

For this example, we have three read groups: RG1, RG2 and RG3. Each read group has two unique base quality scores except for RG1 which has three: Q16, Q18 and Q21. Each node in the graph has the number of errors and observations as a fraction as well as the calculated empirical score. The calculated score applied Yate’s correction, e.g -10 * log10( (errors + 1) / (observations + 2) ) which penalizes bins with a lower number of observations (see RG2 and its quality bins).

BQSR requires two passes over your data. During the first pass, a nested structure (above) is created to hold aggregate errors and observations for each read group/quality score combination. Note that you don’t need to calculate the read group or root aggregations since they can be easily derived from the quality score children. Once this structure is finalized, it’s used during the second pass of the data to recalibrate every base. For example, suppose that we had a read from read group RG1, we recalibrate any base with a Q score of 16 to be 17, any base with Q18 stays the same and any base with Q21 is recalibrated to be Q26.

Covariates

Just getting statistics about read group and quality scores is enough to improve your quality scores, but you improve accuracy by tracking covariates such as base position, base neighbors, machine cycle, etc.

Let’s look at an example of how to track the covariate: base position.

For this example, we are zoomed into read group RG1 and added a covariate table with a column for each read group/quality bucket (e.g. Q25, Q35, Q45) and rows for base position in a read (e.g. 0, 1, 2 …). This is an example of just once covariate table, you will have a separate table for each covariate you are tracking. Once you’ve completed the first pass over your data, these covariate tables can be used to improve your empirical quality calculation.

To calculate the empirical quality score, you use the following equation.

This takes the global quality score and shifts it by the read group, quality and covariates Δs.

For example, let’s say that we had a single base in read group RG1 with a reported quality score of Q24 at position 0 in the read. The GlobalQ is 38, Δall is 39-38=1, Δreadgroup is 40-41=-1, Δquality is 24-25=-1, and Δcovariate is 17-25=-8. Together, the recalibrated quality value is then 38 + 1 - 1 - 1 - 8 = 27.

As a side note, in the single covariate case, you could just use the empirical qualities in each leaf instead of using the general formula of BQSR above.