Theoretical foundation for probabilistic evaluation
In this section, we formalize the probabilistic formulation of assembly quality and the model of the sequencing process that allows us to compute the likelihood of any particular assembly of a set of reads. We will show that the proposed probabilistic score is correct in the sense that the score is maximized by the true genome sequence.
Likelihood of an assembly
Let A denote the event that a given assembly is the true genome sequence, and let R denote the event of observing a given set of reads. In the following, we will use the same symbol to denote the assembly sequence and the event of observing the assembly. We will also use the same symbol to denote the set of reads and the event of observing the set of reads.
According to Bayes’ rule, given the observed set of reads, the probability of the assembly can be written as:
(1)
where Pr[ A] is the prior probability of observing the genome sequence A. Any prior knowledge about the genome being assembled (e.g., approximate length, presence of certain genes, etc.) can be included in Pr[ A]; however, for the purpose of this paper, we will assume that this prior probability is constant across the set of “reasonable” assemblies of a same set of reads. Given commonly available information about the genomes, formulating a precise mathematical framework for defining Pr[ A] is an extensive endeavor beyond the scope of this paper.
Similarly, Pr[ R] is the prior probability of observing the set of reads R. Since our primary goal is to compare multiple assemblies of a same set of reads, rather than to obtain a universally accurate measure of assembly quality, we can assume Pr[ R] is a constant as well. Thus, for the purpose of comparing assemblies, the values Pr[ A|R] and Pr[ R|A] are equivalent. The latter, the posterior probability of a set of reads, given a particular assembly of the data, can be easily computed on the basis of an appropriately defined model of the sequencing process and will be used in our paper as a proxy for assembly quality.
Under the assumption that individual reads are independent of each other (violations of this assumptions in the case of mate-pair experiments will be discussed later in this section), . If the set of reads is unordered, we need to account for the different permutations that generate the same set of reads. As this value is a constant for any given set of reads, we ignore it in the rest of our paper.
Pr[ R|A], hereafter referred to as p
r
, can be computed using an appropriate model for the sequencing process. Throughout the remainder of the paper, we will discuss increasingly complex models and their impact on the accuracy of the likelihood score.
True genome obtains the maximum likelihood
Any useful assembly quality metric must achieve its maximum value when evaluating the true genome sequence; otherwise, incorrect assemblies of the data would be preferred. We prove below that the likelihood measure proposed in our paper satisfies this property.
Assuming that we have a set of reads R from the true genome, produced by generating exactly one single-end read from each location in the genome without errors and with a fixed length. Given the set of reads R, the probability a particular read is generated from the true genome is precisely the number of times the read occurs in R divided by the size of R (note that multiple reads can have the same sequence, e.g., when generated from repeats). Let N
s
denote number of times that the sequence s occurs in R, and q
s
= N
s
/|R| denote the probability that sequence s is generated from the true genome. To show that the true genome maximizes the likelihood score, let us assume that we have some assembly A and p
s
is the probability that the sequence s is generated from the assembly A.
Given assembly A, our likelihood score is then the product of over all sequences s in S, which can be rewritten as . Now, note that since |R| is a fixed constant, maximizing the likelihood score is equivalent to maximizing
The likelihood can be re-written as
where D
KL
(Q||P) is the KL-divergence for the distributions Q and P, and H(Q) is the Shannon entropy of Q. Since the KL-divergence is always non-negative and only equal to 0 if and only if Q = P, the average probability is maximized if the assembly is equal to the true genome.
Even though the true genome does maximize the likelihood in this model, there may be other assemblies that achieve the same optimal score as long as these assemblies yield probabilities p
s
which are equal to the probabilities q
s
for every sequence s. This can happen, for example, in the case of a misassembly that is nonetheless consistent with the generated reads. This situation highlights the loss of information inherent in modern sequencing experiments – without additional long-range information, the information provided by the reads themselves is insufficient to distinguish between multiple possible reconstructions of a genome [6].
Error-free model for fragment sequencing
The most basic model for the sequencing process is the error-free model. In this model, we assume reads of a given fixed length (a more general read length distribution can be included in the model but would not impact comparative analyses of assemblies derived from a same set of reads). We further assume that reads are uniformly sampled across the genome, i.e., that every position of the genome is equally likely to be a starting point for a read. This simplifying assumption is made by virtually all other theoretical models of genome assembly, despite the biases inherent to all modern sequencing technologies. A more accurate, technology-dependent, model can be obtained by including additional factors that account, for example, for DNA composition biases. For the purpose of generality, we restrict our discussion to the uniform sampling model. Furthermore, for the sake of simplicity, we assume (1) that the true genome consists of a single circular contiguous sequence, (2) that our assembly is also a single contig, and (3) that every read can be mapped to the assembly. We will later discuss extensions of our model that relax these assumptions.
Under these assumptions, we can compute the probability of a read r given the assembled sequence as:
where n
r
represents the number of places where the read occurs in the assembled sequence of length L. The factor 2 is due to the fact that reads are sampled with equal likelihood from both the forward and reverse strands of a DNA molecule. This formulation was previously used by Medvedev et al.[26] to define an objective function for genome assembly.
A realistic model of the sequencing process
The error-free model outlined above makes many simplifying assumptions that are not representative of real datasets. Here we demonstrate how the model can be extended to account for artifacts such as sequencing errors, mate-pair information, assemblies consisting of multiple contigs, and the presence of un-mappable reads.
Sequencing errors
All current technologies for sequencing DNA have a small but significant probability of error. Here we focus on three common types of errors: the insertion, deletion, and substitution of a nucleotide.
In the error-free model, the probability of a read having been generated from a position j in the sequence is one if the read exactly matches the reference at that position and zero otherwise. We now extend this model such that the probability of each read having been generated from any position j of the reference is a real value between zero and one, representing the likelihood that a sequencing instrument would have generated that specific read from that specific position of the reference. This value clearly depends on the number of differences between the sequence of the read and the sequence of the reference at position j. Given the assembled sequence, the probability of a particular read will be the cumulative probability of the read across all possible locations in the genome.
Specifically, let us denote the probability that read r is observed by sequencing the reference, ending at position j by pr,j. Then, the total probability of the read r is
(3)
The individual probabilities pr,j can be computed if we do not model insertion and deletion errors and only allow substitution errors which occur with probability ε. The per-base probability of a substitution error can be set individually for each based on the quality value produced by the sequencing instrument. Then, pr,j = εs(1 - ε)l-s, where s is the number of substitutions needed to match read r to position j of the reference sequence. In the more general case, pr,j values can be computed using dynamic programming.
Exact probability calculation via dynamic programming
For a model of the sequencing process that allows insertions, deletions, and substitutions with specific probabilities, we can exactly compute probability, p
r
= Pr[ r|A], of observing a read r given an assembly A using a dynamic programming algorithm. In general, we want to find the sum of the probabilities of all possible alignments of a read to a position of the assembly.
The number of such possible alignments grows exponentially as a function of read length. Most of those alignments have a very small probability. However, several alignments may have probabilities that are equal or close to the optimal. For example, when aligning the sequence ACG to the assembly ACCG, both A-CG and AC-G are optimal alignments and have the same probability. The contribution of all such alignments must be taken into account when computing the probability for a read.
We use a dynamic programming algorithm (similar to the “forward” algorithm in Hidden Markov Models) to efficiently calculate the sum of the probabilities of all alignments of a read to the assembly as follows. In the formula (3), and are the sum of the probabilities of all possible alignments of the read r to, respectively, the reference and its reverse complement, ending at position j.
We define T[ x,y] as the probability of observing prefix [ 1…y] of the read r, if y bases are sequenced from the reference, ending at position x. Therefore, pr,j = T[ j,l]. T[ x,0] represents the probability of observing an empty sequence if we sequence zero bases and is set to 1. T[ 0,y] represents the probability of observing prefix [ 1…y] of the read if y bases are sequenced from the reference, ending at position 0 (before the beginning), and is set to 0.
For x ≥ 1 and y ≥ 1, T[ x,y] is recursively defined:
(4)
where r[y] and A[ x] represent the nucleotides at positions y and x of the read r and the assembly A, respectively. Pr[ Substitute(A[ x],r[y])] is the probability of observing the nucleotide r[y] by sequencing the nucleotide A[ x]. In our experiments, we did not distinguish between different types of errors and considered their probability to be ε and the probability of observing the correct nucleotide to be 1-ε.
The dynamic programming algorithm outlined above has a running time of O(l L) per read. Even though the running time is polynomial, it is slow in practice. However, we can speed it up by using alignment seeds. The seeds would give us the regions of the assembly where a read may align with high probability. We can apply the dynamic programming only to those regions and get a very good approximate value of the total probability. We use exact seeds (k-mers) of a given length to build a hash index of the assembly sequence. Then, each read is compared to the regions where it has a common k-mer with the assembly sequence.
Mate pairs
Many of the current sequencing technologies produce paired reads – reads generated from the opposite ends of the same DNA fragment. This information is extremely valuable in resolving genomic repeats and in ordering the contigs along long-range scaffolds; however, the paired reads violate the assumption that reads are sampled independently, that we made in the discussion above. To address this issue, we can use the pairs rather than the individual reads as the underlying objects from which the assembly likelihood is computed. To address the possibility that assembly errors may result in violations of the constraints imposed by the paired reads, we only consider pairs for which both ends align to a same contig or scaffold within the constraints imposed by the parameters of the sequencing experiment. Any pairs that violate these constraints get classified as unassembled. Note that in addition to sequencing errors, we now also handle fragment sizing errors – deviations of the estimated distance between paired reads from the distance implied by the sequencing experiment. We model the distribution of fragment sizes within a same library by a normal distribution, using user-supplied parameters, and use this information to appropriately scale the likelihood estimate for each possible placement of a mate pair along the genome.
We modify the dynamic programming recurrence from formula (4) to handle the probability calculation for the paired reads as follows. The probability of the first read in the pair is calculated as the same as in the formula (4). For the second read, we adjust the dynamic programming to ensure that it is aligned within a certain distance downstream of the alignment of the first read. We modify the first column of the dynamic programming table of the second read in the pair to take into account the distance from the first read.
Formally, given a paired read, we define T2[ x,y] as the probability of observing prefix [ 1…y] of the 2nd read in the pair, if y bases are sequenced from the reference, ending at position x.
Assume that the second read occurs after the first read and is separated by a normally-distributed distance with mean μ and with a standard deviation σ.
Therefore,
(5)
where Pr[ insert(n)|N(μ,σ)))] is the probability of observing an insert size of length n from a normal distribution with parameters μ and σ, and l is the length of the first read in the pair.
Instead of using two tables, we can concatenate the read pair together with a special character (M), which will signal when the insert size should be taken into account.
For x ≥ 1 and y ≥ 1, T[ x,y] is recursively defined as follows:
(6)
Assemblies containing more than one contig
As we mentioned in the introduction, the output of an assembler usually consists of a (large) set of contigs rather than one single contig, representing the genome being assembled. In the extreme case, an “assembler” may return the set of unassembled input reads (or the set of all k-mers in De Bruijn-based assemblers) as its output. Our likelihood score must be modified to account for such fragmented assemblies.
In practice, most assemblers join contigs only if they overlap by more than a certain number of bases; however, we only consider the case where contigs are non-overlapping substrings of the true genome. In this case, the length of the original genome must be at least the sum of the lengths of the contigs, that is, , where L
j
is the length of the j th contig. Therefore, the probability of every read is at most:
Overlapping contigs can be handled by reducing the length of the contigs by a value representing the minimum overlap required by the assembler, as performed, for example, in Genovo [19].
Reads that do not align well
In practice, popular assemblers do not incorporate every read in the assembly. Possible reasons include assembly errors (such as collapsed tandem repeats), reads with high error rates, or contamination in the DNA sample. These “singleton” or “chaff” reads cannot be modeled by our likelihood approach as the likelihood of any assembly that does not incorporate every read is zero. When sequencing errors are modeled, every read obtains a non-zero likelihood, even if it does not align to the assembly. Since, in general, a non-trivial fraction of the total set of the reads cannot be mapped to the assembly, by their sheer number, the singleton reads would dominate the probability calculation.
To account for this factor, we argue that for any read that does not align well, the overall probability of the assembly should not be lower than the probability of the same assembly when the missing read is appended to its sequence as a separate contig. The effect of such an addition on the overall probability can be calculated as follows. First, the probability of observing this read exactly, , is multiplied to the product of the probabilities of all mapped reads. Second, the probabilities of the mapped reads are decreased slightly due to the increase in the length of the assembled sequence.
For simplicity, let us assume an error-free model where each read maps to exactly one position on the assembled sequence. Let k denote the number of the original reads. The ratio between the new probability for all original reads divided by their probability before adding the new read is:
Therefore, if the probability of observing a read is less than
(8)
we consider this read as “unmapped” and use formula (8) as its probability. The probability of an exact match Pr[ exact match] is approximated by (1-ε)l, where ε is the probability of an error (a mismatch, an insertion, or a deletion).
Performance considerations
Estimating the average read likelihood by sampling
Depending on the specific characteristics of the chosen sequencing model, the computation of the probability Pr[ R|A] can be expensive for the dataset sizes commonly encountered in current projects (tens to hundreds of millions of reads). In such cases, we can approximate the likelihood of an assembly by using a random subset of the reads R′ ⊆ R. To counteract the effect of the size of the sample on the computed probability, we define the assembly quality as the geometric mean of the individual read probabilities:
(9)
The logarithm of this value (Log Average Probability (LAP)) is reported in the remainder of the paper as the assembly quality “score”:
(10)
In other words, we define the assembly quality as the average log likelihood of the reads given an assembly. This formulation also allows us to estimate the accuracy of the approximate likelihood value produced by sub-sampling the set of reads. According to sampling theory, the distribution of the scores across multiple samples has the mean equal to the true likelihood of the assembly (computed from all the reads) and a standard error proportional to , i.e., the larger the sample is, the more accurate our estimation is for the likelihood of the true assembly. Since the probability of a read is bounded by formula (8), the variance of the sample can also be bounded by this value.
In practice, we increase the sample size until the assemblies can be unambiguously distinguished by the LAP value. Specifically, we increase the sample size, by binary search, until the LAP values are separated by at least a single standard deviation. The level of subsampling required will, thus, be dependent on the extent of the differences between the assemblies — for very different assemblies, low levels of subsampling are sufficient.
Approximating the likelihood value using an aligner
Alternatively, when it is impractical to calculate exact probabilities for large sets of reads, we can approximate these probabilities using fast and memory-efficient alignment search programs, which internally model the sequencing process. We use Bowtie 2 [29] to align the reads to the assembly. However, our programs are easy to adapt for any read alignment tool that stores the alignment results in SAM [30] format.
For each reported alignment, we use the number of substitutions s to compute the probability p
r
. The probability of this alignment, pr,j, can be approximated by εs(1 - ε)l-s and
(11)
where S
r
is the set of alignments in the SAM file for the read r.
We can further extend this equation to mated reads. A pair of mated reads aligns if the distance and orientation of the alignment of the pair are consistent with the experimental design parameters. Given read i1 and its mate i2, we compute by multiplying the probabilities of individually aligning each mate at their respective positions with the probability that they are separated by their distance from each other. That is,
(12)
where . Mate pair insert sizes follow a normal distribution with mean and standard deviation being estimated from the parameters of the sequencing process. Unless otherwise stated, the standard deviation is 10% of the insert size. If only one of the mates, i1 or i2, maps, the probability is 0. We use (8) to set the probability for this case.
In our experiments, Bowtie 2 was used to approximate the read probabilities for the larger datasets; however, it could be substituted with any other aligner.
Data sets
The read data for Rhodobacter sphaeroides 2.4.1 was downloaded from http://gage.cbcb.umd.edu/data/Rhodobacter_sphaeroides, and the corresponding reference sequence was obtained from the NCBI RefSeq database (NC_007493.1, NC_007494.1, NC_009007.1, NC_007488.1, NC_007489.1, NC_007490.1, NC_009008.1). In addition, two more Rhodobacter genomes were selected as reference genomes, specifically R. sphaeroides ATCC 17025 (NCBI IDs NC_009428.1, NC_009429.1, NC_009430.1, NC_009431.1, NC_009432.1), and R. capsulatus SB1003 (NC_014034.1, NC_014035.1).
The read data for Stapylococcus aureus USA300 was downloaded from http://http://gage.cbcb.umd.edu/data/{Staphylococcus}_aureus, and the corresponding reference sequence was obtained from the NCBI RefSeq database (NC_010063.1, NC_010079.1, NC_012417.1). In addition, two more Stapylococcus genomes were selected as reference genomes, specifically S. aureus 04-02981 (CP001844), and S. epidermidis ATCC 12228 (AE015929, AE015930, AE015931, AE015932, AE015933, AE015934, AE015935).
The read data for human chromosome 14 was downloaded from http://gage.cbcb.umd.edu/data/{Hg}_chr14/, and the corresponding reference sequence was obtained from the NCBI RefSeq database (NC_000014.8).
The Assemblathon 1 competition evaluates assemblies on the simulated short read dataset generated from the simulated 110 Mbp diploid genome. The competition provides sequence libraries with varying insert sizes (200-10,000 bp) and coverage (20-40x). Assemblathon 1 allowed teams to submit multiple entries, but for our analyses, we only examine the top ranking assemblies from each team. The raw reads and the consensus sequence of the top ranking assemblies were downloaded from http://korflab.ucdavis.edu/Datasets/Assemblathon/Assemblathon1/.
Also used in our analyses is the E. coli K12 MG1655 dataset, generated using Illumina MiSeq technology (300 bp insert size, 370 × coverage) (http://www.illumina.com/systems/miseq/{scientific}_data.ilmn).
For detailed software usage, please see Additional file 1.