Comparison of next-generation sequencing samples using compression-based distances and its application to phylogenetic reconstruction

Background Enormous volumes of short read data from next-generation sequencing (NGS) technologies have posed new challenges to the area of genomic sequence comparison. The multiple sequence alignment approach is hardly applicable to NGS data due to the challenging problem of short read assembly. Thus alignment-free methods are needed for the comparison of NGS samples of short reads. Results Recently several k-mer based distance measures such as CVTree, d2S, and co-phylog have been proposed or enhanced to address this problem. However, how to choose an optimal k value for those distance measures is not trivial since it may depend on different aspects of the sequence data. In this paper, we considered an alternative parameter-free approach: compression-based distance measures. These measures have shown good performance for the comparison of long genomic sequences, but they have not yet been tested on NGS short reads. Hence, we performed extensive validation in this study and showed that the compression-based distances are highly consistent with those distances obtained from the k-mer based methods, from the multiple sequence alignment approach, and from existing benchmarks in the literature. Moreover, as the compression-based distance measures are parameter-free, no parameter optimization is required and these measures still perform consistently well on multiple types of sequence data, for different kinds of species and taxonomy levels. Conclusions The compression-based distance measures are assembly-free, alignment-free, parameter-free, and thus represent useful tools for the comparison of long genomic sequences as well as the comparison of NGS samples of short reads.


Introduction
Escherichia/Shigella genomes, 70 Gammaproteobacteria genomes, and 39 mammalian gut metagenomic 87 samples. The data sets include various types of genomic sequences, in silico and real NGS short reads, dif-88 ferent species and taxonomy levels. The validation results show that the compression-based distances are 89 highly consistent with those distances obtained from the k-mer based methods, from the MSA approach, 90 and from existing benchmarks in the literature. The results also show that the k-mer based distance 91 measures depend remarkably on the choice of k, and the optimal k varies across different data sets. The

100
Compression-based distance measures 101 We used the compression-based distance measures that were proposed in [?, ?, ?]. Those measures were 102 first developed based on the theory of Kolmogorov complexity [?]. As Kolmogorov complexity is not 103 computable, the authors further refined the measures using data compression. In particular, the following Here C(x) denotes the size of the compressed file of sequence x from a standard compressor, xy 106 denotes the concatenation of two sequences x and y, and C(x|y) denotes the size of the compressed file 107 of sequence x conditioning on sequence y. The authors further proposed a more mathematically precise 108 version in [?] which was referred to as normalized compression-based distance (NCD): 109 d NCD (x, y) = max{C(x|y), C(y|x)} max{C(x), C(y)} .
A good compressor should be able to remove redundant information that is shared between two

119
We used the tool GenCompress [?] for compression. The advantage of GenCompress is that it can 120 perform conditional compression x|y. When applying GenCompress to an NGS sample, we simply con-121 catenated all short reads of the sample to form a single sequence and then compressed that sequence.

122
Using the same compression tool allows a consistent and fair comparison across different data sets.  the D 2 , D * 2 , and D S 2 statistics which were proposed in [?,?] for the comparison of long genome sequences.

129
The main difference between d S 2 and CVTree lies in the normalization of the frequency vectors of k-130 mers. The co-phylog distance measure is also based on k-mers but not in the frequency context. It 131 first constructs context-object structures from the observed k-mers and then measures the distance as 132 the number of structures that have the same contexts, but different objects in the two sequences (or 133 two samples). In other words, this is a micro-alignment process that attempts to estimate the average 134 nucleotide substitution rate between two DNA sequences (or two NGS samples).

6
The implementations of CVTree and d S 2 have the option to input the parameter k. Hence, we were 136 able to try all possible values of k allowed by the programs. For CVTree, we tried k from 2 to 32.

137
However, for d S 2 , we were not able to run the program for k > 9 due to some "segmentation fault". There 138 is no input option for co-phylog, thus we simply used its default settings.

139
Accuracy assessment the mtDNA sequences and compared their results with those obtained from the MSA method. Figure S1 180 and Table S1 show that the compression-based distances d NCD , d, d CDM and the k-mer based distances 181 CVTree, d S 2 (for optimal choices of k) show good agreement with the MSA distance, in terms of both 182 tree symmetric difference and distance correlation. Moreover, the phylogenetic trees reconstructed from 183 those distances are highly consistent with existing benchmarks in the literature [?, ?, ?, ?]. The co-phylog 184 measure, however, failed for this data set. One possible explanation is that co-phylog may be only suitable 185 for closely related species, as the authors have mentioned in [?]. We also noted that the CVTree and d S 2 186 distances varied considerably with respect to k (Table S2). For instance, the optimal symmetric difference 187 between the CVTree trees and the MSA tree was 6 (for k = 10), but the worst case was up to 48 (for 188 k = 16, 17). Similarly, the optimal correlation between the d S 2 distances and the MSA distance was 0.88 be found in Table S2. As shown in Table 1, the d S 2 distance achieved the highest correlation with the 204 MSA distance, followed by the CVTree and the compression-based distances. In terms of the symmetric 205 difference from the MSA tree, the d CDM distance performed consistently well for all four error models, 206 followed by the d S 2 and d NCD distances. Figure 1 shows an example of the phylogenetic trees reconstructed

210
The d S 2 tree, however, has more inconsistent branches in the group Ferungulates, although it has the 211 highest correlation with the MSA distance. Last but not least, we also found that the alignment-free 212 results obtained from the simulated NGS short reads were highly consistent with their corresponding 213 counterparts obtained from the mtDNA sequences in the previous section, especially for the Exact model 214 ( have shown that the co-phylog distance was highly consistent with the MSA distance in terms of both 231 tree symmetric difference and distance correlation. Hence, to avoid the time-consuming MSA, we used 232 the co-phylog distance as benchmark.

233
Performance on whole genome sequences 234 We first applied the five measures d NCD , d, d CDM , CVTree, and d S 2 to the whole genome sequences and 235 compared their results with the benchmark obtained from the co-phylog measure. Table 3 shows that the 236 d CDM distance performed the best in terms of both tree symmetric difference and distance correlation.

237
The results of d NCD and d were also considerably better than the optimal results of CVTree and d S 2 , 238 especially with the remarkably high distance correlation. The d S 2 distance failed for this data set and its 239 correlation with the benchmark co-phylog distance was significantly lower than that of the other measures.

240
The CVTree distance achieved good correlation but produced inconsistent phylogenetic trees for different 241 values of k. The most significant inconsistency among them is whether the genus Shigella violates the 242 monophyleticity of the genus Escherichia or the monophyleticity of E.coli strains ( Figure S2). This was 243 also mentioned previously in [?]. terms of both tree symmetric difference and distance correlation. We also noted that while the results

256
of the compression-based distances for the whole genome sequences (Table 3) and for the NGS short 257 reads (Table 4) were comparable, the performance of the CVTree and d S 2 distances became worse when 258 they were applied to the NGS short reads. Similar results for the NGS data sets obtained from the 5× 259 sampling depth can be found in Table S3. The last section has focused on closely related bacteria at the genus level. We next applied the MSA and 270 the alignment-free distance measures to a larger and more complicated data set at a higher taxonomy 271 level. In particular, the data set consists of 70 genomes that were randomly chosen from 15 orders of the 272 class Gammaproteobacteria (Table S4). As the number of genomes is large and they come from different 273 groups, it is interesting to ask if the distance measures can cluster and classify those genomes into their 274 correct orders. We used the parsimony score to measure the difference between a clustering tree and the 275 true classification [?]. As the number of groups is 15, the optimal parsimony score is 14. The higher the 276 parsimony score is, the more different the clustering tree is from the true classification. The parsimony 277 score of a clustering tree was also compared with that of randomly joined trees to assess its p-value and 278 statistical significance. For this data set a parsimony score lower than 40 corresponds to a p-value less 279 than 0.001 (10,000 random trees were generated). This data set has been studied previously in [?] and 280 the authors found that the co-phylog distance did not perform well because the bacteria of interest are 281 not closely related.

282
Performance on 16S rRNA sequences and whole genome sequences 283 As it is challenging to perform MSA of 70 whole genome sequences, we applied the MSA method to 284 16S rRNA sequences of those 70 genomes to obtain the benchmark distance and clustering tree ( Figure   285 S3, parsimony score = 18). We then applied all six alignment-free distance measures d NCD , d, d CDM ,

286
CVTree, d S 2 , and co-phylog to the 16S rRNA sequences. Table 5 shows that the alignment-free distances 287 were highly correlated with the MSA distance and they all achieved similar parsimony scores (17-18), 288 except for the co-phylog distance. Overall, the d NCD distance performed the best in terms of parsimony 289 score, tree symmetric difference, and distance correlation. Its clustering tree (Figure 3) shows that the 290 genomes of the six orders Aeromonadales, Enterobacteriales, Legionellales, Pasteurellales, Vibrionales,

291
and Xanthomonadales were all correctly classified into their groups. Majority of the genomes in the 292 remaining orders were also well clustered. Then, we applied the alignment-free distance measures to the 293 whole genome sequences, and the results were slightly worse than those obtained from the 16S rRNA 294 sequences (Table 5). We also noted that the optimal k of CVTree and d S 2 for the whole genome sequences 295 were different from those for the 16S rRNA sequences (Table S5).

296
Performance on NGS short reads 297 Finally, we applied all six alignment-free distance measures d NCD , d, d CDM , CVTree, d S 2 and co-phylog to 298 NGS short reads which were simulated from the whole genome sequences. Again, even at the very low 299 1× sampling depth, the clustering results obtained from the NGS short reads were quite similar to those 300 obtained from the whole genome sequences, although both were slightly worse than those obtained from 301 the 16S rRNA sequences (Table 5). As this experiment was conducted at a high taxonomy level and the 302 species were selected from different orders of the class Gammaproteobacteria, one could expect that the 303 16S rRNA sequences should be more suitable for the classification. It can also be seen from  (Table S6). This data set has been studied previously in [?, ?]. In [?] the authors applied the CVTree 328 and d S 2 distances to those 39 metagenomic samples and found that the sequence signatures (that is, the 329 k-mers) of the samples were strongly associated with the diet and gut physiology of the host species. primates. This interesting finding has not been observed in previous studies.

379
In this paper we studied the application of the compression-based distance measures for the problem 380 of sequence comparison with a special focus on NGS short reads data. The key advantages of the 381 compression-based distance measures are assembly-free, alignment-free, and parameter-free. We con- the optimal parameter k for each data set is of critical importance for using the k-mer based methods.

398
This task may be a difficult problem when there is no benchmark (e.g., true phylogenetic trees, true 399 classifications) available to guide the analysis and the selection of k.

416
To the best of our knowledge, our work is the first study to assess the performance of the compression- The short reads were simulated from the tool MetaSim using the Empirical model and 5× sampling depth. The group of three species platypus, opossum, and wallaroo was used as the outgroup to root the tree.    The short reads were simulated from the mtDNA sequences using four error models 454, Exact, Empirical, and Sanger of the tool MetaSim at 5× sampling depth. The two smallest tree symmetric differences and the two highest distance correlation coefficients for each error model are highlighted in boldface. Similar results for 1×, 10×, and 30× sampling depths can be found in Table S2. The short reads were simulated from the mtDNA sequences using four error models 454, Exact, Empirical, and Sanger of the tool MetaSim at 5× sampling depth. The two smallest tree symmetric differences for each error model are highlighted in boldface. Similar results for 1×, 10×, and 30× sampling depths can be found in Table S2. The two smallest tree symmetric differences and the two highest correlation coefficients are highlighted in boldface. The short reads were simulated from the Escherichia/Shigella genomes using four error models 454, Exact, Empirical, and Sanger of the tool MetaSim at 1× sampling depth. The two smallest tree symmetric differences and the two highest correlation coefficients for each error model are highlighted in boldface. Similar results for 5× sampling depth can be found in Table S3. Supplementary  Table S5. The parsimony score of the clustering trees for 39 metagenomic samples.