The complete chloroplast genome sequence of Brachypodium distachyon: sequence comparison and phylogenetic analysis of eight grass plastomes

Background Wheat, barley, and rye, of tribe Triticeae in the Poaceae, are among the most important crops worldwide but they present many challenges to genomics-aided crop improvement. Brachypodium distachyon, a close relative of those cereals has recently emerged as a model for grass functional genomics. Sequencing of the nuclear and organelle genomes of Brachypodium is one of the first steps towards making this species available as a tool for researchers interested in cereals biology. Findings The chloroplast genome of Brachypodium distachyon was sequenced by a combinational approach using BAC end and shotgun sequences derived from a selected BAC containing the entire chloroplast genome. Comparative analysis indicated that the chloroplast genome is conserved in gene number and organization with respect to those of other cereals. However, several Brachypodium genes evolve at a faster rate than those in other grasses. Sequence analysis reveals that rice and wheat have a ~2.1 kb deletion in their plastid genomes and this deletion must have occurred independently in both species. Conclusion We demonstrate that BAC libraries can be used to sequence plastid, and likely other organellar, genomes. As expected, the Brachypodium chloroplast genome is very similar to those of other sequenced grasses. The phylogenetic analyses and the pattern of insertions and deletions in the chloroplast genome confirmed that Brachypodium is a close relative of the tribe Triticeae. Nevertheless, we show that some large indels can arise multiple times and may confound phylogenetic reconstruction.

In addition to their important biological roles, plastids have the potential to make a big impact on biotechnology. Plastid transformation, achieved via homologous recombination, is very advantageous compared to nuclear genome transformation mainly because it can generate high levels of gene expression and the recombinant DNA is more easily contained since chloroplasts are maternally inherited in most species of angiosperms [3].
The family Poaceae, with approximately 10,000 species, contains the world's most important crops. The tribe Triticeae, of subfamily Pooideae, includes species grown in temperate regions, some of which are of great economic importance; i.e., wheat, rye, triticale, and barley. Despite their contribution to human food supply, members of the Triticeae are not easily amenable to functional genomics aimed at crop improvement because of their large genome size and difficulty in transformation.
Brachypodium distachyon, a small grass in the Pooideae, has recently emerged as a new model species for functional genomics of temperate grasses. Brachypodium offers many advantages as a model grass; among them, its reduced stature, short life cycle, and small genome [4].
In the last few years a considerable effort has been made to develop genetic and molecular tools for Brachypodium, including ESTs [5], Bacterial Artificial Chromosome (BAC) libraries [6], cytological characterization of accessions [7][8][9], and techniques to perform rapid and efficient transformation [10,11]. Finally, sequencing of the Brachypodium distachyon genotype Bd21 has been initiated by the DOE Joint Genomics Institute and will soon be available to the public.
Here we report the sequencing of the chloroplast genome of the Bd21 genotype of Brachypodium, and perform a sequence analysis and phylogeny reconstruction with the completely sequenced chloroplast genomes from seven grass species. We compare the evolutionary dynamics of Brachypodium chloroplast genes with those of wheat, rice and maize, and discuss the significance of some indels in the framework of grass evolution.

Sequencing of the Brachypodium chloroplast genome
Sequencing of plastid genomes is usually done by isolation of chloroplasts followed by purification and amplification of plastid DNA for library construction. To sequence the chloroplast genome of Brachypodium distachyon, we took advantage of existing BAC libraries [12] and identified several chloroplast BACs from a database of BAC end sequences (BES). In our analysis, 1,725 BES matched wheat chloroplast queries. Clones generated from a single restriction of the chloroplast genome should contain the entire chloroplast genome and its two BES would assemble in the same region in opposite orientations. The two BES from BAC DH037I03 matched backto-back the sequence of the wheat psbC gene (Fig. 1C). Overall, we identified over 30 BACs harboring the complete chloroplast genome, suggesting that this strategy is efficient in identifying full-length chloroplast genomes from genomic BAC libraries.
As expected, the chloroplast sequence assembled using the BES contained many gaps due to the distance between restriction sites (Fig. 1). To complete the Brachypodium chloroplast genome, a shotgun sequencing library of DH037I03 was constructed. The complete genome sequence was assembled using 1,725 BES, 410 sequences from the shotgun library, and 264 gap-filling sequences generated by primer walking. The sequence coverage of the entire chloroplast genome is 8.9×.

Genome organization of Brachypodium chloroplast
The chloroplast genome of Brachypodium distachyon is 135,197 bp in length. The Inverted Repeats (IR) are 21,540 bp in length each, and the Large Single Copy (LSC) and Small Single Copy (SSC) regions are 79,446 bp and 12,668 bp long respectively. The Brachypodium chloroplast genome contains 118 unique genes, 18 of which are duplicated in the IRs, making a total of 136 genes of known function. In addition, there are 9 predicted open reading frames (ORFs) and 3 tRNA pseudogenes. With a few exceptions discussed below, the gene number and order are identical to other grass chloroplast genomes (Fig. 2).

Grass chloroplast phylogeny based on complete chloroplast genomes
In a landmark article that included data from multiple sources, the Grass Phylogeny Working Group [13] examined relationships among grasses using a large and diverse assemblage of species. That study highlighted the existence of two major lineages, the BEP clade and the PAC-CAD clade, that together encompass the majority of grasses. The BEP clade includes the subfamilies Bambusoideae, Ehrhartoideae, and Pooideae. Rice belongs to subfamily Ehrhartoideae while wheat, barley, bentgrass, and Brachypodium are in the Pooideae. The PACCAD clade includes several subfamilies, among them the Panicoideae, a large group of mainly tropical and subtropical species, some of which are important crops worldwide, like maize, sugarcane, and sorghum.
So far, all phylogeny reconstructions of the Poaceae have used selected genes or partial regions as data. However, with sequenced chloroplast genomes of several species in this family and the computer power to align them, it is possible for the first time to perform whole chloroplast genome phylogenic analyses. To examine if the genome-wide phylogenic analysis is consistent with those based on selected genes, we employed Bayesian [14] and Maximum Parsimony [15] methods to reconstruct a grass phylogeny using whole chloroplast sequences. Both Bayesian and Maximum Parsimony estimates produced the same topology with maximum node support (Fig. 3). The topology shown on Fig. 3 contained 99% of the Bayesian credible trees and the tree is in agreement with the results obtained with a larger group of species [13]. The phylogram also shows that branches in the BEP clade are much longer than those in the PACCAD clade. A similar result was found by Saski et al. [16] in a phylogenetic study using 61 protein-coding genes, indicating that the rates of evolution are higher in the BEP clade compared to the PACCAD species sampled here. However, it is possible that these slower rates do not extend to other species of the PACCAD clade, since maize, sorghum, and sugarcane are closely related, with all three belonging to subfamily Panicoideae.

Evolution of Brachypodium chloroplast genes
For a given protein-coding gene, the proportion of substitutions that do not cause a change in the amino acid sequence (synonymous) to those that do (nonsynonymous) is a commonly used estimator of the evolutionary dynamics operating on that gene [15]. To find out if Brachypodium plastid genes show the same evolutionary dynamics as other grasses we calculated the ratio of nonsynonymous to synonymous substitution rates for Brachypodium chloroplast genes using tobacco as an outgroup.
We found that the nonsynonymous/synonymous ratios for Brachypodium chloroplast genes are similar to those of rice, maize and wheat, with photosynthetic genes having BAC end sequences (BES) coverage of the Brachypodium distachyon plastid chromosome Alignment of grass chloroplast genomes. The sequence of rice chloroplast genome is compared to those of Brachypodium (top alignment), maize (middle), and wheat (bottom). Sequences were aligned in mVISTA [24] and the annotation shown above the alignment corresponds to the rice genome. Grey arrows above the alignment indicate genes and their orientation. Colors indicate location of exons, conserved non-coding sequences (CNS), and untranslated regions (UTRs). Ribosomal genes are colored as CNS. Thick black lines show the position of the IRs. Other grass genomes mentioned in the text have been omitted for the sake of simplicity. the lowest ratio (Table 1), in agreement with previous findings [17]. Within the NADH class, ndhB and rps12 have very low rates of both kinds of substitutions compared to other genes in the same class, a result explained by their position, in the IRs and most likely due to the dynamics of the IRs' evolution rather than to evolutionary constrains on ndhB and rps12.
The rate of evolution of a particular gene; i.e., the estimated number of substitutions per site, can vary among different organisms for reasons like rapid gene duplication that creates opportunity for sequence divergence, different generation time, and various DNA repair mechanisms [15]. We conducted a relative rate test [18] for all Brachypodium chloroplast genes with known function against their orthologs in maize, wheat, and rice and found that most Brachypodium genes evolve at similar rates to those of wheat, rice, and maize. However, there are unequal rates of evolution (at P = 0.05) in 15 genes and 17 cases of species comparisons, and Brachypodium genes evolved at a faster rate in 14 out of those 17 comparisons (Table 2).

Sequence comparison among grass chloroplast genomes
The structure and gene number of the chloroplast genome is very similar among land plants, although the Poaceae have three large inversions compared to the canonical plastid genome usually represented by the tobacco chloroplast genome [19]. This conservation of overall struc-ture in the chloroplast genomes of grasses allowed us to align the chloroplast genome sequences of eight grass species at the genome-wide level.
Comparison of the sequences of eight chloroplast genomes (only rice, Brachypodium, wheat, and maize are represented on Fig. 2) reveals several regions of high sequence length polymorphism, as well as shared deletions and insertions. The IRs show lower sequence divergence among grasses than the single-copy region (Fig. 1), a result previously reported by other authors [20]. The region between rbcL and psaI (at position ~54 kb, Fig. 2) is one of the most polymorphic chloroplast loci in grasses. In rice, this region is 1532 bp long and contains ORF133 and the accD gene, but it is much shorter in other grasses. In Brachypodium, both ORF133 and accD are missing, and the entire rbcL-psaI spacer region, containing only the rbcL 3'UTR and psaI promoter sequences, is reduced to 296 bp long.
As expected from its phylogenetic placement, Brachypodium shares several indels with barley, wheat, and bentgrass, all of which are in subfamily Pooideae, including a 410 bp deletion in ORF70 (~14.5 kb, Fig. 2) and the duplication of a 5' portion of ndhH IRb (~102 K in Fig. 2) that is also shared with rice [16,21]. The size of this duplication is variable, ranging from 238 bp in rice to 311 bp in Brachypodium. Insertions in rpoC2 (~25 K, Fig. 2) have been described and used previously in phylogenetic analyses [[13], and references therein] and will not be discussed here.

Rice and wheat have identical and independently derived deletions
Despite the overall sequence conservation of IRs, the region between ndhB and trnI (~84 K and ~131 in Fig. 2) appears to be a hot spot for large indels. Previously, Ogihara et al [21] described a 2,131 bp deletion in wheat and rice with respect to maize. This deletion is located between ORF249 and ORF28 (~84 K and ~131 K, Fig. 2). Because rice is more closely related to wheat than to maize, the authors concluded that the deletion was present in the common ancestor of rice and wheat. However, this deletion is present only in rice and wheat, which are not sister species (Fig. 3), whereas in Brachypodium, barley, and bentgrass there is a smaller deletion of about 1,141 bp (Fig. 4).
To confirm that the 2,131-bp deletion in rice and wheat was not an artifact of the alignment or missing sequence, we used the Brachypodium sequence missing in wheat and rice and blasted it against grass sequence databases. We recovered sequences from many grasses except wheat and rice, confirming the presence of the deletion in their genomes. In addition, we searched the GenBank Complete chloroplast genome phylogeny of the grasses Figure 3 Complete chloroplast genome phylogeny of the grasses. The phylogram was obtained from an exhaustive parsimony search and was the same to a topology obtained from a Bayesian analysis. The tree was rooted making maize, sugarcane, and sorghum the outgroup. Support for the nodes is shown as posterior probability after 1000000 generations and bootstrap values from 1000 repetitions. The GenBank accesions used for the analyses are X15901 (rice), EU325680 (Brachypodium), EF115543 (bentgrass), EF115541 (barley), X86563 (maize), AP006714 (sugarcane), EF115542 (sorghum), and AB042240 (wheat). The sequences were aligned and visualized using mVISTA [25]. MrBayes [14] and PAUP* [24] were used to analyze the data.
angiosperm databases with the maize sequence corresponding to the deleted wheat and rice region and found that the region is present in species representing diverse lineages of flowering plants, including the monocot Dioscorea, the early-diverging angiosperms Amborella and Nymphaea, and several core eudicots (data not shown). Therefore, we concluded that the 2,131-bp deletions in the wheat and rice chloroplast genomes are derived characters that arose independently in those species.
The 2,131-bp deletions in rice and wheat are identical in both IRs and the sequences bordering them align unambiguously with those of other grasses (Fig. 4). In addition, the lack of direct short repeats in sequences indicates that Deletions in the IR region Figure 4 Deletions in the IR region. Rice and wheat have an identical 2.1 kb deletion in both IRs (indicated by the dashes). Brachypodium, bentgrass, and barley have a 1.14 kb deletion in the same region. The sequences flanking the deletions are shown. The positions shown on top of the alignment correspond to the maize sequence. Two slashes indicate that the sequence continues but is not shown here. recombination via short repeats is not the way by which they arose. Thus, despite the fact that deletions of varying lengths in the ndhB-trnI region seem to be common in the BEP clade, the mechanism underlying these specific deletions remains unclear. In tobacco, nucleotide mutations in plastid coding sequences are quickly eliminated by gene conversion, a process facilitated by the polyploid nature of the plastid genome [22]. Whatever the mechanism is that generates deletions in the trnI-ndhB region in species of the BEP clade, their multiple occurrences suggests that they may provide a selective advantage to those species in order to overcome gene conversion and become fixed in the population.