DNA barcode trnH-psbA is a promising candidate for efficient identification of forage legumes and grasses

Objective Grasslands are widespread ecosystems that fulfil many functions. Plant species richness (PSR) is known to have beneficial effects on such functions and monitoring PSR is crucial for tracking the effects of land use and agricultural management on these ecosystems. Unfortunately, traditional morphology-based methods are labor-intensive and cannot be adapted for high-throughput assessments. DNA barcoding could aid increasing the throughput of PSR assessments in grasslands. In this proof-of-concept work, we aimed at determining which of three plant DNA barcodes (rbcLa, matK and trnH-psbA) best discriminates 16 key grass and legume species common in temperate sub-alpine grasslands. Results Barcode trnH-psbA had a 100% correct assignment rate (CAR) in the five analyzed legumes, followed by rbcLa (93.3%) and matK (55.6%). Barcode trnH-psbA had a 100% CAR in the grasses Cynosurus cristatus, Dactylis glomerata and Trisetum flavescens. However, the closely related Festuca, Lolium and Poa species were not always correctly identified, which led to an overall CAR in grasses of 66.7%, 50.0% and 46.4% for trnH-psbA, matK and rbcLa, respectively. Barcode trnH-psbA is thus the most promising candidate for PSR assessments in permanent grasslands and could greatly support plant biodiversity monitoring on a larger scale.


Introduction
Grasslands are some of the most widespread ecosystems on Earth, covering two-fifth of its land surface [1]. They provide roughage for ruminant livestock production and many other environmental services related to carbon sequestration, water flow regulation and soil stabilization [2,3]. Plant species richness (PSR) is a component of biodiversity with major effects on the ecosystem functioning of grasslands. In experimental grassland plant communities, high levels of PSR stabilize yields and confer tolerance against environmental stressors [4]. Similar effects have been observed in semi-natural grasslands, which are composed of a limited number of species and are an important component of sustainable livestock production [5]. Assessing PSR is thus crucial for tracking its changes and effects on ecosystem services. However, such assessments have traditionally relied on morphology-based surveys that are labor-intensive and require trained taxonomists, limiting their use for surveying PSR over large scales and long time periods [3]. Furthermore, grasses and legumes (the two plant families of major economic relevance in temperate grasslands) can be taxonomically assessed with highest precision only when certain distinctive morphological characters are on display (e.g., flowering bodies and leaves). Still, some grass and legume species are difficult to distinguish from closely related species. A standardized, precise, high-throughput solution for PSR surveys in grasslands is therefore desirable for large-scale assessments of changes in PSR.
DNA barcoding is a methodology that has been successfully applied for standardizing and increasing the throughput of PSR surveys in ecological studies [6,7]. DNA barcodes are organellar or nuclear loci that show a high degree of species-level conservation [8,9]. By comparing newly sequenced DNA barcodes to reference databases, it is possible to assign an unknown biological sample to its correct taxonomy. An international effort is currently in place to maintain a well-curated, public reference database of DNA barcodes (The Barcode Of Life Datasystems database, BOLD [10]).
In animals, the DNA barcode of choice is the mitochondrial COI gene, which can reproducibly differentiate most of the major animal phyla [8]. In plants, in contrast, there is no single DNA barcode with comparable success [11]. Most plant DNA barcodes are located in the chloroplast genome, either within coding sequences (such as rbcLa and matK) or in intergenic regions (such as trnH-psbA) [11,12], although some nuclear loci have also been used as DNA barcodes, e.g., the internal transcribed spacer of the ribosomal DNA (ITS) [13]. More than one barcode per plant individual are typically sequenced and used for taxonomical assignments [11,12]. However, sequencing more than one DNA barcode per plant may not be technically feasible in higher throughput settings, particularly when analyzing mixed-species samples.
The aim of the present study was to determine the best DNA barcode sequences for forage species by screening the BOLD database for promising candidates and sequencing three DNA barcodes (rbcLa, matK and trnH-psbA) from multiple cultivars of 16 forage plant species that are common in sub-alpine grasslands.

Plant material and DNA extraction
Seeds of 2-3 cultivars of 16 forage species (Alopecurus pratensis L., Arrhenaterum elatius L., Cynosurus cristatus L., Dactylis glomerata L., Festuca pratensis Huds., F. rubra L., Lolium perenne L., L. multiflorum Lam., Lotus corniculatus L., Medicago sativa L., Phleum pratense L., Poa pratensis L., Trifolium pratense L., T. repens L. and Trisetum flavescens L.), kindly provided by Agroscope, Zurich, Switzerland were used for the study (Table 1). Seeds were germinated and transferred into pot trays (77 wells, 50 cm × 32 cm, with compost as substrate). The species selected are predominant components of sub-alpine grasslands and hold great potential for multifunctional, species-rich agriculture [14,15]. Plants were grown for 3 weeks after which DNA was extracted from three plants per species. For grasses, three leaf fragments of ~ 1 cm and for legumes three young leaflets were harvested. The plant material was freeze-dried for 48 h and pulverized in a QIAGEN TissueLyser II (QIA-GEN, Hilden, Germany). DNA was extracted using the NucleoSpin ® II kit (Macherey-Nagel, Düren, Germany) and its integrity visually inspected by agarose gel electrophoresis (1% w/v). DNA purity and concentration were determined with a NanoDrop ™ spectrophotometer (ThermoFisher Scientific, Waltham, MA, USA).

DNA barcode amplification and sequencing
The BOLD database was screened for DNA barcode sequences of the selected species and close relatives; barcodes rbcLa, matK and trnH-psbA were selected as candidates because they reported the most available sequences. Those DNA barcodes are mainly located in the chloroplast genome and are not known to have paralogs that can interfere with taxonomic assignments, as is the case for some nuclear loci such as ITS [13]. Primer sequences for the three barcodes were obtained from BOLD [10] and were optimized for amplification in the target plant families (Additional file 1: Table S1). Amplicons were purified in a MultiScreen PCR96 filter plate (Merck, Darmstadt, Germany). Sequencing reactions were prepared with 1× BigDye ™ Terminator 3.1 Reaction Mix (ThermoFisher Scientific, Waltham, MA, USA), 1× BigDye ™ 3.1 Sequencing Buffer, forward or reverse primer at 0.16 µM and 800 ng of purified amplicon to a final volume of 5 µL. The same primers used for PCR were used for sequencing. Capillary electrophoresis was performed on a 3130 ABI (ThermoFisher Scientific, Waltham, MA, USA). The resulting traces were quality filtered and merged using GAP4 [16] with the default settings. All traces and sequences were uploaded to BOLD v4 (project code: SWFRG; http://www.bolds ystem s.org/ index .php/Publi c_Searc hTerm s).

Taxonomical assignments
Sequences of matK, rbcLa and trnH-psbA were downloaded from BOLD v4 on May 23, 2019 [10]. Only In total, 6232 rbcLa, 11,971 matK and 1236 trnH-psbA sequences were present in the downloaded fasta files, which also include the plants from the BOLD project SWFRG (Additional file 1: Table S2). The taxonomical identifiers of the BOLD fasta files were reformatted to remove spaces and rearrange their informative fields in a consistent manner (fasta_name_reformat.py script from https ://githu b.com/mloer a/forag e-barco ding). Each barcode-specific fasta file was then used to make a blast database and the SWFRG sequences were queried in their corresponding database with blastn using the flag outfmt = 6 (i.e., tabular format). The resulting blast output tables were parsed with the blastn_matcher.R script from the above-mentioned GitHub repository. The script removes self-hits and corrects some misspellings in the taxonomy of queries and hits. The script then compares the taxonomy of the queries and hits at the species-and genuslevel. A "match" was called when the taxonomy of a query sequence is equal to the taxonomy of the highest scoring hit or hits (Additional file 1: Table S3). A "taxonomical assignment rate" for each barcode was then calculated as the ratio between the sum of its correct taxonomical assignments and the total number of query sequences.

Results and discussion PCR and sequencing results
The primer sequences of trnH-psbA and matK were adapted to allow for amplification within the target species, while the primer sequences of rbcLa did not need any modification (Additional file 1: Table S1). From the 48 processed specimens, 130 sequences were obtained (46 for matK, 43 for rbcLa and 41 for trnH-psbA-) after repeating and optimizing failed amplifications. The size of the sequences ranged from 470 to 588 bp for rbcLa, 185 to 888 bp for matK and 268 to 614 bp for trnH-psbA (Table 1).
The low CARs for grass DNA barcodes could be due to various factors. Some grass species, such as Poa spp., are notoriously hard to discriminate morphologically and their phylogeny is subject to controversy [17,18]. This could have resulted in misidentified reference sequences. Another factor is the high genetic similarity between some grass taxa. For example, the genetic similarity of some species of the Festuca-Lolium complex is reported to be > 90%, as calculated from transcriptomic data of orthologous genes [19]. This may result in a higher proportion of incorrect taxonomic assignments for such grass species [20].
Barcode trnH-psbA makes for a good candidate for large-scale DNA barcoding of forage legumes and some grasses, such as C. cristatus, D. glomerata and T. flavescens (Table 3). However, further work is needed to  produce reference sequences in more forage species and cultivars. Overall, our results provide the basic tools to implement DNA barcoding in forage species (i.e., family-specific primer pairs and a standard bioinformatic workflow for taxonomic assignments) and can help in choosing an appropriate DNA barcode for high-throughput applications. Such high-throughput applications could greatly enhance the biodiversitymonitoring protocols that are used to study the ecology of grasslands, its dynamics and its interplay with agriculture.

Limitations
This is exploratory work focused on the most common forage plant species from sub-alpine temperate grasslands; further work is needed to address other forage species from different kinds of grasslands. As a proof of concept, three specimens per species were analyzed.