Identification of annotation artifacts concerning the chalcone synthase (CHS)
BMC Research Notes volume 16, Article number: 109 (2023)
Chalcone synthase (CHS) catalyzes the initial step of the flavonoid biosynthesis. The CHS encoding gene is well studied in numerous plant species. Rapidly growing sequence databases contain hundreds of CHS entries that are the result of automatic annotation. In this study, we evaluated apparent multiplication of CHS domains in CHS gene models of four plant species.
CHS genes with an apparent triplication of the CHS domain encoding part were discovered through database searches. Such genes were found in Macadamia integrifolia, Musa balbisiana, Musa troglodytarum, and Nymphaea colorata. A manual inspection of the CHS gene models in these four species with massive RNA-seq data suggests that these gene models are the result of artificial fusions in the annotation process. While there are hundreds of seemingly correct CHS records in the databases, it is not clear why these annotation artifacts appeared.
Flavonoids are one of the most important groups of specialized plant metabolites. Their enormous chemical diversity results in a plethora of biological functions . Most noticeable are the anthocyanins which can provide blue to red coloration to flowers. Flavonoids are distributed across the whole plant kingdom. Given the visual phenotype of several flavonoids, this pathway was established as a model system for plant metabolism and transcriptional regulation [2, 3]. Today, the flavonoid biosynthesis and its regulation are among the best studied processes in plants. Numerous studies are published every year which investigate the flavonoid biosynthesis and the corresponding regulators in different plant species.
One of the best studied enzymes in the flavonoid biosynthesis pathway is the chalcone synthase (CHS). This enzyme catalyzes the first committed step of the flavonoid biosynthesis, particularly the reaction leading to the formation of the naringenin chalcone from one p-coumaroyl-CoA and three malonyl-CoA . CHS belongs to the type III polyketide synthases, a larger protein family that also harbors several closely related enzymes like the stilbene synthase (STS) . CHS and STS differ by only two functionally important residues which are Q166 and Q167 in the Arabidopsis thaliana CHS sequence . The gene structure of CHS comprises usually two coding exons .
Numerous plant genomes are sequenced every year  and the rapid development of long read sequencing technologies enables the generation of high quality genome sequences . Since an automatic annotation of all genes in the flavonoid biosynthesis and many regulators is possible [10, 11], it is feasible to perform large scale analyses of gene families acting in this pathway. Analyses at this scale are inherently prone to errors concerning individual species and sequences that require human intervention at the final interpretation step. A systematic analysis of domains in the chalcone synthase revealed an apparent triplication in several species.
Here, we describe an investigation of several CHS gene models that appeared to encode a protein with a triplicated CHS domain. A manual inspection of multiple cases based on transcriptomic data suggests mis-annotation leading to artificially fused gene models.
Identification of sequences with potential CHS domain duplications
A BLASTp analysis with the Nymphae colorata CHS (XP_049936683.1) as query and nr (containing 545,546,009 sequences) as subject was performed through the NCBI webportal (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins) on the 2nd of June 2023. The word size was set to 5 and an e-value cutoff of 10–30 was applied. Matched sequences longer than 500 amino acid residues were manually screened for CHS domain duplications or triplications. Redundant sequences were removed from the list.
Evaluation of gene models via RNA-seq read mapping
The genome sequences and corresponding annotations of four species with apparent fusion of multiple CHS gene copies were selected for manual inspection: Musa balbisiana , Musa troglodytarum , Macadamia integrifolia , and Nymphaea colorata . All available paired-end RNA-seq data sets of these species were retrieved from the Sequence Read Archive via fastq-dump  (Additional File 1 in ). Read pairs were aligned to the genome sequence with STAR v2.5.1b [18, 19] in 2-pass-mode using a minimal similarity of 95% and a minimal alignment length of 90% as previously described . The resulting BAM files were processed with customized Python scripts . All reads mapped to the locus of interest were extracted from each individual BAM file with samtools v1.15.1 . These subset BAM files were merged to produce one BAM file per species that contains all reads belonging to the locus of interest (Additional File 2, Additional File 3, Additional File 4, Additional File 5 in ). The final BAM file was visualized with Integrative Genomics Viewer (IGV) .
Analysis and visualization of RNA-seq read mapping coverage
A previously developed  Python script was applied to convert BAM files into coverage files that list the number of aligned reads per genomic position. These coverage files served as the basis for the generation of gene model-focused coverage plots . These plots visualize the number of aligned reads per position around a given locus of interest. High coverage with RNA-seq reads indicates exon positions while introns are characterized by very low or no RNA-seq read coverage at all. Introns are not included in the final mRNAs and usually mostly mature mRNAs are extracted with standard RNA extraction protocols used for RNA-seq experiments.
Comparison of sequence similarity
Sequences around the CHS loci of interest were retrieved from the respective genome sequence using the Python script seqex3.py v0.5 . Sequence similarity between the different domains predicted in the inspected sequences was analyzed by constructing dot plots with dotplotter v0.1  using a k-mer size of 31. Dotplotter compares two sequences by generating k-mers from one given sequence and identifying matches in the other sequence. The corresponding k-mer positions in both sequences are visualized in a dotplot. This allows an intuitive visualization of repeats.
An analysis of the CHS sequences retrieved from the NCBI revealed apparent fusion proteins harboring multiple (two or three) CHS domains in 28 protein sequences across different plant species (Additional File 6 in ). More specifically, 8 sequences showed a triplication of the CHS domain, while 20 sequences showed a duplication. Representative examples of predicted proteins with multiple CHS domains were identified in Macadamia integrifolia, Musa balbisiana, Musa troglodytarum, and Nymphae colorata (Fig. 1). A comparison of each CHS locus in the four species against itself revealed the expected high similarity (Additional File 7 in ).
Loci with an unexpected CHS annotation were inspected in an RNA-seq read mapping (Fig. 2). The annotation indicates a single CHS gene encoding three CHS domains, but the results of our RNA-seq read mapping suggest that there are multiple individual CHS genes. The coverage continuously drops towards the ends of several exons. That is an indication of the end of a gene. In contrast, an abrupt drop in coverage to almost zero would indicate a splice site. There are almost no connecting reads between some of the annotated exons and individual exons show very different RNA-seq coverage. There are also substantial fractions of annotated exons without RNA-seq support. It is not expected that exons at the 5’- and 3’-end of a gene have high coverage, while the enclosed exons have low coverage. The underlying RNA-seq read mapping files of Macadamia integrifolia (Additional File 2 in ), Musa balbisiana (Additional File 3 in ), Musa troglodytarum (Additional File 4 in ), and Nymphae colorata (Additional File 5 in ) are available for in depth inspection. Detailed instructions are provided to enable researchers to perform a similar investigation of other cases of potential annotation artifacts .
We outlined how surprising results of a quick database search and in-depth inspection of these entries resulted in the identification of most likely annotation artifacts. We investigated the assembly and annotation process underlying the annotations of the four analyzed species. The Macadamia integrifolia assembly was produced with MaSuRCA v3.2.6 and ALLMAPS v.Jul-2019 [14, 26, 27]. The Musa balbisiana and Musa troglodytarum genome sequences were assembled with wtdbg v1.2.8 and NextDenovo v2.4.0, respectively [12, 13, 28, 29]. The Nymphae colorata assembly was produced by Canu v1.3 [15, 30]. All these assemblers are well established tools that have been deployed in numerous plant genome sequencing projects. The annotation was produced by Gnomon  which is the default annotation pipeline operated at the NCBI. There is no indication how any of these steps could have caused the mis-annotation. Gnomon was also applied on many other plant genome sequences and usually generated CHS annotations comprising only a single CHS domain [32,33,34]. The repetitive regions with multiple CHS copies in an array might contribute to the mis-annotation as suggested for repeats before .
To the best of our knowledge, there are no reports that validated a CHS protein with multiple repeated domains. Despite all efforts, it might not be possible to automatically maintain a perfect database free of any mis-annotations. Despite technological advances, manual inspection and correction might be required in some cases . Swiss-Prot is an initiative to establish a collection of sequences that were curated by experts [37, 38]. However, such a manual inspection is not a scalable approach given the rapid growth of sequence collections due to improved sequencing technologies [8, 9]. As most records in the database are likely correct, users should carefully inspect any substantially deviating records. Especially sequences, which appear as exciting discoveries, should be considered as suspicious and thorough inspection is required.
Unexpected sequences should be carefully checked, because annotation artifacts are more likely than striking differences between closely related species. We describe our strategy for the gene model investigation in detail to enable repetition by others .
A comparison with orthologs in many other species is often a way to identify unexpected sequence properties. Sequences should only differ systematically if there are particularly striking events during evolution. If such events are not known, any major sequence differences could point to artifacts and should be considered as such.
Generally, the alignment of RNA-seq reads or other types of transcriptomic evidence should be used to gain insights into the structure of genes. It is important to use a proper split read aligner like STAR [18, 19] or HISAT2  for this step. The coverage should be consistent or should drop continuously towards the ends of a gene. Splice sites within a gene are characterized by abrupt changes of the coverage to almost zero. In addition, introns should be spanned by a number of reads that is almost equivalent to the coverage at the border of the flanking exons.
If no suitable datasets are available, it is also possible to validate the gene structure and sequence through amplification via RT-PCR followed by Sanger sequencing . As this approach is more time consuming and requires more financial resources, the aforementioned data-reuse approaches should be applied first.
Our investigation was restricted to four examples of apparent CHS domain triplication events within a single CHS gene in four different plant species. The gene models in question were contradicted by the analyzed RNA-seq datasets. However, we cannot rule out the possibility that such modified CHS versions exist elsewhere.
The developed scripts, a detailed description for the inspection of suspicious gene models, and additional sequence collections are available via GitHub: https://github.com/bpucker/CHS. All analyzed data sets are publicly available (Additional File 1 in ). Genome sequences and corresponding annotations were retrieved from the BananaGenomeHub (https://banana-genome-hub.southgreen.fr/) and the NCBI (https://www.ncbi.nlm.nih.gov/), respectively.
Integrative Genomics Viewer
Winkel-Shirley B. Flavonoid biosynthesis. A colorful model for genetics, biochemistry, cell biology, and biotechnology. Plant Physiol. 2001;126:485–93.
Dubos C, Le Gourrierec J, Baudry A, Huep G, Lanet E, Debeaujon I. MYBL2 is a new regulator of flavonoid biosynthesis in Arabidopsis thaliana. Plant J. 2008;55:940–53.
Ramsay NA, Glover BJ. MYB-bHLH-WD40 protein complex and the evolution of cellular diversity. Trends Plant Sci. 2005;10:63–70.
Dao TTH, Linthorst HJM, Verpoorte R. Chalcone synthase and its functions in plant resistance. Phytochem Rev. 2011;10:397–412.
Flores-Sanchez IJ, Verpoorte R. Plant Polyketide Synthases: a fascinating group of enzymes. Plant Physiol Biochem. 2009;47:167–74.
Schröder G, Schröder J. A single change of histidine to glutamine alters the substrate preference of a stilbene synthase. J Biol Chem. 1992;267:20558–60.
Yang J, Gu H. Duplication and divergent evolution of the CHS and CHS-like genes in the chalcone synthase (CHS) superfamily. Chin SCI BULL. 2006;51:505–9.
Marks RA, Hotaling S, Frandsen PB, VanBuren R. Representation and participation across 20 years of plant genome sequencing. Nat Plants. 2021;7:1571–8.
Pucker B, Irisarri I, de Vries J, Xu B. Plant genome sequence assembly in the era of long reads: Progress, challenges and future directions. Quant Plant Biology. 2022;3:e5.
Rempel A, Pucker B. KIPEs3:Automaticannotationofbiosynthesispathways.2022;:2022.06.30.498365.
Pucker B. Automatic identification and annotation of MYB gene family members in plants. BMC Genomics. 2022;23:220.
Wang Z, Miao H, Liu J, Xu B, Yao X, Xu C. Musa balbisiana genome reveals subgenome evolution and functional divergence. Nat plants. 2019;5:810–21.
Li Z, Wang J, Fu Y, Jing Y, Huang B, Chen Y. The Musa troglodytarum L. genome provides insights into the mechanism of non-climacteric behaviour and enrichment of carotenoids. BMC Biol. 2022;20:186.
Nock CJ, Baten A, Mauleon R, Langdon KS, Topp B, Hardner C. Chromosome-scale assembly and annotation of the macadamia genome (Macadamia integrifolia HAES 741). G3: genes, genomes. Genetics. 2020;10:3497–504.
Zhang L, Chen F, Zhang X, Li Z, Zhao Y, Lohaus R. The water lily genome and the early evolution of flowering plants. Nature. 2020;577:79–84.
NCBI. sra-tools. 2020. https://github.com/ncbi/sra-tools.
Pucker B. Manual inspection of CHS gene models. 2023. https://github.com/bpucker/CHS.
Dobin A, Gingeras TR. Mapping RNA-seq reads with STAR. Curr protocols Bioinf. 2015;51:11–4.
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21.
Haak M, Vinke S, Keller W, Droste J, Rückert C, Kalinowski J. High quality de novo transcriptome assembly of Croton tiglium. Front Mol Biosci. 2018;5:62.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–9.
Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative Genomics viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013;14:178–92.
Pucker B, Brockington SF. Genome-wide analyses supported by RNA-Seq reveal non-canonical splice sites in plant genomes. BMC Genomics. 2018;19:1–13.
Pucker B, Schilbert HM, Schumacher SF. Integrating Molecular Biology and Bioinformatics Education.J Integr Bioinform.2019;16.
Pucker B. PBBtools v0.1. 2023. https://github.com/bpucker/PBBtools.
Zimin AV, Marçais G, Puiu D, Roberts M, Salzberg SL, Yorke JA. The MaSuRCA genome assembler. Bioinformatics. 2013;29:2669–77.
Tang H, Zhang X, Miao C, Zhang J, Ming R, Schnable JC. ALLMAPS: robust scaffold ordering based on multiple maps. Genome Biol. 2015;16:3.
Ruan J, Li H. Fast and accurate long-read assembly with wtdbg2. Nat Methods. 2020;17:155–8.
GrandOmics. NextDenovo. 2023. https://github.com/Nextomics/NextDenovo.
Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 2017;27:722–36.
Souvorov A, Kapustin Y, Kiryutin B, Chetvernin V, Tatusova T, Lipman D. Gnomon-theNCBIeukaryoticgenepredictiontool.2018.https://www.ncbi.nlm.nih.gov/genome/annotation_euk/gnomon/.Accessed13Nov2018.
Liu Y-J, Wang X-R, Zeng Q-Y. De novo assembly of white poplar genome and genetic diversity of white poplar population in Irtysh River basin in China. Sci China Life Sci. 2019;62:609–18.
Daccord N, Celton J-M, Linsmith G, Becker C, Choisne N, Schijlen E. High-quality de novo assembly of the apple genome and methylome dynamics of early fruit development. Nat Genet. 2017;49:1099–106.
Guo L, Winzer T, Yang X, Li Y, Ning Z, He Z. The opium poppy genome and morphinan production. Science. 2018;362:343–7.
Tørresen OK, Star B, Mier P, Andrade-Navarro MA, Bateman A, Jarnot P. Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases. Nucleic Acids Res. 2019;47:10994–1006.
Vaattovaara A, Leppälä J, Salojärvi J, Wrzaczek M. High-throughput sequencing data and the impact of plant gene annotation quality. J Exp Bot. 2019;70:1069–76.
Boeckmann B, Bairoch A, Apweiler R, Blatter M-C, Estreicher A, Gasteiger E. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003;31:365–70.
Schneider M, Tognolli M, Bairoch A. The swiss-prot protein knowledgebase and ExPASy: providing the plant community with high quality proteomic data and tools. Plant Physiol Biochem. 2004;42:1013–21.
Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019;37:907–15.
Pucker B, Holtgräwe D, Weisshaar B. Consideration of non-canonical splice sites improves gene prediction on the Arabidopsis thaliana Niederzenz–1 genome sequence. BMC Res Notes. 2017;10:1–6.
This work was supported by the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A, 031A535A, 031A537A, 031A537B, 031A537C, 031A537D, 031A538A). Many thanks to the Bioinformatics Resource Facility (BRF) at the Center for Biotechnology (CeBiTec) at Bielefeld University for providing an environment to perform the computational analyses.
We acknowledge support by the Open Access Publication Funds of Technische Universität Braunschweig.
Open Access funding enabled and organized by Projekt DEAL.
The authors declare no competing interests.
Ethics approval and consent to participate
Consent for publication
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Bartas, M., Volna, A., Cerven, J. et al. Identification of annotation artifacts concerning the chalcone synthase (CHS). BMC Res Notes 16, 109 (2023). https://doi.org/10.1186/s13104-023-06386-z
- Chalcone synthase
- Flavonoid biosynthesis
- Annotation error
- RNA-seq mapping
- Domain composition