- Data Note
- Open access
- Published:
Gene function annotations for the maize NAM founder lines
BMC Research Notes volume 17, Article number: 9 (2024)
Abstract
Objectives
We annotated the latest published sequences of the 26 Zea mays Nested Association Mapping (NAM) founder lines using GOMAP, the Gene Ontology Meta Annotator for Plants. The maize NAM panel enables researchers to understand and identify the genetic basis of complex traits. Annotations of predicted functions for genes can help researchers investigate gene-phenotype associations, prioritize candidate genes for phenotypes of interest, and formulate testable hypotheses about gene function/phenotype associations. The creation and release of high-confidence, high-coverage gene function annotation sets for the NAM founder lines is critical to accelerate the generation of knowledge in maize genetics research. GOMAP is a high-throughput computational pipeline that annotates gene functions genome-wide in plant genomes using Gene Ontology functional class terms. Here we report and share GOMAP-generated functional annotations for the NAM founder lines.
Data description
Datasets include the protein sequences used as input, GOMAP-generated annotation files, scripts used to update obsolete terms, and GAF-formatted tab-delimited text files of gene function annotations along with README files that describe formatting, content, and how files relate to each other.
Objective
GOMAP is an annotation tool that generates high-coverage, high-quality (based on F-measure), whole-genome functional annotations for plants. It assigns genes with Gene Ontology (GO) terms through sequence similarity, domain presence, and mixed method pipelines [1]. The GO framework includes a standardized vocabulary designed to describe gene functions under three categories: biological process, molecular function, and cellular component [2]. Using GOMAP, we annotated the 26 Zea mays ssp. mays Nested Association Mapping (NAM) founder lines [3]. The maize NAM population was established to enhance the genetic diversity of maize to determine the genetic structure of complex traits by merging the benefits of quantitative trait locus and association mapping studies and reducing their limitations [4].
The availability of the GOMAP-generated Zea mays ssp. mays NAM founder lines annotation datasets can be of great use to scientists in the plant community, especially those with research focused on maize. GO-based function predictions can allow researchers to identify novel candidate genes for hypotheses generation and testing of gene functions. Moreover, the annotation datasets can also be used for gene-phenotype association analyses, identification of novel genes in a pathway of interest, and investigation of different functions within subpopulations of maize, to name a few. We expect new gene function findings and experimental validations from our cleaned datasets described here.
Data description
A standardized functional annotation dataset is available for each of the NAM founder lines (Table 1). Datasets include:
-
Protein sequences of the maize lines that were used as input for GOMAP. We have included the original protein sequences, the Python script we used to reformat the original sequences to produce the GOMAP input file, and the GOMAP input file. A README file is provided for further description of the data and includes where the original file was downloaded from, and how to run the python file. Reformatting was required for proper text wrapping, removal of any asterisks in the sequences, and the selection of the longest transcript of each gene.
-
The raw output gene annotation file produced by GOMAP. This file is the aggregated functional annotation generated by the pipeline and follows the GO Annotation File 2 (GAF 2) format.
-
Python scripts and supplementary resources to modify and clean the GOMAP-output file. Modifications are done to the gene and transcript names by adding a transcript identifier column. Cleanup includes the removal of any obsolete GO terms and the removal of duplicates. Descriptions of these files and details on how to run the scripts are provided in an accompanying README file. For consistency, the go.obo file used on all our maize datasets reported here is of release 2022-07-01, the same as that incorporated in GOMAP v1.3.9.
-
The final cleaned functional annotation dataset. This is the GAF 2 file that is generated using the 2.3_cleanup.py script. These GO-based gene function predictions can be readily used by the public.
Each directory has its own README that provides more information about the files. There also is a top-level overall README that describes the dataset more generally. Moreover, each dataset has its own standardized metadata. The datasets are publicly available on CyVerse [5] and can be accessed using the links provided in Table 1.
The structure of our dataset is an attempt to ensure data reproducibility and abidance to the data principles of findability, accessibility, interoperability, and reusability (FAIR) [6]. The overall organization of our annotation datasets is not new; we developed and applied this form for previously studied GOMAP-generated annotation files [7]. A full list of our annotated plant genomes can be found here [8], and includes annotation sets for 24 species, including sorghum, rice, wheat, barley, cotton, and hemp. For users interested in generating their own GO-based annotations, the GOMAP pipeline itself is available for general use [1], and a description of how to use the pipeline is also available [9].
We have generated and publicly released new functional annotations using the most up-to-date version of GOMAP (v1.3.9) on our old datasets, including the previously annotated maize lines Mo17 [36,37,38], W22 [39, 40], and PH207 [41, 42]. As an example, the annotations for Zea mays B73v5 reported here is an update of a previously released dataset [43]. We anticipate that the availability and maintenance of our datasets will benefit researchers in providing plant gene function predictions, paving the way for the generation of testable hypotheses on novel candidate genes of interest.
Limitations
The quality of the annotations is dependent on the quality of the input sequences. Genomes with high quality sequencing and coverage are expected to have better annotations. However, genomes with lower quality sequencing will result in limitations in downstream analyses.
In the case of the presence of multiple transcripts per gene IDs, GOMAP requires the selection of the longest transcript for each gene ID because the pipeline contains a reciprocal best hit step. This means that not every transcript ID per gene ID is going to be annotated in the resulting file. We have included a transcript ID column in the final cleaned file that allows the user to identify which one was included in the reformatted input file and annotated through GOMAP.
A cleanup step is performed on our GOMAP output file to remove any obsolete GO terms. This step relies on using a go.obo file. For the datasets reported here, we have used the go.obo file released in 2022-07-01. A user may replace this with the most current version of the go.obo file currently available for their own output.
While using the functional annotation datasets, it is worth noting that the GO Directed Acyclic Graph (DAG) lacks a good portrayal of plant functions underrepresented in the model species, Arabidopsis thaliana. This could lead to instances where the assignment of unconventional functions is due to the absence of related plant functions [7].
Abbreviations
- GOMAP:
-
Gene Ontology Meta Annotator for Plants
- NAM:
-
Nested Association Mapping
- GO:
-
Gene Ontology
- DAG:
-
Directed Acyclic Graph
- FAIR:
-
Findability, accessibility, interoperability, and reusability
References
Wimalanathan K, Lawrence-Dill CJ. Gene ontology meta annotator for plants (GOMAP). Plant Methods. 2021;17(1):1–4.
Thomas PD. The gene ontology and the meaning of biological function. The gene ontology handbook. Springer; 2017. p. 15–24.
Hufford MB, Seetharam AS, Woodhouse MR, Chougule KM, Ou S, Liu J, Ricci WA, Guo T, Olson A, Qiu Y, Della CR. De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes. Science. 2021;373(6555):655–62.
Yu J, Holland JB, McMullen MD, Buckler ES. Genetic design and statistical power of nested association mapping in maize. Genetics. 2008;178(1):539–51.
Goff SA, Vaughn M, McKay S, Lyons E, Stapleton AE, Gessler D, Matasci N, Wang L, Hanlon M, Lenards A, Muir A. The iPlant collaborative: cyberinfrastructure for plant biology. Front Plant Sci. 2011;2:34.
Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten JW, da Silva Santos LB, Bourne PE, Bouwman J. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data. 2016;3(1):1–9.
Fattel L, Psaroudakis D, Yanarella CF, Chiteri KO, Dostalik HA, Joshi P, Starr DC, Vu H, Wimalanathan K, Lawrence-Dill CJ. Standardized genome-wide function prediction enables comparative functional genomics: a new application area for Gene Ontologies in plants. GigaScience. 2022;11:giac023.
Publicly available GOMAP Datasets; 2023. https://faculty.sites.iastate.edu/triffid/gomap
Wimalanathan K, Lawrence-Dill CJ. Dill-PICL/GOMAP-singularity. GitHub; 2023. https://github.com/Dill-PICL/GOMAP-singularity
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_B73_NAM_5.0_October_2022_v2.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/cfvb-jn16
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_B97_NAM_1.0_October_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/abf6-pa81
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_CML52_NAM_1.0_November_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/qgb3-8743
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_CML69_NAM_1.0_November_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/xvga-0f52
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_CML103_NAM_1.0_October_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/1n89-rd43
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_CML228_NAM_1.0_October_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/e6hc-0406
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_CML247_NAM_1.0_October_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/jnwv-g571
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_CML277_NAM_1.0_October_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/ggj0-by23
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_CML322_NAM_1.0_November_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/36bb-f096
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_CML333_NAM_1.0_November_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/tnhe-yr36
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_HP301_NAM_1.0_November_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/2jhr-hy41
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_Il14H_NAM_1.0_November_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/t500-af32
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_Ki3_NAM_1.0_November_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/y2t8-zp24
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_Ki11_NAM_1.0_November_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/thx1-dm44
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_Ky21_NAM_1.0_November_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/ay3t-b914
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_M37W_NAM_1.0_November_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/cgmt-s267
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_M162W_NAM_1.0_November_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/pewv-k336
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_Mo18W_NAM_1.0_November_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/w0zf-jc74
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_Ms71_NAM_1.0_November_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/9gb5-aq74
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_NC350_NAM_1.0_November_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/q46m-qy91
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_NC358_NAM_1.0_November_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/0w9q-ta36
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_Oh7B_NAM_1.0_November_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/910q-f303
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_Oh43_NAM_1.0_November_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/8a63-3n35
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_P39_NAM_1.0_November_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/dgda-md18
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_Tx303_NAM_1.0_November_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/gz5q-rw97
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_Tzi8_NAM_1.0_November_2022.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/9g8d-ny61
Lawrence-Dill C. GOMAP Maize Zm-Mo17-REFERENCE-CAU-1.0 Zm00014a.1. CyVerse Data Commons; 2019. https://doi.org/10.25739/m634-cn58
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_Mo17_CAU_1.0_May_2023_v2.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/zjmm-vf13
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_CyVerse_Mo17_CAU_2.0_July_2023.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/tr4x-ta89
Lawrence-Dill C. GOMAP Maize Zm-W22-REFERENCE-NRGENE-2.0 Zm00004b.1. CyVerse Data Commons; 2019. https://doi.org/10.25739/e4va-9f09
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_W22_NRGENE_2.0_May_2023_v2.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/x1fn-w309
Lawrence-Dill C. GOMAP Maize Zm-PH207-REFERENCE_NS-UIUC_UMN-1.0 Zm00008a.1. CyVerse Data Commons; 2019. https://doi.org/10.25739/dm9s-aa15
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_PH207_NS-UIUC_UMN_1.0_May_2023_v2.r1. CyVerse Data Commons; 2023. https://doi.org/10.25739/e047-e733
Lawrence-Dill C. Carolyn_Lawrence_Dill_GOMAP_Maize_MaizeGDB_B73_NAM_5.0_December_2021.r1. CyVerse Data Commons; 2022. https://doi.org/10.25739/g1rt-b278
Acknowledgements
We thank CyVerse for providing a collaborative cyberinfrastructure to share data with the research community. We also thank Iowa State University High Performance Computing facility for the equipment and resources that accelerate our research. Finally, we thank the researchers who sequenced and assembled the plant datasets that were used as input in our research.
Funding
We gratefully acknowledge support from: NSF and USDA for AIIRA 2021-67021-35329; IOW0417 Hatch Funding to Iowa State University; Iowa State Predictive Plant Phenomics NSF Research Traineeship (DGE-1545453; CJLD is a co-principal investigator, and CFY is a trainee).
Author information
Authors and Affiliations
Contributions
LF generated and organized the maize NAM founder maize lines datasets. BN generated the updated datasets of maize lines Mo17, PH207, and W22. OTJ generated the original maize B73v5 dataset. LF created the metadata for each dataset and requested DOIs. CFY established the dataset structure to be applied to all our GOMAP-generated datasets. DAC supervised the release of datasets and creation of DOIs through CyVerse. KW created the GOMAP system. LF and CJLD wrote the manuscript. All authors read, suggested improvements, and approved the final copy of the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Fattel, L., Yanarella, C.F., Ngara, B. et al. Gene function annotations for the maize NAM founder lines. BMC Res Notes 17, 9 (2024). https://doi.org/10.1186/s13104-023-06668-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13104-023-06668-6