Wisconsin diversity panel phenotypes: spoken descriptions of plants and supporting data

Objectives Phenotyping plants in a field environment can involve a variety of methods including the use of automated instruments and labor-intensive manual measurement and scoring. Researchers also collect language-based phenotypic descriptions and use controlled vocabularies and structures such as ontologies to enable computation on descriptive phenotype data, including methods to determine phenotypic similarities. In this study, spoken descriptions of plants were collected and observers were instructed to use their own vocabulary to describe plant features that were present and visible. Further, these plants were measured and scored manually as part of a larger study to investigate whether spoken plant descriptions can be used to recover known biological phenomena. Data description Data comprise phenotypic observations of 686 accessions of the maize Wisconsin Diversity panel, and 25 positive control accessions that carry visible, dramatic phenotypes. The data include the list of accessions planted, field layout, data collection procedures, student participants’ (whose personal data are protected for ethical reasons) and volunteers’ observation transcripts, volunteers’ audio data files, terrestrial and aerial images of the plants, Amazon Web Services method selection experimental data, and manually collected phenotypes (e.g., plant height, ear and tassel features, etc.; measurements and scores). Data were collected during the summer of 2021 at Iowa State University’s Agricultural Engineering and Agronomy Research Farms.

descriptions that are both structured (e.g., ontologies) and unstructured (i.e., free text) hold great promise for enabling researchers to advance analytics for phenotypes and traits, especially when these data are made publicly accessible [3].
We developed this dataset as a foundation for analyzing large volumes of spoken phenotype descriptions in a field environment.These phenotype observations were drawn from the Wisconsin Diversity panel, which contains sufficient phenotypic diversity in a field environment for various genotype-to-phenotype analyses [4][5][6].Observers generating the datasets were not confined to rigid vocabularies and were not strictly limited to a list of

Objective
Formative research using free text descriptions of plant phenotypes along with Natural Language Processing (NLP) methods has demonstrated that computing on plant phenotypes alone can recover known genotypephenotype associations [1,2].Building on these successes, continued efforts to generate plant phenotype traits to comment on.The protocols that we have developed, along with additional pipeline development, can initiate the use of citizen scientists and become practical for the capture low-cost large-volume free-text phenotype description of plants.
Supplemental to spoken descriptions of plant phenotypes and the text derived from these observations, measurements and scores for traits of interest were also collected as ground truth.Field layout and weather data are reported, along with images of the rows in the field and aerial images from a drone.Consequently, this dataset may be useful to investigators interested in data collected from diversity panels and to those interested in processing natural language and its use in describing scientific phenomena.
The use of this dataset for investigating biological relevance and utility, including developed tools to assist in the use of spoken descriptions for field-based plant phenotype analytics is available [7].

Data description
This dataset [8] was collected and derived from observations of an experimental field at Iowa State University's Agricultural Engineering and Agronomy Research Farms in Boone, Iowa.The Wisconsin Diversity panel (686 accessions), an environmental control line (B73, the maize reference line used for genetics and genomics), and 25 positive control accessions were planted in two replicates, and observations were generated over the summer of 2021.This dataset includes the following elements (Table 1).
• Each student participant and volunteer is identified in datasets only by their code names (from the NATO phonetic alphabet).• Audio text processing data contains the spoken data collected by the volunteers (WAV files) and descriptions of the recordings generated by student participants using Sony ICD-UX570 recorders.
Additionally included are metadata (summary statistics) derived from the recordings and code to generate these statistics.Further, all intermediate files (JSON, TXT, and EXCEL files) and code to generate the final cleaned transcripts for all student participant recordings and a subset of the volunteer's recordings are included.These files provide a resource to investigators to utilize field-collected spoken natural language descriptions of maize plants.
• Methods selection data includes data and code for generating transcriptions using various Amazon Web Services (AWS) Transcribe methods.These methods include using an individualized custom vocabulary for each student participant and an example using data collected by volunteer "Whiskey" as well as a generalized custom vocabulary for each student participant and volunteer Whiskey's data, and no custom vocabulary.A subset of data was selected to process and compare to a gold standard transcription manually generated to calculate a similarity score to determine the method for transcribing all spoken descriptions collected during the summer of 2021.

Limitations
Some audio observations were incomplete due to technical difficulties, including microphone disengagement from the recording devices or observers recording observations for the incorrect row.Also, speech-to-text pipelines and post-process cleaning steps are fallible, leading to transcription inaccuracies.These data were taken over approximately seven weeks, and there were apparent growth and developmental changes throughout the duration of the study.Additionally, the observations within this dataset are for two replicates in the same environment, and additional years, plots, and environments could supplement these available speech data for a more robust dataset.