Automated gene data integration with Databio

Objective Although sequencing and other high-throughput data production technologies are increasingly affordable, data analysis and interpretation remains a significant factor in the cost of -omics studies. Despite the broad acceptance of findable, accessible, interoperable, and reusable (FAIR) data principles which focus on data discoverability and annotation, data integration remains a significant bottleneck in linking prior work in order to better understand novel research. Relevant and timely information discovery is difficult for increasingly multi-disciplinary projects when scientists cannot easily keep up with work across multiple fields. Computational tools are necessary to accurately describe data contents, and empower linkage to existing resources without prior knowledge of the various database resources. Results We developed the Databio tool, accessible at https://datab.io/, to automate data parsing, identifier detection, and streamline common tasks to provide a point-and-click approach to data manipulation and integration in life sciences research and translational medicine. Databio uses fast real-time data structures and a data warehouse of 137 million identifiers, with automated heuristics to describe data provenance without highly specialized knowledge or bioinformatics training.


Introduction
Although sequencing and other high-throughput data production technologies are increasingly affordable, data analysis remains a significant factor in the cost of -omics studies [1]. Without improving the ability to automate data integration and interoperation, the cost of analysis will continue to impede access to precision medicine for underserved populations with limited resources. Many resources have been developed around the concept of a central "Data Commons", but the path forward remains unclear [2], and current large data repositories are highly specialized and difficult to apply broadly. Despite the acceptance and proliferation the Findable, Accessible, Interoperable, and Reusable (FAIR) data principles [3], current data provider implementations focus on descriptive metadata and keyword-oriented search applications, leaving the detailed gene and other -omics data inaccessible to computational discovery methods.
Data producers recognize the need to enable greater access to hosted data, but there are no well-accepted machine-readable means for annotating the contents of data sets across the biomedical landscape [4]. The lack of available standards and tools make it a cumbersome and time-consuming task to properly annotate identifier sources, record their provenance throughout an analytical process, and track subsequent data quality metrics. These challenges exist regardless of the level of research activity, including mammalian, marine, and agricultural research domains [5][6][7]. As a result, the majority of useful scientific results remain buried in supplementary tables, figures, and poorly indexed data archives. Although manual curation efforts have led to increasingly more data becoming available in data portals and publication annotations, these efforts require specialized knowledge around biomedical resources. Even seemingly trivial tasks are burdensome, such as those required for secondary analysis of a gene list in a supplementary table. One must be able to identify obscure identifiers such as 'ENSG00000168653' , identify tools or mapping data that support it, and translate into symbols (e.g. 'NDUFS5') or identifers (Entrez Gene ID 4725, or RefSeq Accession NM_004552.3, etc) useful for their own analysis methods. Using these resources necessitates experience with the extract-transform-load (ETL) process, and the resource knowledge and technical expertise has little to do with the science itself.

BMC Research Notes
These challenges represent an increasing burden on data producers, which is deferred to data consumers who are faced with the need to integrate loosely described high-throughput experiments into novel studies [8]. Because data consumers only need these analytical skills occasionally, they are more prone to implementation errors and struggle to fully integrate complex data relationships [9,10]. Thus there is a need to simplify and automate the discovery and retrieval process.

Main text
We present Databio, a novel framework for automating the extraction, annotation, and integration of geneoriented data sets. Databio automates data parsing and identifier detection, and streamlines many common tasks to provide a point-and-click approach to data manipulation and integration across a broad spectrum of applications in life sciences research and translational medicine. This ability to quickly and accurately streamline complex tasks will enable faster and better analysis of -omics data.

Implementation and available data
Databio is implemented as a web-based data portal (https ://datab .io) that allows users to interact with the embedded tools using an interactive web browser-based interface.
User data uploads are first handled via an automatic detection framework that determines the source data format (see top Fig. 1). The current implementation supports Tab-separated values (TSV), Comma-separated values (CSV), and Excel 2007+ spreadsheets (XLSX). Records (rows) and fields (columns) within these documents are exposed to the rest of the application through a modular interface allowing for support for more data formats in future software updates. Heuristic techniques are applied to the parsed data to remove headers and determine field labels, allowing for a more descriptive display interface (see Fig. 1).
Once fields are parsed, values are aggregated together and searched against our warehouse of multiple gene identifier data sources. Our current snapshot contains over 137 million unique gene, transcript, and protein identifiers and 92 million unique mapping pairs (Table 1). Despite the extreme scale of determining identifier source, this classification can be completed accurately in real-time (less than 1 s) using Bloom filters for fast approximate matching [11]. The top hits for each field are collected (along with sample values) and returned to the web interface so that users can verify the accuracy of the predicted identifier type.
In addition to the classification index representation, the Databio database also contains mappings that allow supported identifiers to be translated into other identifier types. Although this common task has been supported by other tools such as David, Uniprot, and BioMart [12][13][14], these tools require manual data manipulation, specialized knowledge of identifier sources, and cannot replace identifiers within the context of the original data file [15]. Databio is able to translate identifiers in-place, removing multiple opportunities for error and keeping the data in context. These changes are applied to the existing data schema and exported to a CSV-format data set that can be readily imported into other tools for subsequent analysis (see bottom of Fig. 1).
Further easing the burden of data manipulation on the user, Databio is able to track important data quality issues such as missing identifiers and ambiguous mappings. The Databio warehouse maintains a record of publication and citation info for each identifier source, the last fetch and access dates, and analysis logs describing processing steps and data quality metrics. Using this information, Databio can establish that necessary metadata for publication, distribution, and reuse is present and accurately tracked. This ensures that data consumers know the state of a data set including access dates, citations, and relevant usage limitations.

Usage
For example, a study identified 634 genes associated with Type 2 Diabetes Genome-Wide Association Study loci [16], and provided the results in a Supplementary Table (see top, Fig. 1). We want to look for relationships between the RefSeq Transcript sequences of the genes and the listed loci. However, searching for 'ENSG00000168653' in RefSeq currently yields no results, and the gene Symbol 'NDUFS5' returns 19 Human results. One must translate the gene identifiers into more specific RefSeq Transcript IDs.
Upon visiting the Databio site, the user is able to upload this Excel file (or any other TSV, CSV or XLSX data file) even though it does not fit a pre-determined field layout. Column names (fields) are automatically parsed and identified for selection on the second page (see top, Fig. 1). Fields with high-quality automated classification are marked with a circle in the top right corner to indicate a high correspondence to a known Databio identifier source (For example, the blue box "geneId (GRCh37.66)" in Fig. 1). The user is then able to click on the field name that they want to remap. The exact match rate, as well as the percent coverage of the corresponding source dataset, is shown to the user under the 'Source Identifiers' header on the left.
We can see that for this example, even though the file did not explicitly mention the source of gene identifiers, Databio easily determined them to be Ensembl Gene IDs. For other data sets, if there is more ambiguity to  HGNC Gene Names 1 42,050 [21] OMIM Genes 1 16,197 [22] the identifiers (e.g. integers), the user can use the dropdown on the left to see the other matched identifiers sources and find the most appropriate choice. The user can then choose the desired identifier type to map to, using the drop-down on the right, and an automatically generated list of identifiers that map to the original identifier source. Changing either the 'to' or 'from' drop-down selections automatically updates to display a sample of the original identifiers from the uploaded data, and the associated remapped identifiers so that the user can confirm expectations. Finally, the user may begin the translation processing, which leads to a new page including the remapped data file for download, statistics, some text describing the methods and data sources used with a bibliography and analysis logs. This information is all available in a compressed ZIP archive ensuring that important information is delivered together as one unit.

Discussion
Databio automates and streamlines the process of gene identifier translation, enabling new approaches to datadriven discovery by lowering the burden of data manipulation and prior knowledge of biomedical resources. Support for more identifier sources, more data formats, and chained identifier conversions (A → B → C) will greatly increase the utility of Databio across the life sciences. In addition, future computational analyses will build upon this base, enabling data set search based on related data contents and not just shared metadata. Together these improvements will enable future machine learning applications by removing the need for manual intervention in data import processes, shortening learning times and improving the pace of data-driven discovery.

Limitations
• Primarily gene-centric automated identifier detection. We are working to expand the data warehouse to include other data types. These methods will require further work to allow identification in the presence of noise or natural language (e.g. clinical reports). • Cannot handle chained/multi-step conversions. e.g.
to translate from A to X if there is no direct mapping, manual translation to an intermediate value is necessary first (A to B, then B to X). This is likely unintuitive to new users but an issue we hope to address in the future. • Search methods currently scale linearly with search scope. e.g. as the data warehouse grows, so does the search time. We are working on algorithmic methods and data structures to address this limitation.