AutoClassWeb: a simple web interface for Bayesian clustering of omics data

Objective Data clustering is a common exploration step in the omics era, notably in genomics and proteomics where many genes or proteins can be quantified from one or more experiments. Bayesian clustering is a powerful unsupervised algorithm that can classify several thousands of genes or proteins. AutoClass C, its original implementation, handles missing data, automatically determines the best number of clusters but is not user-friendly. Results We developed an online tool called AutoClassWeb, which provides an easy-to-use and simple web interface for Bayesian clustering with AutoClass. Input data are entered as TSV files and quality controlled. Results are provided in formats that ease further analyses with spreadsheet programs or with programming languages, such as Python or R. AutoClassWeb is implemented in Python and is published under the 3-Clauses BSD license. The source code is available at https://github.com/pierrepo/autoclassweb along with a detailed documentation.


Introduction
In biology, high-throughput technologies (notably in genomics and proteomics) enable identification and quantification of several thousands of genes or proteins in a single experiment.To analyze such a large amount of data, from one or more experiments, clustering algorithms are widely used unsupervised machine-learning methods to group genes or proteins with similar patterns.Bayesian clustering is such an algorithm and one of its implementation in the C programming language (AutoClass C) has been developed in 1996 at the Ames Research Center at NASA [1,2].The idea behind Bayesian clustering and the AutoClass algorithm is to find a classification that fits the data with the highest probability.The AutoClass algorithm provides some additional and interesting features: it handles missing data and determines automatically the best number of clusters.
AutoClass C has been used in a wide variety of applications from clustering cells of the prefrontal cortex in rats and mice [3] to detecting body patterns in the common cuttlefish [4] (see also references [5] and [6] for a detailed list of applications).However, AutoClass C, originally developed by physicists, is not user-friendly: the program is solely accessible through the command line, only 32-bit binaries are available and results files are difficult to parse for subsequent analysis.
More than 10 years ago, Achcar et al. published AutoClass@IJM [5], a web interface for AutoClass C.This web service drastically simplified the use of AutoClass C and widen its adoption, especially in biology [3,7,8,9,10].Unfortunately, this tool is not maintained anymore, and its source code is not publicly available.
To continue to offer this powerful Bayesian clustering method to the community, we developed AutoClassWeb, a new easy-to-use open-source web interface for AutoClass C.

Main text
Implementation AutoClassWeb utilizes AutoClassWrapper [6], a Python wrapper for the AutoClass C program.This wrapper facilitates the preparation and quality control of data, runs the actual classification, and eventually, prepare results in file formats that allow further analysis.
AutoClassWeb is written in Python [11] and uses the Flask library to build the web interface users interact with.For better reproducibility and sustainability, Au-toClassWeb is packaged in a Docker image stored in the BioContainers [12] registry.
The web service itself has been designed to be user-friendly.There is no user authentication and by default, results are kept 30 days before being deleted.A comprehensive help page provides all the help and guidance the user might need.
Using Docker technology, AutoClassWeb can be quickly deployed on a local machine or on a public web server.To this end and to reduce the installation burden, we provide two companion GitHub repositories with detailed instructions, for local (https://github.com/pierrepo/autoclassweb-app)and server installation (https://github.com/pierrepo/autoclassweb-server).

Data submission
The input data must be formatted as tab-separated values (TSV) files.The first line is a header containing the names of the columns which must be unique.The first column contains the names of the objects studied (i.e. protein or gene identifiers).
Missing data is supported and should be coded with an empty value (i.e.nothing).. AutoClass supports three categories of data: • real location: negative and positive values such as position, elevation, microarray log ratio... • real scalar : singly bounded real values, typically bounded below at zero (i.e.: length, weight, age).• discrete: qualitative data.For instance, color, phenotype, name...If the initial input dataset contains several data types (real scalar, real location, discrete), it is recommended to split the initial dataset into several datasets of homogeneous type and submit them in the input form (Figure 1 (A)).
For the data types real scalar and real location, the user can optionally specify an absolute and relative error, respectively.

Clustering
Upon submission, input data is quality checked and formatted to be usable by AutoClass C. The web interface provides a unique job name, a link to the status page and a quick summary of input data (toggled with the text Hide/show logs), as illustrated in Figure 1 (B).
The status page lists running, failed and completed runs with their respective identifier (Job name), creation date, status and running time (Figure 1 (C)).

Results
Once a job is completed, a green button allows the user to download results of the clustering.Results are bundled in a zip archive with the following files (where xxx stands for the unique identifier of the job): • xxx autoclass out.cdt and xxx autoclass out withproba.cdtcan be viewed with Java TreeView [13], a versatile viewer initially developed for microarray data.The file xxx autoclass out withproba.cdtcontains the probability for each object (gene or protein) to belong to each class.• xxx autoclass out stats.tsv contains means and standard deviations of distance between all classes.
• xxx autoclass out.tsvcontains all the data with the class assignment and membership probabilities for all classes.This file is in the TSV format and can be easily parsed with spreadsheet programs such as Microsoft Excel or LibreOffice Calc, or programming languages such as R or Python.

Performance
The AutoClass C algorithm has been designed to run on a single CPU.The running time depends exponentially on the size of the input dataset.Figure 2 illustrates the running time as a function of the input dataset sizes.Dataset size is expressed as the number of rows (usually genes or proteins) times the number of columns (features or properties of interest).

Conclusion
Data clustering is an essential step in most modern omics analyses.The AutoClass algorithm, while very powerful, is not widely used, mainly because its original Au-toClass C implementation is difficult to use.AutoClassWeb provides an easy-to-use web interface for AutoClass C. The project is open-source, packaged in a Docker image available in BioContainers for better reproducibility and sustainability.

Limitations
AutoClassWeb provides a convenient online service to cluster results from highthroughput experiments such as RNA-seq or mass spectrometry based proteomics.However, we would like to point out that the processing time required to cluster data with AutoClass is proportional to the number of genes or proteins to be clustered.A parallel version of AutoClass C that potentially reduces the processing time has been published [14].Unfortunately, the source code is not available, and the project has been discontinued.
Another limitation of AutoClassWeb requires users to split input data by type (real location, real scalar or discrete) with a special attention to real location and real scalar which may sometimes be confused.

Figure 1
Figure 1 Views of AutoClassWeb.(A) Main page with the form to input data according to its type.(B) View after data input and quality check.(C) Status page.

Figure 2
Figure 2 Running time (in hour) as a function of the input dataset sizes.Dataset size is expressed as the number of rows (genes or proteins) times the number of columns (features or studied properties).