The rapid development of new sequencing technologies helps improve the accuracy as well as scope of many biological applications such as the assembly of genomes, transcriptomes (RNAs), or ChIP-Seq (chromatin-immunoprecipitation followed by next-generation DNA sequencing). Most of these applications execute the read alignment as their first step. Therefore, the sequence alignment is the most important and fundamental part to almost all applications of sequencing analysis.
New sequencing technologies in genomics create incredible amounts of data to process at a lower cost per nucleotide. Manufacturers are constantly increasing output in terms of the number of reads, increasing read length, as well as working to improve read quality. While it took 10 years and over $3 billion dollars to produce a first draft of the human reference genome (approx. 3.5 billion base pairs), the current generation of sequencing instruments is able to generate hundreds of billions of bases in only a few days. It is projected that this output will continue to increase dramatically over the next few years at a rate much faster than Moore's Law, a doubling every year, which is the approximate rate of increase in the semiconductor field over the past 40 or so years. For example, the latest sequencer from Illumina, the HiSeq 2000, is able to generate 25 Gb(gigabases)of sequence per day. In terms of price, in comparison to the prior model, the GA sequencer, the cost per base on the HiSeq is actually substantially reduced by as much as 8 times [1]. However, this is still the second-generation sequencing. The third-generation single molecule sequencing instruments are beginning to be introduced by Pacific Biosciences at a much reduced reagent cost and longer sequences.
These extensive genetic informational datasets create many serious problems and challenges for the popular alignment tools such as bowtie [2], RMAP [3, 4], MAQ [5], bwa [6, 7], etc. The first challenge is performance. As the data grows it is taking an increasing amount of time to compile, search and analyze and radical new approaches are required that would ensure project scalability. The second issue is the enormous capital expense for equipment that typically has 6 months as its state of the art half-life. Both computing and sequencing technologies advance at a very fast pace. To keep up with this pace, bio-organizations have to spend much money on replacing or updating devices.
In computer science, Cloud Computing has recently emerged as an evolutionary model to accommodate storage and computing service as a utility. Cloud providers offer different computing services to users through the Internet. Cloud users only pay for the resources (computing, bandwidth, etc) they actually consume without worrying about the maintenance expense, provisioning resources for future needs, taking care of availability, and reliability issues. The price is based on the time and types of services. As a result using Cloud Computing services is a recent and very promising solution in bioinformatics to deal with the issues related to storage and computation [8]. With Cloud solution, biologists don't need to equip and maintain powerful and high capacity servers for their analysis as before. They can securely store their data in the Cloud with high availability, and can have thousands of on-demand powerful computers ready to run their analysis. Nevertheless, to use Cloud, users need to be trained a little bit, and they are also required to have a stable high-speed Internet connection to the service providers.
The Cloud Computing solution, however, just enables the flexible and scalable infrastructure to deal with storage and computational issues. To deal with performance and scalability when processing a huge amount of data, we need to have a special parallel programming model. Recently, Google has designed a parallel computing framework called Mapreduce [9] which can scale efficiently many thousands of commodity machines. These commodity machines forming a cluster can be accessed by users in an institution or can be rented over the Internet through utility computing services. Actually, the idea of this framework is not new since it has already been used in traditional functional programming languages such as Haskell, Lisp, Erlang, etc.
The basic idea of the MapReduce framework is shown in Figure 1. The data that need to be processed is divided into "input splits". Each split contains many records in a key-value pair structure < K,V >. The map blocks (a piece of code defined by software developers based on the application business) map these input key-value pairs into other intermediate key-value pairs. This intermediate data is then sorted and grouped together based on the keys. As a result, the input of the reduce blocks is a key with a collection of values. The reduce blocks (also developed by MapReduce programmers) then produce the final results in the form of key-value pairs as well. One very important feature enabling MapReduce to process a huge amount of data efficiently is that all maps and reduce blocks are executed concurrently. There are two main phases though: map and reduce. As we can see from the figure, all map tasks need to finish before running any reduce tasks.
There are many different implementations of the MapReduce framework such as Hadoop, Phoenix, Disco, Mars, etc. In developing our tool, CloudAligner, we chose Hadoop http://hadoop.apache.org since it is open-source (easy to fine tune), written in Java (high portability) and widely used in both academy and industry.
There have been some initiatives towards this trend of using Hadoop such as CloudBurst [10], SeqMapreduce [11], Crossbow [12], etc. The results are very promising. These tools can provide better performance and web-based interface which is easier to use than the command line interface of many existing tools.
However, in spite of these promising features, these Cloud-based applications do not significantly improve its functionality. Nor do they offer a variety of user-friendly features or interfaces needed to popularize them. For instance, the common functions that are often implemented in well-established on-premises alignment tools are bisulfite sequencing and pair-end mapping. These techniques are used for detecting genome variations such as single nucleotide polymorphisms (SNP) and large-scale structural variations, which are very important in biological analyses. The CloudBurst, for example, doesn't support either of these features. It also doesn't support the fastq input format which is a very common output of current sequencers. In addition, its interface is a command line style which is not very user-friendly. Another MapReduce-based software, SeqMapReduce, is a performance improvement version of CloudBurst, but its website and code are in-accessible. Crossbow is the read mapping and SNP calling software that runs in the Amazon EC2 cloud. It consists of a set of Perl and shell scripts that allow Bowtie and SOAPsnp to run on Cloud. Crossbow has a very nice and friendly web interface created with the aid of JotForm, a web-interface creation tool. However, since its biological functionalities depend entirely on other tools (Bowtie and SOAP-snp), it inherits their shortcomings too. For example, Bowtie can only allow at most 3 mismatches in its mapping and was only designed for short reads. Therefore it can't improve or fine tune the core functional algorithms.
Consequently, we developed CloudAligner to address such limitations of the existing tools and also to advocate a Cloud and MapReduce-based solution for genomic problems. Especially, CloudAligner is designed to achieve better performance, longer reads, and extremely high scalability. It has more common functions such as bisulfite (BS) and pair-end mapping as well as a friendly user interface, and it supports more input as well as output formats.