The assembly process consists of four major phases. In the first phase, the de Bruijn graph is straightforwardly constructed. The second phase (called the cleaning process) is a very important step in which the graph is simplified as much as possible by collapsing paths, removing tips and solving bubbles, as well as handling a few other different structures in the graph. In the third phase the graph components are detected before starting the assembly algorithm in the fourth step.
Our algorithm differs from previous works in the following ways:
-
1.
The cleaning process simplifies the graph by a few iterations without incorporating time-consuming algorithms, such as the Dijkstra-like breadth-first search in Velvet [9, 10] and the Dijkstra algorithm in SOAPdenovo [14].
-
2.
An algorithm was created to solve only simple bubbles (Figure 1), but by involving other algorithms (i.e. paths collapsing, tips, etc.) all complex bubbles are solved after a few iterations of the cleaning algorithm.
-
3.
The assembly algorithm uses the frequency values and lengths of k-mers in order to construct contigs as will be described below.
Most de novo assemblers focus on solving large genomes; this involves implementing time-consuming and very complicated algorithms. As a result, the construction of contigs becomes stricter, though this is not the case for small genomes, as shown in the results section.
Input data and graph construction
The entire dataset of k-mers is recorded using hash tables in order to speed up further operations. The reverse complements are also recorded without binding them with their original k-mers. All we need is a linear algorithm for constructing the de Bruijn graph. Since the alphabet is composed of four nucleotide letters, each k-mer will be connected to four k-mers at most. All k-mers that include unknown ‘N’ nucleotides are discarded. The pseudo-code of the algorithm is shown below:
-
1.
deBruijnGraphBuilder( HashTable kmerList , integer K )
-
2.
Integer N :=|kmerList|; //the size of kmerList
-
3.
String temp;
-
4.
for i:=1 to N do
-
5.
begin
-
6.
temp := kmerList[i][1..K−1];
-
7.
//forward connection
-
8.
if temp+“A” kmerList then createArc( i, kmerList.IndexOf(temp+“A”));
-
9.
if temp+“T” kmerList then createArc( i, kmerList.IndexOf(temp+“T”));
-
10.
if temp+“C” kmerList then createArc( i, kmerList.IndexOf(temp+“C”));
-
11.
if temp+“G” kmerList then createArc( i, kmerList.IndexOf(temp+“G”));
-
12.
//backward connection
-
13.
if “A”+ temp kmerList then createArc(kmerList.IndexOf(“A”+ temp), i );
-
14.
if “T”+ temp kmerList then createArc(kmerList.IndexOf(“T”+ temp), i );
-
15.
if “C”+ temp kmerList then createArc(kmerList.IndexOf(“C”+ temp), i )
-
16.
if “G”+ temp kmerList then createArc(kmerList.IndexOf(“G”+ temp), i);
-
17.
end
Let K be the length of the short reads. The variable temp will contain the first prefix of a given K-mer whose length is K − 1. The algorithm computes the out-neighbours in the forward orientation, and the in-neighbours in the opposite direction.
Cleaning process (simplifying the graph and solving errors)
The raw DNA data always suffer from errors, and since the de Bruijn graph is based on the exact matching of k-mers, error correction (or removal) becomes very important to the use of such graphs in representing and analyzing sequencing data. The coverage plays a vital role in guiding the cleaning and assembly algorithms to a more accurate result. After constructing the graph, some erroneous k-mers appear in the graph in different forms. The most common forms are the so-called “Tips, Bubbles and Chimeric connections”. However, while analyzing the graph, we found other forms as well. We have implemented an iterative algorithm that reduces the graph to its maximum simplification. The pseudo-code of the algorithm is shown below and its flowchart is given in Figure 2.
-
1.
cleaningAlgorithm
()
-
2.
Boolean
col, bub, intip, outip, less, great;
-
3.
Begin
-
4.
do
-
5.
col := collapsePaths();
-
6.
bub := solveBubbles();
-
7.
if
col==false
and
bub==false
then
-
8.
begin
-
9.
intip := removeInTips();
-
10.
outip := removeOutTips();
-
11.
less := removeLessMarkTips();
-
12.
great := removeGreatMarkTips();
-
13.
if
intip==false
and
outip==false
and
-
14.
less ==false
and
great==false
then
stop;
-
15.
end
-
16.
while
(true)
-
17.
removeSingletons();
-
18.
End
The collapsePaths() procedure will return false if it does not collapse any path, otherwise, it returns true. The other procedures behave exactly as collapsePaths() does. We will hereafter explain each procedure invoked by the cleaning algorithm.
Path collapsing
To simplify and shrink the graph before applying any cleaning procedure, a path collapsing algorithm should be run immediately after constructing the graph.
A path is a chain of nodes. Two nodes X and Y are merged if the node X has only one outgoing arc connected to the node Y that has only one incoming arc. Their corresponding k-mers must be concatenated accordingly. Most of the resulting nodes (we call them switch nodes) are seen in Figure 3.
Bubble solving
In genome assembly, a bubble appears where two sequences initially align, then diverge in the middle, and align again at the end. Bubbles are caused by repeats or heterozygotes of diploid chromosomes [14], or created by errors or biological variants, such as SNPs, diploids or cloning artefacts prior to sequencing.
A path is a chain of nodes in a graph. We call a path a simple path if each internal node (i.e., each node between the start node and the end node of the path) has one outgoing edge and one incoming edge. A bubble is a subgraph that consists of multiple simple paths all of which share the same start node and the same end node. In the original graph, the start node must not have any outgoing edges other than those in the bubble, and the end node must not have any incoming edges other than those in the bubble.
In Velvet [9, 10], detection of bubbles was done by an algorithm based on a Dijkstra-like breadth-first search called “The Tour Bus Algorithm”. Similarly, Dijkstra’s algorithm is also used to detect bubbles in SOAPdenovo [14], in which the detected bubbles are merged into a single path if the sequences of the parallel paths are very similar; that is, had fewer than four base pairs difference with more than 90% identity.
In Arapan-S, all bubbles will be relaxed by combining all the cleaning procedures and without incorporating a time-consuming algorithm. After collapsing all paths, bubbles will appear in the graph as shown in Figure 1. The node with a high coverage will not be removed from the bubble (However, the algorithm can also be parameterized to keep only the node that has the maximum k-mer’s length instead of high coverage).
Tips removal
Tips generally result from errors at the end of reads. In the graph, a tip is a node connected only on one end (Figure 4). In Velvet, a tip is removed if it is shorter than 2 k (k is chosen for the k-mer). After removing tips, new paths will appear again in the graph. Almost all the remaining nodes’ degrees are ≥ 2. We will hereafter call such nodes: switch nodes. The result of the cleaning process will be similar to what is shown in Figure 5.
Connected components detection
Once the graph is reduced and contains only switch nodes, we start determining the connected components of the graph. There are two cases in which we need to determine the connected component. The first case is the nature of the k-mers and their reverse complements. Since each k-mer was recorded along with its reverse complement, we will obtain a graph composed of two subgraphs, one being the reverse of the other. The second case is the sparseness of the graph, especially when the initial k-mer length is a bit longer. Our assembly algorithm can run on every connected component of the graph. Detection of these components can lead the assembly algorithm to be run in parallel. The breadth-first search or depth-first search can be applied to find the connected components in linear time. The search begins at an arbitrary node v from which the entire connected component including v will be detected. A loop through all nodes of the graph must be implemented in order to find all the connected components. The loop runs until no visited node can be found. The pseudo-code of the modified algorithm is shown as follows:
-
1.
connectedComponent(VertexSet
V
, EdgeSet
E
, Node
a
)
-
2.
Set
X;
-
3.
Boolean
visited[|V|];
-
4.
//Step 1
-
5.
-
6.
-
7.
//Step 2
-
8.
while
do
-
9.
begin
-
10.
-
11.
-
12.
end
-
13.
return
X;
The idea of this algorithm is to traverse the graph from an arbitrary node a, mark it as a visited node and record its neighbors in the set X. The same job is done for the recorded nodes until there are no visited nodes in the set X. The algorithm returns the connected component engendered from the node a. To find all connected components we apply the following algorithm:
-
1.
allComponents(VertexSet
V
, EdgeSet
E
)
-
2.
SetList
C;
-
3.
Set
X’ ;
-
4.
Integer
i;
-
5.
//Step1
-
6.
-
7.
-
8.
//Step 2
-
9.
while
do
-
10.
begin
-
11.
select an arbitrary x∈X’;
-
12.
-
13.
-
14.
-
15.
end
-
16.
return
C;
We only need to select an arbitrary node x and determine, due to the connectedComponent() procedure, the connected component C
i
having x. The determined component’s nodes will be removed from the X’ (Line 14). The same operation is performed until no connected components can be detected.
Assembly algorithm
Once the connected components are detected, we run the assembly algorithm for each component. The assembly algorithm can be run by using one of two parameters: the coverage (k-mer’s frequency), and the k-mer lengths. The latter parameter is obtained by the cleaning process, which provides us with switch nodes whose corresponding k-mers have longer lengths due to the merging process.
Most of the previous work on genome assembly has the following assumption: given a set of reads, the objective of the assembly program is to minimize the length of the assembled genome [18]. However, according to our knowledge, there is no proof that the shortest path can always faithfully represent the genome. The same can be concluded concerning the longest path, the Hamiltonian path and the Eulerian path.
The assembly algorithm is a greedy function. It traverses the graph by selecting only the nodes whose frequency values are higher. We have chosen this strategy by assuming that k-mers, which are characterized by high frequency values, are more likely to be free of sequencing errors (we call it “frequency function”). All procedures of the assembly algorithm are given as follows:
-
1.
stringPath( Set
C
)
-
2.
Ordered Set
path;
-
3.
Set
P, Visited;
-
4.
Node
u, v;
-
5.
//Step1: preprocessing
-
6.
u := the index of the node which have the longest k-mer length.
-
7.
-
8.
-
9.
Visited: =Ø
-
10.
//Step 2: forward direction
-
11.
do forever
-
12.
begin
-
13.
P := out_neighbors(u) − Visited;
-
14.
;
-
15.
if P= Ø then stop;
-
16.
u := bestNeighbor(u, P);
-
17.
-
18.
End
-
19.
//Step 3: backward direction
-
20.
do forever
-
21.
begin
-
22.
P := in_neighbors(v) − Visited;
-
23.
-
24.
if
P=Ø
then
stop;
-
25.
v := bestNeighbor(v, P);
-
26.
-
27.
end
-
28.
return
path;
The set C represents a connected component of the graph. The resulting path is kept in the ordered set path. After variables initialization, the algorithm goes in a forward direction selecting the best out-neighbors. In the last step, it goes backwards selecting the best in-neighbors. The bestNeighbor() function is the current node and the set of its in- or out-neighbors. Since each node could be connected to several neighbouring nodes, the best neighbor is characterized by the highest frequency value. The two loops stop when no more exploration can be done. To find all possible paths, we apply the following algorithm, called the stringPath() algorithm.
-
1.
allPaths()
-
2.
SetList C; //components list
-
3.
SetList P; //paths list
-
4.
Integer i;
-
5.
//Step 1
-
6.
-
7.
//Step 2
-
8.
for
i
:= 1 to |
C
| do
-
9.
begin
-
10.
-
11.
end
-
12.
return P;
By going through all connected components (determined by the allComponents() procedure), and due to the previous algorithm, a path P
i
will be constructed for each connected component C
i.
Availability and requirements
Arapan-S is open access and freely available. All questions, comments and requests should be sent by email to nihon.sahli@gmail.com.
Project name: Arapan project
Project home page: http://shibuyalab.hgc.jp/Arapan/
Operating system(s): Windows, Linux (Ubuntu)
Programming language: C/C++
Other requirements: None
License: None required
Any restrictions to use by non-academics: None required