Genome-wide detection of hybrid genes with multiple components in human

Background Previous studies showed that gene hybrid is one of the principal processes for generating new genes. Although some gene hybrid events have been reported to be inter- or intra-species, there lacks a well-organized method for large scale detection of the events with multiple components. Hence in this study, we focus on building up an efficient method for exploring all candidates of gene hybrid events in human genome and provide useful results for further study. Findings We have developed a method designated Triad Comparison Algorithm (TCA) to detect all potential N-hybrid events (i.e., an N-hybrid gene and its N non-overlapping component regions derived from N different genes) in human genome. The results reveal that there are many convoluted N-hybrid events with multiple components (N > 2) and that the most complicated N-hybrid genes detected in human by TCA are composed of six component regions. Interestingly, our results show that most of the hybrid events belong to the 3-hybrid category. Furthermore, we observe that a single gene might participate in different events. Twelve genes were found to have dual identities contained in different N-hybrid events (i.e., they were identified as hybrid genes as well as component genes). This points out that to a certain extent the gene hybrid mechanism has generated new genes during the course of human genome evolutionary history. Conclusion An efficient method, TCA, is developed for exploring all candidates of hybrid genes in the human genome and provides useful results for the evolutionary analysis. The advantage of TCA is its power of detecting any kinds of hybrid events in any species with a large genome size.

expectation value (E-value) is the number of different alignments with scores equivalent to or better than S (the score of the alignment pairs that can not be improved by extension or trimming) that are expected to occur in a database search by chance. The score of the alignment pairs is more significant if the E-value is lower.
The relatedness level for expectation value (E < ) was chosen according to the identification of the duplicated genes [2]. The identity for the overlapping segment was chosen based on the distribution of human-mouse ortholog similarities [3]. 10 10 − 10 10 − For the strict necessities for all N-hybrid genes ( ), we also require the different components in hybrid gene (i.e., the contributed components come from un-related gene pair (B, C) in the above case) can not be aligned in the BLAST report.
The purifying criteria will be automatically applied after detecting an N-hybrid gene.
If two genes have E-value lower than the threshold and pass the filtering criteria, they are called "related".
The algorithm for detecting a "triad" is described in the following. Let = the relatedness function between gene i and j and is defined by For the implication of the function , we can define the necessary and sufficient conditions for a "triad" is , for arbitrary three genes i, j, and k.
Note that is symmetrical (i.e., Again, the middle gene is the 2-hybrid gene with two other component genes. The order of triads (e.g.
is dependent on the sequence of contributed components in the middle hybrid gene.

N-hybrid events detection by TCA
For the hybrid gene with N component genes (N > 2), it is called an "N-hybrid" gene. With the same idea in the "triad" detection, we also use relatedness function between gene pairs. For each gene i, we collect all other genes j with and found the appropriate N component from the list of j. Hence, the generalized necessary condition for an N-hybrid event is The number N will be increased one by one till there's no fitting N-hybrid events can be detected.
In simplified notation, we use the N-polygon to represent the detected N-hybrid events (i.e., N-hybrid gene is in the center and surrounded by its N component genes).  Figure 2. Based on the UCSC-identified transcripts (38,086 different transcripts), we first use the BLAST2 alignment to generate a 38,086×38,086 alignment score matrix, which contains the alignment results of 38,086×(38,086-1)/2 different transcript pairs. The recorded alignment score of a transcript pair includes E-value (expectation value), identity, and alignment length is used as criteria to retrieve candidates from all the transcript pairs. We only consider the transcript pairs whose alignment score satisfies the following criteria: E-value < 10 -10 , identity > 70%, and alignment length > 50 bp.

For example, for N = 3, the graph representation is
A total of 688 transcript pairs (202 different transcripts) were identified under the above criteria. We then extracted 796 triads (2-hybrid events), i.e., one hybrid gene derived from 2 component genes, from these 688 alignable transcript pairs. Figure 1A shows an example of a triad "B-A-C", which is composed of a 2-hybrid gene (gene A) and two non-alignable component genes (genes B and C). At this stage, for each N-hybrid event, N×(N-1)/2 triads will be selected by this detection. For example, for the 3-hybrid event shown in Fig. 1B, three triads "B-A-C", "B-A-D", and "C-A-D" could be identified at the process. Such an example is illustrated in Figure 1C Subsequently, we extract 2-, 3-,…, N-hybrid (N ≥ 2) events from the 796 triads identified. In this study, we identify 438 cases with 2-hybrid, 701 cases with 3-hybrid, 105 cases with 4-hybrid, 34 cases with 5-hybrid, and 14 cases with 6-hybrid (Table 1) by using the alignment parameters of E-value < , identity > 70%, and alignment length > 50bp between component regions of N Subsets of the 3-hybrid event such as "B-A-C", "B-A-D", and "C-A-D" must be excluded from the counts of 2-hybrid events. In addition, the case illustrated in Figure   1C is regarded as two 2-hybrid events ("B-A-C" and "B-A-D") but not a 3-hybrid event because genes C and D are alignable. Under the same alignment parameters of E-value and identity, the distributions of different alignable lengths (60~150bp) are also listed in Table 1. We find that the number of N-hybrid events is not stringently 10 10 − decreased as N increases, because the majority cases belong to N = 3 (Table 1). The tendency holds well regardless the parameter of alignable length.
We also find that the number of N-hybrid events is generally decreasing as the length of component regions increases in most of cases (Table 1) (Table 1).