Comparative sequence and structure analysis
Figure 1 compares the number of precursor miRNAs predicted in the three non-human primates when filtering with different minimum thresholds of sequence identity. At a threshold of 85 percent, the number of pre-miRNAs begins to converge and is nearly the same in all genomes. With higher thresholds the total numbers of predictions decrease while the differences between species increase. Only 40.8 percent of the precursors are 100 percent identical between human and orangutan (ppy), compared to 60.3 percent between human and chimpanzee (ptr). It is interesting to see that above 95 percent identity the number of homologs found in orangutan is closer to the level of rhesus macaque (mml) than of chimpanzee. This could mean that, on the miRNA level, the evolutionary distance to human is actually more similar for orangutan and rhesus. It remains unclear, however, how far these figures are influenced by differences in sequencing quality between the non-human primate genomes.
miRNA homologs with perfect (100 percent) or near-perfect (around 98 percent) sequence identity allow us to assume that these are likely functional (as in human). Candidate sequences with less but more than 85 percent similarity - true for 38 percent of the precursor homologs found in orangutan - require the verification of more miRNA features.
One important aspect is how well the structure of a miRNA precursor is preserved. Already a few nucleotide mutations can imply large structural changes or even disrupt the hairpin structure completely (see below). In Figure 2 frequency distributions of precursor homologs are plotted over the percent structure identity. As defined in Section Methods, this is calculated between the predicted secondary structure sequences. The vast majority of structures meets a minimum requirement of 80 percent identity. For primate species more closely related to human, this distribution is more shifted towards 100 percent. 73.0 percent of the precursor homologs found in chimpanzee show no change in structure compared to human. For the most part this is due to a 100 percent identical sequence (see Figure 1). This number compares to only 52.7 percent in orangutan and to 42.5 percent in rhesus monkey. Again, the distributions between orangutan and rhesus macaque are more similar than between orangutan and chimpanzee.
Figures 3 contains the frequency distributions of nucleotide mismatches (including deletions and insertions) in the mature subsequence. First, it demonstrates that the total number of single nucleotide mutations in mature miRNA homologs varies between 0-2 only - with few exceptions - and, second, that the prevailing number of mature sequences - 89.4/79.1/75.7 percent for ptr/ppy/mml - is identical to the human counterpart. Mutations in the mature region are, obviously, most important for explaining phenotypic differences between humans and primates. Except for requiring a 100 percent conservation of the seed region, we do not further limit the number of mutations in the mature sequence. Because miRNA-target binding shows near-perfect complementarity, the seed is less variable such that mutations are less likely under positive selection.
As an example, the multiple sequence alignment for pre-miRNA mir-618 is given in Figure 4 together with the corresponding secondary structure sequences and stem-loop structures. Absolute distances to the human version (hsa) are given in number of point mutations. In case of orangutan, only a few nucleotide mutations imply a significantly higher number of local changes in the structure sequence. The global structure is, however, much less affected. Actually, the stem region - including the mature sequence - is the same as predicted for chimpanzee. In case of rhesus macaque, on the other hand, the basic stem-loop structure is still preserved even with many nucleotide mutations.
MicroRNA gene identification in orangutan
Lists of positive and negative predictions from our analyses are provided in the supplementary material (see Section Additional files). Additional file 1 contains all 605 homologs of human precursor miRNAs found for orangutan, including 77 sequences which are already known (i.e. in miRBase). 18 homologs are identical or have an overlapping genome location with another miRNA. This leaves 510 newly discovered miRNAs in total.
Besides known miRNAs, candidates are marked in Additional file 1 that pass various other filters (see Section Methods). This allows a flexible combination of filtering criteria, including those derived from the precursor structure or the mature sequence. 526 orthologs (from 605) remain after applying the hairpin filter and 494 after the seed filter. Here, we also utilize the existence (detection) of a miRNA homolog in more than one primate species (besides human). This is to improve the reliability of predictions and helps to reduce the effect of possible sequencing errors. In our setup, 499 human precursor miRNAs are found to have a homologous sequence in all three primates. 563 miRNAs are conserved in both chimpanzee and orangutan, and 530 are shared between orangutan and rhesus macaque.
Additional file 2 lists all homologs of human mature miRNAs found in the orangutan precursors. The 682 entries include homologs of both 5' and 3' miRNAs, some originating from the same precursor. 611 human mature miRNAs are conserved in at least one orangutan precursor, resulting in 624 different sequences.
Identification of lineage-specific microRNAs
Another question of interest is which and how many miRNAs are lineage- or species-specific. Our analysis especially supports the identification of human-specific miRNAs. Since we cannot completely exclude the possibility that some homologs may not be found because of erroneous or incomplete genome assembly, we require the negative prediction of a miRNA to be confirmed by our method in at least two of the three non-human primate genomes at hand. Additional file 3 lists all 35 homologs which are missing in this way. 12 human miRNAs could not be identified in any of the three primate genomes and, thus, are the most likely to be human-specific. These in particular may be responsible for phenotypic differences between human and non-human primates, i.e., may help to explain what makes us human.
Sequence and structural similarities to a human miRNA are strong indications for a putative homolog to be transcribed and functional. Nevertheless, the expression levels of both miRNAs may differ due to alterations in the specific regulatory pathway that controls their expression. In addition, the regulatory effects, i.e., the selection and expression of target genes, may be significantly different. This is due to a fast evolution of miRNA binding sites [25] which led to many lineage- or species-specific sites and is just as responsible for what makes us different from other primates.