- Research article
- Open Access
Mathematical modeling and comparison of protein size distribution in different plant, animal, fungal and microbial species reveals a negative correlation between protein size and protein number, thus providing insight into the evolution of proteomes
© Tiessen et al; licensee BioMed Central Ltd. 2012
- Received: 20 May 2011
- Accepted: 1 February 2012
- Published: 1 February 2012
The sizes of proteins are relevant to their biochemical structure and for their biological function. The statistical distribution of protein lengths across a diverse set of taxa can provide hints about the evolution of proteomes.
Using the full genomic sequences of over 1,302 prokaryotic and 140 eukaryotic species two datasets containing 1.2 and 6.1 million proteins were generated and analyzed statistically. The lengthwise distribution of proteins can be roughly described with a gamma type or log-normal model, depending on the species. However the shape parameter of the gamma model has not a fixed value of 2, as previously suggested, but varies between 1.5 and 3 in different species. A gamma model with unrestricted shape parameter described best the distributions in ~48% of the species, whereas the log-normal distribution described better the observed protein sizes in 42% of the species. The gamma restricted function and the sum of exponentials distribution had a better fitting in only ~5% of the species. Eukaryotic proteins have an average size of 472 aa, whereas bacterial (320 aa) and archaeal (283 aa) proteins are significantly smaller (33-40% on average). Average protein sizes in different phylogenetic groups were: Alveolata (628 aa), Amoebozoa (533 aa), Fornicata (543 aa), Placozoa (453 aa), Eumetazoa (486 aa), Fungi (487 aa), Stramenopila (486 aa), Viridiplantae (392 aa). Amino acid composition is biased according to protein size. Protein length correlated negatively with %C, %M, %K, %F, %R, %W, %Y and positively with %D, %E, %Q, %S and %T. Prokaryotic proteins had a different protein size bias for %E, %G, %K and %M as compared to eukaryotes.
Mathematical modeling of protein length empirical distributions can be used to asses the quality of small ORFs annotation in genomic releases (detection of too many false positive small ORFs). There is a negative correlation between average protein size and total number of proteins among eukaryotes but not in prokaryotes. The %GC content is positively correlated to total protein number and protein size in prokaryotes but not in eukaryotes. Small proteins have a different amino acid bias than larger proteins. Compared to prokaryotic species, the evolution of eukaryotic proteomes was characterized by increased protein number (massive gene duplication) and substantial changes of protein size (domain addition/subtraction).
- Eukaryotic Protein
- Protein Size
- Protein Length
- Eukaryotic Species
- Protein Number
The biological function of a protein is determined by its tertiary, i.e., three-dimensional, structure, which in turn is influenced by its primary structure, i.e. its amino acid sequence. Besides the given order of amino acids (aa), the total length of a protein is also important for determining the tertiary structure of any polypeptide. The longer a protein is, the more options there are for accommodating multiple secondary structures and folding loops [1–3]. The statistical distribution of the sizes of proteins has been investigated by several groups in the past, although with a limited number of representative taxa or focused on prokaryotes. Comparison of average protein size in the proteomes of 5 archaeal, 15 bacterial and 2 eukaryotic species revealed marked differences of protein size . A larger study compared also protein sizes in 16 archaeal, 67 bacterial and 5 eukaryotic species and came to a similar conclusion . It is actually well established that eukaryotic proteins are on average, significantly longer than bacterial proteins, and these in turn are longer on average than archaeal proteins [4, 5]. However, previous studies have not investigated whether there are protein size differences among eukaryotic organisms.
The fact that eukaryotic organisms have larger proteins than prokaryotic species has been interpreted as a true evolutionary trend towards an increase of protein size [4, 5]. It has been postulated that the evolution of eukaryotic proteins was influenced by processes of fusion of single-function proteins into extended multifunctional and multi-domain proteins . Fusion of domains of given size could be predicted to give rise to peaks of a multiple given size in the protein size histograms given the discontinuity of domain sizes and the limited number of different structural domain types. Fusion of domains increases the average size of proteins and this could potentially lead to a respective reduction of the number of individual proteins in the genome.
The evolutionary forces that have shaped protein number and size distributions in modern organisms are unknown. Some groups have tried to find answers based on theoretical models. According to the frequency of stop triplets in the genetic code (= 3/64), the expected size of an open reading frame (ORF) from a random DNA sequence should be on average 64 nucleotides long (~21 aa) . However, since stop codons are biased towards the nucleotides T and A, the expected size also depends on the %GC content of the random sequence, varying between 14 aa (for 35% GC) and 31 aa (for 60% GC) . According to a more detailed analysis of ORF statistics, non-coding DNA sequences are not fully random, but generate random ORF much longer than theoretically expected . Nevertheless, a sharp cutoff is found at 100 triplets (~33 aa) . Since most biologically active proteins are actually much larger than 50 aa, there must be a strong selective mechanism for maintaining the coding capacity of DNA (ORF length).
Two theories have been postulated to explain the relation between protein origin and size distribution: the starter-set and the random-origin hypotheses. The starter-set hypothesis assumes that proteins originated from a small set of starter sequences (functional domains) with lengths of 4 aa, 15 aa or 50 aa which were expanded by gene duplication and modification [9–13]. The premise of this hypothesis is that gene or exon duplication and fusion were essential from the very beginning of protein evolution for producing modern sequences of all organisms, including prokaryotic and eukaryotic species . In contrast, the random-origin hypothesis assumes that proteins emerged first from a very large number of random heteropeptides [15–17]. The random-origin hypothesis assumes that the length of proteins were initially determined by the "start" and "end" signals that delimited the primitive genes and that were distributed randomly along nucleotide sequences. The random-origin hypothesis explains the appearance of large proteins alone by chance [14, 18]. The starter set hypothesis assumes that primitive proteins were initially very small (< 50 aa) but biologically active, and by domain joining and gene fusion became gradually larger [14, 18].
In order to discriminate between those evolutionary theories, is it important to know the frequency of small proteins in the genomes of different species. It is also relevant to determine the selective advantage of having numerous proteins (gene duplications) or proteins of larger size (biological functionality). Some researchers have started by investigating which is the best theoretical model that underlies protein size distributions. White (1994) examined 1,792 sequences and reported that prokaryotic and eukaryotic protein sets had a similar statistical length distribution that could be described by a gamma distribution with shape parameter equal to two or with a distribution that results from the sum of two exponential distributions . A moderate fit to a gamma-type distribution was also found by  whereas  postulated a better fit to a log-normal distribution. Some have argued that a log-normal function was more appropriate because it describes distributions that arise from the product of many random independent events . A stochastic model based on multiplicative processes has also been used to explain protein length distributions .
Several groups have postulated that the sequence distributions of all organisms are similar and that it is possible to describe them in terms of simple mathematical functions [14, 18, 20, 22]. Considering that proteins increase in length by addition or duplication, the log-normal distribution has been said to be less appropriate . Gamma distributions, on the other hand, result from the addition of random intervals and have been said to be more reasonable a priori .
A simple theory to explain why protein lengths can follow gamma-type distributions , assumes that (i) protein sequences have exponentially distributed random lengths; (ii) there is a length dependence of protein fold stability and potential for biochemical activity which greatly limits the number of small proteins (< 100 aa). This theory assumes that maximal protein size (> 1,000 aa) is limited by the very frequent occurrence of stop codons, whereas minimal size is determined by the more limited biological usefulness of small proteins (< 100 aa) [14, 18]. In support of the hypothesis that proteins emerged first from random heteropeptides, it has been shown that modern proteins have a 90% chance of having a lengthwise distribution of amino acids that is indistinguishable from the random expectation . Preliminary surveys concluded that the abundance of different amino acids in proteins is not dependent on the protein length or species of origin . This suggest that small and large proteins should have indistinguishable amino acid composition. It also implies that bacterial and animal proteins (or other groups for that matter) should have the same amino acid bias.
In this work we wanted to revise many of these assumptions. We aimed to address the following questions: How does protein size vary in eukaryotic taxa? Which evolutionary forces influence protein number and protein size? Which theoretical function better describes the observed distributions of protein sizes? Is average protein size correlated to the total number of proteins or to the GC content of DNA? How well are small proteins annotated in the genomic releases? Is there an amino acids bias according to protein size? In order to answer those and other similar questions, we analyzed two independent sets of proteomic data.
Construction and curation of protein datasets
We decided to compare the protein size distributions of different taxa in order to examine the factors that determine protein function, stability and evolutionary trends. In order to achieve this, we first had to construct and validate two datasets for this purpose (set 1 and 2). Selection of biological species, automatic filtering and manual curation of the protein data files was a necessary requirement to ensuring the reliability of the statistical analysis that was performed subsequently. Set 1 was biased towards eukaryotes and plant species, whereas set 2 was biased towards prokaryotic species.
Average protein size in different species and lineages
Protein size summary.
Archaeoglobus profundus DSM5631
Candidatus Korarchaeum cryptofilum OPF8
Cenarchaeum symbiosum A
Desulfurococcus kamchatkensis 1221n
Nanoarchaeum equitans Kin4-M
Sulfolobus acidocaldarius DSM 639
Thermoproteus neutrophilus V24Sta
Thermoplasma volcanium GSS1
Acidimicrobium ferrooxidans DSM 10331
Bacteroides fragilis NCTC 9343
Bacillus subtilis 168
Bifidobacterium adolescentis ATCC 15703
Bradyrhizobium japonicum USDA 110
Burkholderia cepacia AMMD
Campylobacter jejuni RM1221
Chlamydia muridarum Nigg
Corynebacterium aurimucosum ATCC 700975
Deinococcus deserti VCD115
Escherichia coli O157:H7 str. EC4115
Gloeobacter violaceus PCC 7421
Hydrogenobacter thermophilus TK-6
Kocuria rhizophila DC2201
Leptospira biflexa Patoc 1 (Ames)
Persephonella marina EX-H1
Staphylococcus aureus aureus MW2
Streptomyces avermitilis MA-4680
Sulfurospirillum deleyianum DSM 6946
Synechocystis sp. PCC 6803
Thermosynechococcus elongatus BP-1
Thermus thermophilus HB27
Xanthomonas campestris pv armoraciae
Phanerochaete chrysosporium strain RP78
Oryza sativa ssp. japonica
Protein size of functional KO categories across taxonomic groups
Three strategies were followed in order to confirm that average protein size was not derived from genomic artefacts and data outliers generated by transposons, alternative spliced proteins and gene family duplications: 1) we performed mathematical modeling for decreasing the influence of outliers (see later sections), 2) we filtered transposon proteins and made a separate analysis of size distribution, and 3) we grouped genes according to functional categories and compared average sizes of different taxonomic groups (see below).
For strategy 2 we selected the best annotated plant genome (Arabidopsis thaliana) as representative eukaryotic species. We filtered all proteins that contained particular keywords in the gene annotation (e.g. transposon, transposase, retrovirus, etc.) and made a separate analysis of protein size (data not shown). The results confirmed that transposon related proteins did neither affect the distribution models nor any of the other results (averages, medians and percentiles shown in Table 1).
In contrast, when comparing bacteria and fungi (as eukaryotic representative) the size differences were present in many KO categories (Figure 4b). Proteins of most KO categories were larger in size in fungi than in bacteria (Figure 4b). This can be interpreted that the average differences of protein size (Table 2) was not an artifact, but it has been caused by a mechanism of gene extension, through domain addition or gene fusion.
Finally, eukaryotic taxa were also compared to each other. Proteins belonging to the same KO category were of variable size in different taxonomic groups, e.g. as shown between fungi and protist (Figure 4c). This means that the observed size differences between eukaryotic groups (see Table 1 and Figure 3) were caused by both evolutionary mechanisms, by gene duplication (increasing the number of large proteins) and by gene fusion (altering average protein size of some selected KO categories).
Protein number and protein average size in different species
Range of average protein size differences
Plotting average protein sizes showed that the values were quite narrowly distributed for prokaryotic species, but it had a much larger spread in eukaryotes (Figure 4). Protein size of archaeal species were even more narrowly distributed than bacterial proteins (Figure 4). Among eukaryotic species, the spread was larger for protists than for animals or fungi (Figure 6). Protist species have longest or shortest protein sizes, or also the genome with the most numerous genes (Figure 6). This indicates that there is larger diversity of protein number and size among unicellular protists species than among all other taxa. This parallels 16SrRNA, where protists showed the largest amount of diversity among eukaryotes .
GC content of coding DNA and average protein size in different species
We also studied the possible relationship between protein number, size and GC content of the genomes. We therefore measured the gene-based GC content for all species in dataset 2 (see methods). The overall correlation coefficient between average protein size and GC content was barely significant (r = 0.05, n = 1,442; p = 0.048). When analyzed as separate kingdoms, correlation values were non-significant in Eukaryotes (r = -0.05, n = 140; p = 0.53) and Archaea (r = 0.021; n = 101; p = 0.84) but significant in Bacteria (r = 0.21, n = 1,239; p = 9.7 × 10-14).
Correlation values between protein number and GC content were: All species (r = 0.17, n = 1,442; p = 3.1 × 10-11), Eukaryotes (r = -0.05, n = 140; p = 0.53), Archaea (r = 0.44; n = 101; p = 2.7 × 10-6) and Bacteria (r = 0.58, n = 1,239; p = 8.6 × 10-114).
These results are in accordance with the hypothesis that protein size in eukaryotic organisms (as compared to prokaryotes) has been under distinct selective pressures during the evolution of lineages . The positive correlation found for bacterial species gives support for the theoretical prediction that the length of ORFs increases with the GC content of DNA due to the AT bias of stop codons . It is interesting to note that this correlation applied only for bacterial genomes but not for complex organisms with large genomes. However, other potential explanations, like selection for smaller genes to increase the rate of duplication in genomes under a reductive process cannot be ruled out.
Size anomalies in the protein length histograms of different species
Theoretical fit of protein length distributions
Finding the best fit of protein length distributions to well characterized mathematical models can offer insights about evolutionary trends, selective pressures and constraints for protein function and structure . Since protein stability is determined in part by length [28, 29], the size of proteins has a selective advantage and therefore, influences the evolution of proteomes in each lineage. One could assume that proteins smaller than 130 aa are less functional or less stable than proteins larger than 200 aa. Indeed, the need to explain the origin of large proteins was one of the major reasons for invoking gene or exon duplication in the starter-set hypothesis [10, 30]. If size distributions can be explained by a simple stochastic model without assuming massive gene or exon duplication; then, the random-origin hypothesis would be supported. If the same model applies for all species, one could also hypothesize that protein size has increased by the same evolutionary process that caused bacterial proteins to be larger than archaeal proteins.
Therefore it is important to find the model that most accurately describes the protein distribution in each of the selected species. We compared the following theoretical distributions: 1) gamma with fixed shape parameter 2) gamma distributions with no restriction of the shape parameter, 3) the log-normal distribution, and 4) a distribution resulting from the sum of two exponential random variables. For each theoretical function we estimated all the parameters (see methods), and we obtained the Akaike's Information Criterion (AIC) for each fit for each species of set 1 (Additional file 1: Table S3) and set 2 (data not shown). For the species shown in Figure 7, 8 the model that had the lowest (most negative) AIC value is shown in small red letters. According to the AIC criterion the log-normal model best fits the data for 40 out of 84 species of set 1; the gamma model best fits the data for further 37 species and the sum of exponential model best fits the remaining 7 species (Additional file 1: Table S3). In set 2, the best fit models were: log-normal (601 species), gamma free (699 species), gamma with fixed shape parameter (71 species) and sum of exponential (71 species) (data not shown).
The shape parameter of the gamma function
Previous attempts to model protein size distributions had used a gamma distribution with fixed shape parameter equal to 2 [14, 18]. To examine the soundness of this assumption, we modeled gamma distributions without the fixed shape parameter restriction, and then analyzed the estimated shape parameter in all species of dataset 1 (Additional file 1: Table S3) and dataset 2 (data not shown). The empirical distribution of the shape parameter for the gamma models had a mean of 2.3 in both data sets (Additional file 1: Figure S1). Using hypothesis testing from normal theory, we determined that the statistical feasibly of a fixed value of 2 is negligible (p = 7.11 × 10-9). This means that gamma models with fixed shape parameter equal to 2 are inadequate for describing protein size distributions. Thus, the shape parameter is not strictly 2 as earlier assumed by , but it can vary between the extreme values of 1 and 3 depending on the species (Additional file 1: Table S3).
Modeling of size distribution can detect data outliers
In order to visualize the goodness of the theoretical fits, we plotted the models on top of the real data for a few selected species (Figure 9). In those figures one can observe datasets for which the log-normal model fits well but there are also cases for which other functions (e.g. the sum of exponential distribution) fit better (Figure 10). Although the models explain the sizes roughly, there are some extreme data points that lie outside the fitted curves (see size ~330 aa in Figure 9 MON_DOM and EQU_CAB).
The fitted values of the models can be used to estimate theoretical protein sizes based on the idealized functions. For example, the theoretical median size can be estimated with the meanlog value (μ) of the log-normal function, whereas the expected value can be calculated with the meanlog (μ) and sdlog (σ) (see methods). The same general conclusions were obtained when comparing the theoretical average and median values from the fitted models (data not shown) with the previous conclusions drawn from the arithmetic values (see Table 1 and Figure 3). Thus, the main conclusions from this study are robust and statistically reliable and not an artifact of extreme outliers or single size anomalies.
Modeling of size distribution can detect genomic annotation artifacts
Comparison of the fitted models between different species also allowed us to detect genomic releases that had an aberrant distribution of protein sizes with very abundant occurrence of small ORFs (see Figure 8 MED_TRU, CAR_PAP and POP_TRI). In those genomes, small proteins of size < 100 aa are more abundant then proteins of size 200 aa. As discussed below, mathematical modeling could be used as a tool for discriminating genomic releases that have a better annotation of small ORFs (see discussion).
Amino acid bias depending on protein length
In addition to studying the length of proteins, we also investigated the amino acid (aa) composition in both absolute terms (occurrence) and relative amounts (percentage occurrence). We asked whether protein length has any amino acid bias (e.g. if small/large proteins have more/less of any amino acid). Do small proteins contain more cysteines for stabilizing their structure? Do proteins from plant species have different properties from proteins of other phylogenetic groups? In order to answer these questions, we first calculated percentage aa composition of each protein of a species, we then calculated the Pearson correlation coefficient r of the 20 aa to protein length (Figure 9). The obtained r values (negative or positive) were then averaged across phylogenetic groups and plotted horizontally with error bars (Figure 9). There were marked differences between prokaryotic and eukaryotic species. For example, glycine was positively correlated to protein length in prokaryotes but not in eukaryotic proteomes (Figure 9g), whereas the opposite was true for glumatic acid (Figure 9e). The r value was also different for other amino acids (Figure 9). Threonine had a high positive r value in archaea and bacteria, much less in animal species and almost a zero or negative value in plant species (Figure 9). Negative r values were found for cysteine, methionine, lysine, fenilalanine, tryptophan and tyrosine, whereas positive r values were found for aspartate, glutamate, serine (Figure 9). Positive r values means either that longer proteins tend to have more from that particular amino acid or that short proteins tend to have less from that amino acid. The opposite is true for negative r values. For example, it seems that long proteins have less cysteine as expected, either because it is detrimental for long proteins, or because cysteine helps to stabilize short proteins. The lowest r values were found for histidine and valine (Figure 9), meaning that those amino acids have rather a neutral effect on protein length.
The overall goal of the study was to investigate how similar or dissimilar are protein sequences across different taxa of eukaryotic organisms. In this paper we focused on average protein length, protein number and amino acid compositional bias. We also tested the fit of the length distributions to different theoretical models. We determined the mathematical function that best fits the empirical distribution of protein size in each organism.
Protein length distribution has not been previously understood as a selective trait per se (i.e. it is not a character that is directly selected for in classical evolutionary terms ). Individuals are selected for having inheritable units (e.g. genes or epigenetic states) whose products (e.g. proteins) confer a selective advantage to their carriers and progeny. Protein features under direct selection could include many, like the specificity and efficiency of a reaction when the protein is an enzyme, or the thermo-stability of the protein, among others [31, 32]. Physicochemical restrictions must also play a role, for example, very small proteins might not fold properly, and the chances to evolve useful proteins are reduced for extreme sizes. Protein size directly affects the number of functions accessible to a polypeptide, and it is also indirectly associated to many features that are indeed under direct selection . The actual shape of the protein length distribution in a particular genome has to be an interplay between mutation, recombination, fusion, fission, deletion, selection, physicochemical restriction and history. The challenge then becomes to explain how these factors have contributed each, to originate a particular distribution.
There are significant differences of average protein size in different eukaryotic species
Previous studies on the statistical distributions of the lengths of modern protein sequences have focused on prokaryotic species. It was already known that archaeal proteins are on average smaller than bacterial proteins [4, 5]. Some preliminary surveys also concluded that proteins from eukaryotic species are larger than bacterial proteins. Brocchieri and Karlin (2005) analysed five eukaryotic species , whereas Zhang (2000) studied only two eukaryotic species (Saccharomyces cerevisiae and Caenorhabditis elegans) . Since the analysis of few species can lead to severe statistical bias due to limited sampling, we considered necessary to analyze protein length in a much larger and diverse set of eukaryotic species. We constructed large protein datasets of prokaryotic (set 1 n = 33, set 2 n = 1,302) and eukaryotic species (set 1 n = 51, set 2 n = 140), including fungal, animal and plant proteomes (Figure 1, 2). We then estimated size differences among all species (Table 1) and confirmed previous reports  (see above) that eukaryotic proteins are larger on average than bacterial and archaeal proteins (Figure 3). We show that average protein size could be due to an altered number of proteins or to an altered size of proteins within a functional KO category (Figure 4).
Furthermore, the large variability of eukaryotic protein length followed some phylogenetic relationships (Figure 3). Plant species had particularly small proteins among all eukaryotes (Figure 3). Simple animal organisms like Nematostella vectensis and Pristionchus pacificus had protein sizes that were similar to plant species (Table 1). We also found that unicellular eukaryotic organisms tend to have larger average sizes than multicelular species (Table 1). For example, the apicomplexa group has larger proteins than the group of vertebrates, whereas chlorophyta group has less but larger proteins than the other groups of plants (Figure 3). In order to confirm these observations we compared all unicellular eukaryotes against all multicellular species of dataset 2 and found that the abovementioned differences were significant for protein number (p = 9.6 × 10-12), GC content (p = 0.0011) and protein size (p = 0.0018).
There are marked peaks but no marked gaps in the protein length distribution curves
After observing the strong size differences among phylogenetic groups we analyzed the distribution curves to see how smooth and homogenous are proteins distributed in size (Figure 7, 8). We detected protein bins of 1 amino acid (aa) that were more frequent than expected by the theoretical models (Figure 9). Gaps in the distribution curves would have indicated that there are protein sizes that are prohibited by structural or functional reasons. For example stably functional proteins can be formed by the very common structure of an eight-stranded α/β barrel (TIM barrel) [33–35], but there are no reported functional proteins that have five or nine α/β strands [36, 37], and thus one could expect gaps at given size intervals. Since ~200 aa residues are required to fully form a TIM barrel (each α/β strand consists of ~25 aa residues) , one could expect that some protein sizes are less frequent than those which have a multiple of 25 aa or 200 aa. However, no marked gaps or such regularities were observed in the histograms (Figure 7, 8). This indicates that there are no prohibitive structural constrains of protein size along the whole range of observed sizes. Instead, marked peaks were indeed observed, which corresponded to data outliers in the fitted models (Figure 9). This can be explained by massive duplication of particular genes. Most noticeably, all mammal species had a very prominent single peak of size ~332 aa in the distribution curves (Figure 9 MON_DOM and EQU_CAB). In plants, some prominent peaks were due to multiple copies of transposon-encoded proteins like the 191 aa peak in Sorghum bicolor (Figure 8).
The significant deviations from the idealized functions, and the strong differences of the distributions among different organisms, indicates that protein length distributions are strongly influenced by specific selective pressures. One of the evolutionary mechanisms is gene duplication and subfunctionalization leading to large gene families. An example for selective pressure for gene duplication is the need for a large repertoire of olfactory receptors in mammal species , leading to large increase of proteins with length = 332 aa (Figure 9).
The theoretical models that best describe the distribution of protein length are the log-normal function and the gamma function with unrestricted shape parameter
Finding the best fit of distributions to known mathematical models can offer some useful biological insights. Protein length analyses of modern species not only could provide clues to better explain the origin of primitive proteins , but it can also provide valuable information on selective pressures that have prevailed during evolution. A good fit to a gamma function had been previously used to sustain the hypothesis that proteins evolved from random nucleotide sequences . The gamma function with shape parameter 2 describes a probability density function that results from the combined action of two independent random variables exponentially distributed with parameters α > 0 and β > 0 respectively. A simple theory for this theoretical distribution is that the occurrence of stop codons in a random nucleotide sequence leads to exponentially distributed protein lengths, whereas selective pressure for protein stability, folding capacity, and potential biochemical activity is dependent on sequence length, so that small proteins (< 100 aa) have a limited potential for a useful biological function, and thus are rather discarded or negatively selected for .
However, the assumption of a fixed shape parameter 2 had to be rejected on statistical grounds (Additional file 1: Figure S1). The lognormal function had a better fit in 48% of the species, making it almost equivalent to the gamma function with free shape parameter. Furthermore, the sum of exponential functions had a better fit in only 8% of the species, particularly of recently sequenced genomes with not so long history of curation and manual annotation. It can be concluded from all previous results that the theoretical model that better describe protein size distribution is the gamma function with unrestricted shape parameter.
Why do genomes have a protein size distribution different from the theoretically expected?
The genetic code allows making some theoretical predictions about average protein size and frequency distribution [7, 8, 38]. Since stop codons can appear stochastically after any start codon, then larger proteins should always be less frequent than smaller proteins. The most frequent protein sizes should be 1 aa in length . However, distributions of well annotated genomes such as Arabidopsis thaliana do not decrease monotonically but rather increase sharply at about ~80 aa, peak several times in the range of 150-250 aa and then decreases gradually (Figure 7, 8). In most genomes, proteins of size 151-250 aa were more frequent than proteins of size 51-150 aa and even more than proteins of size 1-100 aa (Figure 7, 8). One can interpret this as evidence of a selective pressure for the avoidance of proteins smaller than 100 aa and the selective advantage of functional proteins of > 250 aa. The characteristic increase of proteins in the range 50-200 aa can be explained with the abovementioned selection force, whereas the monotonic decrease of frequency in the range 500-1,000 aa can be explained by the probable occurrence of stop codons in the coding determining sequence (CDS).
If one considers simple models, the average protein size should be ~21 aa . If one considers more sophisticated models explaining the length of random open reading frames (ORFs) in the intergenic regions of yeast , random ORFs of ~33 aa can be explained by the mummy and baby ORF theory alone [8, 39]. However, the average eukaryotic protein is much larger than 100 aa (Table 1). We assume that the frequent occurrence proteins of size 150-250 aa is due to protein folding stability (for example TIM barrels) that generates a selective pressures avoiding stop codons within exons or genes. We postulate that this force is so strong in eukaryotic species, that it overcomes the influence of the GC content of DNA on average ORF length as indeed found in prokaryotic genomes.
Since most prokaryotes (archaea and bacteria) lack introns, the fact that eukaryotic proteins are much larger can be explained because proteins usually are encoded in multiple exons . In follow up studies we will analyze how the statistical frequency of stop codons limits the maximal protein length in prokaryotic species that do not have splicing mechanisms.
Limitations for the occurrence of small proteins
Why are proteins of size 150-250 aa so frequent? Why are proteins smaller than 150 aa so infrequent in some genomes but not in others? Is it the result of bioinformatic annotation procedures? In the yeast genome, many ORFs < 100 aa are likely non-coding or over annotated . Some genomic annotating algorithms are instructed to ignore small open reading frames with a minimal cutoff of ~33 aa in order to limit the number of false positive ORFs . Some proteomes show clearly such drastic cutoffs in the range of 20-60 aa (Figure 7, 8). Small proteins could have important biological functions, however there is a statistical justification in that smaller proteins are more difficult to predict than larger ones . Therefore, the definition of cut off limits the number of false ORF predictions. A cutoff increase from 21 aa to 33 aa might be supported by the mummy and baby ORF theory . Sophisticated measures of nucleotide bias at the DNA level (asymmetry in the composition of the first and the second positions in the codons) can help to detect spurious non-coding ORFs in the yeast genome . However, we propose using additional bioinformatic tools at the protein level, such as mathematical modeling and amino acid bias to exclude false positive small ORFs rather than defining an arbitrary cutoff of protein size. For example, from two equally small proteins, it is more likely to be functional the one that contains more %C, %M, %K, %F, %R, %W and %Y and less %D, %E, %Q, %S and %T (Figure 9). The importance of amino acids like C is not surprising, since cisteine bridges stabilize the structure of small proteins.
Biological functionality of small proteins
As previously mentioned, the protein length distribution of most eukaryotic organisms is not monotonically decreasing, but increases from 1-200 aa (see Figure 7, 8). What determines the minimal size of a biologically active protein? Is it the function, the structure or the capacity to be regulated? Finding a biochemical explanation for a threshold of small proteins is neither simple nor trivial. An active site of an enzyme typically consists of only 3-5 amino acid residues correctly accommodated in space . If enzymatic catalysis can be carried out by only so few amino acids, then the remaining residues (> 97%) of eukaryotic proteins are maybe only for accommodating those catalytic residues correctly in space and for spatial filling. The smallest known enzymes are about 10 kDa (~100 aa) in size with some extreme cases like an enzyme of 62 aa that forms a stable homopentamer (5 × 62 aa = 310 aa) . Thus it seems that even the smallest enzymes achieve stable folding and regulatory features only when more than 200 aa residues interact spatially. Proteomic surveys show that the most abundant proteins in SDS-PAGE gels of plant extracts are in the range of 20-60 kDa (~200-600 aa) with very few proteins appearing in the range below 10 kDa (data not shown). Most enzymes form quaternary structures of several polypeptides, for example the often found di-mers or tetra-mers (http://www.expasy.org/).
From the bioinformatic survey of eukaryotic organisms (n = 140) we conclude that the range of 150-250 aa is the optimal length for a biologically active polypetide. Does that represent a waste of resources? Probably not, since a minimal size of > 200 aa might be required for conferring regulatory properties to enzymes and proteins. More work and deeper studies are needed to address such open questions on enzymatic function and biological capabilities of small proteins.
What limits the occurrence large proteins in plants?
We found that there are not so many long proteins in plants (Figure 3). But, why? We found that plant proteins are on average encoded by less exons than in animal genomes (data not shown). What limits a more frequent appearance of multi-exon genes in plants in comparison to animals? Is there any metabolic efficiency and amino acid composition in plant proteomes? Biosynthetic cost-minimization of bacterial proteins has been postulated as an explanatory hypothesis for differences in evolutionary fitness . In bacteria, it has been shown that the energetic advantage of using different amino acids for highly expressed genes can be a substantial proportion of the total energy budget . Is there cost-minimization of amino acid usage in plant proteins? Or is the reason the appearance of more complex proteins in animals. Plant genomes have numerous genes, but it seems that the average plant protein is not only smaller but it is also encoded by less exons, thus suggesting that sequence length differences could reflect a difference in protein multi-functionalities between plant and animal proteins. More work is required for the analysis and comparison of the multiplicity of PFAM and Interpro domains in plant and animal proteins in order to statistically test such hypotheses.
Is there an universal evolutionary trend towards larger proteins?
The average length of polypeptides in archaea (~283 aa), bacteria (~319 aa) and eukaryot (~472 aa) are significantly different (Table 1). The progressive increase of protein size among archaea, bacteria and eukaryotes has been interpreted as a constant evolutionary trend for larger proteins . Did eukaryotic proteins become steadily larger through domain fusion as suggested by Brocchieri & Karlin (2005)  The negative correlation between protein number and average protein size among eukaryotes (Figure 4, 5, 6) provides support for the hypothesis that proteins can increase or decrease their average size through the fusion or splitting of protein domains. However, we rule out the possibility of a steady trend for size increase in all organisms. Some phylogenetic groups had larger proteins while others had smaller proteins, and this is not related to the evolutionary time of emergence of those eukaryotic lineages. Most noticeably was the extreme variability of protein size and protein number in protist species (Figure 6). This indicates that protein size can depend greatly on short-term environmental adaptations.
Among eukaryotes, there was an indirect relationship between total protein number and average protein size (Figure 5, 6). It is therefore tempting to speculate that proteins can fuse together, so that many small proteins can become fewer larger proteins. This could be because some organisms have better a adaptation when they have less but more multifunctional proteins (larger size), while other species are better adapted to specific environments, when they have more but less multifunctional proteins (smaller size).
In a nutshell, compared to prokaryotic species, eukaryotic proteomes have been shaped by distinct evolutionary forces that have favored massive gene duplication events (increase of protein number) and domain addition (increase of average protein size).
The shape of histograms and the fit to theoretical distributions could be indicative of the efficiency of the bioinformatic procedures for annotating small proteins
As shown in Figure 7, 8, the empirical distribution of protein lengths for many species show the characteristic shape of a log-normal or a gamma function. However, there are some exceptions. For example, the protein size distribution in Chlamydomonas reinhardtii (CHL_REI) is monotonically decreasing from 50 aa to 250 aa (best fit to sum of exponentials model) whereas all other algae show a characteristic increase in that range (Figure 7). The same is true for the histogram of Medicago trunculata (MED_TRU) and Carica papaya (CAR_PAP) in comparison to other dicot species (Figure 8). In comparison, the histograms of well characterized plant genomes like Arabidopsis thaliana (ARA_THA) and Zea mays (ZEA_MAY) show a typical gamma distribution (Figure 8). It seems therefore plausible to suggest that the bioinformatic procedures that were used for annotating small proteins in the genomes of CHL_REI and MED_TRU were not as accurate as the procedures that were implemented in other plant species.
More work is required to optimize bioinformatic procedures for correctly discriminating ORFs of biological active proteins and therefore filtering out false positives and baby ORFs of small size. One strategy would be to measure the nucleotide bias of the first and second codon positions . Another complementary strategy would be to use the amino acid bias information in dependence of protein length and taxonomy (Figure 9) in order to discriminate baby ORFs from mummy ORFs . Comparing the empirical size distribution to a log-normal or gamma function could then indicate how well a genomic release has been annotated.
There are some significant amino acid bias depending on protein length and on species
It has been reported that the isoelectric point (pI) of proteins has a bimodal distribution, with very low fractions of proteins with pI close to 7.4 [46, 47]. When all proteins are analyzed together, there is no correlation between protein size and pI values . When acidic or basic proteins were analyzed separately, the correlation coefficient between protein size and pI was positive (r = 0.15) for the acidic set of proteins and negative (r = -0.25) for the basic set of proteins [ There is also a negative correlation between the pI bias of proteins and the genomic GC content . We therefore asked whether protein size is correlated to the occurrence of certain amino acids influencing the pI values.
Previous studies have reported that there is a dependence of amino acid frequencies on sequence length . Using a non-redundant set of only 2,275 proteins they found that the frequency of cysteine (C) increases dramatically for sequences shorter than 100 amino acids. It was also reported that arginine (R) and Lysine (K) frequencies increase for short sequences whereas aspartate (D) and glutamate (E) frequencies decrease . In our datasets based on over 1.2 million proteins, we observed that the correlation of protein length with percentage amino acid composition is variable and not as restricted as postulated earlier . Protein length correlates negatively with %C, %M, %K, %F, %R, %W, %Y (Figure 9). Positive correlations where detected for %D, %E, %Q, %S and %T (Figure 9). Rather variable, small or no significant correlations were found for %H, %I, and %V (Figure 9). It is noteworthy to mention that protein size correlated negatively with the basic amino acids (K and R) and positively with the acidic amino acids (D and E). This agrees with previous reports on the bimodal pI distribution, where acidic proteins are significantly longer than basic ones .
There are also some strong differences among phylogenetic groups. In archaea and bacteria, marked differences were found for the amino acids %E, %G, %K, %M as compared to the observed values in other eukaryotic groups (Figure 9). The positive correlation of %G is in accordance to the occurrence of large glycine rich proteins in bacterial genomes, a feature that seems specific for bacteria but not for eukaryotes (Figure 9. Thus, prokaryotic and eukaryotic proteins not only are different in size (Table 1 and Figure 2), but have also a different amino acid composition in dependence on protein length (Figure 9).
Interestingly, there were also some differences between plant groups. In monocot plants, protein length correlated negatively with %A, %G, %P. This was not the case in the dicot plant group. The opposite effect was observed for other amino acids where in monocot plants, protein length correlated positively with %F, %I, %K, %N. The reason for this is also unknown and should be investigated in more detail in follow-up studies. What is clear is that bacterial, animal and plant proteins have not only different average sizes, but have also specific biases towards different amino acids.
In the present study, we demonstrate that proteins of different phylogenetic groups have different mean lengths. Some groups of species (e.g. protists, ciliophora, chlorophyta) have larger proteins than other groups of species (e.g. cnidaria, nematoda, placozoa, metazoa, viridiplantae). We found variable relationships between protein size, protein number or genomic %GC content. Among eukaryotes, protein number and protein size are negatively correlated (Figure 5, 6). The theoretical model of a gamma distribution with an unrestricted shape parameter can be a valuable tool to study protein evolution and to optimize automatic ORFs annotations.
Plants have usually more proteins in their genomes than animal species, but animal proteins are on average much larger. One possible reason for plants having smaller proteins than most animal species could be that plant proteins are encoded by less exons on average. We speculate that evolutionary forces related to functional domains in eukaryotic proteins explain the distribution of protein size in eukaryotes, but these hypotheses need to be tested statistically in some follow-up studies.
A comprehensive understanding of protein size across all taxonomic groups is relevant in the context of synthetic biology, which aims to construct minimal genomes and organisms. In addition to finding the minimal set of genes that are required to build a living cell (minimal number of proteins), it is also important to design synthetic proteins with minimal size (minimal aa usage) that still perform the desired biological function. Multi-domain proteins and multi-functional enzymes could be attractive strategies for synthetic DNA minimization.
Protein datafiles set 1
The protein sequences of all organisms where obtained from the NCBI and Ensemble public databases (download date February 2010), unless otherwise stated. We choose the species on following criteria: 1) full genomic versions containing all proteins of that species 2) sequence files publicly available. 3) focus primarily on all available higher plants species. We also included species from other taxonomic groups (animals, fungi, etc.) by selecting few representatives that also matched previous criteria. For a general comparison between eukaryotic and prokaryotic proteomes we therefore included 24 bacterial and 9 archaeal species in addition to the main 51 eukaryotic species.
The release version of the genomic sequences, the species IDs and the download sites of the protein fasta files are listed in Additional file 1: Table S1. In order to create a protein database that would be reliable for our purposes, we filtered the publicly available genomic protein fasta files for the presence of identical proteins, redundant, and thus created a database containing over 1.2 million non-redundant entries. The number of proteins in the original files (1,312,288), the duplicated genes (46,134) and the number of proteins finally kept for further analysis (1,266,154) are shown in Additional file 1: Table S2 for each species individually.
Protein datafiles set 2
Available genomes from KEGG database (27 of May 2011; http://www.genome.jp/kegg/) were downloaded, and a database of protein sequences (in fasta format) from these genomes was constructed, comprising ~6.1 million entries. The taxonomic classification, KEGG code, number of proteins and gene-based G + C content is shown for each species in Additional file 1: Table S6.
Sequence analysis and statistical procedures
In house developed perl scripts were used for sequence handling and analysis taking advantage of some standard BioPerl routines (http://www.bioperl.org/).
For all comparisons between samples, analysis of variance (ANOVA) or Linear Modelling (LM) were done with a threshold of p ≤ 0.01 for statistical significance. In certain cases, we used Bonferroni type corrections for multiple comparisons, so that highly stringent p-values were used such as p ≤ 0.00001. Statistical analysis was done mainly with the R program version 2.13 . Analysis of variance (aov), Linear models (lm), Principal component analysis (pca), hierarchical clustering (hclust) and heatmap biclusters were done with R using the bioconductor, lattice and pcaMethods libraries with default settings for the aov, lm, pca, histogram, hclust, dist, pairs, heatmap and plot functions .
Probability density functions for calculating theoretical protein length distribution
For describing the protein length distribution curves we chose some probability density functions that previous groups had chosen before: The gamma distribution , the sum of two independent exponential distributions  and the log-normal distribution [20, 21]. For proteins larger than 1,500 aa we also analyzed the sequence length distributions using the Pareto's function .
The sum of two independent exponential random variables
where z is the protein length (total number of amino acid residues).
One can obtain the same result by setting θ = 2 in (2), so it follows that the probability density function given in (3) corresponds to that of a Gamma random variable with scale parameter α and fixed shape parameter 2. The estimators of the parameters can be obtained easily using the maximum likelihood method with the moment's estimators as initial values.
Log normal distribution
This distribution can be used to approximate the frequency distributions of gamma distributed random variables since both distributions have the same support and similar asymmetries .
The Pareto's distribution is a power law probability distribution often used in social sciences and economics. This model has been used previously to describe the frequency distributions of protein lengths for the last few percentiles .
so , then taking natural logarithms in both sides leads to a linear model in z, log(P(Z > z)) = αz m - (α + 1)log z. That means that under the null hypotheses the data comes from a Pareto's distribution, if one plots log(P(Z > z)) vs log Z the corresponding scatter plot must look like a straight line.
We applied the Chi-squared test to whether the data comes from the Pareto's distribution for each data set.
The statistical models described previously were fitted by using the well known maximum likelihood method for each species with the observed data in the range 0 < z < 1,500. In the case of the gamma and log normal distributions we used the fitdistr function in the MASS package  in R . For the gamma distribution with fixed shape parameter and the distribution of the sum of two exponential distributions we used the optim function (also available in R) to maximize the likelihood function and obtain the estimates of the parameters of interest. The scripts used to fit all the models are available upon request.
Discriminating between models
where k is the number of parameters estimated in the model, and log(L) is the value of the log-likelihood function for the estimated parameters. For a given data set, several models can be fitted; according to the AIC criterion the model with smallest AIC is the best (more negative AIC value).
Testing whether the gamma type function has a fixed shape parameter equal to 2 or not
That is we want to know if the value hypothesized by White (1994) for the shape parameter 2 is feasible or not. This hypothesis can be easily tested using the abovementioned assumptions from the normal theory. If the obtained p-value is low enough, then the null hypothesis needs to be rejected, and that means that the gamma model with fixed shape parameter equal to 2 is inadequate.
We thank Andres Christen for critical reading of earlier versions of the manuscript and helpful mathematical advice. We also thank Cei Abreu-Goodger and Ruairidh Sawers for their helpful suggestions on the manuscript. We greatly thank Diego Riaño for providing perl scripts, obtaining the data set 1, and for his useful suggestions and critical comments. We thank also Luis David Alcaraz for bioinformatic advice and for providing some of the perl scripts we used in the preliminary proteomic surveys. This work was partially supported by grants from the Consejo Nacional de Ciencia y Tecnología (CONACYT México) to AT. PPR was partially funded by Subdirección de Investigación: Línea 15, Colegio de Postgraduados, México. We thank the Max Planck Institute of Molecular Plant Physiology in Potsdam, Germany for providing access to compute clusters for sequence analyses.
- Nelson DL, Cox MM: Lehninger. Principles of Biochemistry. Edited by: Freeman WH. 2004, 4Google Scholar
- Chothia C, Finkelstein AV: The Classification and Origins of Protein Folding Patterns. Annu Rev Biochem. 1990, 59: 1007-1039. 10.1146/annurev.bi.59.070190.005043.PubMedView ArticleGoogle Scholar
- Petsko GA, Ringe D: Protein Structure and Function. 2003, New Science Press, Ltd, 111-Google Scholar
- Zhang JZ: Protein-length distributions for the three domains of life. Trends Genet. 2000, 16 (3): 107-109. 10.1016/S0168-9525(99)01922-8.PubMedView ArticleGoogle Scholar
- Brocchieri L, Karlin S: Protein length in eukaryotic and prokaryotic proteomes. Nucleic Acids Res. 2005, 33 (10): 3390-3400. 10.1093/nar/gki615.PubMedPubMed CentralView ArticleGoogle Scholar
- Jukes TH, Holmquist R, Moise H: Amino-Acid Composition of Proteins-Selection against Genetic Code. Science. 1975, 189 (4196): 50-51. 10.1126/science.237322.PubMedView ArticleGoogle Scholar
- Oliver JL, Marin A: A relationship between GC content and coding-sequence length. J Mol Evol. 1996, 43 (3): 216-223. 10.1007/BF02338829.PubMedView ArticleGoogle Scholar
- Gierlik A, Mackiewicz P, Kowalczuk M, Cebrat S, Dudek MR: Some hints on open reading frame statistics-How ORF length depends on selection. Int J Mod Phys C. 1999, 10 (4): 635-643. 10.1142/S0129183199000474.View ArticleGoogle Scholar
- Eck RV, Dayhoff MO: Evolution of Structure of Ferredoxin Based on Living Relics of Primitive Amino Acid Sequences. Science. 1966, 152 (3720): 363-10.1126/science.152.3720.363. &PubMedView ArticleGoogle Scholar
- McLachlan A: Repeating sequences and gene duplication in proteins. J Mol Biol. 1972, 64: 417-437. 10.1016/0022-2836(72)90508-6.PubMedView ArticleGoogle Scholar
- Darnell J: Implications of RNA-RNA splicing in evolution of eukaryotic cells. Science. 1978, 202: 1257-1260. 10.1126/science.364651.PubMedView ArticleGoogle Scholar
- Dorit RL, Gilbert W: The limited universe of exons. Cur Opinion Struc Biol. 1991, 1: 973-977. 10.1016/0959-440X(91)90093-9.View ArticleGoogle Scholar
- Dorit R, Schoenbach L, Gilbert W: How big is the universe of exons?. Science. 1990, 250: 1377-1382. 10.1126/science.2255907.PubMedView ArticleGoogle Scholar
- White SH, Jacobs RE: The evolution of proteins from random amino-acid-sequences. 1. Evidence from the lengthwise distribution of amino-acids in modern protein sequences. J Mol Evol. 1993, 36 (1): 79-95. 10.1007/BF02407307.PubMedView ArticleGoogle Scholar
- White SH, Jacobs RE: Statistical distribution of hydrophobic residues along the length of protein chains-implications for protein folding and evolution. Biophys J. 1990, 57 (4): 911-921. 10.1016/S0006-3495(90)82611-4.PubMedPubMed CentralView ArticleGoogle Scholar
- Lau KF, Dill KA: Theory for protein mutability and biogenesis. P Natl Acad Sci USA. 1990, 87 (2): 638-642. 10.1073/pnas.87.2.638.View ArticleGoogle Scholar
- Shakhnovich EI, Gutin AM: Implications of thermodynamics of protein folding for evolution of primary sequences. Nature. 1990, 346 (6286): 773-775. 10.1038/346773a0.PubMedView ArticleGoogle Scholar
- White SH: The Evolution of proteins from random amino-acid-sequences. 2. Evidence from the statistical distributions of the lengths of modern protein sequences. J Mol Evol. 1994, 38 (4): 383-394. 10.1007/BF00163155.PubMedView ArticleGoogle Scholar
- Nei M, Chakraborty R, Fuerst P: Infinite allele model with varying mutation rate. P Natl Acad Sci USA. 1976, 73: 4164-4168. 10.1073/pnas.73.11.4164.View ArticleGoogle Scholar
- Sommer S, Cohen J: The size distributions of proteins, mRNA, and nuclear RNA. J Mol Evol. 1980, 15: 37-57. 10.1007/BF01732582.PubMedView ArticleGoogle Scholar
- Jain R, Ramakumar S: Stochastic dynamics modeling of the protein sequence length distribution in genomes: implications for microbial evolution. Physica A. 1999, 273 (3-4): 476-485. 10.1016/S0378-4371(99)00370-2.View ArticleGoogle Scholar
- Ross SM: Introduction to Probability models. 1989, San Diego: Academic press, 4Google Scholar
- Holmquist R, Moise H: compositional nonrandomness-quantitatively conserved evolutionary invariant. J Mol Evol. 1975, 6 (1): 1-14. 10.1007/BF01732670.View ArticleGoogle Scholar
- Eukaryota, Organisms with nucleated cells. The Tree of Life Web Project, Version 28. [http://tolweb.org/Eukaryotes/3/2009.10.28]
- Schlegela M: Molecular phylogeny of eukaryotes. Trends in Ecology & Evolution. 1994, 9 (9): 330-335. 10.1016/0169-5347(94)90153-8.View ArticleGoogle Scholar
- Zhang Y, Hubner IA, Arakaki AK, Shakhnovich E, Skolnick J: On the origin and highly likely completeness of single-domain protein structures. P Natl Acad Sci USA. 2006, 103 (8): 2605-2610. 10.1073/pnas.0509379103.View ArticleGoogle Scholar
- Denton MJ, Marshall CJ, Legge M: The protein folds as platonic forms: New support for the pre-Darwinian conception of evolution by natural law. J Theor Biol. 2002, 219 (3): 325-342. 10.1006/jtbi.2002.3128.PubMedView ArticleGoogle Scholar
- Dill KA: The theory of the folding and stability of globular proteins. Biochemistry. 1985, 24: 1501-1509. 10.1021/bi00327a032.PubMedView ArticleGoogle Scholar
- White SH: The amino acid preferences of small proteins: implications for protein stability and evolution. J Mol Biol. 1992, 227 (4): 991-995. 10.1016/0022-2836(92)90515-L.PubMedView ArticleGoogle Scholar
- Blake C: Exons-Present from the Beginning. Nature. 1983, 306 (5943): 535-537. 10.1038/306535a0.PubMedView ArticleGoogle Scholar
- Mayr E: The objects of selection. P Natl Acad Sci USA. 1994, 94: 2091-2094.View ArticleGoogle Scholar
- Gimelbrant AA, Skaletsky H, Chess A: Selective pressures on the olfactory receptor repertoire since the human-chimpanzee divergence. P Natl Acad Sci USA. 2004, 101 (24): 9019-9022. 10.1073/pnas.0401566101.View ArticleGoogle Scholar
- Aravind L, Iyer LM, Koonin EV: Comparative genomics and structural biology of the molecular innovations of eukaryotes. Curr Opin Struc Biol. 2006, 16 (3): 409-419. 10.1016/j.sbi.2006.04.006.View ArticleGoogle Scholar
- Chandonia JM, Kim SH: Structural proteomics of minimal organisms: Conservation of protein fold usage and evolutionary implications. BMC Structural Biology. 2006, 6: 7-10.1186/1472-6807-6-7.PubMedPubMed CentralView ArticleGoogle Scholar
- Kolodny R, Petrey D, Honig B: Protein structure comparison: implications for the nature of 'fold space', and structure and function prediction. Curr Opin Struc Biol. 2006, 16 (3): 393-398. 10.1016/j.sbi.2006.04.007.View ArticleGoogle Scholar
- Murzin AG: New-Protein Folds. Curr Opin Struc Biol. 1994, 4 (3): 441-449. 10.1016/S0959-440X(94)90114-7.View ArticleGoogle Scholar
- Yeates TO: Protein structure: evolutionary bridges to new folds. Curr Biol. 2007, 17 (2): R48-R50. 10.1016/j.cub.2006.12.003.PubMedView ArticleGoogle Scholar
- Jukes TH, Holmquist R, Moise H: Average proteins and genetic code. Science. 1976, 194 (4265): 642-643. 10.1126/science.982033.View ArticleGoogle Scholar
- Cebrat S, Dudek MR: Generation of overlapping open reading frames. Trends Genet. 1996, 12 (1): 12-12. 10.1016/0168-9525(96)81376-X.PubMedView ArticleGoogle Scholar
- Naora H, Deacon NJ: Relationship between the total size of exons and introns in protein-coding genes of higher eukaryotes. P Natl Acad Sci-Biol. 1982, 79 (20): 6196-6200. 10.1073/pnas.79.20.6196.View ArticleGoogle Scholar
- Mackiewicz P, Kowalczuk M, Mackiewicz D, Nowicka A, Dudkiewicz M, Laszkiewicz A, Dudek MR, Cebrat S: How many protein-coding genes are there in the Saccharomyces cerevisiae genome?. Yeast. 2002, 19: 619-629. 10.1002/yea.865.PubMedView ArticleGoogle Scholar
- Warren AS, Archuleta J, Feng WC, Setubal JC: Missing genes in the annotation of prokaryotic genomes. BMC Bioinformatics. 2010, 11: 131-10.1186/1471-2105-11-131.PubMedPubMed CentralView ArticleGoogle Scholar
- Chen LH, Kenyon GL, Curtin F, Harayama S, Bembenek ME, Hajipour G, Whitman CP: 4-oxalocrotonate tautomerase, an enzyme composed of 62 amino-acid-residues per monomer. J Biol Chem. 1992, 267 (25): 17716-17721.PubMedGoogle Scholar
- Seligmann H: Cost-minimization of amino acid usage. J Mol Evol. 2003, 56 (2): 151-161. 10.1007/s00239-002-2388-z.PubMedView ArticleGoogle Scholar
- Akashi H, Gojobori T: Metabolic efficiency and amino acid composition in the proteomes of escherichia coli and bacillus subtilis. P Natl Acad Sci USA. 2002, 99 (6): 3695-3700. 10.1073/pnas.062526999.View ArticleGoogle Scholar
- Kiraga J, Mackiewicz P, Mackiewicz D, Kowalczuk M, Biecek P, Polak N, Smolarczyk K, Dudek MR, Cebrat S: The relationships between the isoelectric point and: length of proteins, taxonomy and ecology of organisms. BMC Genomics. 2007, 8: 8-10.1186/1471-2164-8-8.View ArticleGoogle Scholar
- Nandi S, Mehra N, Lynn AM, Bhattacharya A: Comparison of theoretical proteomes: Identification of COGs with conserved and variable pI within the multimodal pI distribution. BMC Genomics. 2005, 6: 116-10.1186/1471-2164-6-116.PubMedPubMed CentralView ArticleGoogle Scholar
- R Development Core Team: R: A language and environment for statistical computing. 2011, Vienna, AustriaGoogle Scholar
- White JV, Stultz CM, Smith TF: protein classification by stochastic modeling and optimal filtering of amino-acid-sequences. Math Biosci. 1994, 119 (1): 35-75. 10.1016/0025-5564(94)90004-3.PubMedView ArticleGoogle Scholar
- Venables WN, Ripley BD: Modern applied statistics with S. 2002, New York: SpringerView ArticleGoogle Scholar
- Lewis F, Butler A, Gilbert L: A unified approach to model selection using the likelihood ratio test. Methods in Ecology and Evolution. 2010, 2041-2210.Google Scholar
- Akaike H: A new look at the statistical model identification. IEEE Transactions on Automatic Control. 1974, 19 (6): 716-723. 10.1109/TAC.1974.1100705.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.