Conservation of CD44 exon v3 functional elements in mammals

Background The human CD44 gene contains 10 variable exons (v1 to v10) that can be alternatively spliced to generate hundreds of different CD44 protein isoforms. Human CD44 variable exon v3 inclusion in the final mRNA depends on a multisite bipartite splicing enhancer located within the exon itself, which we have recently described, and provides the protein domain responsible for growth factor binding to CD44. Findings We have analyzed the sequence of CD44v3 in 95 mammalian species to report high conservation levels for both its splicing regulatory elements (the 3' splice site and the exonic splicing enhancer), and the functional glycosaminglycan binding site coded by v3. We also report the functional expression of CD44v3 isoforms in peripheral blood cells of different mammalian taxa with both consensus and variant v3 sequences. Conclusion CD44v3 mammalian sequences maintain all functional splicing regulatory elements as well as the GAG binding site with the same relative positions and sequence identity previously described during alternative splicing of human CD44. The sequence within the GAG attachment site, which in turn contains the Y motif of the exonic splicing enhancer, is more conserved relative to the rest of exon. Amplification of CD44v3 sequence from mammalian species but not from birds, fish or reptiles, may lead to classify CD44v3 as an exclusive mammalian gene trait.


Background
The CD44 family of transmembrane glycoproteins mediates the response of cells to their extracellular microenvironment by regulating growth, survival, differentiation and motility. All human CD44 proteins are encoded by a single, highly conserved gene containing 20 exons, 12 of each undergo alternative splicing [1] (see figure 1A). Complex alternative splicing of the central region of the gene is responsible for the incorporation of the variable domains to shape, predominantly, the extracellular, membrane-proximal stem structure of the protein. The heterogeneity of the CD44 protein products can be further increased by post-translational modifications [1][2][3][4]. The sequence encoded in exon v3 contains an optimal Ser-Gly-Ser-Gly (SGSG) consensus motif for modification by heparan sulfate (HS) side chains, to which several heparin-binding proteins attach [5]. This unique HS addition site is critical for CD44v3 isoforms' capacity to bind and present HS-dependent growth factors.
Human variable exon v3 can follow a specific alternative splicing route different from that affecting other variable exons so it can be included in the mRNA together with other variable exons or independently from them [6,7]. This inclusion is regulated by a multisite bipartite exonic splicing enhancer (ESE) consisting in a tandem nonamer (XX motif) and a heptamer (Y motif) that act cooperatively for the efficient recognition of the splice sites [8]. The XX motif is located centrally in the exon while the Y motif is located within the sequence coding for the glycosaminglycan (GAG) binding site immediately downstream from the SGSG motif in v3 (figure 1B).
In order to address the existence and functional nature of the XXY ESE in non-human species we have evaluated the overall level of conservation of CD44 exon v3, including its splicing regulatory elements-the 3' splice site (3'ss), the XXY splicing enhancer-and the GAG binding site, in 95 mammalian species. We also provide data of CD44v3 inclusion into mRNA from peripheral blood samples, by means of RT-PCR, in some representative mammalian taxa with differing levels of conservation of the sequence elements analyzed.

Methods
Frozen (-80°C) blood samples were selected from the animal tissue bank of the Department of R+D+I, Laboratorio Dr. Echevarne, Barcelona, Spain. Genomic DNA was isolated from 200 μl of blood using the NucleoSpin Blood kit (Macherey-Nagel) following manufacturer's instructions. PCR amplification of CD44v3 was performed with INT6SF and I7wtR primer set or -49v3F and I7wtR primer set (table 1) using PCR Master Mix (Promega). PCR bands of interest were isolated from agarose using the NucleoSpin Extract II kit (Machery-Nagel), sequenced in both directions with the primers used during PCR and the CEQ Dye Terminator Cycle Sequencing Quick Start kit (Beckman Coulter) and analyzed in a CEQ 8800 Genetic Anal-CD44 exon v3 Figure 1 CD44 exon v3. A) Genomic structure of human CD44 gene (gray boxes, constitutive exons; white boxes, alternative exons; black box, variable exon v3; black line, introns). B) Schematic representation of exon v3 and its flanking introns (gray boxes, relative location of the XX and Y splicing enhancer motifs; white box, relative location of the nucleotides coding for the GAG binding site and the SGSG motif). Genomic DNA amplification was performed with PCR primers located in intron 6 and in the v3-intron 7 junction (indicated by arrows).  ysis System (Beckman Coulter). All sequences were edited to remove ambiguous base calls and primer sequences and submitted to GenBank.
CD44 RT-PCR was performed from total RNA extracted from frozen blood samples with a modification of the QIAamp RNA Blood Mini kit protocol (QIAGEN). Briefly, 150 μl of frozen blood were lysed at 70°C for 10 min with RLT/β-mercaptoethanol buffer containing 4 mg/ml Proteinase K and centrifuged at 10,000 × g for 3 min. 450 μl of the lysate supernatant were mixed with 225 μl of absolute ethanol and loaded in a QIAamp spin column following manufacturer's instructions. Eluted RNA was treated with RQ1 RNase-free DNase (Promega) at 37°C for 30 min and purified following the QIAamp RNA Mini protocol for RNA cleanup (QIAGEN). The first-strand reaction was performed with random primers (Promega) and SuperScript II Reverse Transcriptase (Invitrogen). As control of RNA quality, total CD44 isoforms were amplified with degenerate E20F-VI and E20R-QEM primer set (table  1) using GC-Rich PCR System (Roche).
In order to amplify CD44v3 containing isoforms, PCR primers were designed based on a multiple sequence alignment containing the sequences corresponding to the 95 mammalian species. Exon v3 positions that showed full conservation were identified and selected to locate the 3' ends of the primers ensuring perfect matches. According to this, v3 amplification was perfomed with primers 13v3F and 100v3R (table 1) and PCR Master Mix (Promega). As control of complete genomic DNA digestion, non reverse-transcribed RNAs were tested amplification negative with primers 13v3F and 100v3R.
Molecular conservation analyses were conducted using the MEGA version 3.1 software [9] and sequence logos were generated with the WebLogo application [10].

CD44v3 sequencing
There is little sequence data available for CD44 variable exons from most animal species. The orthologue prediction for human CD44 in Ensembl release 48 provides v3 exon sequence for 16 species of mammals. In order to increase the data available we studied CD44 exon v3 in most of the animal samples stored in our tissue bank.
A region that enabled amplification of CD44v3, was located by a Blast search against multiple species using a human genomic fragment spanning intron6-v3-intron7. In view of this, our sample set has been restricted to 95 mammalian species distributed in 29 families. The region amplified comprises, relative to the known human CD44v3 sequence, a 5' partial fragment of intron 6 and a 3' partial fragment of exon v3 (117 nucleotides out of 126) (see figure 1B). The resulting sequences are shown in Additional file 1.

CD44v3 splicing regulatory elements conservation
We have compared sequences of 140 nucleotides long spanning 23 nucleotides of intron 6 and 117 nucleotides of exon v3, in the mammalian species listed in Additional file 1. The genomic region amplified enables the analysis of the CD44v3 splicing regulatory elements, namely, the 3'ss and the XXY ESE [8]. The level of conservation at each position of the alignment in the 95 species analyzed is shown in figure 2A. Sequence alignment of exon v3 in these species reveals 69 (59%) conserved and 48 (41%) variable residues. In this way, the 95 species studied are represented by 28 different nucleotide sequences.
The 3'ss is fully conserved in 84 out of 95 species. The rest of species (11 out of 95) have single-nucleotide substitutions at positions -5 (n = 1), -6 (n = 2), -7 (n = 6) or -8 (n = 2) (see figure 2B). The functional significance of these varying positions in the splice site is addressed below by means of v3 expression analysis in peripheral blood.
The percentage of conserved residues in the ESE and in exon v3 is 68% and 59%, respectively. These values suggest that the level of conservation of the ESE with respect to the exon is of the same range ( figure 2A). Upon ESE dissection, XX reveals higher variability relative to Y (figure Nucleotide sequence conservation . If we consider the XXY ESE as a whole unit of 25 nucleotides long, such unit is represented by 13 different sequences whose relative frequency is shown in table 2. The most common sequence is # 10, present in 42% of the species tested and followed by # 2, present in 28% of the species. The latter corresponds to the previously described human sequence [8] and it is also detected in other primate families. The third most common sequence is # 8, present in 11% of the species. The analysis of the XXY sequences reveals 17 (68%) conserved and 8 (32%) variable residues of which most are singlenucleotide substitutions. Within the XX motif, positions 3 (conserved in 58 sequences out of 95), 4 (92 out of 95) and 10 (88 out of 95) are represented by 3 different nucleotides. The positions that have been functionally shown elsewhere to decrease CD44v3 inclusion in mutant expression constructs (X mutant: AAATGggtA and Y mutant: ATGggtA) [8] remain invariant, corroborating their functional importance during splicing.

CD44v3 GAG binding site conservation
Considering v3's reading frame (codon start = 3), the 117 nucleotides of CD44 exon v3 translate into a 38 amino acid extracellular protein domain. Amino acid sequence alignment of this domain from our data set shows 12 (32%) conserved and 26 (68%) variable residues classifying the 95 species studied into 28 different amino acid sequence groups. This alignment also reveals full conservation of the SGSG motif (figure 3A) with the exception of Tursiops truncatus and Oryctolagus cuniculus where the sequences have been changed to SGSD and PGSG, respectively (see Additional file 1). Bourdon et al [11] identified the amino acid sequence homology around the SG dipep-tide sites that serves as GAG attachment sites in the core proteins of proteoglycans. The core protein must contain acidic amino acids on the amino-terminal side of the sequence SGXG, where X stands for any amino acid. The S is the most critical of the invariant residues and the relative importance of the residues is S > first G > second G > acidic residues. According to this, only the rabbit's v3 domain would not be expected to bind HS.
Human exon v3 contains acidic residues both upstream and downstream of the SGSG motif. The eight amino acids located downstream of the SGSG site consist of acidic residues flanked by hydrophobic residues that are necessary for the specific addition of HS at this site [12]. The species analyzed maintain a conserved GAG binding site both at the nucleotide (see figure 2A) and the amino acid level (figure 3B) implying that the secondary and/or tertiary structure around the SGSG motif is critical to initiate HS attachment, and this may have further contributed to the conservation of the Y ESE motif contained therein in all species tested.

CD44v3 expression in mammalian species
CD44v3 has been reported to be constitutively expressed in human peripheral blood cells, irrespective of their activation status [13][14][15]. We have used peripheral blood accordingly as a model to evaluate the expression of CD44v3 isoforms in some taxon-representative mammalian species. The RT-PCR results (see Additional files 1 and 2) show that there is no correlation between sequence variation within the 3'ss or the ESE and lack of v3 expression in peripheral blood. All species tested have revealed v3 expression implying that the conservation observed is sufficient to maintain v3 inclusion. The human CD44 protein contains an unique HS binding site coded by CD44 exon v3 [16], therefore enabling only CD44v3 containing Amino acid sequence conservation   [17,18] for CD44v3 in these species.
In addition to mammals, CD44 constitutive exons are also found in birds, amphibians and fish as described in public databases although their expression in certain tissues in such taxa has not been addressed. In conclusion, we have obtained CD44v3 sequence from 95 mammalian species but have failed to amplify the homologous fragment from bird, reptile or fish species, in agreement with the lack of CD44 variable exons from available genome sequences of model organisms in these taxa. This implies that CD44v3 appears to be an exclusive mammalian gene trait. The sequence conservation observed in our dataset would support a common origin and function for this exon in all mammals. Furthermore, CD44v3 sequence conservation in mammalian species enables maintenance of functional splicing regulatory elements and the GAG binding site. The level of conservation of the sequence encoding the GAG binding site, which in turn contains the Y motif of the ESE analysed, is higher than the overall level found for the rest of the exon. Whether this phenomenon is due to purifying selection pressure contributed by the GAG attachment domain alone or in conjunction with the Y motif of the ESE remains undetermined. Functional inclusion of CD44v3 has also been demonstrated in peripheral blood from mammalian species representative of the different sequence variations observed, implying in vivo use of exon v3 in these species. Further work is required to search for the exact evolutionary origin of CD44 exon v3 in mammals.