A significant amount of sequence data for various organisms has accumulated in databases; however, not all of the information on the genes of interest to researchers is accurate [1, 2]. At the same time, researchers are faced with the necessity of cloning the particular genes or non-coding genomic regions to establish or verify their functional role. Availability of sequence information makes complete genomic DNA library creation [3] unnecessary, thus speeding up research, since even advanced of genomic libraries creation techniques [4] are time-consuming.
However, working with genes that have not been fully investigated is risky because GenBank [5] entries can contain incorrect sequences. GenBank sequence accuracy and annotation problems have been widely discussed, for example, Wesche et al. [6] reported 0.1–0.2% mismatches in sequences of murine origin.
For our own research we needed to clone long genomic DNA sequences, but only partial sequence data (and unverified sequences for a closely related species) were available.
The traditional approach for this type of work involves amplification of the target region using a high-fidelity thermostable DNA polymerase, which usually requires optimization of the PCR conditions, but often results in a small PCR product yield owing to the low processivity of high-fidelity DNA polymerases. Moreover, in practice, the error rates can be rather high, even for proof-reading polymerases. Experimental measurement of the mutation rate in PCR products showed that for a 349-bp fragment amplified by 30 PCR cycles, approximately 1% of clones had incorrect sequences [7]. It should be noted, that a significant proportion of PCR-introduced point mutations were not detected in the study because a functional forward mutation assay was used [7]. Additionally, when higher numbers of PCR cycles are used to amplify target fragments from vertebrate genomic DNA or large target fragment sizes, this can increase the level of incorrect PCR products to tens of per cents, making identification of non-mutated clones problematic. The search for correct PCR-generated long DNA fragments is further complicated by the need for multiple specific sequencing primers directed to various regions of the target DNA sequence instead of generic vector-specific primers with known performances (Figure 1).
In some cases, if a region of interest has a complex spatial structure and a non-optimal GC-content, optimization of the PCR conditions to obtain a long PCR product can become a challenging task. A number of PCR additives are commonly used for such cases (e.g. dimethyl sulfoxide [DMSO], betaine-Na, or alteration of Mg+2 ion concentrations), but they are not guaranteed to help, and may instead reduce the accuracy of the PCR. Moreover, if a PCR error is introduced at an early cycle, the option of sequencing more clones will not increase the reliability of the sequence data and the probability of picking a plasmid clone with a fully correct sequence will be marginal.
To overcome these limitations, we have developed an experimental approach and a computer tool which together simplify the process of generating a plasmid clone containing a long DNA fragment with an accurate sequence. The proposed modular assembly cloning (MAC) strategy is an alternative to direct PCR amplification and cloning for DNA fragments exceeding 3 kbp. Instead of attempts to obtain a sufficient yield of target long amplicon followed by extensive screening of clones with a low prospect of obtaining a clone without point mutations, the target fragment is divided into 500–1000-bp modules, or consecutive sub-fragments, each starting and ending with restriction sites and/or hybrid sites, generated by pairs of compatible restriction endonucleases (REs). An example of a compatible RE pair is Nco I and Pci I, which recognize CCATGG and ACATGT sites, respectively, producing compatible cohesive ends with CATG overhangs and a non-palindromic (hybrid) site (CCATGT) after ligation of the cut DNA fragments.
PCR-amplified modules are cloned individually into a plasmid vector; the sequence is verified using generic primers and assembled by sticky-end ligation of isolated inserts using the first cloned module (Figure 1). PCR products may also be added to the assembly without sub-cloning if their direct sequencing gives clear and satisfactory results. Because the restriction-ligation procedure rarely generates point mutations, re-sequencing of the assembled fragment is unnecessary. The MAC strategy minimizes the number of custom synthetic oligonucleotides for DNA sequencing, the number of sequencing runs, and allows rapid molecular cloning of long DNA fragments with a high level of accuracy.
The simplest way to divide a target DNA into modules is to identify the recognition sites of all of the available restriction enzymes that produce 4-nt overhangs. If we assume that the distribution of nucleotides in the DNA fragment is random, certain 6-nt sequences will occur once in every 46 bp (4 096). There are 43 (64) different 6-bp palindromes, therefore, one of all possible 6-nt palindromes should occur once in 43 (64) nt. This number of RE recognition sites appears to be sufficient for finding at least one suitable site within a 100–200 bp area, but a significant proportion of REs will have no recognition sites in the entire target DNA fragment (the probability of an absence of recognition sites within a 5000-bp fragment for a given 6-bp RE is 29.5%, based on a binomial distribution). Many REs will have two or more recognition sites (34.4% probability of multiple recognition sites for 5000-bp fragment); however, some REs are not convenient for cloning purposes because of sensitivity to Dam/Dcm DNA methylation. In addition, with some 6-bp palindromes, no REs with 4-nt overhangs are available.
The overall quantity of the various 6-bp palindromes is 43 = 64, while the overall quantity of the various hybrid recognition sites, including palindromes, is 42 × 42 = 256. The frequency of hybrid recognition sites in a random DNA fragment is one in 42 bp, so the possibility of finding a suitable site within a short region is four times higher. More importantly, if the chosen hybrid site occurs in another part of the target DNA fragment, it will not be rendered unusable, because it will be created after the ligation of the adjacent modules and will not be cut by any RE. The occurrence of recognition sites for both REs, comprising this hybrid site in another part of the target DNA fragment, should not affect the success of the cloning, because only two adjacent modules should be cut by the REs during assembly. The only limitation in using hybrid recognition sites is the presence of the recognition site of the RE from the hybrid recognition site in the same module. The probability of this occurring is only 25% for 1000-bp modules and 12.5% for 500-bp modules.
Use of hybrid restriction sites will greatly simplify MAC. The obvious bottleneck in utilizing such sites, however, is the complexity of identifying their positions in the DNA fragment for generation of a list of possible sub-fragments using the “Find” function in text processors. An option for mapping hybrid recognition sites is not present in the common software packages used for molecular cloning, such as Vector NTI Suite or DNA Star. Hence, a specialized software tool is needed for such calculations.