MarkerSet: a marker selection tool based on markers location and informativity in experimental designs

Background The recent sequencing of full genomes has led to the availability of many SNP markers which are very useful for the mapping of complex traits. In livestock production, there are still no commercial arrays and many studies use home-made sets of SNPs. Thus, the current methodologies for SNP genotyping are still expensive and it is a crucial step to select the SNPs to use. Indeed, the main factors affecting the power of the linkage analyses are the density of the genetic map and the heterozygosity of markers in tested animal parents. Findings This is why we have developed a PERL program selecting a defined number of markers based on their locations on the genome and their informativity in specific experimental designs. As an option, different experimental designs can be combined in order to select the best possible common marker set. The program has been tested using different conditions of marker informativity and density with both real and simulated datasets. The results show the efficiency of our program to select the most informative markers even if there is a wide range of informativity for whole genome scan mapping analyses. In case of combination of different experimental crosses, the multidesign mode can optimize the SNP markers selection. Conclusion Written in PERL, it assures a maximum portability to other operating systems (OS) and the source code availability for user modifications. Except for the simulation mode which could be time consuming, MarkerSet can compute results in a very short time.


Findings
The recent sequencing of full genomes has led to the availability of many SNP markers ( [1] for Human and [2] for Chicken). The current methodologies for home-made SNP sets genotyping are still expensive, meaning that only few thousands of SNPs can be used. It is then a crucial step for a specific study to select the best suited SNPs. For linkage analyses, the main criteria to increase the analysis power are the distances between markers and the ability to follow the marker's allele segregation in the experimental design. It means that the markers must be as much as possible heterozygous for phenotyped animal parents. This is why the heterozygosity in phenotyped animal parents (further called reference animals) must be included in the marker selection. In this manuscript, this heterozygosity for reference animals will be referred as informativity of the markers. From our point of view, if there are no available SNP arrays, the best strategy is a two step genotyping, with a test of a large panel of SNPs informativity on reference animals from the studied experimental design, fol-lowed by a genotyping of all the animals for markers selected based on the results of the first step. The marker selection is complicated by the fact that markers the most heterozygous in reference animals are not homogenously spaced across the genome, and the number of markers to handle has greatly increased. It is therefore not possible anymore to select the markers without dedicated software. Different tools have already been proposed to select Tag SNPs [3][4][5][6][7][8][9][10][11], but most of them are based on very high marker density and linkage disequilibrium information and cannot be used in exotic species and species without SNP arrays for which linkage disequilibrium information is not always available. We propose here a tool to select the best possible markers for further linkage analysis, without any use of linkage disequilibrium information. Its originality is the use of both marker location in the genome and heterozygosity in parental animals.
The MarkerSet software was written in the PERL programming language and can be downloaded with manual and example files at http://www.sigenae.org/ index.php?id=136.
The software is designed to use already available information about markers informativity, expressed in number of heterozygous animals out of all the reference animals tested in the experimental design. This allows the use of any kind of markers, as well as their combinations if needed. If more than one experimental design is to be genotyped, a specific set of SNPs can be selected, or the marker informativity for all these experimental designs can be used simultaneously to select common sets of markers. In case of a marker set selection common to all designs, both general informativity score and experimental design specific scores are detailed, so it is possible to evaluate specifically the marker set informativity for each experimental design.
In most species, the only available information for the markers will be their physical location (especially true for SNP markers), as all the markers have not been tested on a reference population to estimate genetic distances. Nevertheless, for a QTL mapping, the genetic distances are the key points as, depending on the species, the recombination rate can highly vary. So MarkerSet uses physical distances as input and converts them into cM. This conversion can be adapted to fit the specificity of the studied species (as an example, in pigs, we can considerer that 1 cM corresponds to approximately 1 Mb).
Basically, the algorithm will select the most informative markers in two windows separated by a constant gap, and sliding on the genome (see Figure 1A). In case of a similar informativity between several markers in a window, Mark-erSet will select the closest marker from the middle of the window. Using this strategy, the distance between two markers is the first criterion of selection, and the informativity is used for discriminating closely located markers. The two main variables are the first window starting point on the genome and the size of the gap separating the two windows. Depending on the number of markers to select and the size of the genome, MarkerSet will compute different window starting points to get the best genome coverage.
The gap size and the window size are defined by the average marker interval (AMI), corresponding to a ratio of the whole genome size and the number of the markers to select. The AMI percentage used to calculate the window size is defined in the config.pm file (set by default as 20% of the AMI). So, the setting of the selection window size is automatically handled by the software. Figure 1 Principles of MarkerSet and main parameters. a) MarkerSet selects markers in two windows separated by the Average Marker Interval (AMI), which is the whole genome size divided by the number of markers to select. The window size is a percentage of the AMI (20% by default). Shifting iteratively the windows by the AMI gives a full genome coverage. Different sets are created by using all the possible starting points (x and y). b) Several parameters and options are available in order to improve the sets quality. The space_plus and space_resampling parameters are used to enlarge the window size in case of low (or no) informativity: space_plus is set by default as 50% of the window size on each side. This is automatically performed if the informativity of markers available is lower than the defined informativity threshold. Space_resampling is used to iteratively enlarge window size (by default +1 cM on each side at each step) until markers with informativity higher than the defined resampling threshold are found (resampling option mode).

Principles of MarkerSet and main parameters
These two parameters (AMI and window size) permit to compute the number of possible starting points (i.e. the number of selected marker panels). Thus, for each combination of these parameters, a marker selection will be performed with a fixed starting point and multiple iterations over the genome (Selection Frame). At each iteration step, the starting point of each pickup box will be increased by AMI+window size (see Figure 1A).
For all analyses, an informativity threshold is set, so if the best available marker in one window has an informativity strictly lower than this threshold, the window is enlarged (space plus: 50% of the window size is added to each side of the window, as default -see Figure 1B) and a more informative marker is searched. By default, this threshold is set as half of the best possible informativity score for one marker (i.e. half of the total animals tested). If there is no marker with a higher informativity, the best previous marker is conserved, as it results in shorter distance between markers. As an option, the window size can be enlarged as long as a marker more informative than the resampling threshold (set by the user) is not found (resampling option). The window size enlargement is defined by the user through the space resampling parameter (see Figure 1B). The working principle of the software is exposed in figure 2.
In order to score the different obtained panel, one approach is to sum basically the informativity value for each selected marker (i.e. the number of heterozygous parental animals in our case). This approach of linear scoring is effective for markers with an extreme informativity value (i.e. 0 or 1 heterozygous animals or, on the other hand, all animals heterozygous), but it is not enough discriminative for "middle-range" marker. As an example, on a total of 6 tested animals, we prefer to give much more weight to a marker with 4 heterozygous animals than one with 3 heterozygous animals. In order to best represent the informativity of a marker, we decided to transform the informativity value of each marker on a sigmoid scale (see Figure 3). Obviously, this approach maximises or minimises the score for maximum or minimum informative markers respectively, but more importantly, discriminates "middle-range" informative markers. Finally, a panel score is obtained by summing the score values of all markers selected for this panel. In addition, the software computes some informations to describe each experimental design: maximum informativity score (i.e. the sum of informativity scores of all available markers), and the distribution of the number of markers in each informativity value class. These data are available to user in a log file.
When studying several experimental designs in the same species, user may want to compare what is the best option: to select markers perfectly fitted for each experimental design (for example heterozygous for all F1 sires), or to try to select a larger set of markers common to all experimental designs (in this case, some markers will be homozygous in some families resulting in a loss of power in the linkage analysis). To help with this dilemma, a multidesign option has been implemented. The principle of marker selection and panel scoring is absolutely the same except that the software use a global informativity value generated by summing, for each marker, the informativity value of each experimental design. Based on this global informativity, MarkerSet will select the best informative markers for the multidesign, and score it with the multidesign informativity values (score A). As mentioned above, this multidesign option should permit to evaluate which solution best fits for a number of defined markers: a set of common marker for all experimental designs or several sets of markers specific for each design.
In order to measure the loss of informativity, MarkerSet Working principle of MarkerSet Figure 2 Working principle of MarkerSet. will perform a simulation of marker selection specific for each design using the same selection frame (with the number of marker to select in multidesign option) and score it (score B). A ratio between multidesign score (score A) and experimental design specific score (score B) is calculated (called MD/Sim, r in the logfile). This ratio gives an estimation of the "conserved" informativity score between multidesign and design-specific marker selection: as an example, a ratio of 0.82 means that only 18% of informativity score is lost with the multidesign option.
As the results can highly fluctuate according to the informativity and the density of available markers, it is possible to perform a simulation to define the best suited window sizes percentage. It is also possible to combine this simulation with all available options (resampling and multidesign).
In order to test the program core functions and options, MarkerSet has been run on several different data files. First, a small data file corresponding to a real case has been generated with 206 low informativity markers (among them, 162 are not informative at all), located on one chromosome of 63 Mb and 4 tested animals. Using MarkerSet with this file in verbose mode, we have checked that the algorithm selects effectively the best informative marker taking into consideration the marker location in case of similar informativity, but also enlarges the win- Figure 3 Computation of informativity weight. Empirically, this sigmoid scale is obtained by computing values between -5 and +5 with the arctangent function (corresponding to -1.37 to +1.37 transformed informativity scores). For each experimental design, we re-assign the different informativity values to a -5 to +5 scale (see Figure 3). Let X = {X 0 , X 1 , ..., X n } denoting the informativity status value, with n denoting the number of tested animals for one experimental design. Each informativity value is determined as X i = X i-1 + 10/n, with X 0 = -5 to fit a scale from -5 to +5. The informativity score values are then expressed from -1.37 to +1.37 (corresponding to -5 and +5 arctangent values respectively). The scores obtained are finally adjusted to a 0 to 2.74 range in order to get only positive score values. vertical axis represents the informativity weight, horizontal axis the informativity values. Once the main concept of the program was tested and validated with the small data file, we have extended the functioning of the program to other various situations by generating simulated data files with different marker density and informativity distribution. Finally, the program has been also tested on a real informativity file of 9216 markers with five experimental designs (cf. Figure 4 for the distribution of the number of maker in each informativity value). For each informativity file, a selection of 384 markers has been performed in the basic mode with or without resampling option, and in the multidesign mode (1536 markers requested) with or without resampling option. Score results for simulated data files and real data file are shown in table 1 and 2, respectively (see additional files 1 and 2 for complete results). As expected, MarkerSet results are very sensitive to marker density and informativity distribution. It is noticeable that, with our real data file, there are not enough informative markers to select 1536 SNP. Moreover, multidesign option could have a drastic impact on the scores and the loss of informativity (Ratio) with low informativity files (especially with the resampling option).

Computation of informativity weight
The simulation option has been also tested for simulated data and real data files (see additional files 1 and 2). As expected, the highest score is always obtained with the highest AMI percentage since the window sizes are larger (see Figure 5). Depending on the priority given to the marker locations or their informativity, users should test different conditions to find out which parameters are best fitted to their experimental designs.

Competing interests
The authors declare that they have no competing interests. For this purpose, six files with different marker informativity status have been generated with two variables. The first one is the marker density (5K markers -LD for low density or 40K markers -HD for High Density spanned homogeneously on the genome). The second one is marker informativity distribution. Considering a total of 100 reference animals, the following conditions have been explored: markers with heterozygozity values ranging from 50 to 100 (High Informativity, HI), 0 to 100 (Various Informativity, VI) or 0 to 50 (Low Informativity, LI). For each markers panel and condition, the maximal available informativity score (max info), the selected set score, the multidesign/monodesign ratio and the score gain obtained by using the resampling options (-r gain) are detailed. R and MD refer at resampling option activation and multidesign option activation respectively. Scores results are depending on marker density and informativity distribution (better with HI and lower with LI files). Nevertheless, there's only a slight score difference between HI and VI, showing the efficiency of MarkerSet to select the most informative markers.
Resampling option is more useful with LD files but can have an impact on the loss of informativity (Ratio) in multidesign mode with LI file. The data file includes the genotype of 9216 SNPs covering the whole genome for The 26 F1 sires of five real chicken F2 designs (4 in Exp1, 5 in Exp3 and Exp5 and 6 in Exp2 and Exp4). For each markers panel and condition, the maximal available informativity score (max info), the selected set score, the multidesign/monodesign ratio, the score gain obtained by using the resampling (-r gain), the maximal (Dmax), minimal (Dmin), average (AveD) and standard deviation (StD) distances between two markers, the number of selected markers and the number of no informative markers in this set are detailed. R and MD refer at resampling option activation and multidesign option activation, respectively. With the resampling option, the gain is inversely proportional to the maximum informativity, except for Exp2, because of an overrepresentation of markers heterozygous for 0 and 6 animals in this experimental design. The results for multidesign mode (1536 markers) are similar to those obtained with the 5K markers file: the ratio is about 0.90, and the resampling option permits the increase of the number of selected markers (and thus the final score) without significant modifications of the average distance and the standard deviation.
Experimental designs marker informativity distribution Figure 4 Experimental designs marker informativity distribution. Each bar represents the number of markers for every informativity values for each experimental design. Impacts of window sizes upon informativity score and standard deviation The horizontal axis represents the percentage of AMI used to define the window size (15 to 40%) Figure 5 Impacts of window sizes upon informativity score and standard deviation The horizontal axis represents the percentage of AMI used to define the window size (15 to 40%). The left vertical axis represents the best marker set score (full squares), and the right vertical axis the standard deviation (white diamonds). The simulation mode was performed on experimental design 1 for 384 markers requested without the resampling options.