An efficient clustering algorithm for partitioning Yshort tandem repeats data
 Ali Seman^{1}Email author,
 Zainab Abu Bakar^{1} and
 Mohamed Nizam Isa^{2}
DOI: 10.1186/175605005557
© Seman et al.; licensee BioMed Central Ltd. 2012
Received: 1 March 2012
Accepted: 22 September 2012
Published: 6 October 2012
Abstract
Background
YShort Tandem Repeats (YSTR) data consist of many similar and almost similar objects. This characteristic of YSTR data causes two problems with partitioning: nonunique centroids and local minima problems. As a result, the existing partitioning algorithms produce poor clustering results.
Results
Our new algorithm, called kApproximate Modal Haplotypes (kAMH), obtains the highest clustering accuracy scores for five out of six datasets, and produces an equal performance for the remaining dataset. Furthermore, clustering accuracy scores of 100% are achieved for two of the datasets. The kAMH algorithm records the highest mean accuracy score of 0.93 overall, compared to that of other algorithms: kPopulation (0.91), kModesRVF (0.81), New Fuzzy kModes (0.80), kModes (0.76), kModesHybrid 1 (0.76), kModesHybrid 2 (0.75), Fuzzy kModes (0.74), and kModesUAVM (0.70).
Conclusions
The partitioning performance of the kAMH algorithm for YSTR data is superior to that of other algorithms, owing to its ability to solve the nonunique centroids and local minima problems. Our algorithm is also efficient in terms of time complexity, which is recorded as O(km(nk)) and considered to be linear.
Keywords
Algorithms Bioinformatics Clustering Optimization Data miningBackground
YShort Tandem Repeats (YSTR) data represent the number of times an STR motif repeats on the Ychromosome. It is often called the allele value of a marker. For example, if there are eight allele values for the DYS391 marker, the STR would look like the following fragments: [TCTA] [TCTA] [TCTA] [TCTA] [TCTA] [TCTA] [TCTA] [TCTA]. The number of tandem repeats has effectively been used to characterize and differentiate between two people.
In modern kinship analyses, the YSTR is very useful for distinguishing lineages and providing information about lineage relationships [1]. Many areas of study, including genetic genealogy, forensic genetics, anthropological genetics, and medical genetics, have taken advantage of the YSTR method. For example, it has been used to trace a similar group of Ysurname projects to support traditional genealogical studies, e.g., [2–4]. Further, in forensic genetics, the YSTR is one of the primary concerns in human identification for sexual assault cases [5], paternity testing [6], missing persons [7], human migration patterns [8], and the reexamination of ancient cases [9].
From a clustering perspective, the goal of partitioning YSTR data is to group a set of YSTR objects into clusters that represent similar genetic distances. The genetic distance of two YSTR objects is based on the mismatch results from comparing the YSTR objects and their modal haplotypes. For Ysurname applications, if two people share 0, 1, 2, and 3 allele value mismatches for each marker, they are considered to be the most familially related. Furthermore, for Yhaplogroup applications, the number of mismatches is variant and greater than that typically found in Ysurname applications. This is because the haplogroup application is based on larger family groups branched out from the same ancestor, covering certain geographical areas and ethnicities throughout the world. The established YDNA haplogroups named by the letters A to T, with further subdivisions using numbers and lower case letters, are now available for reference (see [10] and [11] for details).
Efforts to group YSTR data based on genetic distances have recently been reported. For example, Schlecht et al. [12] used machine learning techniques to classify YSTR fragments into related groups. Furthermore, Seman et al. [13–19] used partitional clustering techniques to group YSTR data by the number of repeats, a method used in genetic genealogy applications. In this study, we continue efforts to partition the YSTR data based on the partitional clustering approaches carried out in [13–19]. Recently, we have also evaluated eight partitional clustering algorithms over six YSTR datasets [19]. As a result, we found that there is scope to propose a new partitioning algorithm to improve the overall clustering results for the same datasets.
A new partitioning algorithm is required to handle the characteristics of YSTR data, thus producing better clustering results. YSTR data are slightly unique compared to the common categorical data used in [20–25]. The YSTR data contain a higher degree of similarity of YSTR objects in their intraclasses and interclasses. (Note that the degree of similarity is based on the mismatch results when comparing the objects and their modal haplotypes.) For example, many YSTR surname objects are found to be similar (zero mismatches) and almost similar (1, 2, and 3 mismatches) in their intraclasses. In some cases, the mismatch values of interclass objects are not obviously far apart. YSTR haplogroup data contain similar, almost similar, and also quite distant objects. Occasionally, the YSTR haplogroup data may include subclasses that are sparse in their intraclasses.
Partitional clustering algorithms
Classically, clustering has been divided into hierarchical and partitional methods. The main difference between the two is that the hierarchical method breaks the data up into hierarchical clusters, whereas the partitional method divides the data into mutually disjoint partitions. The pillar of the partitional algorithms is the kMeans algorithm [26], introduced almost four decades ago. As a consequence, the kMeans paradigm has been extended to various versions, including the kModes algorithm [25] for categorical data.
The kModes algorithm owes its existence to the ineffectiveness of the kMeans algorithm for handling categorical data. Ralambondrainy [27] attempted to rectify this using a hybrid numeric–symbolic method based on the binary characters 0 and 1. However, this approach suffered from an unacceptable computational cost, particularly when the categorical attributes had many categories. Since then, a variety of kModestype algorithms have been introduced, such as kModes with new dissimilarity measures [21, 22], kPopulation [23], and a new Fuzzy kModes [20].
where:

w_{ li } is a (k × n) partition matrix that denotes the degree of membership of object i in the l th cluster that contains a value of 0 to 1,

k (≤ n) is a known number of clusters,

Z is the centroid such that [Z_{ 1 }, Z_{ 2 },…,Z_{ k }] ∈ R^{ mk },

α [1, ∞) is a weighting exponent,

d(X_{ i }, Z_{ l }) is the distance measure between the object X_{ i } and the centroid Z_{ l }, as described in Eqs. (2) and (2a).$d\left(x,z\right)={\displaystyle {\sum}_{j=1}^{n}\delta}\left({x}_{j},{z}_{j}\right)$(2)
Huang and Ng [24] described the optimization process of P_{ 1 } and P_{ 2 } as follows:

Problem P_{ 1 }: Fix Z = $\widehat{Z}$ and solve the reduced problem P(W,$\widehat{Z}$) as in Eq. (3). This process obtains the minimized values of 0–1 of the partition matrix w_{ li }.${w}_{\mathit{li}}=\{\begin{array}{ll}1,& If\phantom{\rule{0.12em}{0ex}}{X}_{i}={\widehat{Z}}_{l}\\ 0,& If\phantom{\rule{0.12em}{0ex}}{X}_{i}={\widehat{Z}}_{h},h\ne l\hfill \\ \raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$\sum _{h=1}^{k}{\left[\frac{d\left({X}_{i,}{\widehat{Z}}_{l}\right)}{d\left({X}_{i,}{\widehat{Z}}_{h}\right)}\right]}^{\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$\left(\alpha 1\right)$}\right.}}$}\right.,\hfill & If\phantom{\rule{0.12em}{0ex}}{X}_{i}\ne {\widehat{Z}}_{l},\phantom{\rule{0.12em}{0ex}}and\phantom{\rule{0.25em}{0ex}}{X}_{i}\ne {\widehat{X}}_{h,}1\le h\le k\end{array}$(3)
and α ∈ [1, ∞) is a weighting exponent.
Problem of partitioning YSTR data
Due to the characteristics of YSTR data, there are two optimization problems for existing partitional algorithms: nonunique centroids and local minima problems. These two problems are caused by the drawback of the modes mechanism of determining the centroids. Nonunique centroids would result in empty clusters, whereas the local minima problem leads to poorer clustering results. Both problems are a result of the obtained centroids, which are not sufficient to represent their classes.
Therefore, problems will occur for the following two cases:
i)The total number of objects in a dataset is small while the number of classes is large. To illustrate this case, consider the following example.
As a result, the mode that consists of [a_{ 1 }, a_{ 2 }, c_{ 3 }] would be obtained twice. Thus, P_{ 2 } would not be minimized due to this nonunique centroid. Another possibility is that the two modes are different, but are not distinctive enough to represent their clusters, such as modes [a_{ 1 }, a_{ 2 }, a_{ 3 }] or [a_{ 1 }, a_{ 2 }, b_{ 3 }] for Cluster 2. As a consequence, this case would fall into a local minima problem.
ii)An extreme distribution of objects in a class. To illustrate this case, consider the following example.
As a result, object A becomes dominant in both clusters, and so the obtained modes might be represented solely by objects in Class A, e.g., [a_{ 1 }, a_{ 2 }, a_{ 3 }] and [a_{ 1 }, a_{ 2 }, b_{ 3 }].
The above situations cause P not to be fully optimized, thus producing poor clustering results. Therefore, a new algorithm with a new concept of P_{ 2 } is proposed in order to overcome these problems and improve the clustering accuracy results of YSTR data.
Methods
The center of a cluster
where m is the number of markers.
Example of dominant objects
Objects  Membership Values  Probability of being the dominant object in the cluster  

c _{ 1 }  c _{ 2 }  c _{ 1 }  c _{ 2 }  
x _{ 1 }  0.7  0.3  100% (1.0)  50% (0.5) 
x _{ 2 }  0.4  0.6  50% (0.5)  100% (1.0) 
x _{ 3 }  0.6  0.4  100% (1.0)  50% (0.5) 
x _{ 4 }  0.3  0.7  50% (0.5)  100% (1.0) 
The kAMH algorithm

W_{ li }^{∝} is a (k × n) partition matrix that denotes the degree of membership of YSTR object i in the l th cluster that contains a value of 0 to 1 as described in Eq. (8), subject to Eqs. (8a) and (8b).${W}_{\mathit{li}}^{\propto}\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}\{{\left(\begin{array}{cc}1,& If,\phantom{\rule{0.5em}{0ex}}{X}_{i}\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}{H}_{i}\\ 0,& If,\phantom{\rule{0.5em}{0ex}}{X}_{i}\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}{H}_{z},z\phantom{\rule{0.5em}{0ex}}\ne \phantom{\rule{0.5em}{0ex}}l\\ \raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$\sum _{z=1}^{k}$}\right.{\left[\frac{{d}_{\mathit{\text{ystr}}}\left({X}_{i,}{H}_{l}\right)}{{d}_{\mathit{\text{ystr}}}\left({X}_{i},{H}_{z}\right)}\right]}^{\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$\left(\propto 1\right)$}\right.}\phantom{\rule{0.5em}{0ex}}\hfill & ,If\phantom{\rule{0.5em}{0ex}}{H}_{i}\phantom{\rule{0.5em}{0ex}}\ne \phantom{\rule{0.5em}{0ex}}{X}_{j}\phantom{\rule{0.5em}{0ex}}\text{and}\phantom{\rule{0.5em}{0ex}}{X}_{i}\phantom{\rule{0.5em}{0ex}}\ne \phantom{\rule{0.5em}{0ex}}{H}_{z},\phantom{\rule{0.5em}{0ex}}1\phantom{\rule{0.5em}{0ex}}\le \phantom{\rule{0.5em}{0ex}}z\phantom{\rule{0.5em}{0ex}}\le \phantom{\rule{0.5em}{0ex}}k\end{array},\phantom{\rule{6em}{0ex}}\right)}^{\propto}$(8)

subject to:${w}_{\mathit{li}}^{\propto}\phantom{\rule{0.5em}{0ex}}\in \phantom{\rule{0.5em}{0ex}}\left[0,1\right],\phantom{\rule{0.5em}{0ex}}1\phantom{\rule{0.5em}{0ex}}\le \phantom{\rule{0.5em}{0ex}}i\phantom{\rule{0.5em}{0ex}}\le \phantom{\rule{0.5em}{0ex}}n,\phantom{\rule{0.5em}{0ex}}1\phantom{\rule{0.5em}{0ex}}\le \phantom{\rule{0.5em}{0ex}}l\phantom{\rule{0.5em}{0ex}}\le \phantom{\rule{0.5em}{0ex}}k\text{,}$(8a)
where,

k (≤ n) is a known number of clusters.

H is the Approximate Modal Haplotype (centroid) such that [H_{ 1 }, H_{ 2 },…,H_{ k }] ∈ X.

α ∈ [1, ∞) is a weighting exponent and used to increase the precision of the membership degrees. Note that this alpha is typical based on 1.1 until 2.0 as introduced by Huang and Ng [24].

d_{ ystr }(X_{i,}H_{ l }) is the distance measure between the YSTR object X_{ i } and the Approximate Modal Haplotype H_{ l } as described in Eq. (5) and subject to Eq.(5a).

D_{ li } is another (k × n) partition matrix which contains a dominant weighting value of 1.0 or 0.5, as explained above (See Table 1). The dominant weighting values are based on the value of W_{ li }^{∝} above. D_{ li } is described in Eq. (9), subject to Eqs. (9a), (9b), and (9c).${d}_{\mathit{li}}\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}\{\begin{array}{c}\hfill 1.0,\phantom{\rule{0.5em}{0ex}}\text{if}\phantom{\rule{0.5em}{0ex}}{w}_{\mathit{li}}^{\propto}\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}{\text{max}}^{{w}_{\mathit{li}}^{\propto},1\phantom{\rule{0.5em}{0ex}}\le \phantom{\rule{0.5em}{0ex}}l\phantom{\rule{0.5em}{0ex}}\le \phantom{\rule{0.5em}{0ex}}k}\hfill \\ \hfill 0.5,\phantom{\rule{0.5em}{0ex}}\text{otherwise}\hfill \end{array}$(9)
 i.
The objects (the data themselves) are used as the centroids instead of modes. Since the distance of YSTR objects is measured by comparing the objects and their modal haplotypes, we need to approximately find the objects that can represent the modal haplotypes. In finding the final Approximate Modal Haplotype for a particular group (cluster), each object needs to be tested onebyone and replaced on a maximization of a cost function.
 ii.
A maximization process of the cost function is required instead of minimizing it as in the kmodetype algorithms.
A detailed description of the kAMH algorithm is given below.
Step 1 – Select k initial objects randomly as Approximate Modal Haplotype (centroids). E.g. if k = 4, then choose randomly 4 objects as the initial Approximate Modal Haplotype.
Step 2 – Calculate distance d_{ ystr }(X_{i,}H_{ l }) according to Eq. (5) and subject to (5a).
Step 3 – Calculate partition matrix w_{ li }^{∝} according to Eq. (8), subject to Eqs. (8a) and (8b). Note that the w_{ li }^{∝} is based on the distance calculated in Step 2.
Step 4 – Assign a weighting dominant of 1.0 or 0.5 for partition matrix D_{ li } according to Eqs. (9), (9a), (9b) and (9c).
Step 5 – Calculate cost function P(Á) based on W_{ li }^{∝}D_{ li } according to Eqs (7) and (7a).
Step 6 – Test for each initial modal haplotype by the other objects onebyone. If current cost function is greater than previous cost function according to Eq. (6), then replace it.
Step 7 – Repeat Step 2 until Step 6 for each x and h
Furthermore, the implementation of the steps above of the algorithm is formalized in the form of pseudocode as follows.
INPUT: Dataset, X, the number of cluster, k , the number of dimensional, d and the fuzziness index,
 01
: Select H_{ l } randomly from X such that 1≤l≤ k
 02
: for each H_{ l } an Approximate Modal Haplotype do
 03
: for each X_{ i } do
 04
: Calculate P(À) = ∑ _{l = 1}^{ k } ∑ _{i = 1}^{ n }À_{ li }
 05
: if P(À) = ∑ _{l = 1}^{ k } ∑ _{i = 1}^{ n }À_{ li } is maximized, then
 06
: Replace H_{ l } by X_{ i }
 07
: end if end for
 09
: end for
 10
: Assign X_{ i } to C_{ l } for all l, 1≤ l ≤ k; 1≤i≤ n as Eq. (10)
 11
: Output Results
Optimization of the problem P
In optimizing the problem P, the kAMH algorithm uses a maximization process instead of the minimization process imposed by the kModetype algorithms. This process is formalized in the kAMH algorithm as follows.
Step 1  Choose an Approximate Modal Haplotype, H^{ (t) }∈ X. Calculate P(Á); Set t=1
Step 2  Choose X^{ (t+1) } such that P(Á)^{ t+1 } is maximized; Replace H^{ 1 } by X^{ (t+1) }
Step 3  Set t=t+1 ; Stop when t=n; otherwise go to Step 2.
*Note: n is the number of objects
The convergence of the algorithm is proven as P_{ 1 } and P_{ 2 } are maximized accordingly. The function P(Á) incorporates the P(W, H) function imposed by the Fuzzy kModes algorithm, where W is a partition matrix and H is the approximate modal haplotype that defines the center of a cluster. Thus, P_{ 1 } and P_{ 2 } are solved by Theorems 1 and 2, respectively.
Proof
Let X= {X_{ 1 },X_{ 2 },.,X_{ n }} be a set of n YSTR categorical objects and H= {H_{ 1 },H_{ 2 },.,H_{ k }} be a set of centroids (Approximate Modal Haplotypes) for k clusters. Suppose that P= {P_{ 1 },P_{ 2 },.,P_{ k }} is a set of dissimilarity measures based on d_{ ystr }(X_{i,}H_{ l }), as described in Eqs. (5) and subject to (5a), ∀ i and l 1 ≤ i ≤ n; 1 ≤ l ≤ k
For any P that is obtained from d_{ ystr }(X_{i,}H_{ l }) where X_{ i } = H_{ l }, the maximum value of w_{ li }^{∝} is 1 and X_{ i } = H_{ z }, z ≠ l the value of w_{ li }^{∝} is 0. Therefore, because H_{ l } is fixed, w_{ li }^{∝} is maximized.
where z ≠ l
Thus, ${{\displaystyle {\sum}_{z=1}^{k}\left[\frac{{P}_{\mathit{li}}}{zi}\right]}}^{\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$\left(\propto 1\right)$}\right.}\phantom{\rule{1em}{0ex}}<\phantom{\rule{0.5em}{0ex}}{{\displaystyle {\sum}_{z=1}^{k}\left[\frac{{P}_{\mathit{ti}}}{{P}_{\mathit{zi}}}\right]}}^{\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$\left(\propto 1\right)$}\right.}$ where
where t ≠ l
Therefore, based on definitions 1 and 2, w_{ li }^{∝} is maximal. Because Ĥ is fixed, $P\left(W,\widehat{H}\right)$ is maximized.
Proof
Because w_{ li }^{∝} and D_{ li } are nonnegative, the product W_{ li }^{∝}D_{ li } must be maximal. It follows that the sum of all quantities ∑ _{l = 1}^{ k } ∑ _{i = 1}^{ n }Á_{ li } is also maximal. Hence, the result follows.
YSTR Datasets
The YSTR data were mostly obtained from a database called worldfamilies.net [30]. The first, second, and third datasets represent YSTR data for haplogroup applications, whereas the fourth, fifth, and sixth datasets represent YSTR data for Ysurname applications. All datasets were filtered for standardization on 25 similar attributes (25 markers). The chosen markers include DYS393, DYS390, DYS19 (394), DYS391, DYS385a, DYS385b, DYS426, DYS388, DYS439, DYS389I, DYS392, DYS389II, DYS458, DYS459a, DYS459b, DYS455, DYS454, DYS447, DYS437, DYS448, DYS449, DYS464a, DYS464b, DYS464c, and DYS464b. These markers are more than sufficient for determining a genetic connection between two people. According to Fitzpatrick [31], 12 markers (YDNA12 test) are already sufficient to determine who does or does not have a relationship to the core group of a family.
 1)
The first dataset consists of 751 objects of the YSTR haplogroup belonging to the Ireland yDNA project [32]. The data contain only 5 haplogroups, namely E (24), G (20), L (200), J (32), and R (475). Thus, k = 5.
 2)
The second dataset consists of 267 objects of the YSTR haplogroup obtained from the Finland DNA Project [33]. The data are composed of only 4 haplogroups: L (92), J (6), N (141), and R (28). Thus, k = 4.
 3)
The third dataset consists of 263 objects obtained from the Yhaplogroup project [34]. The data contain Groups G (37), N (68), and T (158). Thus, k = 3.
 4)
The fourth dataset consists of 236 objects combining four surnames: Donald [35], Flannery [36], Mumma [37], and William [38]. Thus, k = 4.
 5)
The fifth dataset consists of 112 objects belonging to the Philips DNA Project [39]. The data consist of eight family groups: Group 2 (30), Group 4 (8), Group 5 (10), Group 8 (18), Group 10 (17), Group 16 (10), Group 17 (12), and Group 29 (7). Thus, k = 8.
 6)
The sixth dataset consists of 112 objects belonging to the Brown Surname Project [40]. The data consist of 14 family groups: Group 2 (9), Group 10 (17), Group 15 (6), Group 18 (6), Group 20 (7), Group 23 (8), Group 26 (8), Group 28 (8), Group 34 (7), Group 44 (6), Group 35 (7), Group 46 (7), Group 49 (10), and Group 91 (6). Thus, k = 14.
The values in parentheses indicate the number of objects belonging to that particular group. Datasets 1–3 represent YSTR haplogroups and datasets 4–6 represent YSTR surnames.
Results and discussion
The following results compare the performance of the kAMH algorithm with eight other partitional algorithms: the kModes algorithm [25], kModes with RVF [21, 22, 41], kModes with UAVM [21], kModes with Hybrid 1 [21], kModes with Hybrid 2 [21], the Fuzzy kModes algorithm [24], the kPopulation algorithm [23], and the New Fuzzy kModes algorithm [20].
where k is the number of clusters, a_{ i } is the number of instances occurring in both cluster i and its corresponding haplogroup or surname, and n is the number of instances in the dataset.
Clustering performance
Clustering accuracy scores for all datasets
ALGORITHM  DATASET  

1  2  3  4  5  6  
kModes  0.70  0.79  0.84  0.84  0.74  0.62 
kModesRVF  0.79  0.83  0.87  0.78  0.87  0.72 
kModesUAVM  0.65  0.75  0.83  0.87  0.56  0.54 
kModesHybrid 1  0.67  0.81  0.85  0.77  0.80  0.64 
kModesHybrid 2  0.56  0.82  0.83  0.79  0.81  0.70 
Fuzzy kModes  0.56  0.74  0.74  0.97  0.76  0.66 
kPopulation  0.80  0.90  0.97  1.00  0.97  0.84 
New Fuzzy kModes  0.71  0.84  0.77  1.00  0.77  0.69 
k AMH  0.83  0.93  0.96  1.00  1.00  0.87 
During the experiments, the kAMH algorithm did not encounter any difficulties. However, the Fuzzy kModes and the New Fuzzy kModes algorithms faced problems with datasets 1, 5, and 6. For dataset 1, the problem was caused by the extreme number of objects in Class R (475), which covered about 63% of the total objects. Further, for datasets 5 and 6, the problem was caused by many similar objects in a larger number of classes. In particular, both algorithms faced the problem P_{ 2 } caused by the initial centroid selections. Note also that the results for both algorithms were based on the diverse method, an initial centroid selection proposed by Huang [25].
Clustering accuracy scores for all YSTR datasets
N  Mean  Std. Dev.  95% Confidence Interval for Mean  Min  Max  

Lower Bound  Upper Bound  
kMode  600  0.76  0.13  0.75  0.77  0.45  1.00 
kModeRVF  600  0.81  0.11  0.80  0.82  0.56  1.00 
kModeUAVM  600  0.70  0.17  0.69  0.71  0.38  1.00 
kModeHybrid 1  600  0.76  0.13  0.75  0.77  0.38  1.00 
kModeHybrid 2  600  0.75  0.14  0.74  0.76  0.45  1.00 
Fuzzy kMode  600  0.74  0.16  0.73  0.75  0.32  1.00 
kPopulation  600  0.91  0.09  0.91  0.92  0.59  1.00 
New Fuzzy kMode  600  0.80  0.13  0.79  0.81  0.44  1.00 
k AMH  600  0.93  0.07  0.93  0.94  0.79  1.00 
Multiple comparisons for the k AMH algorithm
Accuracy Games–Howell  

(I) Algorithm  (J) Algorithm  Mean Diff. (IJ)  Std. Error  p value  95% Confidence Interval  
Lower Bound  Upper Bound  
kAMH  kMode  0.17^{*}  0.01  < 0.00001  0.16  0.19 
kModeRVF  0.12^{*}  0.01  < 0.00001  0.11  0.14  
kModeUAVM  0.23^{*}  0.01  < 0.00001  0.21  0.25  
kModeHybrid 1  0.17^{*}  0.01  < 0.00001  0.16  0.19  
kModeHybrid 2  0.18^{*}  0.01  < 0.00001  0.16  0.20  
Fuzzy kMode  0.19^{*}  0.01  < 0.00001  0.17  0.21  
kPopulation  0.02^{*}  0.00  0.00271  0.01  0.03  
New Fuzzy kModes  0.13^{*}  0.01  < 0.00001  0.12  0.15 
Efficiency
We now consider the time efficiency of the kAMH algorithm. The computational cost of the algorithm depends on the nested loop for k(nk), where k is the number of clusters and n is the number of data required to obtain the cost function, P(À). The function P(À) involves the number of attributes m in calculating the distances and the membership values for its partition matrix w_{ li }. Thus, the overall time complexity is O(km(nk)). However, the time efficiency of the kAMH algorithm will not reach O(n^{ 2 }) because the value of k in the outer loop will not become equivalent to the value of nk in the inner loop. See pseudocode for a detailed implementation of these loops.
Conclusions
Our experimental results indicate that the performance of the proposed kAMH algorithm for partitioning YSTR data was significantly better than that of the other algorithms. Our algorithm handled all problems, as described previously, and was not too sensitive to P_{ 0 }, the initial centroid selection, even though the datasets contained a lot of similar objects. Moreover, the concept of P_{ 2 } in using the object (the data itself) as the approximate center of a cluster has significantly improved the overall performance of the algorithm. In fact, our algorithm is the most consistent of those tested because the difference between the minimum and maximum scores is smaller. The kAMH algorithm always produces the highest minimum score for each dataset. In conclusion, the kAMH algorithm is an efficient method of partitioning YSTR categorical data.
Declarations
Acknowledgements
This research is supported by Fundamental Research Grant Scheme, Ministry of Higher Eduction Malaysia. We would like to thank RMI, UiTM for their support for this research. We extend our gratitude to many contributors toward the completion of this paper, including Prof. Dr. Daud Mohamed, En. Azizian Mohd Sapawi, Puan Nuru'l'Izzah Othman, Puan Ida Rosmini, and our research assistants: Syahrul, Azhari, Kamal, Hasmarina, Nurin, Soleha, Mastura, Fadzila, Suhaida, and Shukriah.
Authors’ Affiliations
References
 Kayser M, Kittler R, Erler A, Hedman M, Lee AC, Mohyuddin A, Mehdi SQ, Rosser Z, Stoneking M, Jobling MA, Sajantila A, TylerSmith C: A comprehensive survey of human Ychromosomal microsatellites. Am J Hum Genet. 2004, 74 (6): 11831197. 10.1086/421531.PubMedPubMed CentralView ArticleGoogle Scholar
 Perego UA, Turner A, Ekins JE, Woodward SR: The science of molecular genealogy. National Genealogical Society Quarterly. 2005, 93 (4): 245259.Google Scholar
 Perego UA: The power of DNA: Discovering lost and hidden relationships. 2005, Oslo: World Library and Information Congress: 71st IFLA General Conference and Council OsloGoogle Scholar
 Hutchison LAD, Myres NM, Woodward S: Growing the family tree: The power of DNA in reconstructing family relationships. Proceedings of the First Symposium on Bioinformatics and Biotechnology (BIOT04). 2004, 1: 4249.Google Scholar
 Dekairelle AF, Hoste B: Application of a YSTRpentaplex PCR (DYS19, DYS389I and II, DYS390 and DYS393) to sexual assault cases. Forensic Sci Int. 2001, 118: 122125. 10.1016/S03790738(00)004813.PubMedView ArticleGoogle Scholar
 Rolf B, Keil W, Brinkmann B, Roewer L, Fimmers R: Paternity testing using YSTR haplotypes: Assigning a probability for paternity in cases of mutations. Int J Legal Med. 2001, 115: 1215. 10.1007/s004140000201.PubMedView ArticleGoogle Scholar
 DettlaffKakol A, Pawlowski R: First polish DNA “manhunt”  an application of Ychromosome STRs. Int J Legal Med. 2002, 116: 289291.PubMedGoogle Scholar
 Stix G: Traces of the distant past. Sci Am. 2008, 299: 5663.Google Scholar
 Gerstenberger J, Hummel S, Schultes T, Häck B, Herrmann B: Reconstruction of a historical genealogy by means of STR analysis and Yhaplotyping of ancient DNA. Eur J Hum Genet. 1999, 7: 469477. 10.1038/sj.ejhg.5200322.PubMedView ArticleGoogle Scholar
 International Society of Genetic Genealogy.http://www.isogg.org,
 The Y Chromosome Consortium.http://ycc.biosci.arizona.edu,
 Schlecht J, Kaplan ME, Barnard K, Karafet T, Hammer MF, Merchant NC: Machinelearning approaches for classifying haplogroup from Y chromosome STR data. PLoS Comput Biol. 2008, 4 (6): e100009310.1371/journal.pcbi.1000093.PubMedPubMed CentralView ArticleGoogle Scholar
 Seman A, Abu Bakar Z, Mohd Sapawi A: Centrebased clustering for YShort Tandem Repeats (YSTR) as Numerical and Categorical data. Proc. 2010 Int. Conf. on Information Retrieval and Knowledge Management (CAMP’10). 2010, 1: 2833. Shah Alam, MalaysiaView ArticleGoogle Scholar
 Seman A, Abu Bakar Z, Mohd Sapawi A: CentreBased Hard and Soft Clustering Approaches for YSTR Data. Journal of Genetic Genealogy. 2010, 6 (1): 19. Available online: http://www.jogg.infoGoogle Scholar
 Seman A, Abu Bakar Z, Mohd Sapawi A: Attribute Value Weighting in KModes Clustering for YShort Tandem Repeats (YSTR) Surname. Proc. of Int. Symposium on Information Technology 2010 (ITsim’10). 2010, 3: 15311536. Kuala Lumpur, MalaysiaView ArticleGoogle Scholar
 Seman A, Abu Bakar Z, Mohd Sapawi A: Hard and Soft Updating Centroids for Clustering YShort Tandem Repeats (YSTR) Data. Proc. 2010 IEEE Conference on Open Systems (ICOS 2010). 2010, 1: 611. Kuala Lumpur, MalaysiaView ArticleGoogle Scholar
 Seman A, Abu Bakar Z, Mohd Sapawi A: Modeling Centrebased Hard and Soft Clustering for Y Chromosome Short Tandem Repeats (Y‐STR) Data. Proc. 2010 International Conference on Science and Social Research (CSSR 2010). 2010, 1: 7378. Kuala Lumpur, MalaysiaGoogle Scholar
 Seman A, Abu Bakar Z, Mohd Sapawi A: Centrebased Hard Clustering Algorithm for YSTR Data. Malaysia Journal of Computing. 2010, 1: 6273.Google Scholar
 Seman A, Abu Bakar Z, Isa MN: Evaluation of kModetype Algorithms for Clustering YShort Tandem Repeats. Journal of Trends in Bioinformatics. 2012, 5 (2): 4752. 10.3923/tb.2012.47.52.View ArticleGoogle Scholar
 Ng M, Jing L: A new fuzzy kmodes clustering algorithm for categorical data. International Journal of Granular Computing, Rough Sets and Intelligent Systems. 2009, 1 (1): 105119. 10.1504/IJGCRSIS.2009.026727.View ArticleGoogle Scholar
 He Z, Xu X, Deng S: Attribute value weighting in kModes clustering. 2007, Ithaca, NY, USA: Cornell University Library, Cornell University, 115. available online: http://arxiv.org/abs/cs/0701013v1Google Scholar
 Ng MK, Junjie M, Joshua L, Huang Z, He Z: On the impact of dissimilarity measure in kmodes clustering algorithm. IEEE Trans Pattern Anal Mach Intell. 2007, 29 (3): 503507.PubMedView ArticleGoogle Scholar
 Kim DW, Lee YK, Lee D, Lee KH: kPopulations algorithm for clustering categorical data. Pattern Recogn. 2005, 38: 11311134. 10.1016/j.patcog.2004.11.017.View ArticleGoogle Scholar
 Huang Z, Ng M: A Fuzzy kModes algorithm for clustering categorical data. IEEE Trans Fuzzy Syst. 1999, 7 (4): 446452. 10.1109/91.784206.View ArticleGoogle Scholar
 Huang Z: Extensions to the kMeans algorithm for clustering large datasets with categorical values. Data Min Knowl Discov. 1998, 2: 283304. 10.1023/A:1009769707641.View ArticleGoogle Scholar
 MacQueen JB: Some methods for classification and analysis of multivariate observations. The 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967, 1: 281297.Google Scholar
 Ralambondrainy H: A conceptual version of the kMeans algorithm. Pattern Recogn Lett. 1995, 16: 11471157. 10.1016/01678655(95)00075R.View ArticleGoogle Scholar
 Bobrowski L, Bezdek JC: cMeans clustering with the l1 and l∞ norms. IEEE Trans Syst Man Cybern. 1989, 21 (3): 545554.View ArticleGoogle Scholar
 Salim SZ, Ismail MA: kMeanstype algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Trans Pattern Anal Mach Intell. 1984, 6: 8187.View ArticleGoogle Scholar
 WorldFamilies.net.http://www.worldfamilies.net,
 Fitzpatrick C: Forensic genealogy. 2005, Fountain Valley: Cal.: Rice Book PressGoogle Scholar
 Ireland yDNA project.http://www.familytreedna.com/public/IrelandHeritage/,
 Finland DNA Project.http://www.familytreedna.com/public/Finland/,
 YHaplogroup project.http://www.worldfamilies.net/yhapprojects/,
 Clan Donald Genealogy Project.http://dnaproject.clandonaldusa.org,
 Flannery Clan.http://www.flanneryclan.ie,
 Doug and Joan Mumma’s Home Page.http://www.mumma.org,
 Williams Genealogy.http://williams.genealogy.fm,
 Phillips DNA Project.http://www.phillipsdnaproject.com,
 Brown Genealogy Society.http://brownsociety.org,
 San OM, Huynh V, Nakamori Y: An alternative extension of the KMeans Algorithm for clustering categorical data. IJAMCS. 2004, 14 (2): 241247.Google Scholar
 Blake CL, Merz CJ: UCI repository of machine learning database. 1989Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.