A comparative study of the impacts of unbalanced sample sizes on the four synthesized methods of meta-analytic structural equation modeling

In the first stage of meta-analytic structural equation modeling (MASEM), researchers synthesized studies using univariate meta-analysis (UM) and multivariate meta-analysis (MM) approaches. The MM approaches are known to be of better performance than the UM approaches in the meta-analysis with equal sized studies. However in real situations, where the studies might be of different sizes, the empirical performance of these approaches is yet to be studied in the first and second stages of MASEM. The present study aimed to evaluate the performance of the UM and MM methods, having unequal sample sizes in different primary studies. Testing the homogeneity of correlation matrices and the empirical power, estimating the pooled correlation matrix and also, estimating parameters of a path model were investigated using these approaches by simulation. The results of the first stage showed that Type I error rate was well under control at 0.05 level when the average sample sizes were 200 or more, irrespective of the types of the methods or the sample sizes used. Moreover, the relative percentage biases of the pooled correlation matrices were also lower than 2.5% for all methods. There was a dramatic decrease in the empirical power for all synthesis methods when the inequality of the sample sizes was increased. In fitting the path model at the second stage, MM methods provided better estimation of the parameters. This study showed the different performance of the four methods in the statistical power, especially when the sample sizes of primary studies were highly unequal. Moreover, in fitting the path model, the MM approaches provided better estimation of the parameters.


Background
Meta-analysis (MA), as a popular statistical technique, is used for the purpose of integrating and summarizing the findings of different studies in order to yield more precise and reliable effect size of interest across independent studies. The dramatic growth of structural equation modeling (SEM) techniques in different types of sciences has attracted the attention of researchers on the methods that utilized the ideas of MA and SEM in synthesizing the results of several studies [1]. The term meta-analytic structural equation modeling (MASEM) refers to a set of statistical techniques used for testing hypothetical models in psychology, medicine and management and accounting researches [2][3][4]. Two stages are considered when analyzing data in MASEM: the first stage involves a combination of correlation matrices of independent studies together to form a pooled correlation matrix, if the homogeneity hypothesis is held across studies. In the second stage, SEM analysis is performed to fit the SEM model by the pooled correlation matrix [1].
There are different methods for synthesizing correlation matrices in the first stage of MASEM. These methods are categorized as UM and MM methods. The UM methods are frequently used in applied researches [5][6][7][8]. Univariatez (UNIz) and Univariate-r (UNIr), introduced by Hedges and Olkin [9] and Hunter and Schmidt [10], are the most popularly used UM techniques in MASEM researches. These approaches synthesize correlation matrices among k studies by taking the weighted average of correlation, r i . However, one problem associated with these approaches is that they fail to take into account the dependencies between correlations. This can cause a bias estimation of the pooled correlation matrix [11]. Given this deficiency, MM methods have been proposed and applied to provide more accurate results. GLS and TSSEM are the two best MM methods introduced by Becker [12] and Cheung and Chan [13]. Becker used generalized least squares estimation method to model the dependency between correlation coefficients in the first stage. However, due to some poor performance of this method in comparison with UMs [13][14][15], the researchers recommended different modifications in order to improve the traditional GLS method [11,14,15]. In TSSEM approach, correlations are pooled by multiple group SEM techniques at stage one and the pooled matrix is used for the analysis of SEM in the second stage.
Previous studies have shown that MM approaches perform better than UMs and also provide results with good and relatively unbiased estimators [13][14][15][16][17]. It should be noted that in most of the previous studies, the comparison between the mentioned methods and their properties was based on equal sample sizes within each MA. However, usually this does not occur in actual practice. Since prior results showed that trial sample sizes, n, influence treatment effect estimates substantially [18], it was hypothesized that these methods would perform inadequately, if a combination of very unequal-sized studies are included in an MA. Such a situation is not uncommon and frequently occurs, especially in clinical trials and medical sciences. For example, in the sample of 22,453 meta-analyses, Davey et al. demonstrated that in general, the sample size of individual studies varied considerably across MAs with a median of 91, an interquartile range from 44 to 210 and maximum of 1,242,071 individuals. They also concluded that sample sizes varied substantially across medical specialties, with the lowest and highest values of median size (61 and 154) for pathological conditions, symptoms and signs and for cancer, respectively [19].
Although several simulation studies were carried out to compare the performance of the UM and MM approaches [11,13,17], there exist no empirical study to evaluate these methods when there is a mixture of very unequal sample sizes design in MA. Differences in the sample sizes of primary studies within each MA are one of the problems encountered by MA studies when dealing with meta-analytical methods [20]. To the best of our knowledge, comparisons between the methods with unequal sample sizes have been evaluated only in some studies in which the variation of sample size was obtained under the specific requirements of the formula and spatial distributions [15,16,21,22]. Although the use of these uneven sample sizes for MA studies might improve the findings [22], the produced sample sizes did not have significant difference when compared with the equal sized studies.
This study aimed to assess the effect of different unequal sample sizes scenarios on the statistical properties of approaches and made comparison with equal sample sizes.

Homogeneous studies
A simulation study was conducted to evaluate the performance of UNIr, UNIz, MGLS and TSSEM approaches in both stages under different combinations of sample sizes. In this study, a path model with four observed variables was considered as shown in Fig. 1, which was already used by the pioneer researchers [17,23].
The general form of the model is written as: where Y 2×1 and X 2×1 are vectors of endogenous and exogenous variables with B 2×2 and Γ 2×2 as their coefficients matrices, respectively. The term ζ 2×1 is the disturbance vector with variance-covariance matrix 2×2 . This model is an over-identified model with one degree of freedom. Population covariance matrix ( ) which is a function of the parameters model is given as: where I 2×2 and 2×2 are identity matrix and covariance matrix of X. If the model parameters are chosen can also serve as the common population correlation matrix. It was used to generate the simulated data. SEM techniques were used to estimate the parameters of the model [24].

Heterogeneous studies
In order to evaluate the statistical power of the four methods for rejecting homogeneity hypothesis correctly, another simulation study was performed in which simulated correlation matrices were classified into two homogeneous subgroups. Two fixed population matrices were used to represent between group differences under the fixed-effects model [13]. and ′ were also used as two population correlation matrices under the fixed-effects model in order to generate the heterogeneous studies. Heterogeneity was assessed at two levels: 20% for small heterogeneity and 50% for large heterogeneity. This implied that 20, 50% of the correlation matrices were selected from another population matrix. Selection of the parameters of the path model was in such a way that the ′ was obtained as:

Sample sizes
In MA of homogeneous and heterogeneous studies, the simulated data were based on three forms of the sample sizes designs: equal, moderately unequal and highly unequal sample sizes, such that the total sample size is the same. First, equal numbers of subjects were assigned to each MA studies. Second, for moderately unequal samples, the percentage of allocation of total sample sizes was considered as 40 and 60% for the large and small studies, respectively. At this point, larger studies had about 2.7 times more subjects than the small studies. Third, for highly unequal sized studies, the total sample size was assigned very unequally such that 40, 20 and 40% of the samples were selected as small, medium and large, respectively. In this case, studies with larger sample sizes had 1.6 and 4 times more subjects than the studies with medium and small sample sizes. The effects of inequality in each MA study and different values of number of studies (k = 5, 10 and 15) on the statistical properties of the four approaches and also the influence of heterogeneities on the statistical power of the four methods were also evaluated. A total of 1000 random samples were generated from multivariate normal distribution with a mean vector of zero and variance covariance matrix of in each simulation in order to achieve simulated correlation matrices. Moreover, the value of n per study was set at 50, 100, 200, 500 and 1000 subjects. Hence, this study included 15 MAs for each of the synthesizing methods.

Estimation methods
In order to test the homogeneity of correlation matrices for the UM methods, the Bonferroni-adjusted at-leastone (BA1) approach [15] was used in the first stage. Q GLS and maximum likelihood (ML) methods which have been described by Cheung et al. [13] were used for the MGLS and TSSEM approaches, respectively. Rejection rates were calculated based on α = 0.05 in the first stage.
In the second stage, ML and asymptotically distributions free (ADF) estimation methods were used for fitting path model with UM and MM approaches, respectively. In addition, the total sample sizes were considered for the estimation of the parameters. For every parameter estimates, the relative percentage bias was defined as Bias θ =¯θ −θ θ × 100%. The value of θ is the mean of the estimates of the parameters in 1000 simulations and θ is the population value of the parameters.
The relative percentage bias of the standard error of each parameter estimate was used to assess the accuracy of the standard error estimates in fitting SEM. This value is defined as Bias (SE(θ)) = SE(θ)−SD(θ) is the mean of the estimated standard errors and SD(θ) is the empirical standard deviation of the parameter estimates across 1000 replications. The values of less than 5% for the parameter estimates and 10% for the standard errors were treated as acceptable bias [25]. The R software version 3.2.1 was used to perform these simulation analyses using lavaan and metaSEM packages [26,27]. The metaSEM runs under the OpenMx package [28].

Results of stage 1
The results of observed rejection percentages of the present approaches for the simulated combinations of sample sizes in the first stage are shown in Table 1. With small average sample sizes (e.g., 50 and 100), there was over-rejection of the true model in some cases by UNIr, MGLS and TSSEM approaches. This over-rejection increased especially for TSSEM, when the number of studies and inequality in samples increased. However, UNIz approach performed very well under different sample sizes. The present findings revealed that the error rates were well under control under large sample sizes (e.g. 200 and above), regardless of the methods or the design of the sample sizes used for the analysis. Table 2 shows relative percentage biases of correlation coefficients obtained by four approaches at stage one. By comparing values with 2.5% which is known as an acceptable criterion [29], all the methods exhibited relative biases lower than 2.5% for all types of the sample sizes design. The values of relative percentage biases were approximately decreased with increasing average sample sizes, in almost all conditions. Furthermore, the findings showed that the UNIr and MGLS had the same relative percentage biases in almost all conditions. Table 3 illustrates the empirical power of homogeneity tests under various combinations of k, n and inequality of the sample sizes within each study for 20% and 50% heterogeneity of population matrices. Broadly speaking, there was increase in the power of homogeneity tests approximately in all scenarios of sample size designs when the number of MA studies and the sample sizes within each study were increased irrespective of the method studied. With a heterogeneity percentage equal to 20%, the power of the tests are ranked as MGLS ≥ UNIr ≥ TSSEM ≥ UNIz in all cases except for k = 5 and n = 50 with equal and moderately unequal sized studies. Based on the results of this table, substantial reduction occurred in the power in moderately and highly unbalanced studies. By comparing moderately unequal and equal samples, the average rates of reduction of approximately 19, 17 and 13% were detected in the power of UNIr method, when the number of studies was equal to 5, 10 and 15, respectively. In UNIz approach, the reductions were approximately 23, 32 and 27% when k was equal to 5, 10 and 15, respectively. These rates were also about 17, 8 and 13% for the MGLS method for k = 5, 10 and 15. Moreover, there was reduction in the power of the TSSEM approach approximately by 21, 24 and 22% for the given value of k, respectively. For highly unequal sample sizes, more decrease of the power was obtained in comparison with equal sample sizes for each of the four methods than moderately unequal samples. There was an approximate decrease in the power of test by 36, 24 and 25% for the UNIr, 58, 50 and 47% for UNIz, 31, 17 and 22% for MGLS, and 54, 42 and 38% for TSSEM methods, for the same sequence of k.
When the heterogeneity of correlation matrices was 50%, the same results were observed, except for the TSSEM method in which the power of the test was to be relatively higher than the others when the sample sizes were equal. Moreover, less decrease was observed in this condition for the average of the power compared to 20% heterogeneity under different unequal   All of the notations are described in Table 1 sample sizes designs. It should be noted that these results were obtained when the average sample sizes were less than 500. When the sample size was equal to or greater than 500, the power was approximately similar for all methods and no substantial reduction was observed.  observed and expected values of Chi square statistics was increased significantly when the n and k increased. The lowest and the highest positive bias referred to moderately unequal and highly unequal samples of UNIr method when k = 5, n = 50 and k = 15, n = 1000 , respectively. However the test statistics of MGLS and TSSEM approaches tended to converge to the expected means and standard deviation in almost all conditions. Furthermore, there was no dramatic difference for moderately and highly unequal than equal sample sizes for all approaches. Figure 2 displays the relative percentage bias of parameter estimates for given values of k. Figure 2a-c shows the bias values of parameter estimates for the studies with equal, moderately unequal and highly unequal samples, respectively. As a result of the space limitations, one representative parameter, γ 11 , was selected to be displayed. Interested readers should refer to Additional file 1 for more details.

Results of stage 2
The results showed that the estimates of the four parameters (e.g., γ 11 , β 21 , ϕ 12 and ψ 22 ) were unbiased for UNIr and UNIz approaches with the values being lower than 5% in all studies. Two parameters, namely γ 12 and ψ 11 , were close to 5% for almost all conditions. The lowest and highest values of relative percentage bias for the last parameter, γ 21 , were 11.3 and 14.2%. However, for MMs, the relative unbiased estimates were observed for all the parameters in all combinations of the studies, inequality in the sample sizes and n. In general, similar results were observed for the bias of the parameter estimates using the MGLS and TSSEM approaches. The relative percentage bias of the parameter estimates from these two methods was lower than 2% (the highest value was 1.97% for ψ_11 in TSSEM when k = 5, n = 50 for study samples of the same size). Relative biases were attenuated slightly towards zero when n were increased. Figure 3 compares the relative percentage biases of the standard errors (SE) of γ 11 as one of the parameters of interest under different combination of sample sizes (Fig. 3a-c). Additional file 2 presents the rest of the parameter estimates in more detail. Using 10% as a good estimation of the relative biases, three SE of γ 11 , γ 21 and ψ 11 had relative biases larger than 10% for UMs, especially in small n. The bias values for these parameters ranged from 13 to 29%. In almost all situations, there were positive biases for a larger number of parameters (three path coefficients and the factor correlation were positively biased). Moreover, the same pattern was observed for the bias values when the average sample sizes or the number of studies were increased. However, unlike the UMs, the results were different for MMs which were unbiased in almost all parameters, except one (e.g., γ 12 , with the highest value being about 25% for TSSEM method). The relative percentage bias for these parameters ranged from 0 to 10.7%, 0 to 11.6%, and 0 to 14% in study sample sizes that were equal, moderately unequal, and highly unequal, respectively. These results showed that MGLS and TSSEM techniques had a similar performance. In these approaches, the relative percentage biases almost had a decreasing pattern when n increased.  Slight negative biases were observed for three path coefficients (γ 11 , β 21 and γ 21 ), two error variance, ψ 11 and ψ 22 and the covariance of observed X, ϕ 12 . Generally, MMs outperformed the UMs in producing unbiased results for the parameters and their SE estimates. The relatively similar results were observed for all sample sizes designs.

Discussion
This study examined the effect of unbalanced sample sizes designs in different primary studies on synthesizing MA methods in the first and second stages of MASEM. For a number of reasons, unequal sample sizes in different studies in MA and the centers in multicenter clinical trials commonly occur [30]. That is an issue, which has not yet been investigated, in the most previous simulation studies. The present findings demonstrated that UM methods performed well in controlling Type I error rate for a combination of sample sizes and the number of MA except for a limited number of conditions. When the average sample sizes were lower than 200, MM methods, especially TSSEM, with moderately and highly unbalanced samples performed worse than UMs in the incorrect rejection of a true null hypothesis. However, when the average sample sizes were 200 or more, both UM and MM methods were closed to their nominal Type I error rates. These findings were in line with those generally reported by the researchers [13,14] and Zhang for MM approaches [17]. These results imply that it is permissible to use any of the methods to estimate pooled correlation matrices in the first stage when there are relatively large sample sizes in the MA.
As compared with equal sample sizes designs, there was a decrease in the power of the UM and MM approaches for detecting heterogeneous studies when the same total sample size was assigned unequally. It is worth mentioning that as compared with moderately unequal sample sizes, studies with high inequality had more adverse effects on the power of homogeneity tests. Although the TSSEM approach provided a good balance between Type I error control and the statistical power in equal sample sizes design in this study and other published studies [13,17], the present findings showed the relatively poor performance of this method for unequal sample sizes, especially in the n lower than 200, with highly unequal sample sizes. The results of this study showed that TSSEM had the highest power of rejecting the incorrect null hypothesis only when there was high heterogeneity in correlation matrices and the inequality of the sample sizes was negligible. Moreover, these results did not reveal the superiority of TSSEM method compared to other methods because there were inflation of the Type I error rates at the same points. However, the MGLS method had a high power for detecting heterogeneous correlation matrices regardless of the sample sizes and inequality used in the simulations. The obtained result is in agreement with the previous studies which had reported the good performance of MGLS approach [15,17,31].
Whether small studies are more heterogeneous than larger ones [32], the heterogeneity of correlation matrices were allocated to the small simulation studies. In addition, also, some other studies were considered as heterogeneous cases. Based on the present findings, MGLS and UNIr have more stability than UNIz and TSSEM methods even if the larger studies are selected as heterogeneous. In general, of the four tests of heterogeneity, MGLS and UNIr approaches have a higher statistical power in detecting heterogeneous studies than the two other methods. These findings are inconsistent with those of Cheung, who reported the superiority of TSSEM and unmodified-GLS procedures than the UM approaches [13].
The performance of UNIr and UNIz methods in Chi square test statistics to fit SEM was poor compared to MGLS and TSSEM approaches at the second stage. As shown by previous studies [11,13], this test statistic had no good performance for UM approaches because it was affected by many factors, such as sample size [13]. In addition, when the number of studies increased, the Type I error rate related with the model fit exceeds the nominal level; therefore, the rate of such error increases. Generally, final decisions in SEM analyses cannot be achieved solely based on Chi square test, and many researchers have recommended utilizing a range of other goodness-of-fit indices to assess model fit [33]. Bollen demonstrated that the means of sampling distributions of Tucker-Lewis (TLI) and incremental fit (IFI) indices had relatively been unaffected by the sample size [34]. In the current study, the performance of some fit indices such as TLI and IFI were also assessed; but details of the results are not presented here. The results indicated good fit with negligible differences between the MM and UM methods. Further studies are required to assess the performance of combining correlation matrices approaches in more complex models, in fitting SEM at the second stage.
Based on the relative percentage bias of the parameter estimates and their SEs in the second stage, the present findings showed that MM approaches outperformed the UM approaches in almost all conditions. MMs produced fewer biased estimates of parameters and the SEs than UMs. These findings are consistent with those of Cheung and Chan [13] and Furlow et al. [11], in which they reported good performance of MM approaches in estimating the parameters and their SEs. It should be pointed out that the number of studies (k) included in the MA did not affect the estimation of the pooled correlation matrix in the first stage [35] or the biases of the parameters and the SE estimates in the second stage [11,13,15]. This is also true when considering the impact of unequal sample sizes in MA studies. However, when the total sample sizes increased, the biases of the parameter estimates decreased and also there was a reduction in the magnitude of the SEs but with a fluctuated pattern.
In the second stage of UM approaches, researchers choose different sample sizes, including arithmetic, weighted or total sample sizes. In the current study, based on the rule presented by Bollen, the total sample size was used to reduce the adverse effect of the sample sizes on SE of the parameter [36]. Nevertheless, UM approaches failed to yield satisfactory results. In general, using MM approaches for fitting SEM model in the second stage avoided the problems encountered using UM approaches, such as over-rejection of Chi square test, the goodness of fit indices, the power of homogeneity tests, and the relative biases of standard error of parameters [13]. Moreover, since it was difficult to consider the appropriate sample size in this stage for UM approaches; it seemed that MM approaches would be better choices for the analysis of MASEM in the second stage. However, owing to the popularity and ease of use for the users, many researchers still use UM approaches for the analysis of synthesized correlation matrices. UM approaches have good performance in controlling Type I error rates. Moreover, the relative percentage bias of the pooled correlation matrices is very good in the first stage, even under small or substantial unequal sample sizes. So it seems that, based on the current and other studies [13,16,23], there is no difficulty for applied researchers to use UMs in estimating pooled correlation matrices.
The present study had two main limitations which should be noted. First, comparison of the MA approaches with unbalanced sample sizes was performed under the fixed-effects model. In this model, the effect sizes of all studies in the MA are limited to one population effect size and the generalization of the results to main population is not possible [21]. However, many applied researchers use fixed-effects models in the MASEM studies [11]. Secondly, the estimation of the pooled correlation matrix was based on the full observation with no missing variable in this simulation study. Cheung and Chan pointed out that when the more studies are included in MASEM, it will be more likely to have missing variables and heterogeneous correlation matrices in the MA studies [13]. In the present study, the value of 15 was considered as the largest number of studies in this simulation with no missing variable. It is suggested that further studies are necessary to assess the larger number of MA studies using random-effects models with missing correlations in the first and second stages of MASEM.

Conclusion
In summary, MGLS was the most appealing approach in terms of Type I error rate, detecting heterogeneous studies and precision of parameter estimates under equal and unequal sample size designs. For large and balance sample sizes, the TSSEM can be applied not only in combining the correlation matrices, but also in estimating the parameters in the second stage. However, it is recommended that the UNIr and UNIz methods are only used for synthesizing the correlation matrices in the first stage.