7 research outputs found
Nonparametric Inference for Multivariate Data: The R Package npmv
We introduce the R package npmv that performs nonparametric inference for the comparison of multivariate data samples and provides the results in easy-to-understand, but statistically correct, language. Unlike in classical multivariate analysis of variance, multivariate normality is not required for the data. In fact, the different response variables may even be measured on different scales (binary, ordinal, quantitative). p values are calculated for overall tests (permutation tests and F approximations), and, using multiple testing algorithms which control the familywise error rate, significant subsets of response variables and factor levels are identified. The package may be used for low- or highdimensional data with small or with large sample sizes and many or few factor levels
Size and Shape Distributions of Primary Crystallites in Titania Aggregates
The primary crystallite size of titania powder relates to its properties in a number of applications. Transmission electron microscopy was used in this interlaboratory comparison (ILC) to measure primary crystallite size and shape distributions for a commercial aggregated titania powder. Data of four size descriptors and two shape descriptors were evaluated across nine laboratories. Data repeatability and reproducibility was evaluated by analysis of variance. One-third of the laboratory pairs had similar size descriptor data, but 83% of the pairs had similar aspect ratio data. Scale descriptor distributions were generally unimodal and were well-described by lognormal reference models. Shape descriptor distributions were multi-modal but data visualization plots demonstrated that the Weibull distribution was preferred to the normal distribution. For the equivalent circular diameter size descriptor, measurement uncertainties of the lognormal distribution scale and width parameters were 9.5% and 22%, respectively. For the aspect ratio shape descriptor, the measurement uncertainties of the Weibull distribution scale and width parameters were 7.0% and 26%, respectively. Both measurement uncertainty estimates and data visualizations should be used to analyze size and shape distributions of particles on the nanoscale
A Comparison of Correlation Structure Selection Penalties for Generalized Estimating Equations
<p>Correlated data are commonly analyzed using models constructed using population-averaged generalized estimating equations (GEEs). The specification of a population-averaged GEE model includes selection of a structure describing the correlation of repeated measures. Accurate specification of this structure can improve efficiency, whereas the finite-sample estimation of nuisance correlation parameters can inflate the variances of regression parameter estimates. Therefore, correlation structure selection criteria should penalize, or account for, correlation parameter estimation. In this article, we compare recently proposed penalties in terms of their impacts on correlation structure selection and regression parameter estimation, and give practical considerations for data analysts. Supplementary materials for this article are available online.</p
Nonparametric Inference for Multivariate Data: The R
We introduce the R package npmv that performs nonparametric inference for the comparison of multivariate data samples and provides the results in easy-to-understand, but statistically correct, language. Unlike in classical multivariate analysis of variance, multivariate normality is not required for the data. In fact, the different response variables may even be measured on different scales (binary, ordinal, quantitative). p values are calculated for overall tests (permutation tests and F approximations), and, using multiple testing algorithms which control the familywise error rate, significant subsets of response variables and factor levels are identified. The package may be used for low- or highdimensional data with small or with large sample sizes and many or few factor levels
Optimal Designs of Pairwise Calculation: an Application to Free Energy Perturbation in Minimizing Prediction Variability
Predicting binding free energy of ligand-protein complexes has been a grand challenge in the field of computational chemistry since the early days of molecular modeling. Multiple computational methodologies exist to predict ligand binding affinities. Pathway-based Free Energy Perturbation (FEP), Thermodynamic Integration (TI) as well as Linear Interaction Energy (LIE), and Molecular Mechanics-Poisson Boltzmann/Generalized Born Surface Area (MM-PBSA/GBSA) have been applied to a variety of biologically relevant problems and achieved different levels of predictive accuracy. Recent advancements in computer hardware and simulation algorithms of molecular dynamics and Monte Carlo sampling, as well as improved general force field parameters, have made FEP a principal approach for calculating the free energy differences, especially when calculating the host-guest binding affinity differences upon chemical modification.Since the FEP-calculated binding free energy difference, denoted ddGFEP only characterizes the difference in free energy between pairs of ligands or complexes, not the absolute binding free energy value of each individual host-guest system, denoted dG, we examine two rarely asked questions in FEP application:1) Which values would be more appropriate as the prediction to assess the ligands prospectively: the calculated pairwise free energy differences, ddGFEP, or the estimated absolute binding energies, d^G, transformed from ddGFEP?2) In the situation where only a limited number of ligand pairs can be calculated in FEP, can the perturbation pairs be optimally selected with respect to the reference ligand(s) to maximize the prediction precision?These two questions underline the viability of an often-neglected assumption in pairwise comparisons: that the pairwise value is sufficient to make a quantitative and reliable characterization of an individual ligand\u27s properties or activities. This implicit assumption would be true if there was no error in each pairwise calculation. Recently pair designs such as multiple pathways or cycle closure analyses provided calculation error estimation but did not address the statistical impact of the two questions above. The error impact is fully minimized by conducting an exhaustive study that obtains all NC2 = N(N-1)/2 pairs for a set N molecules; more if there is directionality (dGi,j != dGj,i). Obviously, that study design is impractical and unnecessary. Thus, we desire to collect the right amount of data that is 1) feasibly attainable, 2) topologically sufficient, and 3) mathematically synthesizable so that we can mitigate inherent calculation errors and have higher confidence in our conclusions.The significance of above questions can be illustrated by a motivating example shown in Figure 1 and Table 1, which considers two different perturbation graph designs for 20 ligands with the same number of FEP perturbation pairs, 19, and the same reference, Ligand 1. These two designs reached different conclusions in rank ordering ligand potencies due to errors inherent in the FEP derived estimates. Based on design A, ligands 5, 7, 14, 15 would be selected as the best four (20%) picks since those d^G estimates are the most favorable. Design B would yield ligands 5, 12, 18, 19 as best for the same reason. Without knowing the true value, dGTrue of the other 19 ligands, we lack a prospective metric to assess which design could be more precise even though, retrospectively, we know that both designs had reasonably good agreement with the true values, as measured through correlation and error metrics. However, the top picks from neither design were consistent with the true top four ligands, which are ligands 7, 10, 12, 18. Yet, if all of the 20C2 =190 pairs could have been calculated as listed in the last column of Table 1, the best four ligands would have been correctly identified. Additionally, the other metrics included in Table 1 were significantly improved. However, as mentioned above, calculating all possible pairs, or even a significant fraction of all possible pairs, is unlikely in practice, especially when number of molecules are large. Given this restriction, is it possible to objectively determine whether design A or B will give more precise predictions?In this report, we investigated the performance of the calculated ddGFEP values compared to the pairwise differences in least squares derived d^G estimates both analytically and through simulations. Based on our findings, we recommend applying weighted least squares to transforming ddGFEP values into d^G estimates. Second, we investigated the factors that contribute to the precision of the d^G estimates, such as the total number of computed pairs, the selection of computed pairs, and the uncertainty in the computed ddGFEP values. The mean squared error, denoted MSE and Spearman\u27s rank correlation, are used as performance metrics.To illustrate, we demonstrated how the structural similarity can be included in design and its potential impact on prediction precision. As in the majority of reported FEP studies on binding affinity prediction, the ddGFEP pairs were selected based on chemical structure similarity. Pairs with small chemical differences are assumed to be more likely to have smaller errors in ddGFEP calculation. Together using the constructed mathematic system and literature examples, we demonstrate that some of pair-selection schemes (designs) are better than the others. To minimize the prediction uncertainty, it is recommended to wisely select design optimality criterion to suitpractical applications accordingly.<br /