36 research outputs found

    Model-based clustering with certainty estimation: implication for clade assignment of influenza viruses

    Get PDF
    Background Clustering is a common technique used by molecular biologists to group homologous sequences and study evolution. There remain issues such as how to cluster molecular sequences accurately and in particular how to evaluate the certainty of clustering results. Results We presented a model-based clustering method to analyze molecular sequences, described a subset bootstrap scheme to evaluate a certainty of the clusters, and showed an intuitive way using 3D visualization to examine clusters. We applied the above approach to analyze influenza viral hemagglutinin (HA) sequences. Nine clusters were estimated for high pathogenic H5N1 avian influenza, which agree with previous findings. The certainty for a given sequence that can be correctly assigned to a cluster was all 1.0 whereas the certainty for a given cluster was also very high (0.92–1.0), with an overall clustering certainty of 0.95. For influenza A H7 viruses, ten HA clusters were estimated and the vast majority of sequences could be assigned to a cluster with a certainty of more than 0.99. The certainties for clusters, however, varied from 0.40 to 0.98; such certainty variation is likely attributed to the heterogeneity of sequence data in different clusters. In both cases, the certainty values estimated using the subset bootstrap method are all higher than those calculated based upon the standard bootstrap method, suggesting our bootstrap scheme is applicable for the estimation of clustering certainty. Conclusions We formulated a clustering analysis approach with the estimation of certainties and 3D visualization of sequence data. We analysed 2 sets of influenza A HA sequences and the results indicate our approach was applicable for clustering analysis of influenza viral sequences

    Confidence intervals for ranks of age-adjusted rates across states or counties

    Get PDF
    Health indices provide information to the general public on the health condition of the community. They can also be used to inform the government’s policy making, to evaluate the effect of a current policy or healthcare program, or for program planning and priority setting. It is a common practice that the health indices across different geographic units are ranked and the ranks are reported as fixed values. We argue that the ranks should be viewed as random and hence should be accompanied by an indication of precision (i.e., the confidence intervals). A technical difficulty in doing so is how to account for the dependence among the ranks in the construction of confidence intervals. In this paper, we propose a novel Monte Carlo method for constructing the individual and simultaneous confidence intervals of ranks for age-adjusted rates. The proposed method uses as input age-specific counts (of cases of disease or deaths) and their associated populations. We have further extended it to the case in which only the age-adjusted rates and confidence intervals are available. Finally, we demonstrate the proposed method to analyze US age-adjusted cancer incidence rates and mortality rates for cancer and other diseases by states and counties within a state using a website that will be publicly available. The results show that for rare or relatively rare disease (especially at the county level), ranks are essentially meaningless because of their large variability, while for more common disease in larger geographic units, ranks can be effectively utilized

    A comprehensive evaluation of SAM, the SAM R-package and a simple modification to improve its performance

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The Significance Analysis of Microarrays (SAM) is a popular method for detecting significantly expressed genes and controlling the false discovery rate (FDR). Recently, it has been reported in the literature that the FDR is not well controlled by SAM. Due to the vast application of SAM in microarray data analysis, it is of great importance to have an extensive evaluation of SAM and its associated R-package (sam2.20).</p> <p>Results</p> <p>Our study has identified several discrepancies between SAM and sam2.20. One major difference is that SAM and sam2.20 use different methods for estimating FDR. Such discrepancies may cause confusion among the researchers who are using SAM or are developing the SAM-like methods. We have also shown that SAM provides no meaningful estimates of FDR and this problem has been corrected in sam2.20 by using a different formula for estimating FDR. However, we have found that, even with the improvement sam2.20 has made over SAM, sam2.20 may still produce erroneous and even conflicting results under certain situations. Using an example, we show that the problem of sam2.20 is caused by its use of asymmetric cutoffs which are due to the large variability of null scores at both ends of the order statistics. An obvious approach without the complication of the order statistics is the conventional symmetric cutoff method. For this reason, we have carried out extensive simulations to compare the performance of sam2.20 and the symmetric cutoff method. Finally, a simple modification is proposed to improve the FDR estimation of sam2.20 and the symmetric cutoff method.</p> <p>Conclusion</p> <p>Our study shows that the most serious drawback of SAM is its poor estimation of FDR. Although this drawback has been corrected in sam2.20, the control of FDR by sam2.20 is still not satisfactory. The comparison between sam2.20 and the symmetric cutoff method reveals that the relative performance of sam2.20 to the symmetric cutff method depends on the ratio of induced to repressed genes in a microarray data, and is also affected by the ratio of DE to EE genes and the distributions of induced and repressed genes. Numerical simulations show that the symmetric cutoff method has the biggest advantage over sam2.20 when there are equal number of induced and repressed genes (i.e., the ratio of induced to repressed genes is 1). As the ratio of induced to repressed genes moves away from 1, the advantage of the symmetric cutoff method to sam2.20 is gradually diminishing until eventually sam2.20 becomes significantly better than the symmetric cutoff method when the differentially expressed (DE) genes are either all induced or all repressed genes. Simulation results also show that our proposed simple modification provides improved control of FDR for both sam2.20 and the symmetric cutoff method.</p

    Enhancing Magnetic Ordering in Cr-doped Bi2Se3 using High-TC Ferrimagnetic Insulator

    Full text link
    We report a study of enhancing the magnetic ordering in a model magnetically doped topological insulator (TI), Bi2-xCrxSe3, via the proximity effect using a high-TC ferrimagnetic insulator Y3Fe5O12. The FMI provides the TI with a source of exchange interaction yet without removing the nontrivial surface state. By performing the elemental specific X-ray magnetic circular dichroism (XMCD) measurements, we have unequivocally observed an enhanced TC of 50 K in this magnetically doped TI/FMI heterostructure. We have also found a larger (6.6 nm at 30 K) but faster decreasing (by 80% from 30 K to 50 K) penetration depth compared to that of diluted ferromagnetic semiconductors (DMSs), which could indicate a novel mechanism for the interaction between FMIs and the nontrivial TIs surface

    The United States COVID-19 Forecast Hub dataset

    Get PDF
    Academic researchers, government agencies, industry groups, and individuals have produced forecasts at an unprecedented scale during the COVID-19 pandemic. To leverage these forecasts, the United States Centers for Disease Control and Prevention (CDC) partnered with an academic research lab at the University of Massachusetts Amherst to create the US COVID-19 Forecast Hub. Launched in April 2020, the Forecast Hub is a dataset with point and probabilistic forecasts of incident cases, incident hospitalizations, incident deaths, and cumulative deaths due to COVID-19 at county, state, and national, levels in the United States. Included forecasts represent a variety of modeling approaches, data sources, and assumptions regarding the spread of COVID-19. The goal of this dataset is to establish a standardized and comparable set of short-term forecasts from modeling teams. These data can be used to develop ensemble models, communicate forecasts to the public, create visualizations, compare models, and inform policies regarding COVID-19 mitigation. These open-source data are available via download from GitHub, through an online API, and through R packages

    An Improved Nonparametric Approach for Detecting Differentially Expressed Genes with Replicated Microarray Data

    No full text
    Previous nonparametric statistical methods on constructing the test and null statistics require having at least 4 arrays under each condition. In this paper, we provide an improved method of constructing the test and null statistics which only requires 2 arrays under one condition if the number of arrays under the other condition is at least 3. The conventional testing method defines the rejection region by controlling the probability of Type I error. In this paper, we propose to determine the critical values (or the cut-off points) of the rejection region by directly controlling the false discovery rate. Simulations were carried out to compare the performance of our proposed method with several existing methods. Finally, our proposed method is applied to the rat data of Pan et al. (2003). It is seen from both simulations and the rat data that our method has lower false discovery rates than those from the significance analysis of microarray (SAM) method of Tusher et al. (2001) and the mixture model method (MMM) of Pan et al. (2003).

    An Improved String Composition Method for Sequence Comparison

    Get PDF
    Background: Historically, two categories of computational algorithms (alignment-based and alignment-free) have been applied to sequence comparison–one of the most fundamental issues in bioinformatics. Multiple sequence alignment, although dominantly used by biologists, possesses both fundamental as well as computational limitations. Consequently, alignment-free methods have been explored as important alternatives in estimating sequence similarity. Of the alignment-free methods, the string composition vector (CV) methods, which use the frequencies of nucleotide or amino acid strings to represent sequence information, show promising results in genome sequence comparison of prokaryotes. The existing CV-based methods, however, suffer certain statistical problems, thereby underestimating the amount of evolutionary information in genetic sequences. Results: We show that the existing string composition based methods have two problems, one related to the Markov model assumption and the other associated with the denominator of the frequency normalization equation. We propose an improved complete composition vector method under the assumption of a uniform and independent model to estimate sequence information contributing to selection for sequence comparison. Phylogenetic analyses using both simulated and experimental data sets demonstrate that our new method is more robust compared with existing counterparts and comparable in robustness with alignment-based methods. Conclusion: We observed two problems existing in the currently used string composition methods and proposed a new robust method for the estimation of evolutionary information of genetic sequences. In addition, we discussed that it might not be necessary to use relatively long strings to build a complete composition vector (CCV), due to the overlapping nature of vector strings with a variable length. We suggested a practical approach for the choice of an optimal string length to construct the CCV

    A note on the performance of the gamma kernel estimators at the boundary

    No full text
    The gamma kernel estimator is proposed in Chen [Chen, S.X., 2000. Probability density function estimation using gamma kernels. Annals of the Institute of Statistical Mathematics 52, 471-480] to estimate densities with support [0,[infinity]). It is shown in his paper that the gamma kernel estimator is non-negative, free of boundary bias, and achieves the optimal rate of convergence for the mean integrated squared error. Numerical results reported in Chen's paper show that, in the boundary region, the gamma kernel estimator even outperforms some widely used boundary corrected density estimators such as the boundary kernel estimator. However, our study finds that the gamma kernel estimator at x=0 is actually the reflection estimator when the double exponential kernel is used and is only boundary problem free when the estimated density has a shoulder at x=0 (i.e., the first derivative of the density at x=0 is zero). For densities not satisfying the shoulder condition, we show that the gamma kernel estimator has a severe boundary problem and its performance is inferior to that of the boundary kernel estimator.

    The convergence rates of empirical Bayes estimation in a multiple linear regression model

    No full text
    Empirical Bayes estimation, multiple linear regression model, convergence rates,
    corecore