371 research outputs found
Assessing the performance of an allocation rule
AbstractThe problem of estimating the error rates of a sample-based rule on the basis of the same sample used in its construction is considered. The apparent error rate is an obvious nonparametric estimate of the conditional error rate of a sample rule, but unfortunately it provides too optimistic an assessment. Attention is focussed on the formation of improved estimates, mainly through appropriate bias correction of the apparent error rate. In this respect the role of the bootstrap, a computer-based methodology, is highlighted
Application of Gene Shaving and Mixture Models to Cluster Microarray Gene Expression Data
Researchers are frequently faced with the analysis of microarray data of a relatively large number of genes using a small number of tissue samples. We examine the application of two statistical methods for clustering such microarray expression data: EMMIX-GENE and GeneClust. EMMIX-GENE is a mixture-model based clustering approach, designed primarily to cluster tissue samples on the basis of the genes. GeneClust is an implementation of the gene shaving methodology, motivated by research to identify distinct sets of genes for which variation in expression could be related to a biological property of the tissue samples. We illustrate the use of these two methods in the analysis of Affymetrix oligonucleotide arrays of well-known data sets from colon tissue samples with and without tumors, and of tumor tissue samples from patients with leukemia. Although the two approaches have been developed from different perspectives, the results demonstrate a clear correspondence between gene clusters produced by GeneClust and EMMIX-GENE for the colon tissue data. It is demonstrated, for the case of ribosomal proteins and smooth muscle genes in the colon data set, that both methods can classify genes into co-regulated families. It is further demonstrated that tissue types (tumor and normal) can be separated on the basis of subtle distributed patterns of genes. Application to the leukemia tissue data produces a division of tissues corresponding closely to the external classification, acute myeloid meukemia (AML) and acute lymphoblastic leukemia (ALL), for both methods. In addition, we also identify genes specific for the subgroup of ALL-Tcell samples. Overall, we find that the gene shaving method produces gene clusters at great speed; allows variable cluster sizes and can incorporate partial or full supervision; and finds clusters of genes in which the gene expression varies greatly over the tissue samples while maintaining a high level of coherence between the gene expression profiles. The intent of the EMMIX-GENE method is to cluster the tissue samples. It performs a filtering step that results in a subset of relevant genes, followed by gene clustering, and then tissue clustering, and is favorable in its accuracy of ranking the clusters produced
Variational Integrators for Almost-Integrable Systems
We construct several variational integrators--integrators based on a discrete
variational principle--for systems with Lagrangians of the form L = L_A +
epsilon L_B, with epsilon << 1, where L_A describes an integrable system. These
integrators exploit that epsilon << 1 to increase their accuracy by
constructing discrete Lagrangians based on the assumption that the integrator
trajectory is close to that of the integrable system. Several of the
integrators we present are equivalent to well-known symplectic integrators for
the equivalent perturbed Hamiltonian systems, but their construction and error
analysis is significantly simpler in the variational framework. One novel
method we present, involving a weighted time-averaging of the perturbing terms,
removes all errors from the integration at O(epsilon). This last method is
implicit, and involves evaluating a potentially expensive time-integral, but
for some systems and some error tolerances it can significantly outperform
traditional simulation methods.Comment: 14 pages, 4 figures. Version 2: added informative example; as
accepted by Celestial Mechanics and Dynamical Astronom
Detection of elliptical shapes via cross-entropy clustering
The problem of finding elliptical shapes in an image will be considered. We
discuss the solution which uses cross-entropy clustering. The proposed method
allows the search for ellipses with predefined sizes and position in the space.
Moreover, it works well for search of ellipsoids in higher dimensions
Testing the validity of the proposed ICD-11 PTSDand complex PTSD criteria using a sample fromNorthern Uganda
Background: The International Classification of Diseases (ICD-11) is currently under development with proposed changes recommended for the posttraumatic stress disorder (PTSD) diagnosis and the inclusion of a separate complex PTSD (CPTSD) disorder. Empirical studies support the distinction between PTSD and CPTSD; however, less research has focused on non-western populations. Objective: The aim of this study was to investigate whether distinct PTSD and CPTSD symptom classes emerged and to identify potential risk factors and the severity of impairment associated with resultant classes. Methods: A latent class analysis (LCA) and related analyses were conducted on 314 young adults from Northern Uganda. Fifty-one percent were female and participants were aged between 18 and 25 years. Forty percent of the participants were former child soldiers (n=124) while the remaining participants were civilians (n=190). Results: The LCA revealed three classes: a CPTSD class (40.2%), a PTSD class (43.8%), and a low symptom class (16%). Child soldier status was a significant predictor of both CPTSD and PTSD classes (OR=5.96 and 2.82, respectively). Classes differed significantly on measures of anxiety/depression, conduct problems, somatic complaints, and war experiences. Conclusions: To conclude, this study provides preliminary support for the proposed distinction between PTSD and CPTSD in a young adult sample from Northern Uganda. However, future studies are needed using larger samples to test alternative models before firm conclusions can be made
A comparison of statistical machine learning methods in heartbeat detection and classification
In health care, patients with heart problems require quick responsiveness in a clinical setting or in the operating theatre. Towards that end, automated classification of heartbeats is vital as some heartbeat irregularities are time consuming to detect. Therefore, analysis of electro-cardiogram (ECG) signals is an active area of research. The methods proposed in the literature depend on the structure of a heartbeat cycle. In this paper, we use interval and amplitude based features together with a few samples from the ECG signal as a feature vector. We studied a variety of classification algorithms focused especially on a type of arrhythmia known as the ventricular ectopic fibrillation (VEB). We compare the performance of the classifiers against algorithms proposed in the literature and make recommendations regarding features, sampling rate, and choice of the classifier to apply in a real-time clinical setting. The extensive study is based on the MIT-BIH arrhythmia database. Our main contribution is the evaluation of existing classifiers over a range sampling rates, recommendation of a detection methodology to employ in a practical setting, and extend the notion of a mixture of experts to a larger class of algorithms
Hierarchical Gaussian process mixtures for regression
As a result of their good performance in practice and their desirable analytical properties, Gaussian process regression models are becoming increasingly of interest in statistics, engineering and other fields. However, two major problems arise when the model is applied to a large data-set with repeated measurements. One stems from the systematic heterogeneity among the different replications, and the other is the requirement to invert a covariance matrix which is involved in the implementation of the model. The dimension of this matrix equals the sample size of the training data-set. In this paper, a Gaussian process mixture model for regression is proposed for dealing with the above two problems, and a hybrid Markov chain Monte Carlo (MCMC) algorithm is used for its implementation. Application to a real data-set is reported
Latent class analysis variable selection
We propose a method for selecting variables in latent class analysis, which is the most common model-based clustering method for discrete data. The method assesses a variable's usefulness for clustering by comparing two models, given the clustering variables already selected. In one model the variable contributes information about cluster allocation beyond that contained in the already selected variables, and in the other model it does not. A headlong search algorithm is used to explore the model space and select clustering variables. In simulated datasets we found that the method selected the correct clustering variables, and also led to improvements in classification performance and in accuracy of the choice of the number of classes. In two real datasets, our method discovered the same group structure with fewer variables. In a dataset from the International HapMap Project consisting of 639 single nucleotide polymorphisms (SNPs) from 210 members of different groups, our method discovered the same group structure with a much smaller number of SNP
- …