1,614 research outputs found

    Data Mining For Robust Tests Of Spread [QA76.9.D343 T26 2008 f rb].

    Get PDF
    Data pelbagai dimensi (data simulasi) dalam kuantiti yang besar daripada halaman output SAS bagi enam ratus tiga puluh empat ujian teguh kehomogenan varians tersedia dihasilkan oleh Keselman, Wilcox, Algina, Othman, dan Fradette (dalam pencetakan). Large quantity of multidimensional data (simulation data sets) from SAS output listings of six hundred and thirty four robust tests of spread procedures conducted by Keselman, Wilcox, Algina, Othman, and Fradette (in press) was available

    The application of machine learning models in the concussion diagnosis process

    Get PDF
    “Concussions represent a growing health concern and are challenging to diagnose and manage. Roughly four million concussions are diagnosed every year in the United States. Although research into the application of advanced metrics such as neuroimages and blood biomarkers has shown promise, they are yet to be implemented at a clinical level due to cost and reliability concerns. Therefore, concussion diagnosis is still reliant on clinical evaluations of symptoms, balance, and neurocognitive status and function. The lack of a universal threshold on these assessments makes the diagnosis process entirely reliant on a physician’s interpretation of these assessment scores. This study aims to show that the implementation of machine learning models can be beneficial to the concussion diagnosis process. While studies on machine learning applications for traumatic brain injuries are gaining traction, previous studies have primarily relied on neuroimaging metrics. The few that used clinical assessment tests have employed only univariate models. This study explores the use of multiple assessment scores in the models and evaluates the importance of each assessment score from the clinical tests. A comprehensive predictive modeling approach was conducted with a number of candidate models and subsampling techniques being evaluated. The findings in this research demonstrate the potential benefits of machine learning models to identify concussed and non-concussed subjects at a 24-48-hour post-injury time point. The results also suggest that not all clinical assessment test scores are of equal importance”--Abstract, page iv

    Restricting Supervised Learning: Feature Selection and Feature Space Partition

    Get PDF
    Many supervised learning problems are considered difficult to solve either because of the redundant features or because of the structural complexity of the generative function. Redundant features increase the learning noise and therefore decrease the prediction performance. Additionally, a number of problems in various applications such as bioinformatics or image processing, whose data are sampled in a high dimensional space, suffer the curse of dimensionality, and there are not enough observations to obtain good estimates. Therefore, it is necessary to reduce such features under consideration. Another issue of supervised learning is caused by the complexity of an unknown generative model. To obtain a low variance predictor, linear or other simple functions are normally suggested, but they usually result in high bias. Hence, a possible solution is to partition the feature space into multiple non-overlapping regions such that each region is simple enough to be classified easily. In this dissertation, we proposed several novel techniques for restricting supervised learning problems with respect to either feature selection or feature space partition. Among different feature selection methods, 1-norm regularization is advocated by many researchers because it incorporates feature selection as part of the learning process. We give special focus here on ranking problems because very little work has been done for ranking using L1 penalty. We present here a 1-norm support vector machine method to simultaneously find a linear ranking function and to perform feature subset selection in ranking problems. Additionally, because ranking is formulated as a classification task when pair-wise data are considered, it increases the computational complexity from linear to quadratic in terms of sample size. We also propose a convex hull reduction method to reduce this impact. The method was tested on one artificial data set and two benchmark real data sets, concrete compressive strength set and Abalone data set. Theoretically, by tuning the trade-off parameter between the 1-norm penalty and the empirical error, any desired size of feature subset could be achieved, but computing the whole solution path in terms of the trade-off parameter is extremely difficult. Therefore, using 1-norm regularization alone may not end up with a feature subset of small size. We propose a recursive feature selection method based on 1-norm regularization which can handle the multi-class setting effectively and efficiently. The selection is performed iteratively. In each iteration, a linear multi-class classifier is trained using 1-norm regularization, which leads to sparse weight vectors, i.e., many feature weights are exactly zero. Those zero-weight features are eliminated in the next iteration. The selection process has a fast rate of convergence. We tested our method on an earthworm microarray data set and the empirical results demonstrate that the selected features (genes) have very competitive discriminative power. Feature space partition separates a complex learning problem into multiple non-overlapping simple sub-problems. It is normally implemented in a hierarchical fashion. Different from decision tree, a leaf node of this hierarchical structure does not represent a single decision, but represents a region (sub-problem) that is solvable with respect to linear functions or other simple functions. In our work, we incorporate domain knowledge in the feature space partition process. We consider domain information encoded by discrete or categorical attributes. A discrete or categorical attribute provides a natural partition of the problem domain, and hence divides the original problem into several non-overlapping sub-problems. In this sense, the domain information is useful if the partition simplifies the learning task. However it is not trivial to select the discrete or categorical attribute that maximally simplify the learning task. A naive approach exhaustively searches all the possible restructured problems. It is computationally prohibitive when the number of discrete or categorical attributes is large. We describe a metric to rank attributes according to their potential to reduce the uncertainty of a classification task. It is quantified as a conditional entropy achieved using a set of optimal classifiers, each of which is built for a sub-problem defined by the attribute under consideration. To avoid high computational cost, we approximate the solution by the expected minimum conditional entropy with respect to random projections. This approach was tested on three artificial data sets, three cheminformatics data sets, and two leukemia gene expression data sets. Empirical results demonstrate that our method is capable of selecting a proper discrete or categorical attribute to simplify the problem, i.e., the performance of the classifier built for the restructured problem always beats that of the original problem. Restricting supervised learning is always about building simple learning functions using a limited number of features. Top Selected Pair (TSP) method builds simple classifiers based on very few (for example, two) features with simple arithmetic calculation. However, traditional TSP method only deals with static data. In this dissertation, we propose classification methods for time series data that only depend on a few pairs of features. Based on the different comparison strategies, we developed the following approaches: TSP based on average, TSP based on trend, and TSP based on trend and absolute difference amount. In addition, inspired by the idea of using two features, we propose a time series classification method based on few feature pairs using dynamic time warping and nearest neighbor

    Hypothesis-based machine learning for deep-water channel systems

    Get PDF
    2020 Spring.Includes bibliographical references.Machine learning algorithms are readily being incorporated into petroleum industry workflows for use in well-log correlation, prediction of rock properties, and seismic data interpretation. However, there is a clear disconnect between sedimentology and data analytics in these workflows because sedimentologic data is largely qualitative and descriptive. Sedimentology defines stratigraphic architecture and heterogeneity, which can greatly impact reservoir quality and connectivity and thus hydrocarbon recovery. Deep-water channel systems are an example where predicting reservoir architecture is critical to mitigating risk in hydrocarbon exploration. Deep-water reservoirs are characterized by spatial and temporal variations in channel body stacking patterns, which are difficult to predict with the paucity of borehole data and low quality seismic available in these remote locations. These stacking patterns have been shown to be a key variable that controls reservoir connectivity. In this study, the gap between sedimentology and data analytics is bridged using machine learning algorithms to predict stratigraphic architecture and heterogeneity in a deep-water slope channel system. The algorithms classify variables that capture channel stacking patterns (i.e., channel positions: axis, off-axis, and margin) from a database of outcrop statistics sourced from 68 stratigraphic measured sections from outcrops of the Upper Cretaceous Tres Pasos Formation at Laguna Figueroa in the Magallanes Basin, Chile. An initial hypothesis that channel position could be predicted from 1D descriptive sedimentologic data was tested with a series of machine learning algorithms and classification schemes. The results confirmed this hypothesis as complex algorithms (i.e., random forest, XGBoost, and neural networks) achieved accuracies above 80% while less complex algorithms (i.e., decision trees) achieved lower accuracies between 60%-70%. However, certain classes were difficult for the machine learning algorithms to classify, such as the transitional off-axis class. Additionally, an interpretive classification scheme performed better (by around 10%-20% in some cases) than a geometric scheme that was devised to remove interpretation bias. However, outcrop observations reveal that the interpretive classification scheme may be an over-simplified approach and that more heterogeneity likely exists in each class as revealed by the geometric scheme. A refined hypothesis was developed that a hierarchical machine learning approach could lend deeper insight into the heterogeneity within sedimentologic classes that are difficult for an interpreter to discern by observation alone. This hierarchical analysis revealed distinct sub-classes in the margin channel position that highlight variations in margin depositional style. The conceptual impact of these varying margin styles on fluid flow and connectivity is shown

    Classification of clinical outcomes using high-throughput and clinical informatics.

    Get PDF
    It is widely recognized that many cancer therapies are effective only for a subset of patients. However clinical studies are most often powered to detect an overall treatment effect. To address this issue, classification methods are increasingly being used to predict a subset of patients which respond differently to treatment. This study begins with a brief history of classification methods with an emphasis on applications involving melanoma. Nonparametric methods suitable for predicting subsets of patients responding differently to treatment are then reviewed. Each method has different ways of incorporating continuous, categorical, clinical and high-throughput covariates. For nonparametric and parametric methods, distance measures specific to the method are used to make classification decisions. Approaches are outlined which employ these distances to measure treatment interactions and predict patients more sensitive to treatment. Simulations are also carried out to examine empirical power of some of these classification methods in an adaptive signature design. Results were compared with logistic regression models. It was found that parametric and nonparametric methods performed reasonably well. Relative performance of the methods depends on the simulation scenario. Finally a method was developed to evaluate power and sample size needed for an adaptive signature design in order to predict the subset of patients sensitive to treatment. It is hoped that this study will stimulate more development of nonparametric and parametric methods to predict subsets of patients responding differently to treatment
    corecore