12 research outputs found

    Algebraic Comparison of Partial Lists in Bioinformatics

    Get PDF
    The outcome of a functional genomics pipeline is usually a partial list of genomic features, ranked by their relevance in modelling biological phenotype in terms of a classification or regression model. Due to resampling protocols or just within a meta-analysis comparison, instead of one list it is often the case that sets of alternative feature lists (possibly of different lengths) are obtained. Here we introduce a method, based on the algebraic theory of symmetric groups, for studying the variability between lists ("list stability") in the case of lists of unequal length. We provide algorithms evaluating stability for lists embedded in the full feature set or just limited to the features occurring in the partial lists. The method is demonstrated first on synthetic data in a gene filtering task and then for finding gene profiles on a recent prostate cancer dataset

    A family of measures for best top-n class-selective decision rules

    Get PDF
    International audienceWhen classes strongly overlap in the feature space, or when some classes are not known in advance, the performance of a classifier heavily decreases. To overcome this problem, the reject option has been introduced. It simply consists in withdrawing the decision, and let another classifier, or an expert, take the decision whenever exclusively classifying is not reliable enough. The classification problem is then a matter of class-selection, from none to all classes. In this paper, we propose a family of measures suitable to define such decision rules. It is based on a new family of operators that are able to detect blocks of similar values within a set of numbers in the unit interval, the soft labels of an incoming pattern to be classified, using a single threshold. Experiments on synthetic and real datasets available in the public domain show the efficiency of our approach

    The Metabolomic Profile in Amyotrophic Lateral Sclerosis Changes According to the Progression of the Disease: An Exploratory Study

    Get PDF
    Amyotrophic lateral sclerosis (ALS) is a multifactorial neurodegenerative pathology of the upper or lower motor neuron. Evaluation of ALS progression is based on clinical outcomes considering the impairment of body sites. ALS has been extensively investigated in the pathogenetic mechanisms and the clinical profile; however, no molecular biomarkers are used as diagnostic criteria to establish the ALS pathological staging. Using the source-reconstructed magnetoencephalography (MEG) approach, we demonstrated that global brain hyperconnectivity is associated with early and advanced clinical ALS stages. Using nuclear magnetic resonance (1H-NMR) and high resolution mass spectrometry (HRMS) spectroscopy, here we studied the metabolomic profile of ALS patients’ sera characterized by different stages of disease progression—namely early and advanced. Multivariate statistical analysis of the data integrated with the network analysis indicates that metabolites related to energy deficit, abnormal concentrations of neurotoxic metabolites and metabolites related to neurotransmitter production are pathognomonic of ALS in the advanced stage. Furthermore, analysis of the lipidomic profile indicates that advanced ALS patients report significant alteration of phosphocholine (PCs), lysophosphatidylcholine (LPCs), and sphingomyelin (SMs) metabolism, consistent with the exigency of lipid remodeling to repair advanced neuronal degeneration and inflammatio

    Information-Theoretic Measures for Objective Evaluation of Classifications

    Full text link
    This work presents a systematic study of objective evaluations of abstaining classifications using Information-Theoretic Measures (ITMs). First, we define objective measures for which they do not depend on any free parameter. This definition provides technical simplicity for examining "objectivity" or "subjectivity" directly to classification evaluations. Second, we propose twenty four normalized ITMs, derived from either mutual information, divergence, or cross-entropy, for investigation. Contrary to conventional performance measures that apply empirical formulas based on users' intuitions or preferences, the ITMs are theoretically more sound for realizing objective evaluations of classifications. We apply them to distinguish "error types" and "reject types" in binary classifications without the need for input data of cost terms. Third, to better understand and select the ITMs, we suggest three desirable features for classification assessment measures, which appear more crucial and appealing from the viewpoint of classification applications. Using these features as "meta-measures", we can reveal the advantages and limitations of ITMs from a higher level of evaluation knowledge. Numerical examples are given to corroborate our claims and compare the differences among the proposed measures. The best measure is selected in terms of the meta-measures, and its specific properties regarding error types and reject types are analytically derived.Comment: 25 Pages, 1 Figure, 10 Table

    An Optimum Class-Rejective Decision Rule and Its Evaluation

    Full text link

    A hybrid computational intelligence approach to groundwater spring potential mapping

    Full text link
    © 2019 by the authors. This study proposes a hybrid computational intelligence model that is a combination of alternating decision tree (ADTree) classifier and AdaBoost (AB) ensemble, namely "AB-ADTree", for groundwater spring potential mapping (GSPM) at the Chilgazi watershed in the Kurdistan province, Iran. Although ADTree and its ensembles have been widely used for environmental and ecological modeling, they have rarely been applied to GSPM. To that end, a groundwater spring inventory map and thirteen conditioning factors tested by the chi-square attribute evaluation (CSAE) technique were used to generate training and testing datasets for constructing and validating the proposed model. The performance of the proposed model was evaluated using statistical-index-based measures, such as positive predictive value (PPV), negative predictive value (NPV), sensitivity, specificity accuracy, root mean square error (RMSE), and the area under the receiver operating characteristic (ROC) curve (AUROC). The proposed hybrid model was also compared with five state-of-the-art benchmark soft computing models, including singleADTree, support vector machine (SVM), stochastic gradient descent (SGD), logistic model tree (LMT), logistic regression (LR), and random forest (RF). Results indicate that the proposed hybrid model significantly improved the predictive capability of the ADTree-based classifier (AUROC = 0.789). In addition, it was found that the hybrid model, AB-ADTree, (AUROC = 0.815), had the highest goodness-of-fit and prediction accuracy, followed by the LMT (AUROC = 0.803), RF (AUC = 0.803), SGD, and SVM (AUROC = 0.790) models. Indeed, this model is a powerful and robust technique for mapping of groundwater spring potential in the study area. Therefore, the proposed model is a promising tool to help planners, decision makers, managers, and governments in the management and planning of groundwater resources

    Landslide susceptibility mapping using remote sensing data and geographic information system-based algorithms

    Get PDF
    Whether they occur due to natural triggers or human activities, landslides lead to loss of life and damages to properties which impact infrastructures, road networks and buildings. Landslide Susceptibility Map (LSM) provides the policy and decision makers with some valuable information. This study aims to detect landslide locations by using Sentinel-1 data, the only freely available online Radar imagery, and to map areas prone to landslide using a novel algorithm of AB-ADTree in Cameron Highlands, Pahang, Malaysia. A total of 152 landslide locations were detected by using integration of Interferometry Synthetic Aperture RADAR (InSAR) technique, Google Earth (GE) images and extensive field survey. However, 80% of the data were employed for training the machine learning algorithms and the remaining 20% for validation purposes. Seventeen triggering and conditioning factors, namely slope, aspect, elevation, distance to road, distance to river, proximity to fault, road density, river density, Normalized Difference Vegetation Index (NDVI), rainfall, land cover, lithology, soil types, curvature, profile curvature, Stream Power Index (SPI) and Topographic Wetness Index (TWI), were extracted from satellite imageries, digital elevation model (DEM), geological and soil maps. These factors were utilized to generate landslide susceptibility maps using Logistic Regression (LR) model, Logistic Model Tree (LMT), Random Forest (RF), Alternating Decision Tree (ADTree), Adaptive Boosting (AdaBoost) and a novel hybrid model from ADTree and AdaBoost models, namely AB-ADTree model. The validation was based on area under the ROC curve (AUC) and statistical measurements of Positive Predictive Value (PPV), Negative Predictive Value (NPV), sensitivity, specificity, accuracy and Root Mean Square Error (RMSE). The results showed that AUC was 90%, 92%, 88%, 59%, 96% and 94% for LR, LMT, RF, ADTree, AdaBoost and AB-ADTree algorithms, respectively. Non-parametric evaluations of the Friedman and Wilcoxon were also applied to assess the models’ performance: the findings revealed that ADTree is inferior to the other models used in this study. Using a handheld Global Positioning System (GPS), field study and validation were performed for almost 20% (30 locations) of the detected landslide locations and the results revealed that the landslide locations were correctly detected. In conclusion, this study can be applicable for hazard mitigation purposes and regional planning

    Kernel-Based Ranking. Methods for Learning and Performance Estimation

    Get PDF
    Machine learning provides tools for automated construction of predictive models in data intensive areas of engineering and science. The family of regularized kernel methods have in the recent years become one of the mainstream approaches to machine learning, due to a number of advantages the methods share. The approach provides theoretically well-founded solutions to the problems of under- and overfitting, allows learning from structured data, and has been empirically demonstrated to yield high predictive performance on a wide range of application domains. Historically, the problems of classification and regression have gained the majority of attention in the field. In this thesis we focus on another type of learning problem, that of learning to rank. In learning to rank, the aim is from a set of past observations to learn a ranking function that can order new objects according to how well they match some underlying criterion of goodness. As an important special case of the setting, we can recover the bipartite ranking problem, corresponding to maximizing the area under the ROC curve (AUC) in binary classification. Ranking applications appear in a large variety of settings, examples encountered in this thesis include document retrieval in web search, recommender systems, information extraction and automated parsing of natural language. We consider the pairwise approach to learning to rank, where ranking models are learned by minimizing the expected probability of ranking any two randomly drawn test examples incorrectly. The development of computationally efficient kernel methods, based on this approach, has in the past proven to be challenging. Moreover, it is not clear what techniques for estimating the predictive performance of learned models are the most reliable in the ranking setting, and how the techniques can be implemented efficiently. The contributions of this thesis are as follows. First, we develop RankRLS, a computationally efficient kernel method for learning to rank, that is based on minimizing a regularized pairwise least-squares loss. In addition to training methods, we introduce a variety of algorithms for tasks such as model selection, multi-output learning, and cross-validation, based on computational shortcuts from matrix algebra. Second, we improve the fastest known training method for the linear version of the RankSVM algorithm, which is one of the most well established methods for learning to rank. Third, we study the combination of the empirical kernel map and reduced set approximation, which allows the large-scale training of kernel machines using linear solvers, and propose computationally efficient solutions to cross-validation when using the approach. Next, we explore the problem of reliable cross-validation when using AUC as a performance criterion, through an extensive simulation study. We demonstrate that the proposed leave-pair-out cross-validation approach leads to more reliable performance estimation than commonly used alternative approaches. Finally, we present a case study on applying machine learning to information extraction from biomedical literature, which combines several of the approaches considered in the thesis. The thesis is divided into two parts. Part I provides the background for the research work and summarizes the most central results, Part II consists of the five original research articles that are the main contribution of this thesis.Siirretty Doriast
    corecore