12 research outputs found
Algebraic Comparison of Partial Lists in Bioinformatics
The outcome of a functional genomics pipeline is usually a partial list of
genomic features, ranked by their relevance in modelling biological phenotype
in terms of a classification or regression model. Due to resampling protocols
or just within a meta-analysis comparison, instead of one list it is often the
case that sets of alternative feature lists (possibly of different lengths) are
obtained. Here we introduce a method, based on the algebraic theory of
symmetric groups, for studying the variability between lists ("list stability")
in the case of lists of unequal length. We provide algorithms evaluating
stability for lists embedded in the full feature set or just limited to the
features occurring in the partial lists. The method is demonstrated first on
synthetic data in a gene filtering task and then for finding gene profiles on a
recent prostate cancer dataset
A family of measures for best top-n class-selective decision rules
International audienceWhen classes strongly overlap in the feature space, or when some classes are not known in advance, the performance of a classifier heavily decreases. To overcome this problem, the reject option has been introduced. It simply consists in withdrawing the decision, and let another classifier, or an expert, take the decision whenever exclusively classifying is not reliable enough. The classification problem is then a matter of class-selection, from none to all classes. In this paper, we propose a family of measures suitable to define such decision rules. It is based on a new family of operators that are able to detect blocks of similar values within a set of numbers in the unit interval, the soft labels of an incoming pattern to be classified, using a single threshold. Experiments on synthetic and real datasets available in the public domain show the efficiency of our approach
The Metabolomic Profile in Amyotrophic Lateral Sclerosis Changes According to the Progression of the Disease: An Exploratory Study
Amyotrophic lateral sclerosis (ALS) is a multifactorial neurodegenerative pathology of the upper or lower motor neuron. Evaluation of ALS progression is based on clinical outcomes considering the impairment of body sites. ALS has been extensively investigated in the pathogenetic mechanisms and the clinical profile; however, no molecular biomarkers are used as diagnostic criteria to establish the ALS pathological staging. Using the source-reconstructed magnetoencephalography (MEG) approach, we demonstrated that global brain hyperconnectivity is associated with early and advanced clinical ALS stages. Using nuclear magnetic resonance (1H-NMR) and high resolution mass spectrometry (HRMS) spectroscopy, here we studied the metabolomic profile of ALS patients’ sera characterized by different stages of disease progression—namely early and advanced. Multivariate statistical analysis of the data integrated with the network analysis indicates that metabolites related to energy deficit, abnormal concentrations of neurotoxic metabolites and metabolites related to neurotransmitter production are pathognomonic of ALS in the advanced stage. Furthermore, analysis of the lipidomic profile indicates that advanced ALS patients report significant alteration of phosphocholine (PCs), lysophosphatidylcholine (LPCs), and sphingomyelin (SMs) metabolism, consistent with the exigency of lipid remodeling to repair advanced neuronal degeneration and inflammatio
Information-Theoretic Measures for Objective Evaluation of Classifications
This work presents a systematic study of objective evaluations of abstaining
classifications using Information-Theoretic Measures (ITMs). First, we define
objective measures for which they do not depend on any free parameter. This
definition provides technical simplicity for examining "objectivity" or
"subjectivity" directly to classification evaluations. Second, we propose
twenty four normalized ITMs, derived from either mutual information,
divergence, or cross-entropy, for investigation. Contrary to conventional
performance measures that apply empirical formulas based on users' intuitions
or preferences, the ITMs are theoretically more sound for realizing objective
evaluations of classifications. We apply them to distinguish "error types" and
"reject types" in binary classifications without the need for input data of
cost terms. Third, to better understand and select the ITMs, we suggest three
desirable features for classification assessment measures, which appear more
crucial and appealing from the viewpoint of classification applications. Using
these features as "meta-measures", we can reveal the advantages and limitations
of ITMs from a higher level of evaluation knowledge. Numerical examples are
given to corroborate our claims and compare the differences among the proposed
measures. The best measure is selected in terms of the meta-measures, and its
specific properties regarding error types and reject types are analytically
derived.Comment: 25 Pages, 1 Figure, 10 Table
A hybrid computational intelligence approach to groundwater spring potential mapping
© 2019 by the authors. This study proposes a hybrid computational intelligence model that is a combination of alternating decision tree (ADTree) classifier and AdaBoost (AB) ensemble, namely "AB-ADTree", for groundwater spring potential mapping (GSPM) at the Chilgazi watershed in the Kurdistan province, Iran. Although ADTree and its ensembles have been widely used for environmental and ecological modeling, they have rarely been applied to GSPM. To that end, a groundwater spring inventory map and thirteen conditioning factors tested by the chi-square attribute evaluation (CSAE) technique were used to generate training and testing datasets for constructing and validating the proposed model. The performance of the proposed model was evaluated using statistical-index-based measures, such as positive predictive value (PPV), negative predictive value (NPV), sensitivity, specificity accuracy, root mean square error (RMSE), and the area under the receiver operating characteristic (ROC) curve (AUROC). The proposed hybrid model was also compared with five state-of-the-art benchmark soft computing models, including singleADTree, support vector machine (SVM), stochastic gradient descent (SGD), logistic model tree (LMT), logistic regression (LR), and random forest (RF). Results indicate that the proposed hybrid model significantly improved the predictive capability of the ADTree-based classifier (AUROC = 0.789). In addition, it was found that the hybrid model, AB-ADTree, (AUROC = 0.815), had the highest goodness-of-fit and prediction accuracy, followed by the LMT (AUROC = 0.803), RF (AUC = 0.803), SGD, and SVM (AUROC = 0.790) models. Indeed, this model is a powerful and robust technique for mapping of groundwater spring potential in the study area. Therefore, the proposed model is a promising tool to help planners, decision makers, managers, and governments in the management and planning of groundwater resources
Landslide susceptibility mapping using remote sensing data and geographic information system-based algorithms
Whether they occur due to natural triggers or human activities, landslides lead to loss of life and damages to properties which impact infrastructures, road networks and buildings. Landslide Susceptibility Map (LSM) provides the policy and decision makers with some valuable information. This study aims to detect landslide locations by using Sentinel-1 data, the only freely available online Radar imagery, and to map areas prone to landslide using a novel algorithm of AB-ADTree in Cameron Highlands, Pahang, Malaysia. A total of 152 landslide locations were detected by using integration of Interferometry Synthetic Aperture RADAR (InSAR) technique, Google Earth (GE) images and extensive field survey. However, 80% of the data were employed for training the machine learning algorithms and the remaining 20% for validation purposes. Seventeen triggering and conditioning factors, namely slope, aspect, elevation, distance to road, distance to river, proximity to fault, road density, river density, Normalized Difference Vegetation Index (NDVI), rainfall, land cover, lithology, soil types, curvature, profile curvature, Stream Power Index (SPI) and Topographic Wetness Index (TWI), were extracted from satellite imageries, digital elevation model (DEM), geological and soil maps. These factors were utilized to generate landslide susceptibility maps using Logistic Regression (LR) model, Logistic Model Tree (LMT), Random Forest (RF), Alternating Decision Tree (ADTree), Adaptive Boosting (AdaBoost) and a novel hybrid model from ADTree and AdaBoost models, namely AB-ADTree model. The validation was based on area under the ROC curve (AUC) and statistical measurements of Positive Predictive Value (PPV), Negative Predictive Value (NPV), sensitivity, specificity, accuracy and Root Mean Square Error (RMSE). The results showed that AUC was 90%, 92%, 88%, 59%, 96% and 94% for LR, LMT, RF, ADTree, AdaBoost and AB-ADTree algorithms, respectively. Non-parametric evaluations of the Friedman and Wilcoxon were also applied to assess the models’ performance: the findings revealed that ADTree is inferior to the other models used in this study. Using a handheld Global Positioning System (GPS), field study and validation were performed for almost 20% (30 locations) of the detected landslide locations and the results revealed that the landslide locations were correctly detected. In conclusion, this study can be applicable for hazard mitigation purposes and regional planning
Kernel-Based Ranking. Methods for Learning and Performance Estimation
Machine learning provides tools for automated construction of predictive
models in data intensive areas of engineering and science. The family of
regularized kernel methods have in the recent years become one of the mainstream
approaches to machine learning, due to a number of advantages the
methods share. The approach provides theoretically well-founded solutions
to the problems of under- and overfitting, allows learning from structured
data, and has been empirically demonstrated to yield high predictive performance
on a wide range of application domains. Historically, the problems
of classification and regression have gained the majority of attention in the
field. In this thesis we focus on another type of learning problem, that of
learning to rank.
In learning to rank, the aim is from a set of past observations to learn
a ranking function that can order new objects according to how well they
match some underlying criterion of goodness. As an important special case
of the setting, we can recover the bipartite ranking problem, corresponding
to maximizing the area under the ROC curve (AUC) in binary classification.
Ranking applications appear in a large variety of settings, examples
encountered in this thesis include document retrieval in web search, recommender
systems, information extraction and automated parsing of natural
language. We consider the pairwise approach to learning to rank, where
ranking models are learned by minimizing the expected probability of ranking
any two randomly drawn test examples incorrectly. The development
of computationally efficient kernel methods, based on this approach, has in
the past proven to be challenging. Moreover, it is not clear what techniques
for estimating the predictive performance of learned models are the most
reliable in the ranking setting, and how the techniques can be implemented
efficiently.
The contributions of this thesis are as follows. First, we develop
RankRLS, a computationally efficient kernel method for learning to rank,
that is based on minimizing a regularized pairwise least-squares loss. In
addition to training methods, we introduce a variety of algorithms for tasks
such as model selection, multi-output learning, and cross-validation, based
on computational shortcuts from matrix algebra. Second, we improve the fastest known training method for the linear version of the RankSVM algorithm,
which is one of the most well established methods for learning to
rank. Third, we study the combination of the empirical kernel map and reduced
set approximation, which allows the large-scale training of kernel machines
using linear solvers, and propose computationally efficient solutions
to cross-validation when using the approach. Next, we explore the problem
of reliable cross-validation when using AUC as a performance criterion,
through an extensive simulation study. We demonstrate that the proposed
leave-pair-out cross-validation approach leads to more reliable performance
estimation than commonly used alternative approaches. Finally, we present
a case study on applying machine learning to information extraction from
biomedical literature, which combines several of the approaches considered
in the thesis. The thesis is divided into two parts. Part I provides the background
for the research work and summarizes the most central results, Part
II consists of the five original research articles that are the main contribution
of this thesis.Siirretty Doriast