73,613 research outputs found
Optimal classifier selection and negative bias in error rate estimation: An empirical study on high-dimensional prediction
In biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results. This strategy may induce a substantial optimistic bias in prediction error estimation, which is quantitatively assessed in the present manuscript. The focus of our work is on class prediction based on high-dimensional data (e.g. microarray data), since such analyses are particularly exposed to this kind of bias.
In our study we consider a total of 124 variants of classifiers (possibly including variable selection or tuning steps) within a cross-validation evaluation scheme. The classifiers are applied to original and modified real microarray data sets, some of which are obtained by randomly permuting the class labels to mimic non-informative predictors while preserving their correlation structure. We then assess the minimal misclassification rate over the different variants of classifiers in order to quantify the bias arising when the optimal classifier is selected a posteriori in a data-driven manner. The bias resulting from the parameter tuning (including gene selection parameters as a special case) and the bias resulting from the choice of the classification method are examined both separately and jointly.
We conclude that the strategy to present only the optimal result is not acceptable, and suggest alternative approaches for properly reporting classification accuracy
Real or not? Identifying untrustworthy news websites using third-party partnerships
Untrustworthy content such as fake news and clickbait have become a pervasive problem on the Internet, causing significant socio-political problems around the world. Identifying untrustworthy content is a crucial step in countering them. The current best-practices for identification involve content analysis and arduous fact-checking of the content. To complement content analysis, we propose examining websites? third-parties to identify their trustworthiness. Websites utilize third-parties, also known as their digital supply chains, to create and present content and help the website function. Third-parties are an important indication of a website?s business model. Similar websites exhibit similarities in the third-parties they use. Using this perspective, we use machine learning and heuristic methods to discern similarities and dissimilarities in third-party usage, which we use to predict trustworthiness of websites. We demonstrate the effectiveness and robustness of our approach in predicting trustworthiness of websites from a database of News, Fake News, and Clickbait websites. Our approach can be easily and cost-effectively implemented to reinforce current identification methods
Predicting and Evaluating Software Model Growth in the Automotive Industry
The size of a software artifact influences the software quality and impacts
the development process. In industry, when software size exceeds certain
thresholds, memory errors accumulate and development tools might not be able to
cope anymore, resulting in a lengthy program start up times, failing builds, or
memory problems at unpredictable times. Thus, foreseeing critical growth in
software modules meets a high demand in industrial practice. Predicting the
time when the size grows to the level where maintenance is needed prevents
unexpected efforts and helps to spot problematic artifacts before they become
critical.
Although the amount of prediction approaches in literature is vast, it is
unclear how well they fit with prerequisites and expectations from practice. In
this paper, we perform an industrial case study at an automotive manufacturer
to explore applicability and usability of prediction approaches in practice. In
a first step, we collect the most relevant prediction approaches from
literature, including both, approaches using statistics and machine learning.
Furthermore, we elicit expectations towards predictions from practitioners
using a survey and stakeholder workshops. At the same time, we measure software
size of 48 software artifacts by mining four years of revision history,
resulting in 4,547 data points. In the last step, we assess the applicability
of state-of-the-art prediction approaches using the collected data by
systematically analyzing how well they fulfill the practitioners' expectations.
Our main contribution is a comparison of commonly used prediction approaches
in a real world industrial setting while considering stakeholder expectations.
We show that the approaches provide significantly different results regarding
prediction accuracy and that the statistical approaches fit our data best
Most Ligand-Based Classification Benchmarks Reward Memorization Rather than Generalization
Undetected overfitting can occur when there are significant redundancies
between training and validation data. We describe AVE, a new measure of
training-validation redundancy for ligand-based classification problems that
accounts for the similarity amongst inactive molecules as well as active. We
investigated seven widely-used benchmarks for virtual screening and
classification, and show that the amount of AVE bias strongly correlates with
the performance of ligand-based predictive methods irrespective of the
predicted property, chemical fingerprint, similarity measure, or
previously-applied unbiasing techniques. Therefore, it may be that the
previously-reported performance of most ligand-based methods can be explained
by overfitting to benchmarks rather than good prospective accuracy
Correcting the optimally selected resampling-based error rate: A smooth analytical alternative to nested cross-validation
High-dimensional binary classification tasks, e.g. the classification of microarray samples into normal and cancer tissues, usually involve a tuning parameter adjusting the complexity of the applied method to the examined data set. By reporting the performance of the best tuning parameter value only, over-optimistic prediction errors are published. The contribution of this paper is two-fold. Firstly, we develop a new method for tuning bias
correction which can be motivated by decision theoretic considerations. The method is based on the decomposition of the unconditional error rate involving the tuning procedure. Our corrected error estimator can be written as
a weighted mean of the errors obtained using the different tuning parameter values. It can be interpreted as a smooth version of nested cross-validation (NCV) which is the standard approach for avoiding tuning bias. In contrast
to NCV, the weighting scheme of our method guarantees intuitive bounds for the corrected error. Secondly, we suggest to use bias correction methods also to address the bias resulting from the optimal choice of the classification method among several competitors. This method selection bias is particularly relevant to prediction problems in high-dimensional data. In the
absence of standards, it is common practice to try several methods successively, which can lead to an optimistic bias similar to the tuning bias. We demonstrate the performance of our method to address both types of bias based on microarray data sets and compare it to existing methods. This study confirms that our approach yields estimates competitive to NCV at a much lower computational price
Active learning in annotating micro-blogs dealing with e-reputation
Elections unleash strong political views on Twitter, but what do people
really think about politics? Opinion and trend mining on micro blogs dealing
with politics has recently attracted researchers in several fields including
Information Retrieval and Machine Learning (ML). Since the performance of ML
and Natural Language Processing (NLP) approaches are limited by the amount and
quality of data available, one promising alternative for some tasks is the
automatic propagation of expert annotations. This paper intends to develop a
so-called active learning process for automatically annotating French language
tweets that deal with the image (i.e., representation, web reputation) of
politicians. Our main focus is on the methodology followed to build an original
annotated dataset expressing opinion from two French politicians over time. We
therefore review state of the art NLP-based ML algorithms to automatically
annotate tweets using a manual initiation step as bootstrap. This paper focuses
on key issues about active learning while building a large annotated data set
from noise. This will be introduced by human annotators, abundance of data and
the label distribution across data and entities. In turn, we show that Twitter
characteristics such as the author's name or hashtags can be considered as the
bearing point to not only improve automatic systems for Opinion Mining (OM) and
Topic Classification but also to reduce noise in human annotations. However, a
later thorough analysis shows that reducing noise might induce the loss of
crucial information.Comment: Journal of Interdisciplinary Methodologies and Issues in Science -
Vol 3 - Contextualisation digitale - 201
On the role of pre and post-processing in environmental data mining
The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed
Multimodal Classification of Urban Micro-Events
In this paper we seek methods to effectively detect urban micro-events. Urban
micro-events are events which occur in cities, have limited geographical
coverage and typically affect only a small group of citizens. Because of their
scale these are difficult to identify in most data sources. However, by using
citizen sensing to gather data, detecting them becomes feasible. The data
gathered by citizen sensing is often multimodal and, as a consequence, the
information required to detect urban micro-events is distributed over multiple
modalities. This makes it essential to have a classifier capable of combining
them. In this paper we explore several methods of creating such a classifier,
including early, late, hybrid fusion and representation learning using
multimodal graphs. We evaluate performance on a real world dataset obtained
from a live citizen reporting system. We show that a multimodal approach yields
higher performance than unimodal alternatives. Furthermore, we demonstrate that
our hybrid combination of early and late fusion with multimodal embeddings
performs best in classification of urban micro-events
- …