21,012 research outputs found
Detecting Outliers in Data with Correlated Measures
Advances in sensor technology have enabled the collection of large-scale
datasets. Such datasets can be extremely noisy and often contain a significant
amount of outliers that result from sensor malfunction or human operation
faults. In order to utilize such data for real-world applications, it is
critical to detect outliers so that models built from these datasets will not
be skewed by outliers.
In this paper, we propose a new outlier detection method that utilizes the
correlations in the data (e.g., taxi trip distance vs. trip time). Different
from existing outlier detection methods, we build a robust regression model
that explicitly models the outliers and detects outliers simultaneously with
the model fitting.
We validate our approach on real-world datasets against methods specifically
designed for each dataset as well as the state of the art outlier detectors.
Our outlier detection method achieves better performances, demonstrating the
robustness and generality of our method. Last, we report interesting case
studies on some outliers that result from atypical events.Comment: 10 page
iFair: Learning Individually Fair Data Representations for Algorithmic Decision Making
People are rated and ranked, towards algorithmic decision making in an
increasing number of applications, typically based on machine learning.
Research on how to incorporate fairness into such tasks has prevalently pursued
the paradigm of group fairness: giving adequate success rates to specifically
protected groups. In contrast, the alternative paradigm of individual fairness
has received relatively little attention, and this paper advances this less
explored direction. The paper introduces a method for probabilistically mapping
user records into a low-rank representation that reconciles individual fairness
and the utility of classifiers and rankings in downstream applications. Our
notion of individual fairness requires that users who are similar in all
task-relevant attributes such as job qualification, and disregarding all
potentially discriminating attributes such as gender, should have similar
outcomes. We demonstrate the versatility of our method by applying it to
classification and learning-to-rank tasks on a variety of real-world datasets.
Our experiments show substantial improvements over the best prior work for this
setting.Comment: Accepted at ICDE 2019. Please cite the ICDE 2019 proceedings versio
Mining the Web for Lexical Knowledge to Improve Keyphrase Extraction: Learning from Labeled and Unlabeled Data.
A journal article is often accompanied by a list of keyphrases, composed of about five to fifteen important words and phrases that capture the articles main topics. Keyphrases are useful for a variety of purposes, including summarizing, indexing, labeling, categorizing, clustering, highlighting, browsing, and searching. The task of automatic keyphrase extraction is to select keyphrases from within the text of a given document. Automatic keyphrase extraction makes it feasible to generate keyphrases for the huge number of documents that do not have manually assigned keyphrases. Good performance on this task has been obtained by approaching it as a supervised learning problem. An input document is treated as a set of candidate phrases that must be classified as either keyphrases or non-keyphrases. To classify a candidate phrase as a keyphrase, the most important features (attributes) appear to be the frequency and location of the candidate phrase in the document. Recent work has demonstrated that it is also useful to know the frequency of the candidate phrase as a manually assigned keyphrase for other documents in the same domain as the given document (e.g., the domain of computer science). Unfortunately, this keyphrase-frequency feature is domain-specific (the learning process must be repeated for each new domain) and training-intensive (good performance requires a relatively large number of training documents in the given domain, with manually assigned keyphrases). The aim of the work described here is to remove these limitations. In this paper, I introduce new features that are conceptually related to keyphrase-frequency and I present experiments that show that the new features result in improved keyphrase extraction, although they are neither domain-specific nor training-intensive. The new features are generated by issuing queries to a Web search engine, based on the candidate phrases in the input document. The feature values are calculated from the number of hits for the queries (the number of matching Web pages). In essence, these new features are derived by mining lexical knowledge from a very large collection of unlabeled data, consisting of approximately 350 million Web pages without manually assigned keyphrases
Learning the Joint Representation of Heterogeneous Temporal Events for Clinical Endpoint Prediction
The availability of a large amount of electronic health records (EHR)
provides huge opportunities to improve health care service by mining these
data. One important application is clinical endpoint prediction, which aims to
predict whether a disease, a symptom or an abnormal lab test will happen in the
future according to patients' history records. This paper develops deep
learning techniques for clinical endpoint prediction, which are effective in
many practical applications. However, the problem is very challenging since
patients' history records contain multiple heterogeneous temporal events such
as lab tests, diagnosis, and drug administrations. The visiting patterns of
different types of events vary significantly, and there exist complex nonlinear
relationships between different events. In this paper, we propose a novel model
for learning the joint representation of heterogeneous temporal events. The
model adds a new gate to control the visiting rates of different events which
effectively models the irregular patterns of different events and their
nonlinear correlations. Experiment results with real-world clinical data on the
tasks of predicting death and abnormal lab tests prove the effectiveness of our
proposed approach over competitive baselines.Comment: 8 pages, this paper has been accepted by AAAI 201
VizRank: Data Visualization Guided by Machine Learning
Data visualization plays a crucial role in identifying interesting patterns in exploratory data analysis. Its use is, however, made difficult by the large number of possible data projections showing different attribute subsets that must be evaluated by the data analyst. In this paper, we introduce a method called VizRank, which is applied on classified data to automatically select the most useful data projections. VizRank can be used with any visualization method that maps attribute values to points in a two-dimensional visualization space. It assesses possible data projections and ranks them by their ability to visually discriminate between classes. The quality of class separation is estimated by computing the predictive accuracy of k-nearest neighbor classifier on the data set consisting of x and y positions of the projected data points and their class information. The paper introduces the method and presents experimental results which show that VizRank's ranking of projections highly agrees with subjective rankings by data analysts. The practical use of VizRank is also demonstrated by an application in the field of functional genomics
On the role of pre and post-processing in environmental data mining
The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed
Certifying and removing disparate impact
What does it mean for an algorithm to be biased? In U.S. law, unintentional
bias is encoded via disparate impact, which occurs when a selection process has
widely different outcomes for different groups, even as it appears to be
neutral. This legal determination hinges on a definition of a protected class
(ethnicity, gender, religious practice) and an explicit description of the
process.
When the process is implemented using computers, determining disparate impact
(and hence bias) is harder. It might not be possible to disclose the process.
In addition, even if the process is open, it might be hard to elucidate in a
legal setting how the algorithm makes its decisions. Instead of requiring
access to the algorithm, we propose making inferences based on the data the
algorithm uses.
We make four contributions to this problem. First, we link the legal notion
of disparate impact to a measure of classification accuracy that while known,
has received relatively little attention. Second, we propose a test for
disparate impact based on analyzing the information leakage of the protected
class from the other data attributes. Third, we describe methods by which data
might be made unbiased. Finally, we present empirical evidence supporting the
effectiveness of our test for disparate impact and our approach for both
masking bias and preserving relevant information in the data. Interestingly,
our approach resembles some actual selection practices that have recently
received legal scrutiny.Comment: Extended version of paper accepted at 2015 ACM SIGKDD Conference on
Knowledge Discovery and Data Minin
- …