126,605 research outputs found
Framework for Identification and Prevention of Direct and Indirect Discrimination using Data mining
Extraction of useful and important information from huge collection of data is known as data mining. Negative social perception about data mining is also there, among which potential privacy invasion and potential discrimination are there. Discrimination involves unequally or unfairly treating people on the basis of their belongings to a specific group. Automated data collection and data mining techniques like classification rule mining have made easier to make automated decisions, like loan granting/denial, insurance premium computation, etc. If the training data sets are biased in what regards discriminatory (sensitive) attributes like age, gender, race, religion, etc., discriminatory decisions may ensue. For this reason, antidiscrimination techniques including discrimination discovery, identification and prevention have been introduced in data mining. Discrimination may of two types, either direct or indirect. Direct discrimination is the one where decisions are taken on basis of sensitive attributes. Indirect discrimination is the one where decisions are made based on non-sensitive attributes which are strongly correlated with biased sensitive ones. In this paper, we are dealing with discrimination prevention in data mining and propose new methods applicable for direct or indirect discrimination prevention individually or both at the same time. We discuss how to clean training data sets and transformed data sets in such a way that direct and/or indirect discriminatory decision rules are converted to legitimate (non-discriminatory) classification rules. We also propose new measures and metrics to analyse the utility of the proposed approaches and we compare these approaches
Visual Detection of Structural Changes in Time-Varying Graphs Using Persistent Homology
Topological data analysis is an emerging area in exploratory data analysis
and data mining. Its main tool, persistent homology, has become a popular
technique to study the structure of complex, high-dimensional data. In this
paper, we propose a novel method using persistent homology to quantify
structural changes in time-varying graphs. Specifically, we transform each
instance of the time-varying graph into metric spaces, extract topological
features using persistent homology, and compare those features over time. We
provide a visualization that assists in time-varying graph exploration and
helps to identify patterns of behavior within the data. To validate our
approach, we conduct several case studies on real world data sets and show how
our method can find cyclic patterns, deviations from those patterns, and
one-time events in time-varying graphs. We also examine whether
persistence-based similarity measure as a graph metric satisfies a set of
well-established, desirable properties for graph metrics
Quality and complexity measures for data linkage and deduplication
Summary. Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and efficiency when linking or deduplicating very large data sets. Different measures have been used to characterise the quality and complexity of data linkage algorithms, and several new metrics have been proposed. An overview of the issues involved in measuring data linkage and deduplication quality and complexity is presented in this chapter. It is shown that measures in the space of record pair comparisons can produce deceptive quality results. Various measures are discussed and recommendations are given on how to assess data linkage and deduplication quality and complexity. Key words: data or record linkage, data integration and matching, deduplication, data mining pre-processing, quality and complexity measures
Performance Evaluation of EM and K-Means Clustering Algorithms in Data Mining System
In the Emerging field of Data Mining System there are different techniques namely Clustering, Prediction, Classification, and Association etc. Clustering technique performs by dividing the particular data set into associated groups such that every group does not have anything in common.Clustering algorithms have emerged as an alternative powerful meta-learning tool to accurately analyze the massive volume of data generated by modern applications. Actually the main goal is to classify data into clusters such that objects are clustered in the same cluster when they are related according to particular metrics. Classification is the organization of data sets into some predefined sets using various mathematical models. This research discusses the comparison of algorithms K-Means and Expectation-Maximization in clustering. Empirically, we focused on wide experiments where wecompared the best typical algorithm from each of the categories using a large number of real or bigdata sets. The effectiveness of the Expectation-Maximization clustering algorithm is measured through a number of internaland external validity metrics, stability, runtime and scalability tests
A Survey on Discrimination Avoidance in Data Mining
ABSTRACT: For extracting useful knowledge which is hidden in large set of data, Data mining is a very important technology. There are some negative perceptions about data mining. This perception may contain unfairly treating people who belongs to some specific group. Classification rule mining technique has covered the way for making automatic decisions like loan granting/denial and insurance premium computation etc. These are automated data collection and data mining techniques. According to discrimination attributes if training data sets are biases then discriminatory decisions may ensue. Thus in data mining antidiscrimination techniques with discrimination discovery and prevention are included. It can be direct or indirect. When decisions are made based on sensitive attributes that time the discrimination is indirect. When decisions are made based on nonsensitive attributes which are strongly correlated with biased sensitive ones that time the discrimination is indirect. The proposed system tries to tackle discrimination prevention in data mining. It proposes new improved techniques applicable for direct or indirect discrimination prevention individually or both at the same time. Discussions about how to clean training data sets and outsourced data sets in such a way that direct and/or indirect discriminatory decision rules are converted to legitimate classification rules are done. New metrics to evaluate the utility of the proposed approaches are proposes and comparison of these approaches is also done
DotHash: Estimating Set Similarity Metrics for Link Prediction and Document Deduplication
Metrics for set similarity are a core aspect of several data mining tasks. To
remove duplicate results in a Web search, for example, a common approach looks
at the Jaccard index between all pairs of pages. In social network analysis, a
much-celebrated metric is the Adamic-Adar index, widely used to compare node
neighborhood sets in the important problem of predicting links. However, with
the increasing amount of data to be processed, calculating the exact similarity
between all pairs can be intractable. The challenge of working at this scale
has motivated research into efficient estimators for set similarity metrics.
The two most popular estimators, MinHash and SimHash, are indeed used in
applications such as document deduplication and recommender systems where large
volumes of data need to be processed. Given the importance of these tasks, the
demand for advancing estimators is evident. We propose DotHash, an unbiased
estimator for the intersection size of two sets. DotHash can be used to
estimate the Jaccard index and, to the best of our knowledge, is the first
method that can also estimate the Adamic-Adar index and a family of related
metrics. We formally define this family of metrics, provide theoretical bounds
on the probability of estimate errors, and analyze its empirical performance.
Our experimental results indicate that DotHash is more accurate than the other
estimators in link prediction and detecting duplicate documents with the same
complexity and similar comparison time
- …