55,765 research outputs found
Sampling Strategies for Mining in Data-Scarce Domains
Data mining has traditionally focused on the task of drawing inferences from
large datasets. However, many scientific and engineering domains, such as fluid
dynamics and aircraft design, are characterized by scarce data, due to the
expense and complexity of associated experiments and simulations. In such
data-scarce domains, it is advantageous to focus the data collection effort on
only those regions deemed most important to support a particular data mining
objective. This paper describes a mechanism that interleaves bottom-up data
mining, to uncover multi-level structures in spatial data, with top-down
sampling, to clarify difficult decisions in the mining process. The mechanism
exploits relevant physical properties, such as continuity, correspondence, and
locality, in a unified framework. This leads to effective mining and sampling
decisions that are explainable in terms of domain knowledge and data
characteristics. This approach is demonstrated in two diverse applications --
mining pockets in spatial data, and qualitative determination of Jordan forms
of matrices
Self-configuration from a Machine-Learning Perspective
The goal of machine learning is to provide solutions which are trained by
data or by experience coming from the environment. Many training algorithms
exist and some brilliant successes were achieved. But even in structured
environments for machine learning (e.g. data mining or board games), most
applications beyond the level of toy problems need careful hand-tuning or human
ingenuity (i.e. detection of interesting patterns) or both. We discuss several
aspects how self-configuration can help to alleviate these problems. One aspect
is the self-configuration by tuning of algorithms, where recent advances have
been made in the area of SPO (Sequen- tial Parameter Optimization). Another
aspect is the self-configuration by pattern detection or feature construction.
Forming multiple features (e.g. random boolean functions) and using algorithms
(e.g. random forests) which easily digest many fea- tures can largely increase
learning speed. However, a full-fledged theory of feature construction is not
yet available and forms a current barrier in machine learning. We discuss
several ideas for systematic inclusion of feature construction. This may lead
to partly self-configuring machine learning solutions which show robustness,
flexibility, and fast learning in potentially changing environments.Comment: 12 pages, 5 figures, Dagstuhl seminar 11181 "Organic Computing -
Design of Self-Organizing Systems", May 201
Disentangling Aspect and Opinion Words in Target-based Sentiment Analysis using Lifelong Learning
Given a target name, which can be a product aspect or entity, identifying its
aspect words and opinion words in a given corpus is a fine-grained task in
target-based sentiment analysis (TSA). This task is challenging, especially
when we have no labeled data and we want to perform it for any given domain. To
address it, we propose a general two-stage approach. Stage one extracts/groups
the target-related words (call t-words) for a given target. This is relatively
easy as we can apply an existing semantics-based learning technique. Stage two
separates the aspect and opinion words from the grouped t-words, which is
challenging because we often do not have enough word-level aspect and opinion
labels. In this work, we formulate this problem in a PU learning setting and
incorporate the idea of lifelong learning to solve it. Experimental results
show the effectiveness of our approach
Concise Fuzzy System Modeling Integrating Soft Subspace Clustering and Sparse Learning
The superior interpretability and uncertainty modeling ability of
Takagi-Sugeno-Kang fuzzy system (TSK FS) make it possible to describe complex
nonlinear systems intuitively and efficiently. However, classical TSK FS
usually adopts the whole feature space of the data for model construction,
which can result in lengthy rules for high-dimensional data and lead to
degeneration in interpretability. Furthermore, for highly nonlinear modeling
task, it is usually necessary to use a large number of rules which further
weakens the clarity and interpretability of TSK FS. To address these issues, a
concise zero-order TSK FS construction method, called ESSC-SL-CTSK-FS, is
proposed in this paper by integrating the techniques of enhanced soft subspace
clustering (ESSC) and sparse learning (SL). In this method, ESSC is used to
generate the antecedents and various sparse subspace for different fuzzy rules,
whereas SL is used to optimize the consequent parameters of the fuzzy rules,
based on which the number of fuzzy rules can be effectively reduced. Finally,
the proposed ESSC-SL-CTSK-FS method is used to construct con-cise zero-order
TSK FS that can explain the scenes in high-dimensional data modeling more
clearly and easily. Experiments are conducted on various real-world datasets to
confirm the advantages
Fighting Accounting Fraud Through Forensic Data Analytics
Accounting fraud is a global concern representing a significant threat to the
financial system stability due to the resulting diminishing of the market
confidence and trust of regulatory authorities. Several tricks can be used to
commit accounting fraud, hence the need for non-static regulatory interventions
that take into account different fraudulent patterns. Accordingly, this study
aims to improve the detection of accounting fraud via the implementation of
several machine learning methods to better differentiate between fraud and
non-fraud companies, and to further assist the task of examination within the
riskier firms by evaluating relevant financial indicators. Out-of-sample
results suggest there is a great potential in detecting falsified financial
statements through statistical modelling and analysis of publicly available
accounting information. The proposed methodology can be of assistance to public
auditors and regulatory agencies as it facilitates auditing processes, and
supports more targeted and effective examinations of accounting reports.Comment: Working Pape
A Multi-Objective Anytime Rule Mining System to Ease Iterative Feedback from Domain Experts
Data extracted from software repositories is used intensively in Software
Engineering research, for example, to predict defects in source code. In our
research in this area, with data from open source projects as well as an
industrial partner, we noticed several shortcomings of conventional data mining
approaches for classification problems: (1) Domain experts' acceptance is of
critical importance, and domain experts can provide valuable input, but it is
hard to use this feedback. (2) The evaluation of the model is not a simple
matter of calculating AUC or accuracy. Instead, there are multiple objectives
of varying importance, but their importance cannot be easily quantified.
Furthermore, the performance of the model cannot be evaluated on a per-instance
level in our case, because it shares aspects with the set cover problem. To
overcome these problems, we take a holistic approach and develop a rule mining
system that simplifies iterative feedback from domain experts and can easily
incorporate the domain-specific evaluation needs. A central part of the system
is a novel multi-objective anytime rule mining algorithm. The algorithm is
based on the GRASP-PR meta-heuristic but extends it with ideas from several
other approaches. We successfully applied the system in the industrial context.
In the current article, we focus on the description of the algorithm and the
concepts of the system. We provide an implementation of the system for reuse
Deep Learning for Sentiment Analysis : A Survey
Deep learning has emerged as a powerful machine learning technique that
learns multiple layers of representations or features of the data and produces
state-of-the-art prediction results. Along with the success of deep learning in
many other application domains, deep learning is also popularly used in
sentiment analysis in recent years. This paper first gives an overview of deep
learning and then provides a comprehensive survey of its current applications
in sentiment analysis.Comment: 34 pages, 9 figures, 2 table
A Study on Feature Selection Techniques in Educational Data Mining
Educational data mining (EDM) is a new growing research area and the essence
of data mining concepts are used in the educational field for the purpose of
extracting useful information on the behaviors of students in the learning
process. In this EDM, feature selection is to be made for the generation of
subset of candidate variables. As the feature selection influences the
predictive accuracy of any performance model, it is essential to study
elaborately the effectiveness of student performance model in connection with
feature selection techniques. In this connection, the present study is devoted
not only to investigate the most relevant subset features with minimum
cardinality for achieving high predictive performance by adopting various
filtered feature selection techniques in data mining but also to evaluate the
goodness of subsets with different cardinalities and the quality of six
filtered feature selection algorithms in terms of F-measure value and Receiver
Operating Characteristics (ROC) value, generated by the NaiveBayes algorithm as
base-line classifier method. The comparative study carried out by us on six
filter feature section algorithms reveals the best method, as well as optimal
dimensionality of the feature subset. Benchmarking of filter feature selection
method is subsequently carried out by deploying different classifier models.
The result of the present study effectively supports the well known fact of
increase in the predictive accuracy with the existence of minimum number of
features. The expected outcomes show a reduction in computational time and
constructional cost in both training and classification phases of the student
performance model
A review on distance based time series classification
Time series classification is an increasing research topic due to the vast
amount of time series data that are being created over a wide variety of
fields. The particularity of the data makes it a challenging task and different
approaches have been taken, including the distance based approach. 1-NN has
been a widely used method within distance based time series classification due
to it simplicity but still good performance. However, its supremacy may be
attributed to being able to use specific distances for time series within the
classification process and not to the classifier itself. With the aim of
exploiting these distances within more complex classifiers, new approaches have
arisen in the past few years that are competitive or which outperform the 1-NN
based approaches. In some cases, these new methods use the distance measure to
transform the series into feature vectors, bridging the gap between time series
and traditional classifiers. In other cases, the distances are employed to
obtain a time series kernel and enable the use of kernel methods for time
series classification. One of the main challenges is that a kernel function
must be positive semi-definite, a matter that is also addressed within this
review. The presented review includes a taxonomy of all those methods that aim
to classify time series using a distance based approach, as well as a
discussion of the strengths and weaknesses of each method
Semi-Automatic Terminology Ontology Learning Based on Topic Modeling
Ontologies provide features like a common vocabulary, reusability,
machine-readable content, and also allows for semantic search, facilitate agent
interaction and ordering & structuring of knowledge for the Semantic Web (Web
3.0) application. However, the challenge in ontology engineering is automatic
learning, i.e., the there is still a lack of fully automatic approach from a
text corpus or dataset of various topics to form ontology using machine
learning techniques. In this paper, two topic modeling algorithms are explored,
namely LSI & SVD and Mr.LDA for learning topic ontology. The objective is to
determine the statistical relationship between document and terms to build a
topic ontology and ontology graph with minimum human intervention. Experimental
analysis on building a topic ontology and semantic retrieving corresponding
topic ontology for the user's query demonstrating the effectiveness of the
proposed approach
- …