111,798 research outputs found
Hyperparameter Importance Across Datasets
With the advent of automated machine learning, automated hyperparameter
optimization methods are by now routinely used in data mining. However, this
progress is not yet matched by equal progress on automatic analyses that yield
information beyond performance-optimizing hyperparameter settings. In this
work, we aim to answer the following two questions: Given an algorithm, what
are generally its most important hyperparameters, and what are typically good
values for these? We present methodology and a framework to answer these
questions based on meta-learning across many datasets. We apply this
methodology using the experimental meta-data available on OpenML to determine
the most important hyperparameters of support vector machines, random forests
and Adaboost, and to infer priors for all their hyperparameters. The results,
obtained fully automatically, provide a quantitative basis to focus efforts in
both manual algorithm design and in automated hyperparameter optimization. The
conducted experiments confirm that the hyperparameters selected by the proposed
method are indeed the most important ones and that the obtained priors also
lead to statistically significant improvements in hyperparameter optimization.Comment: \c{opyright} 2018. Copyright is held by the owner/author(s).
Publication rights licensed to ACM. This is the author's version of the work.
It is posted here for your personal use, not for redistribution. The
definitive Version of Record was published in Proceedings of the 24th ACM
SIGKDD International Conference on Knowledge Discovery & Data Minin
Learning Heterogeneous Similarity Measures for Hybrid-Recommendations in Meta-Mining
The notion of meta-mining has appeared recently and extends the traditional
meta-learning in two ways. First it does not learn meta-models that provide
support only for the learning algorithm selection task but ones that support
the whole data-mining process. In addition it abandons the so called black-box
approach to algorithm description followed in meta-learning. Now in addition to
the datasets, algorithms also have descriptors, workflows as well. For the
latter two these descriptions are semantic, describing properties of the
algorithms. With the availability of descriptors both for datasets and data
mining workflows the traditional modelling techniques followed in
meta-learning, typically based on classification and regression algorithms, are
no longer appropriate. Instead we are faced with a problem the nature of which
is much more similar to the problems that appear in recommendation systems. The
most important meta-mining requirements are that suggestions should use only
datasets and workflows descriptors and the cold-start problem, e.g. providing
workflow suggestions for new datasets.
In this paper we take a different view on the meta-mining modelling problem
and treat it as a recommender problem. In order to account for the meta-mining
specificities we derive a novel metric-based-learning recommender approach. Our
method learns two homogeneous metrics, one in the dataset and one in the
workflow space, and a heterogeneous one in the dataset-workflow space. All
learned metrics reflect similarities established from the dataset-workflow
preference matrix. We demonstrate our method on meta-mining over biological
(microarray datasets) problems. The application of our method is not limited to
the meta-mining problem, its formulations is general enough so that it can be
applied on problems with similar requirements
Automated data pre-processing via meta-learning
The final publication is available at link.springer.comA data mining algorithm may perform differently on datasets with different characteristics, e.g., it might perform better on a dataset with continuous attributes rather than with categorical attributes, or the other way around.
As a matter of fact, a dataset usually needs to be pre-processed. Taking into account all the possible pre-processing operators, there exists a staggeringly large number of alternatives and nonexperienced users become overwhelmed.
We show that this problem can be addressed by an automated approach, leveraging ideas from metalearning.
Specifically, we consider a wide range of data pre-processing techniques and a set of data mining algorithms. For each data mining algorithm and selected dataset, we are able to predict the transformations that improve the result
of the algorithm on the respective dataset. Our approach will help non-expert users to more effectively identify the transformations appropriate to their applications, and hence to achieve improved results.Peer ReviewedPostprint (published version
The machines take over: a comparison of various supervised learning approaches for automated scoring of divergent thinking tasks
Traditionally, researchers employ human raters for scoring responses to creative thinking tasks. Apart from the associated costs this approach entails two potential risks. First, human raters can be subjective in their scoring behavior (inter-rater-variance). Second, individual raters are prone to inconsistent scoring patterns (intra-rater-variance). In light of these issues, we present an approach for automated scoring of Divergent Thinking (DT) Tasks. We implemented a pipeline aiming to generate accurate rating predictions for DT responses using text mining and machine learning methods. Based on two existing data sets from two different laboratories, we constructed several prediction models incorporating features representing meta information of the response or features engineered from the response’s word embeddings that were obtained using pre-trained GloVe and Word2Vec word vector spaces. Out of these features, word embeddings and features derived from them proved to be particularly effective. Overall, longer responses tended to achieve higher ratings as well as responses that were semantically distant from the stimulus object. In our comparison of three state-of-the-art machine learning algorithms, Random Forest and XGBoost tended to slightly outperform the Support Vector Regression.Correction for this article: https://doi.org/10.1002/jocb.62
Problem-Solving Knowledge Mining from Users’\ud Actions in an Intelligent Tutoring System
In an intelligent tutoring system (ITS), the domain expert should provide\ud
relevant domain knowledge to the tutor so that it will be able to guide the\ud
learner during problem solving. However, in several domains, this knowledge is\ud
not predetermined and should be captured or learned from expert users as well as\ud
intermediate and novice users. Our hypothesis is that, knowledge discovery (KD)\ud
techniques can help to build this domain intelligence in ITS. This paper proposes\ud
a framework to capture problem-solving knowledge using a promising approach\ud
of data and knowledge discovery based on a combination of sequential pattern\ud
mining and association rules discovery techniques. The framework has been implemented\ud
and is used to discover new meta knowledge and rules in a given domain\ud
which then extend domain knowledge and serve as problem space allowing\ud
the intelligent tutoring system to guide learners in problem-solving situations.\ud
Preliminary experiments have been conducted using the framework as an alternative\ud
to a path-planning problem solver in CanadarmTutor
Ontology of core data mining entities
In this article, we present OntoDM-core, an ontology of core data mining
entities. OntoDM-core defines themost essential datamining entities in a three-layered
ontological structure comprising of a specification, an implementation and an application
layer. It provides a representational framework for the description of mining
structured data, and in addition provides taxonomies of datasets, data mining tasks,
generalizations, data mining algorithms and constraints, based on the type of data.
OntoDM-core is designed to support a wide range of applications/use cases, such as
semantic annotation of data mining algorithms, datasets and results; annotation of
QSAR studies in the context of drug discovery investigations; and disambiguation of
terms in text mining. The ontology has been thoroughly assessed following the practices
in ontology engineering, is fully interoperable with many domain resources and
is easy to extend
PRESISTANT: Learning based assistant for data pre-processing
Data pre-processing is one of the most time consuming and relevant steps in a
data analysis process (e.g., classification task). A given data pre-processing
operator (e.g., transformation) can have positive, negative or zero impact on
the final result of the analysis. Expert users have the required knowledge to
find the right pre-processing operators. However, when it comes to non-experts,
they are overwhelmed by the amount of pre-processing operators and it is
challenging for them to find operators that would positively impact their
analysis (e.g., increase the predictive accuracy of a classifier). Existing
solutions either assume that users have expert knowledge, or they recommend
pre-processing operators that are only "syntactically" applicable to a dataset,
without taking into account their impact on the final analysis. In this work,
we aim at providing assistance to non-expert users by recommending data
pre-processing operators that are ranked according to their impact on the final
analysis. We developed a tool PRESISTANT, that uses Random Forests to learn the
impact of pre-processing operators on the performance (e.g., predictive
accuracy) of 5 different classification algorithms, such as J48, Naive Bayes,
PART, Logistic Regression, and Nearest Neighbor. Extensive evaluations on the
recommendations provided by our tool, show that PRESISTANT can effectively help
non-experts in order to achieve improved results in their analytical tasks
Student-centric Model of Learning Management System Activity and Academic Performance: from Correlation to Causation
In recent years, there is a lot of interest in modeling students' digital
traces in Learning Management System (LMS) to understand students' learning
behavior patterns including aspects of meta-cognition and self-regulation, with
the ultimate goal to turn those insights into actionable information to support
students to improve their learning outcomes. In achieving this goal, however,
there are two main issues that need to be addressed given the existing
literature. Firstly, most of the current work is course-centered (i.e. models
are built from data for a specific course) rather than student-centered;
secondly, a vast majority of the models are correlational rather than causal.
Those issues make it challenging to identify the most promising actionable
factors for intervention at the student level where most of the campus-wide
academic support is designed for. In this paper, we explored a student-centric
analytical framework for LMS activity data that can provide not only
correlational but causal insights mined from observational data. We
demonstrated this approach using a dataset of 1651 computing major students at
a public university in the US during one semester in the Fall of 2019. This
dataset includes students' fine-grained LMS interaction logs and administrative
data, e.g. demographics and academic performance. In addition, we expand the
repository of LMS behavior indicators to include those that can characterize
the time-of-the-day of login (e.g. chronotype). Our analysis showed that
student login volume, compared with other login behavior indicators, is both
strongly correlated and causally linked to student academic performance,
especially among students with low academic performance. We envision that those
insights will provide convincing evidence for college student support groups to
launch student-centered and targeted interventions that are effective and
scalable.Comment: 43 pages, 9 figures, 18 tables, Journal of Educational Data Mining
(Initial Submission
FSL-BM: Fuzzy Supervised Learning with Binary Meta-Feature for Classification
This paper introduces a novel real-time Fuzzy Supervised Learning with Binary
Meta-Feature (FSL-BM) for big data classification task. The study of real-time
algorithms addresses several major concerns, which are namely: accuracy, memory
consumption, and ability to stretch assumptions and time complexity. Attaining
a fast computational model providing fuzzy logic and supervised learning is one
of the main challenges in the machine learning. In this research paper, we
present FSL-BM algorithm as an efficient solution of supervised learning with
fuzzy logic processing using binary meta-feature representation using Hamming
Distance and Hash function to relax assumptions. While many studies focused on
reducing time complexity and increasing accuracy during the last decade, the
novel contribution of this proposed solution comes through integration of
Hamming Distance, Hash function, binary meta-features, binary classification to
provide real time supervised method. Hash Tables (HT) component gives a fast
access to existing indices; and therefore, the generation of new indices in a
constant time complexity, which supersedes existing fuzzy supervised algorithms
with better or comparable results. To summarize, the main contribution of this
technique for real-time Fuzzy Supervised Learning is to represent hypothesis
through binary input as meta-feature space and creating the Fuzzy Supervised
Hash table to train and validate model.Comment: FICC201
- …