32,845 research outputs found
Learning Modulo Theories for preference elicitation in hybrid domains
This paper introduces CLEO, a novel preference elicitation algorithm capable
of recommending complex objects in hybrid domains, characterized by both
discrete and continuous attributes and constraints defined over them. The
algorithm assumes minimal initial information, i.e., a set of catalog
attributes, and defines decisional features as logic formulae combining Boolean
and algebraic constraints over the attributes. The (unknown) utility of the
decision maker (DM) is modelled as a weighted combination of features. CLEO
iteratively alternates a preference elicitation step, where pairs of candidate
solutions are selected based on the current utility model, and a refinement
step where the utility is refined by incorporating the feedback received. The
elicitation step leverages a Max-SMT solver to return optimal hybrid solutions
according to the current utility model. The refinement step is implemented as
learning to rank, and a sparsifying norm is used to favour the selection of few
informative features in the combinatorial space of candidate decisional
features.
CLEO is the first preference elicitation algorithm capable of dealing with
hybrid domains, thanks to the use of Max-SMT technology, while retaining
uncertainty in the DM utility and noisy feedback. Experimental results on
complex recommendation tasks show the ability of CLEO to quickly focus towards
optimal solutions, as well as its capacity to recover from suboptimal initial
choices. While no competitors exist in the hybrid setting, CLEO outperforms a
state-of-the-art Bayesian preference elicitation algorithm when applied to a
purely discrete task.Comment: 50 pages, 3 figures, submitted to Artificial Intelligence Journa
Text Classification using Data Mining
Text classification is the process of classifying documents into predefined
categories based on their content. It is the automated assignment of natural
language texts to predefined categories. Text classification is the primary
requirement of text retrieval systems, which retrieve texts in response to a
user query, and text understanding systems, which transform text in some way
such as producing summaries, answering questions or extracting data. Existing
supervised learning algorithms to automatically classify text need sufficient
documents to learn accurately. This paper presents a new algorithm for text
classification using data mining that requires fewer documents for training.
Instead of using words, word relation i.e. association rules from these words
is used to derive feature set from pre-classified text documents. The concept
of Naive Bayes classifier is then used on derived features and finally only a
single concept of Genetic Algorithm has been added for final classification. A
system based on the proposed algorithm has been implemented and tested. The
experimental results show that the proposed system works as a successful text
classifier.Comment: 19 Pages, International Conferenc
A Deep Ensemble Framework for Fake News Detection and Classification
Fake news, rumor, incorrect information, and misinformation detection are
nowadays crucial issues as these might have serious consequences for our social
fabrics. The rate of such information is increasing rapidly due to the
availability of enormous web information sources including social media feeds,
news blogs, online newspapers etc.
In this paper, we develop various deep learning models for detecting fake
news and classifying them into the pre-defined fine-grained categories.
At first, we develop models based on Convolutional Neural Network (CNN) and
Bi-directional Long Short Term Memory (Bi-LSTM) networks. The representations
obtained from these two models are fed into a Multi-layer Perceptron Model
(MLP) for the final classification. Our experiments on a benchmark dataset show
promising results with an overall accuracy of 44.87\%, which outperforms the
current state of the art.Comment: 6 pages, 1 figure, accepted as a short paper in Web Intelligence 2018
(https://webintelligence2018.com/accepted-papers.html), title changed from
{"Going Deep to Detect Liars" Detecting Fake News using Deep Learning} to {A
Deep Ensemble Framework for Fake News Detection and Classification} as per
reviewers suggestio
A Survey on Sampling and Profiling over Big Data (Technical Report)
Due to the development of internet technology and computer science, data is
exploding at an exponential rate. Big data brings us new opportunities and
challenges. On the one hand, we can analyze and mine big data to discover
hidden information and get more potential value. On the other hand, the 5V
characteristic of big data, especially Volume which means large amount of data,
brings challenges to storage and processing. For some traditional data mining
algorithms, machine learning algorithms and data profiling tasks, it is very
difficult to handle such a large amount of data. The large amount of data is
highly demanding hardware resources and time consuming. Sampling methods can
effectively reduce the amount of data and help speed up data processing. Hence,
sampling technology has been widely studied and used in big data context, e.g.,
methods for determining sample size, combining sampling with big data
processing frameworks. Data profiling is the activity that finds metadata of
data set and has many use cases, e.g., performing data profiling tasks on
relational data, graph data, and time series data for anomaly detection and
data repair. However, data profiling is computationally expensive, especially
for large data sets. Therefore, this paper focuses on researching sampling and
profiling in big data context and investigates the application of sampling in
different categories of data profiling tasks. From the experimental results of
these studies, the results got from the sampled data are close to or even
exceed the results of the full amount of data. Therefore, sampling technology
plays an important role in the era of big data, and we also have reason to
believe that sampling technology will become an indispensable step in big data
processing in the future
A Novel Rough Set Reduct Algorithm for Medical Domain Based on Bee Colony Optimization
Feature selection refers to the problem of selecting relevant features which
produce the most predictive outcome. In particular, feature selection task is
involved in datasets containing huge number of features. Rough set theory has
been one of the most successful methods used for feature selection. However,
this method is still not able to find optimal subsets. This paper proposes a
new feature selection method based on Rough set theory hybrid with Bee Colony
Optimization (BCO) in an attempt to combat this. This proposed work is applied
in the medical domain to find the minimal reducts and experimentally compared
with the Quick Reduct, Entropy Based Reduct, and other hybrid Rough Set methods
such as Genetic Algorithm (GA), Ant Colony Optimization (ACO) and Particle
Swarm Optimization (PSO).Comment: IEEE Publication Format,
https://sites.google.com/site/journalofcomputing
Zero and Few Shot Learning with Semantic Feature Synthesis and Competitive Learning
Zero-shot learning (ZSL) is made possible by learning a projection function
between a feature space and a semantic space (e.g.,~an attribute space). Key to
ZSL is thus to learn a projection that is robust against the often large domain
gap between the seen and unseen class domains. In this work, this is achieved
by unseen class data synthesis and robust projection function learning.
Specifically, a novel semantic data synthesis strategy is proposed, by which
semantic class prototypes (e.g., attribute vectors) are used to simply perturb
seen class data for generating unseen class ones. As in any data
synthesis/hallucination approach, there are ambiguities and uncertainties on
how well the synthesised data can capture the targeted unseen class data
distribution. To cope with this, the second contribution of this work is a
novel projection learning model termed competitive bidirectional projection
learning (BPL) designed to best utilise the ambiguous synthesised data.
Specifically, we assume that each synthesised data point can belong to any
unseen class; and the most likely two class candidates are exploited to learn a
robust projection function in a competitive fashion. As a third contribution,
we show that the proposed ZSL model can be easily extended to few-shot learning
(FSL) by again exploiting semantic (class prototype guided) feature synthesis
and competitive BPL. Extensive experiments show that our model achieves the
state-of-the-art results on both problems.Comment: Submitted to IEEE TPAM
IDEL: In-Database Entity Linking with Neural Embeddings
We present a novel architecture, In-Database Entity Linking (IDEL), in which
we integrate the analytics-optimized RDBMS MonetDB with neural text mining
abilities. Our system design abstracts core tasks of most neural entity linking
systems for MonetDB. To the best of our knowledge, this is the first defacto
implemented system integrating entity-linking in a database. We leverage the
ability of MonetDB to support in-database-analytics with user defined functions
(UDFs) implemented in Python. These functions call machine learning libraries
for neural text mining, such as TensorFlow. The system achieves zero cost for
data shipping and transformation by utilizing MonetDB's ability to embed Python
processes in the database kernel and exchange data in NumPy arrays. IDEL
represents text and relational data in a joint vector space with neural
embeddings and can compensate errors with ambiguous entity representations. For
detecting matching entities, we propose a novel similarity function based on
joint neural embeddings which are learned via minimizing pairwise contrastive
ranking loss. This function utilizes a high dimensional index structures for
fast retrieval of matching entities. Our first implementation and experiments
using the WebNLG corpus show the effectiveness and the potentials of IDEL.Comment: This manuscript is a preprint for a paper submitted to VLDB201
End-to-End Entity Resolution for Big Data: A Survey
One of the most important tasks for improving data quality and the
reliability of data analytics results is Entity Resolution (ER). ER aims to
identify different descriptions that refer to the same real-world entity, and
remains a challenging problem. While previous works have studied specific
aspects of ER (and mostly in traditional settings), in this survey, we provide
for the first time an end-to-end view of modern ER workflows, and of the novel
aspects of entity indexing and matching methods in order to cope with more than
one of the Big Data characteristics simultaneously. We present the basic
concepts, processing steps and execution strategies that have been proposed by
different communities, i.e., database, semantic Web and machine learning, in
order to cope with the loose structuredness, extreme diversity, high speed and
large scale of entity descriptions used by real-world applications. Finally, we
provide a synthetic discussion of the existing approaches, and conclude with a
detailed presentation of open research directions
A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities
The explosive growth in fake news and its erosion to democracy, justice, and
public trust has increased the demand for fake news detection and intervention.
This survey reviews and evaluates methods that can detect fake news from four
perspectives: (1) the false knowledge it carries, (2) its writing style, (3)
its propagation patterns, and (4) the credibility of its source. The survey
also highlights some potential research tasks based on the review. In
particular, we identify and detail related fundamental theories across various
disciplines to encourage interdisciplinary research on fake news. We hope this
survey can facilitate collaborative efforts among experts in computer and
information sciences, social sciences, political science, and journalism to
research fake news, where such efforts can lead to fake news detection that is
not only efficient but more importantly, explainable.Comment: ACM Computing Surveys (CSUR), 37 page
The automatic creation of concept maps from documents written using morphologically rich languages
Concept map is a graphical tool for representing knowledge. They have been
used in many different areas, including education, knowledge management,
business and intelligence. Constructing of concept maps manually can be a
complex task; an unskilled person may encounter difficulties in determining and
positioning concepts relevant to the problem area. An application that
recommends concept candidates and their position in a concept map can
significantly help the user in that situation. This paper gives an overview of
different approaches to automatic and semi-automatic creation of concept maps
from textual and non-textual sources. The concept map mining process is
defined, and one method suitable for the creation of concept maps from
unstructured textual sources in highly inflected languages such as the Croatian
language is described in detail. Proposed method uses statistical and data
mining techniques enriched with linguistic tools. With minor adjustments, that
method can also be used for concept map mining from textual sources in other
morphologically rich languages.Comment: ISSN 0957-417
- …