32,845 research outputs found

    Learning Modulo Theories for preference elicitation in hybrid domains

    Full text link
    This paper introduces CLEO, a novel preference elicitation algorithm capable of recommending complex objects in hybrid domains, characterized by both discrete and continuous attributes and constraints defined over them. The algorithm assumes minimal initial information, i.e., a set of catalog attributes, and defines decisional features as logic formulae combining Boolean and algebraic constraints over the attributes. The (unknown) utility of the decision maker (DM) is modelled as a weighted combination of features. CLEO iteratively alternates a preference elicitation step, where pairs of candidate solutions are selected based on the current utility model, and a refinement step where the utility is refined by incorporating the feedback received. The elicitation step leverages a Max-SMT solver to return optimal hybrid solutions according to the current utility model. The refinement step is implemented as learning to rank, and a sparsifying norm is used to favour the selection of few informative features in the combinatorial space of candidate decisional features. CLEO is the first preference elicitation algorithm capable of dealing with hybrid domains, thanks to the use of Max-SMT technology, while retaining uncertainty in the DM utility and noisy feedback. Experimental results on complex recommendation tasks show the ability of CLEO to quickly focus towards optimal solutions, as well as its capacity to recover from suboptimal initial choices. While no competitors exist in the hybrid setting, CLEO outperforms a state-of-the-art Bayesian preference elicitation algorithm when applied to a purely discrete task.Comment: 50 pages, 3 figures, submitted to Artificial Intelligence Journa

    Text Classification using Data Mining

    Full text link
    Text classification is the process of classifying documents into predefined categories based on their content. It is the automated assignment of natural language texts to predefined categories. Text classification is the primary requirement of text retrieval systems, which retrieve texts in response to a user query, and text understanding systems, which transform text in some way such as producing summaries, answering questions or extracting data. Existing supervised learning algorithms to automatically classify text need sufficient documents to learn accurately. This paper presents a new algorithm for text classification using data mining that requires fewer documents for training. Instead of using words, word relation i.e. association rules from these words is used to derive feature set from pre-classified text documents. The concept of Naive Bayes classifier is then used on derived features and finally only a single concept of Genetic Algorithm has been added for final classification. A system based on the proposed algorithm has been implemented and tested. The experimental results show that the proposed system works as a successful text classifier.Comment: 19 Pages, International Conferenc

    A Deep Ensemble Framework for Fake News Detection and Classification

    Full text link
    Fake news, rumor, incorrect information, and misinformation detection are nowadays crucial issues as these might have serious consequences for our social fabrics. The rate of such information is increasing rapidly due to the availability of enormous web information sources including social media feeds, news blogs, online newspapers etc. In this paper, we develop various deep learning models for detecting fake news and classifying them into the pre-defined fine-grained categories. At first, we develop models based on Convolutional Neural Network (CNN) and Bi-directional Long Short Term Memory (Bi-LSTM) networks. The representations obtained from these two models are fed into a Multi-layer Perceptron Model (MLP) for the final classification. Our experiments on a benchmark dataset show promising results with an overall accuracy of 44.87\%, which outperforms the current state of the art.Comment: 6 pages, 1 figure, accepted as a short paper in Web Intelligence 2018 (https://webintelligence2018.com/accepted-papers.html), title changed from {"Going Deep to Detect Liars" Detecting Fake News using Deep Learning} to {A Deep Ensemble Framework for Fake News Detection and Classification} as per reviewers suggestio

    A Survey on Sampling and Profiling over Big Data (Technical Report)

    Full text link
    Due to the development of internet technology and computer science, data is exploding at an exponential rate. Big data brings us new opportunities and challenges. On the one hand, we can analyze and mine big data to discover hidden information and get more potential value. On the other hand, the 5V characteristic of big data, especially Volume which means large amount of data, brings challenges to storage and processing. For some traditional data mining algorithms, machine learning algorithms and data profiling tasks, it is very difficult to handle such a large amount of data. The large amount of data is highly demanding hardware resources and time consuming. Sampling methods can effectively reduce the amount of data and help speed up data processing. Hence, sampling technology has been widely studied and used in big data context, e.g., methods for determining sample size, combining sampling with big data processing frameworks. Data profiling is the activity that finds metadata of data set and has many use cases, e.g., performing data profiling tasks on relational data, graph data, and time series data for anomaly detection and data repair. However, data profiling is computationally expensive, especially for large data sets. Therefore, this paper focuses on researching sampling and profiling in big data context and investigates the application of sampling in different categories of data profiling tasks. From the experimental results of these studies, the results got from the sampled data are close to or even exceed the results of the full amount of data. Therefore, sampling technology plays an important role in the era of big data, and we also have reason to believe that sampling technology will become an indispensable step in big data processing in the future

    A Novel Rough Set Reduct Algorithm for Medical Domain Based on Bee Colony Optimization

    Full text link
    Feature selection refers to the problem of selecting relevant features which produce the most predictive outcome. In particular, feature selection task is involved in datasets containing huge number of features. Rough set theory has been one of the most successful methods used for feature selection. However, this method is still not able to find optimal subsets. This paper proposes a new feature selection method based on Rough set theory hybrid with Bee Colony Optimization (BCO) in an attempt to combat this. This proposed work is applied in the medical domain to find the minimal reducts and experimentally compared with the Quick Reduct, Entropy Based Reduct, and other hybrid Rough Set methods such as Genetic Algorithm (GA), Ant Colony Optimization (ACO) and Particle Swarm Optimization (PSO).Comment: IEEE Publication Format, https://sites.google.com/site/journalofcomputing

    Zero and Few Shot Learning with Semantic Feature Synthesis and Competitive Learning

    Full text link
    Zero-shot learning (ZSL) is made possible by learning a projection function between a feature space and a semantic space (e.g.,~an attribute space). Key to ZSL is thus to learn a projection that is robust against the often large domain gap between the seen and unseen class domains. In this work, this is achieved by unseen class data synthesis and robust projection function learning. Specifically, a novel semantic data synthesis strategy is proposed, by which semantic class prototypes (e.g., attribute vectors) are used to simply perturb seen class data for generating unseen class ones. As in any data synthesis/hallucination approach, there are ambiguities and uncertainties on how well the synthesised data can capture the targeted unseen class data distribution. To cope with this, the second contribution of this work is a novel projection learning model termed competitive bidirectional projection learning (BPL) designed to best utilise the ambiguous synthesised data. Specifically, we assume that each synthesised data point can belong to any unseen class; and the most likely two class candidates are exploited to learn a robust projection function in a competitive fashion. As a third contribution, we show that the proposed ZSL model can be easily extended to few-shot learning (FSL) by again exploiting semantic (class prototype guided) feature synthesis and competitive BPL. Extensive experiments show that our model achieves the state-of-the-art results on both problems.Comment: Submitted to IEEE TPAM

    IDEL: In-Database Entity Linking with Neural Embeddings

    Full text link
    We present a novel architecture, In-Database Entity Linking (IDEL), in which we integrate the analytics-optimized RDBMS MonetDB with neural text mining abilities. Our system design abstracts core tasks of most neural entity linking systems for MonetDB. To the best of our knowledge, this is the first defacto implemented system integrating entity-linking in a database. We leverage the ability of MonetDB to support in-database-analytics with user defined functions (UDFs) implemented in Python. These functions call machine learning libraries for neural text mining, such as TensorFlow. The system achieves zero cost for data shipping and transformation by utilizing MonetDB's ability to embed Python processes in the database kernel and exchange data in NumPy arrays. IDEL represents text and relational data in a joint vector space with neural embeddings and can compensate errors with ambiguous entity representations. For detecting matching entities, we propose a novel similarity function based on joint neural embeddings which are learned via minimizing pairwise contrastive ranking loss. This function utilizes a high dimensional index structures for fast retrieval of matching entities. Our first implementation and experiments using the WebNLG corpus show the effectiveness and the potentials of IDEL.Comment: This manuscript is a preprint for a paper submitted to VLDB201

    End-to-End Entity Resolution for Big Data: A Survey

    Full text link
    One of the most important tasks for improving data quality and the reliability of data analytics results is Entity Resolution (ER). ER aims to identify different descriptions that refer to the same real-world entity, and remains a challenging problem. While previous works have studied specific aspects of ER (and mostly in traditional settings), in this survey, we provide for the first time an end-to-end view of modern ER workflows, and of the novel aspects of entity indexing and matching methods in order to cope with more than one of the Big Data characteristics simultaneously. We present the basic concepts, processing steps and execution strategies that have been proposed by different communities, i.e., database, semantic Web and machine learning, in order to cope with the loose structuredness, extreme diversity, high speed and large scale of entity descriptions used by real-world applications. Finally, we provide a synthetic discussion of the existing approaches, and conclude with a detailed presentation of open research directions

    A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities

    Full text link
    The explosive growth in fake news and its erosion to democracy, justice, and public trust has increased the demand for fake news detection and intervention. This survey reviews and evaluates methods that can detect fake news from four perspectives: (1) the false knowledge it carries, (2) its writing style, (3) its propagation patterns, and (4) the credibility of its source. The survey also highlights some potential research tasks based on the review. In particular, we identify and detail related fundamental theories across various disciplines to encourage interdisciplinary research on fake news. We hope this survey can facilitate collaborative efforts among experts in computer and information sciences, social sciences, political science, and journalism to research fake news, where such efforts can lead to fake news detection that is not only efficient but more importantly, explainable.Comment: ACM Computing Surveys (CSUR), 37 page

    The automatic creation of concept maps from documents written using morphologically rich languages

    Full text link
    Concept map is a graphical tool for representing knowledge. They have been used in many different areas, including education, knowledge management, business and intelligence. Constructing of concept maps manually can be a complex task; an unskilled person may encounter difficulties in determining and positioning concepts relevant to the problem area. An application that recommends concept candidates and their position in a concept map can significantly help the user in that situation. This paper gives an overview of different approaches to automatic and semi-automatic creation of concept maps from textual and non-textual sources. The concept map mining process is defined, and one method suitable for the creation of concept maps from unstructured textual sources in highly inflected languages such as the Croatian language is described in detail. Proposed method uses statistical and data mining techniques enriched with linguistic tools. With minor adjustments, that method can also be used for concept map mining from textual sources in other morphologically rich languages.Comment: ISSN 0957-417
    • …
    corecore