54 research outputs found

    Knowledge Base Enrichment by Relation Learning from Social Tagging Data

    Get PDF
    There has been considerable interest in transforming unstructured social tagging data into structured knowledge for semantic-based retrieval and recommendation. Research in this line mostly exploits data co-occurrence and often overlooks the complex and ambiguous meanings of tags. Furthermore, there have been few comprehensive evaluation studies regarding the quality of the discovered knowledge. We propose a supervised learning method to discover subsumption relations from tags. The key to this method is quantifying the probabilistic association among tags to better characterise their relations. We further develop an algorithm to organise tags into hierarchies based on the learned relations. Experiments were conducted using a large, publicly available dataset, Bibsonomy, and three popular, human-engineered or data-driven knowledge bases: DBpedia, Microsoft Concept Graph, and ACM Computing Classification System. We performed a comprehensive evaluation using different strategies: relation-level, ontology-level, and knowledge base enrichment based evaluation. The results clearly show that the proposed method can extract knowledge of better quality than the existing methods against the gold standard knowledge bases. The proposed approach can also enrich knowledge bases with new subsumption relations, having the potential to significantly reduce time and human effort for knowledge base maintenance and ontology evolution

    Learning Relations from Social Tagging Data

    Get PDF
    An interesting research direction is to discover structured knowledge from user generated data. Our work aims to find relations among social tags and organise them into hierarchies so as to better support discovery and search for online users. We cast relation discovery in this context to a binary classification problem in supervised learning. This approach takes as input features of two tags extracted using probabilistic topic modelling, and predicts whether a broader-narrower relation holds between them. Experiments were conducted using two large, real-world datasets, the Bibsonomy dataset which is used to extract tags and their features, and the DBpedia dataset which is used as the ground truth. Three sets of features were designed and extracted based on topic distributions, similarity and probabilistic associations. Evaluation results with respect to the ground truth demonstrate that our method outperforms existing ones based on various features and heuristics. Future studies are suggested to study the Knowledge Base Enrichment from folksonomies and deep neural network approaches to process tagging data

    Learning Structured Knowledge from Social Tagging Data A critical review of methods and techniques

    Get PDF
    For more than a decade, researchers have been proposing various methods and techniques to mine social tagging data and to learn structured knowledge. It is essential to conduct a comprehensive survey on the related work, which would benefit the research community by providing better understanding of the state-of-the-art and insights into the future research directions. The paper first defines the spectrum of Knowledge Organization Systems, from unstructured with less semantics to highly structured with richer semantics. It then reviews the related work by classifying the methods and techniques into two main categories, namely, learning term lists and learning relations. The method and techniques originated from natural language processing, data mining, machine learning, social network analysis, and the Semantic Web are discussed in detail under the two categories. We summarize the prominent issues with the current research and highlight future directions on learning constantly evolving knowledge from social media data

    TiFi: Taxonomy Induction for Fictional Domains [Extended version]

    No full text
    Taxonomies are important building blocks of structured knowledge bases, and their construction from text sources and Wikipedia has received much attention. In this paper we focus on the construction of taxonomies for fictional domains, using noisy category systems from fan wikis or text extraction as input. Such fictional domains are archetypes of entity universes that are poorly covered by Wikipedia, such as also enterprise-specific knowledge bases or highly specialized verticals. Our fiction-targeted approach, called TiFi, consists of three phases: (i) category cleaning, by identifying candidate categories that truly represent classes in the domain of interest, (ii) edge cleaning, by selecting subcategory relationships that correspond to class subsumption, and (iii) top-level construction, by mapping classes onto a subset of high-level WordNet categories. A comprehensive evaluation shows that TiFi is able to construct taxonomies for a diverse range of fictional domains such as Lord of the Rings, The Simpsons or Greek Mythology with very high precision and that it outperforms state-of-the-art baselines for taxonomy induction by a substantial margin

    Learning and Leveraging Structured Knowledge from User-Generated Social Media Data

    Get PDF
    Knowledge has long been a crucial element in Artificial Intelligence (AI), which can be traced back to knowledge-based systems, or expert systems, in the 1960s. Knowledge provides contexts to facilitate machine understanding and improves the explainability and performance of many semantic-based applications. The acquisition of knowledge is, however, a complex step, normally requiring much effort and time from domain experts. In machine learning as one key domain of AI, the learning and leveraging of structured knowledge, such as ontologies and knowledge graphs, have become popular in recent years with the advent of massive user-generated social media data. The main hypothesis in this thesis is therefore that a substantial amount of useful knowledge can be derived from user-generated social media data. A popular, common type of social media data is social tagging data, accumulated from users' tagging in social media platforms. Social tagging data exhibit unstructured characteristics, including noisiness, flatness, sparsity, incompleteness, which prevent their efficient knowledge discovery and usage. The aim of this thesis is thus to learn useful structured knowledge from social media data regarding these unstructured characteristics. Several research questions have then been formulated related to the hypothesis and the research challenges. A knowledge-centred view has been considered throughout this thesis: knowledge bridges the gap between massive user-generated data to semantic-based applications. The study first reviews concepts related to structured knowledge, then focuses on two main parts, learning structured knowledge and leveraging structured knowledge from social tagging data. To learn structured knowledge, a machine learning system is proposed to predict subsumption relations from social tags. The main idea is to learn to predict accurate relations with features, generated with probabilistic topic modelling and founded on a formal set of assumptions on deriving subsumption relations. Tag concept hierarchies can then be organised to enrich existing Knowledge Bases (KBs), such as DBpedia and ACM Computing Classification Systems. The study presents relation-level evaluation, ontology-level evaluation, and the novel, Knowledge Base Enrichment based evaluation, and shows that the proposed approach can generate high quality and meaningful hierarchies to enrich existing KBs. To leverage structured knowledge of tags, the research focuses on the task of automated social annotation and propose a knowledge-enhanced deep learning model. Semantic-based loss regularisation has been proposed to enhance the deep learning model with the similarity and subsumption relations between tags. Besides, a novel, guided attention mechanism, has been proposed to mimic the users' behaviour of reading the title before digesting the content for annotation. The integrated model, Joint Multi-label Attention Network (JMAN), significantly outperformed the state-of-the-art, popular baseline methods, with consistent performance gain of the semantic-based loss regularisers on several deep learning models, on four real-world datasets. With the careful treatment of the unstructured characteristics and with the novel probabilistic and neural network based approaches, useful knowledge can be learned from user-generated social media data and leveraged to support semantic-based applications. This validates the hypothesis of the research and addresses the research questions. Future studies are considered to explore methods to efficiently learn and leverage other various types of structured knowledge and to extend current approaches to other user-generated data

    Using Data Mining for Facilitating User Contributions in the Social Semantic Web

    Get PDF
    This thesis utilizes recommender systems to aid the user in contributing to the Social Semantic Web. In this work, we propose a framework that maps domain properties to recommendation technologies. Next, we develop novel recommendation algorithms for improving personalized tag recommendation and for recommendation of semantic relations. Finally, we introduce a framework to analyze different types of potential attacks against social tagging systems and evaluate their impact on those systems

    Community-driven & Work-integrated Creation, Use and Evolution of Ontological Knowledge Structures

    Get PDF

    Knowledge extraction from fictional texts

    Get PDF
    Knowledge extraction from text is a key task in natural language processing, which involves many sub-tasks, such as taxonomy induction, named entity recognition and typing, relation extraction, knowledge canonicalization and so on. By constructing structured knowledge from natural language text, knowledge extraction becomes a key asset for search engines, question answering and other downstream applications. However, current knowledge extraction methods mostly focus on prominent real-world entities with Wikipedia and mainstream news articles as sources. The constructed knowledge bases, therefore, lack information about long-tail domains, with fiction and fantasy as archetypes. Fiction and fantasy are core parts of our human culture, spanning from literature to movies, TV series, comics and video games. With thousands of fictional universes which have been created, knowledge from fictional domains are subject of search-engine queries - by fans as well as cultural analysts. Unlike the real-world domain, knowledge extraction on such specific domains like fiction and fantasy has to tackle several key challenges: - Training data: Sources for fictional domains mostly come from books and fan-built content, which is sparse and noisy, and contains difficult structures of texts, such as dialogues and quotes. Training data for key tasks such as taxonomy induction, named entity typing or relation extraction are also not available. - Domain characteristics and diversity: Fictional universes can be highly sophisticated, containing entities, social structures and sometimes languages that are completely different from the real world. State-of-the-art methods for knowledge extraction make assumptions on entity-class, subclass and entity-entity relations that are often invalid for fictional domains. With different genres of fictional domains, another requirement is to transfer models across domains. - Long fictional texts: While state-of-the-art models have limitations on the input sequence length, it is essential to develop methods that are able to deal with very long texts (e.g. entire books), to capture multiple contexts and leverage widely spread cues. This dissertation addresses the above challenges, by developing new methodologies that advance the state of the art on knowledge extraction in fictional domains. - The first contribution is a method, called TiFi, for constructing type systems (taxonomy induction) for fictional domains. By tapping noisy fan-built content from online communities such as Wikia, TiFi induces taxonomies through three main steps: category cleaning, edge cleaning and top-level construction. Exploiting a variety of features from the original input, TiFi is able to construct taxonomies for a diverse range of fictional domains with high precision. - The second contribution is a comprehensive approach, called ENTYFI, for named entity recognition and typing in long fictional texts. Built on 205 automatically induced high-quality type systems for popular fictional domains, ENTYFI exploits the overlap and reuse of these fictional domains on unseen texts. By combining different typing modules with a consolidation stage, ENTYFI is able to do fine-grained entity typing in long fictional texts with high precision and recall. - The third contribution is an end-to-end system, called KnowFi, for extracting relations between entities in very long texts such as entire books. KnowFi leverages background knowledge from 142 popular fictional domains to identify interesting relations and to collect distant training samples. KnowFi devises a similarity-based ranking technique to reduce false positives in training samples and to select potential text passages that contain seed pairs of entities. By training a hierarchical neural network for all relations, KnowFi is able to infer relations between entity pairs across long fictional texts, and achieves gains over the best prior methods for relation extraction.Wissensextraktion ist ein Schlüsselaufgabe bei der Verarbeitung natürlicher Sprache, und umfasst viele Unteraufgaben, wie Taxonomiekonstruktion, Entitätserkennung und Typisierung, Relationsextraktion, Wissenskanonikalisierung, etc. Durch den Aufbau von strukturiertem Wissen (z.B. Wissensdatenbanken) aus Texten wird die Wissensextraktion zu einem Schlüsselfaktor für Suchmaschinen, Question Answering und andere Anwendungen. Aktuelle Methoden zur Wissensextraktion konzentrieren sich jedoch hauptsächlich auf den Bereich der realen Welt, wobei Wikipedia und Mainstream- Nachrichtenartikel die Hauptquellen sind. Fiktion und Fantasy sind Kernbestandteile unserer menschlichen Kultur, die sich von Literatur bis zu Filmen, Fernsehserien, Comics und Videospielen erstreckt. Für Tausende von fiktiven Universen wird Wissen aus Suchmaschinen abgefragt – von Fans ebenso wie von Kulturwissenschaftler. Im Gegensatz zur realen Welt muss die Wissensextraktion in solchen spezifischen Domänen wie Belletristik und Fantasy mehrere zentrale Herausforderungen bewältigen: • Trainingsdaten. Quellen für fiktive Domänen stammen hauptsächlich aus Büchern und von Fans erstellten Inhalten, die spärlich und fehlerbehaftet sind und schwierige Textstrukturen wie Dialoge und Zitate enthalten. Trainingsdaten für Schlüsselaufgaben wie Taxonomie-Induktion, Named Entity Typing oder Relation Extraction sind ebenfalls nicht verfügbar. • Domain-Eigenschaften und Diversität. Fiktive Universen können sehr anspruchsvoll sein und Entitäten, soziale Strukturen und manchmal auch Sprachen enthalten, die sich von der realen Welt völlig unterscheiden. Moderne Methoden zur Wissensextraktion machen Annahmen über Entity-Class-, Entity-Subclass- und Entity- Entity-Relationen, die für fiktive Domänen oft ungültig sind. Bei verschiedenen Genres fiktiver Domänen müssen Modelle auch über fiktive Domänen hinweg transferierbar sein. • Lange fiktive Texte. Während moderne Modelle Einschränkungen hinsichtlich der Länge der Eingabesequenz haben, ist es wichtig, Methoden zu entwickeln, die in der Lage sind, mit sehr langen Texten (z.B. ganzen Büchern) umzugehen, und mehrere Kontexte und verteilte Hinweise zu erfassen. Diese Dissertation befasst sich mit den oben genannten Herausforderungen, und entwickelt Methoden, die den Stand der Kunst zur Wissensextraktion in fiktionalen Domänen voranbringen. • Der erste Beitrag ist eine Methode, genannt TiFi, zur Konstruktion von Typsystemen (Taxonomie induktion) für fiktive Domänen. Aus von Fans erstellten Inhalten in Online-Communities wie Wikia induziert TiFi Taxonomien in drei wesentlichen Schritten: Kategoriereinigung, Kantenreinigung und Top-Level- Konstruktion. TiFi nutzt eine Vielzahl von Informationen aus den ursprünglichen Quellen und ist in der Lage, Taxonomien für eine Vielzahl von fiktiven Domänen mit hoher Präzision zu erstellen. • Der zweite Beitrag ist ein umfassender Ansatz, genannt ENTYFI, zur Erkennung von Entitäten, und deren Typen, in langen fiktiven Texten. Aufbauend auf 205 automatisch induzierten hochwertigen Typsystemen für populäre fiktive Domänen nutzt ENTYFI die Überlappung und Wiederverwendung dieser fiktiven Domänen zur Bearbeitung neuer Texte. Durch die Zusammenstellung verschiedener Typisierungsmodule mit einer Konsolidierungsphase ist ENTYFI in der Lage, in langen fiktionalen Texten eine feinkörnige Entitätstypisierung mit hoher Präzision und Abdeckung durchzuführen. • Der dritte Beitrag ist ein End-to-End-System, genannt KnowFi, um Relationen zwischen Entitäten aus sehr langen Texten wie ganzen Büchern zu extrahieren. KnowFi nutzt Hintergrundwissen aus 142 beliebten fiktiven Domänen, um interessante Beziehungen zu identifizieren und Trainingsdaten zu sammeln. KnowFi umfasst eine ähnlichkeitsbasierte Ranking-Technik, um falsch positive Einträge in Trainingsdaten zu reduzieren und potenzielle Textpassagen auszuwählen, die Paare von Kandidats-Entitäten enthalten. Durch das Trainieren eines hierarchischen neuronalen Netzwerkes für alle Relationen ist KnowFi in der Lage, Relationen zwischen Entitätspaaren aus langen fiktiven Texten abzuleiten, und übertrifft die besten früheren Methoden zur Relationsextraktion

    Entity-centric knowledge discovery for idiosyncratic domains

    Get PDF
    Technical and scientific knowledge is produced at an ever-accelerating pace, leading to increasing issues when trying to automatically organize or process it, e.g., when searching for relevant prior work. Knowledge can today be produced both in unstructured (plain text) and structured (metadata or linked data) forms. However, unstructured content is still themost dominant formused to represent scientific knowledge. In order to facilitate the extraction and discovery of relevant content, new automated and scalable methods for processing, structuring and organizing scientific knowledge are called for. In this context, a number of applications are emerging, ranging fromNamed Entity Recognition (NER) and Entity Linking tools for scientific papers to specific platforms leveraging information extraction techniques to organize scientific knowledge. In this thesis, we tackle the tasks of Entity Recognition, Disambiguation and Linking in idiosyncratic domains with an emphasis on scientific literature. Furthermore, we study the related task of co-reference resolution with a specific focus on named entities. We start by exploring Named Entity Recognition, a task that aims to identify the boundaries of named entities in textual contents. We propose a newmethod to generate candidate named entities based on n-gram collocation statistics and design several entity recognition features to further classify them. In addition, we show how the use of external knowledge bases (either domain-specific like DBLP or generic like DBPedia) can be leveraged to improve the effectiveness of NER for idiosyncratic domains. Subsequently, we move to Entity Disambiguation, which is typically performed after entity recognition in order to link an entity to a knowledge base. We propose novel semi-supervised methods for word disambiguation leveraging the structure of a community-based ontology of scientific concepts. Our approach exploits the graph structure that connects different terms and their definitions to automatically identify the correct sense that was originally picked by the authors of a scientific publication. We then turn to co-reference resolution, a task aiming at identifying entities that appear using various forms throughout the text. We propose an approach to type entities leveraging an inverted index built on top of a knowledge base, and to subsequently re-assign entities based on the semantic relatedness of the introduced types. Finally, we describe an application which goal is to help researchers discover and manage scientific publications. We focus on the problem of selecting relevant tags to organize collections of research papers in that context. We experimentally demonstrate that the use of a community-authored ontology together with information about the position of the concepts in the documents allows to significantly increase the precision of tag selection over standard methods
    • …
    corecore