104,141 research outputs found

    From Text to Knowledge

    Get PDF
    The global information space provided by the World Wide Web has changed dramatically the way knowledge is shared all over the world. To make this unbelievable huge information space accessible, search engines index the uploaded contents and provide efficient algorithmic machinery for ranking the importance of documents with respect to an input query. All major search engines such as Google, Yahoo or Bing are keyword-based, which is indisputable a very powerful tool for accessing information needs centered around documents. However, this unstructured, document-oriented paradigm of the World Wide Web has serious drawbacks, when searching for specific knowledge about real-world entities. When asking for advanced facts about entities, today's search engines are not very good in providing accurate answers. Hand-built knowledge bases such as Wikipedia or its structured counterpart DBpedia are excellent sources that provide common facts. However, these knowledge bases are far from being complete and most of the knowledge lies still buried in unstructured documents. Statistical machine learning methods have the great potential to help to bridge the gap between text and knowledge by (semi-)automatically transforming the unstructured representation of the today's World Wide Web to a more structured representation. This thesis is devoted to reduce this gap with Probabilistic Graphical Models. Probabilistic Graphical Models play a crucial role in modern pattern recognition as they merge two important fields of applied mathematics: Graph Theory and Probability Theory. The first part of the thesis will present a novel system called Text2SemRel that is able to (semi-)automatically construct knowledge bases from textual document collections. The resulting knowledge base consists of facts centered around entities and their relations. Essential part of the system is a novel algorithm for extracting relations between entity mentions that is based on Conditional Random Fields, which are Undirected Probabilistic Graphical Models. In the second part of the thesis, we will use the power of Directed Probabilistic Graphical Models to solve important knowledge discovery tasks in semantically annotated large document collections. In particular, we present extensions of the Latent Dirichlet Allocation framework that are able to learn in an unsupervised way the statistical semantic dependencies between unstructured representations such as documents and their semantic annotations. Semantic annotations of documents might refer to concepts originating from a thesaurus or ontology but also to user-generated informal tags in social tagging systems. These forms of annotations represent a first step towards the conversion to a more structured form of the World Wide Web. In the last part of the thesis, we prove the large-scale applicability of the proposed fact extraction system Text2SemRel. In particular, we extract semantic relations between genes and diseases from a large biomedical textual repository. The resulting knowledge base contains far more potential disease genes exceeding the number of disease genes that are currently stored in curated databases. Thus, the proposed system is able to unlock knowledge currently buried in the literature. The literature-derived human gene-disease network is subject of further analysis with respect to existing curated state of the art databases. We analyze the derived knowledge base quantitatively by comparing it with several curated databases with regard to size of the databases and properties of known disease genes among other things. Our experimental analysis shows that the facts extracted from the literature are of high quality

    Information Extraction in Illicit Domains

    Full text link
    Extracting useful entities and attribute values from illicit domains such as human trafficking is a challenging problem with the potential for widespread social impact. Such domains employ atypical language models, have `long tails' and suffer from the problem of concept drift. In this paper, we propose a lightweight, feature-agnostic Information Extraction (IE) paradigm specifically designed for such domains. Our approach uses raw, unlabeled text from an initial corpus, and a few (12-120) seed annotations per domain-specific attribute, to learn robust IE models for unobserved pages and websites. Empirically, we demonstrate that our approach can outperform feature-centric Conditional Random Field baselines by over 18\% F-Measure on five annotated sets of real-world human trafficking datasets in both low-supervision and high-supervision settings. We also show that our approach is demonstrably robust to concept drift, and can be efficiently bootstrapped even in a serial computing environment.Comment: 10 pages, ACM WWW 201

    Conceptualising teachers' professional learning with Web 2.0

    Get PDF
    Purpose ā€“ This paper seeks to identify and develop an exploratory framework for conceptualising how teachers might use the affordances of Web 2.0 technologies to support their own professional learning. Design/methodology/approach ā€“ The paper draws on a large corpus of literature and recent research evidence to identify the principal elements and features of professional learning and the underlying affordances of Web 2.0 technologies and applications. It generates an exploratory conceptual framework based on the emerging findings from this review using a socioā€cultural theoretical perspective. The framework is explored through three individual illustrations which are drawn from a much larger case study which the author is undertaking within a newly established Academy in the North of England. Findings ā€“ The findings indicate that there is potential value in exploring professional learning with Web 2.0 technologies in the ways described. The framework offers an exploratory instrument to examine how professional learning for teachers could be supported with Web 2.0 technologies in ways that might have significant benefits over traditional methods of continuing professional development (CPD). Originality/value ā€“ The potential value and affordances of Web 2.0 technologies for teachers' professional learning are largely unexplored and underā€theorised, and this work seeks to establish a framework for further discussion and empirical exploration

    Artequakt: Generating tailored biographies from automatically annotated fragments from the web

    Get PDF
    The Artequakt project seeks to automatically generate narrativebiographies of artists from knowledge that has been extracted from the Web and maintained in a knowledge base. An overview of the system architecture is presented here and the three key components of that architecture are explained in detail, namely knowledge extraction, information management and biography construction. Conclusions are drawn from the initial experiences of the project and future progress is detailed

    Identifying Web Tables - Supporting a Neglected Type of Content on the Web

    Full text link
    The abundance of the data in the Internet facilitates the improvement of extraction and processing tools. The trend in the open data publishing encourages the adoption of structured formats like CSV and RDF. However, there is still a plethora of unstructured data on the Web which we assume contain semantics. For this reason, we propose an approach to derive semantics from web tables which are still the most popular publishing tool on the Web. The paper also discusses methods and services of unstructured data extraction and processing as well as machine learning techniques to enhance such a workflow. The eventual result is a framework to process, publish and visualize linked open data. The software enables tables extraction from various open data sources in the HTML format and an automatic export to the RDF format making the data linked. The paper also gives the evaluation of machine learning techniques in conjunction with string similarity functions to be applied in a tables recognition task.Comment: 9 pages, 4 figure
    • ā€¦
    corecore