17,389 research outputs found

    Neural Word Segmentation with Rich Pretraining

    Full text link
    Neural word segmentation research has benefited from large-scale raw texts by leveraging them for pretraining character and word embeddings. On the other hand, statistical segmentation research has exploited richer sources of external information, such as punctuation, automatic segmentation and POS. We investigate the effectiveness of a range of external training sources for neural word segmentation by building a modular segmentation model, pretraining the most important submodule using rich external sources. Results show that such pretraining significantly improves the model, leading to accuracies competitive to the best methods on six benchmarks.Comment: Accepted by ACL 201

    Knowledge Rich Natural Language Queries over Structured Biological Databases

    Full text link
    Increasingly, keyword, natural language and NoSQL queries are being used for information retrieval from traditional as well as non-traditional databases such as web, document, image, GIS, legal, and health databases. While their popularity are undeniable for obvious reasons, their engineering is far from simple. In most part, semantics and intent preserving mapping of a well understood natural language query expressed over a structured database schema to a structured query language is still a difficult task, and research to tame the complexity is intense. In this paper, we propose a multi-level knowledge-based middleware to facilitate such mappings that separate the conceptual level from the physical level. We augment these multi-level abstractions with a concept reasoner and a query strategy engine to dynamically link arbitrary natural language querying to well defined structured queries. We demonstrate the feasibility of our approach by presenting a Datalog based prototype system, called BioSmart, that can compute responses to arbitrary natural language queries over arbitrary databases once a syntactic classification of the natural language query is made

    DataHub: Collaborative Data Science & Dataset Version Management at Scale

    Get PDF
    Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform, DataHub, that gives users the ability to perform collaborative data analysis building on this version control system. We outline the challenges in providing dataset version control at scale.Comment: 7 page

    Specification and implementation of mapping rule visualization and editing : MapVOWL and the RMLEditor

    Get PDF
    Visual tools are implemented to help users in defining how to generate Linked Data from raw data. This is possible thanks to mapping languages which enable detaching mapping rules from the implementation that executes them. However, no thorough research has been conducted so far on how to visualize such mapping rules, especially if they become large and require considering multiple heterogeneous raw data sources and transformed data values. In the past, we proposed the RMLEditor, a visual graph-based user interface, which allows users to easily create mapping rules for generating Linked Data from raw data. In this paper, we build on top of our existing work: we (i) specify a visual notation for graph visualizations used to represent mapping rules, (ii) introduce an approach for manipulating rules when large visualizations emerge, and (iii) propose an approach to uniformly visualize data fraction of raw data sources combined with an interactive interface for uniform data fraction transformations. We perform two additional comparative user studies. The first one compares the use of the visual notation to present mapping rules to the use of a mapping language directly, which reveals that the visual notation is preferred. The second one compares the use of the graph-based RMLEditor for creating mapping rules to the form-based RMLx Visual Editor, which reveals that graph-based visualizations are preferred to create mapping rules through the use of our proposed visual notation and uniform representation of heterogeneous data sources and data values. (C) 2018 Elsevier B.V. All rights reserved

    Flattening an object algebra to provide performance

    Get PDF
    Algebraic transformation and optimization techniques have been the method of choice in relational query execution, but applying them in object-oriented (OO) DBMSs is difficult due to the complexity of OO query languages. This paper demonstrates that the problem can be simplified by mapping an OO data model to the binary relational model implemented by Monet, a state-of-the-art database kernel. We present a generic mapping scheme to flatten data models and study the case of straightforward OO model. We show how flattening enabled us to implement a query algebra, using only a very limited set of simple operations. The required primitives and query execution strategies are discussed, and their performance is evaluated on the 1-GByte TPC-D (Transaction-processing Performance Council's Benchmark D), showing that our divide-and-conquer approach yields excellent result

    Death of mixed methods?: Or the rebirth of research as a craft

    Get PDF
    The classification by many scholars of numerical research processes as quantitative and other research techniques as qualitative has prompted the construction of a third category, that of ‘mixed methods’, to describe studies that use elements from both processes. Such labels might be helpful in structuring our understanding of phenomena. But they can also inhibit our activities when they serve as inaccurate or limiting descriptors. Based on the observation that mixed methods is fast becoming a common research approach in the social sciences, this paper questions whether the assumptions that are used and perpetuated by mixed methods are valid. The paper calls for a critical change in how we perceive research, in order to better describe actual research processes. A more ethological taxonomy of the mechanisms underlying research structures and processes is posited to encourage creative thinking around alternatives to the three purported paradigms of quantitative, qualitative and mixed methods. This ‘return to basics’ seeks to encourage new and innovative research designs to emerge, and suggests a rebirth of research from the ashes of mixed methods

    Application of Semantics to Solve Problems in Life Sciences

    Get PDF
    Fecha de lectura de Tesis: 10 de diciembre de 2018La cantidad de información que se genera en la Web se ha incrementado en los últimos años. La mayor parte de esta información se encuentra accesible en texto, siendo el ser humano el principal usuario de la Web. Sin embargo, a pesar de todos los avances producidos en el área del procesamiento del lenguaje natural, los ordenadores tienen problemas para procesar esta información textual. En este cotexto, existen dominios de aplicación en los que se están publicando grandes cantidades de información disponible como datos estructurados como en el área de las Ciencias de la Vida. El análisis de estos datos es de vital importancia no sólo para el avance de la ciencia, sino para producir avances en el ámbito de la salud. Sin embargo, estos datos están localizados en diferentes repositorios y almacenados en diferentes formatos que hacen difícil su integración. En este contexto, el paradigma de los Datos Vinculados como una tecnología que incluye la aplicación de algunos estándares propuestos por la comunidad W3C tales como HTTP URIs, los estándares RDF y OWL. Haciendo uso de esta tecnología, se ha desarrollado esta tesis doctoral basada en cubrir los siguientes objetivos principales: 1) promover el uso de los datos vinculados por parte de la comunidad de usuarios del ámbito de las Ciencias de la Vida 2) facilitar el diseño de consultas SPARQL mediante el descubrimiento del modelo subyacente en los repositorios RDF 3) crear un entorno colaborativo que facilite el consumo de Datos Vinculados por usuarios finales, 4) desarrollar un algoritmo que, de forma automática, permita descubrir el modelo semántico en OWL de un repositorio RDF, 5) desarrollar una representación en OWL de ICD-10-CM llamada Dione que ofrezca una metodología automática para la clasificación de enfermedades de pacientes y su posterior validación haciendo uso de un razonador OWL
    corecore