45,535 research outputs found

    Unity in diversity : integrating differing linguistic data in TUSNELDA

    Get PDF
    This paper describes the creation and preparation of TUSNELDA, a collection of corpus data built for linguistic research. This collection contains a number of linguistically annotated corpora which differ in various aspects such as language, text sorts / data types, encoded annotation levels, and linguistic theories underlying the annotation. The paper focuses on this variation on the one hand and the way how these heterogeneous data are integrated into one resource on the other hand

    GitTables: A Large-Scale Corpus of Relational Tables

    Full text link
    The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existing table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need resources with tables that resemble relational database tables. Here we introduce GitTables, a corpus of 1M relational tables extracted from GitHub. Our continuing curation aims at growing the corpus to at least 10M tables. Analyses of GitTables show that its structure, content, and topical coverage differ significantly from existing table corpora. We annotate table columns in GitTables with semantic types, hierarchical relations and descriptions from Schema.org and DBpedia. The evaluation of our annotation pipeline on the T2Dv2 benchmark illustrates that our approach provides results on par with human annotations. We present three applications of GitTables, demonstrating its value for learned semantic type detection models, schema completion methods, and benchmarks for table-to-KG matching, data search, and preparation. We make the corpus and code available at https://gittables.github.io

    Combining Programming-by-Example with Transformation Discovery from large Databases

    Get PDF
    Data transformation discovery is one of the most tedious tasks in data preparation. In particular, the generation of transformation programs for semantic transformations is tricky because additional sources for look-up operations are necessary. Current systems for semantic transformation discovery face two major problems: either they follow a program synthesis approach that only scales to a small set of input tables, or they rely on extraction of transformation functions from large corpora, which requires the identification of exact transformations in those resources and is prone to noisy data. In this paper, we try to combine approaches to benefit from large corpora and the sophistication of program synthesis. To do so, we devise a retrieval and pruning strategy ensemble that extracts the most relevant tables for a given transformation task. The extracted resources can then be processed by a program synthesis engine to generate more accurate transformation results than state-of-the-art

    The Interpersonal Entrainment in Music Performance Data Collection

    Get PDF
    The Interpersonal Entrainment in Music Performance Data Collection (IEMPDC) comprises six related corpora of music research materials: Cuban Son & Salsa (CSS), European String Quartet (ESQ), Malian Jembe (MJ), North Indian Raga (NIR), Tunisian Stambeli (TS), and Uruguayan Candombe (UC). The core data for each corpus comprises media files and computationally extracted event onset timing data. Annotation of metrical structure and code used in the preparation of the collection is also shared. The collection is unprecedented in size and level of detail and represents a significant new resource for empirical and computational research in music. In this article we introduce the main features of the data collection and the methods used in its preparation. Details of technical validation procedures and notes on data visualization are available as Appendices. We also contextualize the collection in relation to developments in Open Science and Open Data, discussing important distinctions between the two related concepts
    • …
    corecore