29 research outputs found

    Weaving Words for Textile Museums: The development of the Linked SILKNOW Thesaurus

    Get PDF
    The cultural heritage domain in general and silk textiles, in particular, are characterized by large, rich and heterogeneous data sets. Silk heritage vocabulary comes from multiple sources that have been mixed up across time and space. This has led to the use of different terminology in specialized organizations in order to describe their artefacts. This makes data interoperability between independent catalogues very difficult. To address these issues, SILKNOW created a multilingual thesaurus related to silk textiles. It was carried out by experts in textile terminology and art historians and computationally implemented by experts in text mining, multi-/cross-linguality and semantic extraction from text. This paper presents the rationale behind the realization of this thesaurus

    A Multilingual Benchmark to Capture Olfactory Situations over Time

    Get PDF
    We present a benchmark in six European languages containing manually annotated information about olfactory situations and events following a FrameNet-like approach. The documents selection covers ten domains of interest to cultural historians in the olfactory domain and includes texts published between 1620 to 1920, allowing a diachronic analysis of smell descriptions. With this work, we aim to foster the development of olfactory information extraction approaches as well as the analysis of changes in smell descriptions over time

    The Download Estimation Task on KDD Cup 2003

    No full text
    This paper describes our work on the Download Estimation task for KDD Cup 2003. The task requires us to estimate how many times a paper has been downloaded in the first 60 days after it has been published on arXiv.org, a preprint server for papers on physics and related areas. The training data consists of approximately 29000 papers, the citation graph, and information about the downloads of a subset of these papers. Our approach is based on an extension of the bag-of-words model, with linear SVM regression as the learning algorithm. We describe our experiments with various kinds of features. We focus particularly on issues of feature construction and weighting, which turns out to be quite important for this task

    Machine learning on large class hierarchies by transformation into multiple binary problems

    Get PDF
    An ontology is a shared conceptualization of some problem domain, usually consisting of concepts, instances, an is-a hierarchy between concepts, and possibly other relations and attributes. This thesis deals with several problems on ontologies from a machine-learning perspective, in which a simple ontology can be seen as a hierarchy of classes: ontology population (classification), ontology evaluation, ontology evolution (predicting structural change) and extraction of ontological data from a corpus of documents. We particularly focus on the problem of classification into a hierarchy of classes, which can be seen as one way to populate an ontology. One approach to deal with multi-class problems such as this one is to convert them into several binary (two-class) problems and use a voting scheme to combine the predictions of the resulting ensemble of binary classifiers. The relationship between the classes of the original multi-class problem and the new binary problems can be concisely described by a coding matrix. A particularly interesting question is whether good classification performance can be achieved with a small number of binary classifiers. The number of different coding matrices (and thus of the ensembles defined by them) is exponentially large in the number of classes of the original problem. Although this space of coding matrices is intractably large, a substantial amount of it can be explored if the number of classes in the original problem is small. We present extensive experiments on one such small dataset and investigate the distribution of classification performance scores as a function of the number of binary classifiers in the ensemble. We demonstrate that good classification performance can be achieved with a small ensemble, but such an ensemble might be hard to find; on the other hand, allowing a larger ensemble makes it easier to achieve good performance but does not lead to further increases in the maximal performance (over all ensembles of that size). We also investigate the well-known claim that high row and column separation (average Hamming distance between rows/columns of the matrix) are important properties of the coding matrix, and show that while matrices with high separation do indeed tend to perform well, maximizing row/separation does not lead to the best-performing matrices. We present a greedy algorithm that constructs the coding matrix one column at a time, based on the idea that the binary classifier defined by the new column should focus on separating those pairs of classes which are most frequently confused by the existing ensemble of classifiers. An empirical evaluation shows that this algorithm allows us to achieve comparable performance with a smaller number of classifiers, compared to a baseline random-matrix approach. We also present an analysis which demonstrates that the impact of adding a single new classifier to the ensemble is necessarily quite limited, even if weighted voting is taken into account. We also deal with the topic of ontology evolution, in particular of predicting structural changes in an ontology. We studied the evolution of the Open Directory Project (ODP) ontology over several years, identified several common types of structural changes, and developed a heuristic approach to recognize them and quantify their frequency. The most common structural change turned out to be the addition of new categories, and we present a machine-learning approach to predicting where a new subcategory might be added by taking a few documents from an existing category. Ontology evaluation consists of various approaches and techniques for evaluating and comparing ontologies. We present a survey of such techniques and classify them depending on which level or aspect of ontologies they focus on, as well as depending on their general approach (gold-standard based, application based, data-driven, and manual evaluation). We introduce an ontology evaluation measure for scenarios where the ontology is a hierarchy of classes and is to be compared to a “gold standard” ontology built over the same set of instances. We investigate how this measure responds to various kinds of structural changes in the ontology. We also discuss another approach to ontology population, aimed at “general knowledge” ontologies rather than document hierarchies. In this approach the goal is to extract useful triples of the form concept1, relation, concept2 from a corpus of natural-language documents. In addition to triples that directly occur in the corpus, we also consider more abstract triples that can be obtained by replacing one or more components by a hypernym. We developed an efficient algorithm that can process a large corpus of documents and extract triples that satisfy a minimum-support threshold at any level of abstraction. However, a high support by itself is not a sufficient condition for a triple to be interesting for inclusion in the ontology, since many triples with a high support are irrelevant or too abstract to be interesting. We show several heuristics that can be used to identify interesting triples by comparing their support to that of their ancestors or neighbors in the triple space. We evaluated these heuristics experimentally by comparing their results to manually assigned relevance labels

    Q-CAT Corpus Annotation Tool

    No full text
    The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these annotations. The tool has been used in various annotation campaigns related to the ssj500k reference training corpus of Slovenian (http://hdl.handle.net/11356/1210), such as named entities, dependency syntax, semantic roles and multi-word expressions, but it can also be used for adding new annotation layers of various types to this or other language corpora. Q-CAT is a .NET application, which runs on Windows operating syste

    Q-CAT Corpus Annotation Tool 1.1

    No full text
    The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these annotations. The tool has been used in various annotation campaigns related to the ssj500k reference training corpus of Slovenian (http://hdl.handle.net/11356/1210), such as named entities, dependency syntax, semantic roles and multi-word expressions, but it can also be used for adding new annotation layers of various types to this or other language corpora. Q-CAT is a .NET application, which runs on Windows operating system. Version 1.1 enables the automatic attribution of token IDs and personalized font adjustments

    Q-CAT Corpus Annotation Tool 1.2

    No full text
    The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these annotations. The tool has been used in various annotation campaigns related to the ssj500k reference training corpus of Slovenian (http://hdl.handle.net/11356/1210), such as named entities, dependency syntax, semantic roles and multi-word expressions, but it can also be used for adding new annotation layers of various types to this or other language corpora. Q-CAT is a .NET application, which runs on Windows operating system. Version 1.1 enables the automatic attribution of token IDs and personalized font adjustments. Version 1.2 supports the CONLL-U format and working with UD POS tags

    Q-CAT Corpus Annotation Tool 1.5

    No full text
    The Q-CAT (Querying-Supported Corpus Annotation Tool) is a tool for manual linguistic annotation of corpora, which also enables advanced queries on top of these annotations. The tool has been used in various annotation campaigns related to the ssj500k reference training corpus of Slovenian (http://hdl.handle.net/11356/1210), such as named entities, dependency syntax, semantic roles and multi-word expressions, but it can also be used for adding new annotation layers of various types to this or other language corpora. Q-CAT is a .NET application, which runs on Windows operating system. Version 1.1 enables the automatic attribution of token IDs and personalized font adjustments. Version 1.2 supports the CONLL-U format and working with UD POS tags. Version 1.3 supports adding new layers of annotation on top of CONLL-U (and then saving the corpus as XML TEI). Version 1.4 introduces new features in command line mode (filtering by sentence ID, multiple link type visualizations) Version 1.5 supports listening to audio recordings (provided in the # sound_url comment line in CONLL-U

    Q-CAT Corpus Annotation Tool 1.4

    No full text
    The Q-CAT (Querying-Supported Corpus Annotation Tool) is a tool for manual linguistic annotation of corpora, which also enables advanced queries on top of these annotations. The tool has been used in various annotation campaigns related to the ssj500k reference training corpus of Slovenian (http://hdl.handle.net/11356/1210), such as named entities, dependency syntax, semantic roles and multi-word expressions, but it can also be used for adding new annotation layers of various types to this or other language corpora. Q-CAT is a .NET application, which runs on Windows operating system. Version 1.1 enables the automatic attribution of token IDs and personalized font adjustments. Version 1.2 supports the CONLL-U format and working with UD POS tags. Version 1.3 supports adding new layers of annotation on top of CONLL-U (and then saving the corpus as XML TEI). Version 1.4 introduces new features in command line mode (filtering by sentence ID, multiple link type visualizations
    corecore