20 research outputs found

    HsH: Estimating Semantic Similarity of Words and Short Phrases with Frequency Normalized Distance Measures

    No full text
    This paper describes the approach of the Hochschule Hannover to the SemEval 2013 Task Evaluating Phrasal Semantics. In order to compare a single word with a two word phrase we compute various distributional similarities, among which a new similarity measure, based on Jensen-Shannon Divergence with a correction for frequency effects. The classification is done by a support vector machine that uses all similarities as features. The approach turned out to be the most successful one in the task

    The Hanover Tagger (Version 1.1.0) - Lemmatization, Morphological Analysis and POS Tagging in Python

    No full text
    HanTa, or Hanover Tagger, is an open source Python program for lemmatization and part of speech tagging. This document contains a description of the functionality, an introduction to the ideas and techniques used and some information on the annotated training data for Dutch, English and German

    A Probabilistic Morphology Model for German Lemmatization

    No full text
    Lemmatization is a central task in many NLP applications. Despite this importance, the number of (freely) available and easy to use tools for German is very limited. To fill this gap, we developed a simple lemmatizer that can be trained on any lemmatized corpus. For a full form word the tagger tries to find the sequence of morphemes that is most likely to generate that word. From this sequence of tags we can easily derive the stem, the lemma and the part of speech (PoS) of the word. We show (i) that the quality of this approach is comparable to state of the art methods and (ii) that we can improve the results of Part-of-Speech (PoS) tagging when we include the morphological analysis of each word

    Extending Linear Indexed Grammars

    No full text
    This paper presents a possibility to extend the formalism of linear indexed grammars. The extension is based on the use of tuples of pushdowns instead of one pushdown to store indices during a derivation. If a restriction on the accessibility of the pushdowns is used, it can be shown that the resulting formalisms give rise to a hierarchy of languages that is equivalent with a hierarchy defined by Weir. For this equivalence, that was already known for a slightly different formalism, this paper gives a new proof. Since all languages of Weir's hierarchy are known to be mildly context sensitive, the proposed extensions of LIGs become comparable with extensions of tree adjoining grammars and head grammars

    A Hybrid Approach to Assignment of Library of Congress Subject Headings

    Get PDF
    Library of Congress Subject Headings (LCSH) are popular for indexing library records. We studied the possibility of assigning LCSH automatically by training classifiers for terms used frequently in a large collection of abstracts of the literature on hand and by extracting headings from those abstracts. The resulting classifiers reach an acceptable level of precision, but fail in terms of recall partly because we could only train classifiers for a small number of LCSH. Extraction, i.e., the matching of headings in the text, produces better recall but extremely low precision. We found that combining both methods leads to a significant improvement of recall and a slight improvement of F1 score with only a small decrease in precision

    Smart Data Analytics : Schriften des Forschungsclusters Smart Data Analytics 2020

    No full text
    Das Forschungscluster Smart Data Analytics stellt in dem vorliegenden Band seine Forschung aus den Jahren 2019 und 2020 vor. In der ersten Hälfte des Bandes geben 20 Kurzporträts von laufenden oder kürzlich abgeschlossenen Projekten einen Überblick über die Forschungsthemen im Cluster. Enthalten in den Kurzporträts ist eine vollständige, kommentierte Liste der wissenschaftlichen Veröffentlichungen aus den Jahren 2019 und 2020. In der zweiten Hälfte dieses Bandes geben vier längere Beiträge exemplarisch einen tieferen Einblick in die Forschung des Clusters und behandeln Themen wie Fehlererkennung in Datenbanken, Analyse und Visualisierung von Sicherheitsvorfällen in Netzwerken, Wissensmodellierung und Datenintegration in der Medizin, sowie die Frage ob ein Computerprogramm Urheber eines Kunstwerkes im Sinne des Urheberrechts sein kann.The Smart Data Analytics research cluster presents in this volume its research from 2019 and 2020. In the first half of the volume, 20 brief portraits of ongoing or recently completed projects provide an overview about the research topics in the cluster. The brief portraits contain a complete, annotated list of scientific publications from 2019 and 2020. In the second half of this volume, four longer contributions provide a deeper insight into the research of the cluster and deal with topics such as error detection in databases, analysis and visualization of security events in networks, knowledge modeling and data integration in medicine, as well as the question of whether a computer program can be the author of a work of art in sense of copyright law

    Structural Analysis of Contract Renewals

    No full text
    In the present paper we sketch an automated procedure to compare different versions of a contract. The contract texts used for this purpose are structurally differently composed PDF files that are converted into structured XML files by identifying and classifying text boxes. A classifier trained on manually annotated contracts achieves an accuracy of 87% on this task. We align contract versions and classify aligned text fragments into different similarity classes that enhance the manual comparison of changes in document versions. The main challenges are to deal with OCR errors and different layout of identical or similar texts. We demonstrate the procedure using some freely available contracts from the City of Hamburg written in German. The methods, however, are language agnostic and can be applied to other contracts as well

    Predicting the Concreteness of German Words

    No full text
    Concreteness of words has been measured and used in psycholinguistics already for decades. Recently, it is also used in retrieval and NLP tasks. For English a number of well known datasets has been established with average values for perceived concreteness. We give an overview of available datasets for German, their correlation and evaluate prediction algorithms for concreteness of German words. We show that these algorithms achieve similar results as for English datasets. Moreover, we show for all datasets there are no significant differences between a prediction model based on a regression model using word embeddings as features and a prediction algorithm based on word similarity according to the same embeddings

    Verbal Idioms: Concrete Nouns in Abstract Contexts

    No full text
    In this paper, we present our approach for the KONVENS 2021 shared task Disambiguation of German Verbal Idioms. Our model is a decision tree-based classifier that uses static word embeddings and computed concreteness values to predict whether a verbal idiom is used figuratively or literal

    Evaluierung von Verschlagwortung im Kontext des Information Retrieval

    No full text
    Dieser Beitrag möchte einen Überblick über die in der Literatur diskutierten Möglichkeiten, Herausforderungen und Grenzen geben, Retrieval als eine extrinsische Evaluierungsmethode für die Ergebnisse verbaler Sacherschließung zu nutzen
    corecore