Search CORE

114 research outputs found

SemEval-2010 Task 17: All-words Word Sense Disambiguation on a Specific Domain

Author: Agirre Eneko
Fellbaum Christiane
Hsieh Shu-Kai
L?pez De Lacalle Oier
Monachini Monica
Segers Roxanne
Tesconi Maurizio
Vossen Piek
Vossen Piek
Publication venue: Association for Computational Linguistics
Publication date: 01/01/2010
Field of study

Domain portability and adaptation of NLP components and Word Sense Disambiguation systems present new challenges. The difﬁculties found by supervised systems to adapt might change the way we assess the strengths and weaknesses of supervised and knowledge-based WSD systems. Unfortunately, all existing evaluation datasets for speciﬁc domains are lexical-sample corpora. This task presented all-words datasets on the environment domain for WSD in four languages (Chinese, Dutch, English, Italian). 11 teams participated, with supervised and knowledge-based systems, mainly in the English dataset. The results show that in all languages the participants where able to beat the most frequent sense heuristic as estimated from general corpora. The most successful approaches used some sort of supervision in the form of hand-tagged examples from the domain

Tailored semantic annotation for semantic search

Author: Berlanga Llavori Rafael
Nebot Romero Victoria
Pérez Catalán María
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

This paper presents a novel method for semantic annotation and search of a target corpus using several knowledge resources (KRs). This method relies on a formal statistical framework in which KR concepts and corpus documents are homogeneously represented using statistical language models. Under this framework, we can perform all the necessary operations for an efficient and effective semantic annotation of the corpus. Firstly, we propose a coarse tailoring of the KRs w.r.t the target corpus with the main goal of reducing the ambiguity of the annotations and their computational overhead. Then, we propose the generation of concept profiles, which allow measuring the semantic overlap of the KRs as well as performing a finer tailoring of them. Finally, we propose how to semantically represent documents and queries in terms of the KRs concepts and the statistical framework to perform semantic search. Experiments have been carried out with a corpus about web resources which includes several Life Sciences catalogs and Wikipedia pages related to web resources in general (e.g., databases, tools, services, etc.). Results demonstrate that the proposed method is more effective and efficient than state-of-the-art methods relying on either context-free annotation or keyword-based search.We thank anonymous reviewers for their very useful comments and suggestions. The work was supported by the CICYT project TIN2011-24147 from the Spanish Ministry of Economy and Competitiveness (MINECO)

Repositori Institucional de la Universitat Jaume I

Supervised and unsupervised methods for learning representations of linguistic units

Author: Rothe Sascha
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 14/06/2017
Field of study

Word representations, also called word embeddings, are generic representations, often high-dimensional vectors. They map the discrete space of words into a continuous vector space, which allows us to handle rare or even unseen events, e.g. by considering the nearest neighbors. Many Natural Language Processing tasks can be improved by word representations if we extend the task specific training data by the general knowledge incorporated in the word representations. The first publication investigates a supervised, graph-based method to create word representations. This method leads to a graph-theoretic similarity measure, CoSimRank, with equivalent formalizations that show CoSimRank’s close relationship to Personalized Page-Rank and SimRank. The new formalization is efficient because it can use the graph-based word representation to compute a single node similarity without having to compute the similarities of the entire graph. We also show how we can take advantage of fast matrix multiplication algorithms. In the second publication, we use existing unsupervised methods for word representation learning and combine these with semantic resources by learning representations for non-word objects like synsets and entities. We also investigate improved word representations which incorporate the semantic information from the resource. The method is flexible in that it can take any word representations as input and does not need an additional training corpus. A sparse tensor formalization guarantees efficiency and parallelizability. In the third publication, we introduce a method that learns an orthogonal transformation of the word representation space that focuses the information relevant for a task in an ultradense subspace of a dimensionality that is smaller by a factor of 100 than the original space. We use ultradense representations for a Lexicon Creation task in which words are annotated with three types of lexical information – sentiment, concreteness and frequency. The final publication introduces a new calculus for the interpretable ultradense subspaces, including polarity, concreteness, frequency and part-of-speech (POS). The calculus supports operations like “−1 × hate = love” and “give me a neutral word for greasy” (i.e., oleaginous) and extends existing analogy computations like “king − man + woman = queen”.Wortrepräsentationen, sogenannte Word Embeddings, sind generische Repräsentationen, meist hochdimensionale Vektoren. Sie bilden den diskreten Raum der Wörter in einen stetigen Vektorraum ab und erlauben uns, seltene oder ungesehene Ereignisse zu behandeln -- zum Beispiel durch die Betrachtung der nächsten Nachbarn. Viele Probleme der Computerlinguistik können durch Wortrepräsentationen gelöst werden, indem wir spezifische Trainingsdaten um die allgemeinen Informationen erweitern, welche in den Wortrepräsentationen enthalten sind. In der ersten Publikation untersuchen wir überwachte, graphenbasierte Methodenn um Wortrepräsentationen zu erzeugen. Diese Methoden führen zu einem graphenbasierten Ähnlichkeitsmaß, CoSimRank, für welches zwei äquivalente Formulierungen existieren, die sowohl die enge Beziehung zum personalisierten PageRank als auch zum SimRank zeigen. Die neue Formulierung kann einzelne Knotenähnlichkeiten effektiv berechnen, da graphenbasierte Wortrepräsentationen benutzt werden können. In der zweiten Publikation verwenden wir existierende Wortrepräsentationen und kombinieren diese mit semantischen Ressourcen, indem wir Repräsentationen für Objekte lernen, welche keine Wörter sind, wie zum Beispiel Synsets und Entitäten. Die Flexibilität unserer Methode zeichnet sich dadurch aus, dass wir beliebige Wortrepräsentationen als Eingabe verwenden können und keinen zusätzlichen Trainingskorpus benötigen. In der dritten Publikation stellen wir eine Methode vor, die eine Orthogonaltransformation des Vektorraums der Wortrepräsentationen lernt. Diese Transformation fokussiert relevante Informationen in einen ultra-kompakten Untervektorraum. Wir benutzen die ultra-kompakten Repräsentationen zur Erstellung von Wörterbüchern mit drei verschiedene Angaben -- Stimmung, Konkretheit und Häufigkeit. Die letzte Publikation präsentiert eine neue Rechenmethode für die interpretierbaren ultra-kompakten Untervektorräume -- Stimmung, Konkretheit, Häufigkeit und Wortart. Diese Rechenmethode beinhaltet Operationen wie ”−1 × Hass = Liebe” und ”neutrales Wort für Winkeladvokat” (d.h., Anwalt) und erweitert existierende Rechenmethoden, wie ”Onkel − Mann + Frau = Tante”

Digitale Hochschulschriften der LMU

Introduction: Modeling, Learning and Processing of Text-Technological Data Structures

Author: Kühnberger Kai-Uwe
Lobin Henning
Lüngen Harald
Mehler Alexander
Storrer Angelika
Witt Andreas
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 21/12/2015
Field of study

Researchers in many disciplines, sometimes working in close cooperation, have been concerned with modeling textual data in order to account for texts as the prime information unit of written communication. The list of disciplines includes computer science and linguistics as well as more specialized disciplines like computational linguistics and text technology. What many of these efforts have in common is the aim to model textual data by means of abstract data types or data structures that support at least the semi-automatic processing of texts in any area of written communication

Publikationsserver des Instituts für Deutsche Sprache

Formal Linguistic Models and Knowledge Processing. A Structuralist Approach to Rule-Based Ontology Learning and Population

Author: Di Buono Maria Pia
Publication venue: Universita degli studi di Salerno
Publication date: 02/03/2016
Field of study

2013 - 2014The main aim of this research is to propose a structuralist approach for knowledge processing by means of ontology learning and population, achieved starting from unstructured and structured texts. The method suggested includes distributional semantic approaches and NL formalization theories, in order to develop a framework, which relies upon deep linguistic analysis... [edited by author]XIII n.s

ON THE USE OF NATURAL LANGUAGE PROCESSING FOR AUTOMATED CONCEPTUAL DATA MODELING

Author: Du Siqing
Publication venue
Publication date: 13/08/2008
Field of study

This research involved the development of a natural language processing (NLP) architecture for the extraction of entity relation diagrams (ERDs) from natural language requirements specifications. Conceptual data modeling plays an important role in database and software design and many approaches to automating and developing software tools for this process have been attempted. NLP approaches to this problem appear to be plausible because compared to general free texts, natural language requirements documents are relatively formal and exhibit some special regularities which reduce the complexity of the problem. The approach taken here involves a loose integration of several linguistic components. Outputs from syntactic parsing are used by a set of hueristic rules developed for this particular domain to produce tuples representing the underlying meanings of the propositions in the documents and semantic resources are used to distinguish between correct and incorrect tuples. Finally the tuples are integrated into full ERD representations. The major challenge addressed in this research is how to bring the various resources to bear on the translation of the natural language documents into the formal language. This system is taken to be representative of a potential class of similar systems designed to translate documents in other restricted domains into corresponding formalisms. The system is incorporated into a tool that presents the final ERDs to users who can modify them in the attempt to produce an accurate ERD for the requirements document. An experiment demonstrated that users with limited experience in ERD specifications could produce better representations of requirements documents than they could without the system, and could do so in less time

Linguistic Refactoring of Business Process Models

Author: Pittke Fabian
Publication venue
Publication date: 01/11/2015
Field of study

In the past decades, organizations had to face numerous challenges due to intensifying globalization and internationalization, shorter innovation cycles and growing IT support for business. Business process management is seen as a comprehensive approach to align business strategy, organization, controlling, and business activities to react flexibly to market changes. For this purpose, business process models are increasingly utilized to document and redesign relevant parts of the organization's business operations. Since companies tend to have a growing number of business process models stored in a process model repository, analysis techniques are required that assess the quality of these process models in an automatic fashion. While available techniques can easily check the formal content of a process model, there are only a few techniques available that analyze the natural language content of a process model. Therefore, techniques are required that address linguistic issues caused by the actual use of natural language. In order to close this gap, this doctoral thesis explicitly targets inconsistencies caused by natural language and investigates the potential of automatically detecting and resolving them under a linguistic perspective. In particular, this doctoral thesis provides the following contributions. First, it defines a classification framework that structures existing work on process model analysis and refactoring. Second, it introduces the notion of atomicity, which implements a strict consistency condition between the formal content and the textual content of a process model. Based on an explorative investigation, we reveal several reoccurring violation patterns are not compliant with the notion of atomicity. Third, this thesis proposes an automatic refactoring technique that formalizes the identified patterns to transform a non-atomic process models into an atomic one. Fourth, this thesis defines an automatic technique for detecting and refactoring synonyms and homonyms in process models, which is eventually useful to unify the terminology used in an organization. Fifth and finally, this thesis proposes a recommendation-based refactoring approach that addresses process models suffering from incompleteness and leading to several possible interpretations. The efficiency and usefulness of the proposed techniques is further evaluated by real-world process model repositories from various industries. (author's abstract