78 research outputs found

    Unsupervised Chinese Verb Metaphor Recognition Based on Selectional Preferences

    Get PDF
    PACLIC / The University of the Philippines Visayas Cebu College Cebu City, Philippines / November 20-22, 200

    D6.2 Integrated Final Version of the Components for Lexical Acquisition

    Get PDF
    The PANACEA project has addressed one of the most critical bottlenecks that threaten the development of technologies to support multilingualism in Europe, and to process the huge quantity of multilingual data produced annually. Any attempt at automated language processing, particularly Machine Translation (MT), depends on the availability of language-specific resources. Such Language Resources (LR) contain information about the language\u27s lexicon, i.e. the words of the language and the characteristics of their use. In Natural Language Processing (NLP), LRs contribute information about the syntactic and semantic behaviour of words - i.e. their grammar and their meaning - which inform downstream applications such as MT. To date, many LRs have been generated by hand, requiring significant manual labour from linguistic experts. However, proceeding manually, it is impossible to supply LRs for every possible pair of European languages, textual domain, and genre, which are needed by MT developers. Moreover, an LR for a given language can never be considered complete nor final because of the characteristics of natural language, which continually undergoes changes, especially spurred on by the emergence of new knowledge domains and new technologies. PANACEA has addressed this challenge by building a factory of LRs that progressively automates the stages involved in the acquisition, production, updating and maintenance of LRs required by MT systems. The existence of such a factory will significantly cut down the cost, time and human effort required to build LRs. WP6 has addressed the lexical acquisition component of the LR factory, that is, the techniques for automated extraction of key lexical information from texts, and the automatic collation of lexical information into LRs in a standardized format. The goal of WP6 has been to take existing techniques capable of acquiring syntactic and semantic information from corpus data, improving upon them, adapting and applying them to multiple languages, and turning them into powerful and flexible techniques capable of supporting massive applications. One focus for improving the scalability and portability of lexical acquisition techniques has been to extend exiting techniques with more powerful, less "supervised" methods. In NLP, the amount of supervision refers to the amount of manual annotation which must be applied to a text corpus before machine learning or other techniques are applied to the data to compile a lexicon. More manual annotation means more accurate training data, and thus a more accurate LR. However, given that it is impractical from a cost and time perspective to manually annotate the vast amounts of data required for multilingual MT across domains, it is important to develop techniques which can learn from corpora with less supervision. Less supervised methods are capable of supporting both large-scale acquisition and efficient domain adaptation, even in the domains where data is scarce. Another focus of lexical acquisition in PANACEA has been the need of LR users to tune the accuracy level of LRs. Some applications may require increased precision, or accuracy, where the application requires a high degree of confidence in the lexical information used. At other times a greater level of coverage may be required, with information about more words at the expense of some degree of accuracy. Lexical acquisition in PANACEA has investigated confidence thresholds for lexical acquisition to ensure that the ultimate users of LRs can generate lexical data from the PANACEA factory at the desired level of accuracy

    Unsupervised Recognition of Motion Verbs Metaphoricity in Atyical Political Dialogues

    Get PDF
    This thesis deals with the unsupervised recognition of the novel metaphorical use of lexical items in dialogical naturally-occurring political texts without the recourse to task-specific hand-crafted knowledge. The focus of metaphorical analysis is represented by the class of verbs of motion identified by Beth Levin. These lexical items are investigated in the atypical political genre of the White House Press Briefings due to their role in the communication strategies deployed in public and political discourse. The Computational White House press Briefings (CompWHoB) corpus, a large resource developed as one of the main objectives of the present work, is used for the extraction of the press briefings including the lexical items under analysis. The metaphor recognition of the motion verbs is addressed employing unsupervised techniques which theoretical foundations primarily lie in the Distributional Hypothesis theory, i.e. word embeddings and topic models. Three algorithms are developed for the task, combining the Word2Vec and the Latent Dirichlet Allocation models, and based on two approaches representing their foundational theoretical framework. The first one is defined as "local" and leverages the syntactic relations of the verb of motion with its direct object for the detection of metaphoricity. The second one, termed as "global", drifts away from the use of the syntactic knowledge as feature of the system hence only using the information inferred from the discourse context. The three systems and their corresponding approaches are evaluated against 1220 instances of verbs of motion annotated by human judges according to their metaphoricity. Results show that the global approach performs poorly compared to the other two models also implementing the local approach, leading to the conclusion that a syntax-agnostic system is still far from reaching a significant performance. The evaluation of the local approach yields instead promising results, proving the importance of endowing the machine with syntactic knowledge as also confirmed by a qualitative analysis on the influence of the linguistic properties of metaphorical utterances

    Streaming and Sketch Algorithms for Large Data NLP

    Get PDF
    The availability of large and rich quantities of text data is due to the emergence of the World Wide Web, social media, and mobile devices. Such vast data sets have led to leaps in the performance of many statistically-based problems. Given a large magnitude of text data available, it is computationally prohibitive to train many complex Natural Language Processing (NLP) models on large data. This motivates the hypothesis that simple models trained on big data can outperform more complex models with small data. My dissertation provides a solution to effectively and efficiently exploit large data on many NLP applications. Datasets are growing at an exponential rate, much faster than increase in memory. To provide a memory-efficient solution for handling large datasets, this dissertation show limitations of existing streaming and sketch algorithms when applied to canonical NLP problems and proposes several new variants to overcome those shortcomings. Streaming and sketch algorithms process the large data sets in one pass and represent a large data set with a compact summary, much smaller than the full size of the input. These algorithms can easily be implemented in a distributed setting and provide a solution that is both memory- and time-efficient. However, the memory and time savings come at the expense of approximate solutions. In this dissertation, I demonstrate that approximate solutions achieved on large data are comparable to exact solutions on large data and outperform exact solutions on smaller data. I focus on many NLP problems that boil down to tracking many statistics, like storing approximate counts, computing approximate association scores like pointwise mutual information (PMI), finding frequent items (like n-grams), building streaming language models, and measuring distributional similarity. First, I introduce the concept of approximate streaming large-scale language models in NLP. Second, I present a novel variant of the Count-Min sketch that maintains approximate counts of all items. Third, I conduct a systematic study and compare many sketch algorithms that approximate count of items with focus on large-scale NLP tasks. Last, I develop fast large-scale approximate graph (FLAG), a system that quickly constructs a large-scale approximate nearest-neighbor graph from a large corpus

    隐喻计算研究进展

    Get PDF
    作为自然语言处理中的普遍现象,隐喻若不得到解决,将成为制约自然语言处理和机器翻译的瓶颈问题.结合相关的隐喻理论基础,从隐喻识别和隐喻解释这两个隐喻计算的子任务出发,介绍了现有的隐喻计算模型以及隐喻语料资源,并对这些隐喻模型的优缺点和适用范围进行了比较.国家自然科学基金(61075058

    Designing Statistical Language Learners: Experiments on Noun Compounds

    Full text link
    The goal of this thesis is to advance the exploration of the statistical language learning design space. In pursuit of that goal, the thesis makes two main theoretical contributions: (i) it identifies a new class of designs by specifying an architecture for natural language analysis in which probabilities are given to semantic forms rather than to more superficial linguistic elements; and (ii) it explores the development of a mathematical theory to predict the expected accuracy of statistical language learning systems in terms of the volume of data used to train them. The theoretical work is illustrated by applying statistical language learning designs to the analysis of noun compounds. Both syntactic and semantic analysis of noun compounds are attempted using the proposed architecture. Empirical comparisons demonstrate that the proposed syntactic model is significantly better than those previously suggested, approaching the performance of human judges on the same task, and that the proposed semantic model, the first statistical approach to this problem, exhibits significantly better accuracy than the baseline strategy. These results suggest that the new class of designs identified is a promising one. The experiments also serve to highlight the need for a widely applicable theory of data requirements.Comment: PhD thesis (Macquarie University, Sydney; December 1995), LaTeX source, xii+214 page

    Computational modeling of lexical ambiguity

    Get PDF
    Lexical ambiguity is a frequent phenomenon that can occur not only for words but also on the phrase level. Natural language processing systems need to efficiently deal with these ambiguities in various tasks, however, we often encounter such system failures in real applications. This thesis studies several complex phenomena related to word/phrase ambiguity at the level of text and proposes computational models to tackle these phenomena. Throughout the thesis, we address a number of lexical ambiguity phenomena varying across the sense granularity line. We start with the idiom detection task, in which candidate senses are constrained toliteral\u27 and idiomatic\u27. Then, we move on to the more general case of detecting figurative expressions. In this task, target phrases are not lexicalized but rather bear nonliteral semantic meanings. Similar to the idiom task, this one has two candidate sense categories (literal\u27 and nonliteral\u27). Next, we consider a more complicated situation where words often have more than two candidate senses and the sense boundaries are fuzzier, namely word sense disambiguation (WSD). Finally, we discuss another lexical ambiguity problem in which the sense inventory is not explicitly specified, word sense induction (WSI).Computationally, we propose novel models that outperform state-of-the-art systems. We start with a supervised model in which we study a number of semantic relatedness features combined with linguistically informed features such as local/global context, part-of-speech tags, syntactic structure, named entities and sentence markers. While experimental results show that the supervised model can effectively detect idiomatic expressions, we further improve the work by proposing an unsupervised bootstrapping model which does not rely on human annotated data but performs at a comparative level to the supervised model. Moving on to accommodate other lexical ambiguity phenomena, we propose a Gaussian Mixture Model that can be used not only for detecting idiomatic expressions but also for extracting unlexicalized figurative expressions from raw corpora automatically. Aiming at modeling multiple sense disambiguation tasks within a uniform framework, we propose a probabilistic model (topic model), which encodes human knowledge as sense priors via paraphrases of gold-standard sense inventories, to effectively perform on the idiom task as well as two WSD tasks. Dealing with WSI, we find state-of-the-art WSI research is hindered by the deficiencies of evaluation measures that are in favor of either very fine-grained or very coarse-grained cluster output. We argue that the information theoretic V-Measure is a promising approach to pursue in the future but should be based on more precise entropy estimators, supported by evidence from the entropy bias analysis, simulation experiments, and stochastic predictions. We evaluate all our proposed models against state-of-the-art systems on standard test data sets, and we show that our approaches advance the state-of-the-art.Lexikalische Mehrdeutigkeit ist ein häufiges Phänomen, das nicht nur auf Wort, sondern auch auf phrasaler Ebene auftreten kann. Systeme zur Verarbeitung natürlicher Sprache müssen diese Mehrdeutigkeiten in verschiedenen Aufgaben effizient bewältigen, doch in realen Anwendungen erweisen sich solche Systeme oft als fehlerhaft. Ziel dieser Dissertation ist es verschiedene komplexe Phänomene lexikalischer und insbesondere phrasaler Mehrdeutigkeit zu erforschen und algorithmische Modelle zur Verarbeitung dieser Phänomene vorzuschlagen. In dieser Dissertation beschäftigen wir uns durchgehend mit einer Reihe von Phänomenen lexikalischer Ambiguität, die in der Granularität der Sinnunterschiede variieren: Wir beginnen mit der Aufgabe Redewendungen zu erkennen, in der die möglichen Bedeutungen auf wörtlich\u27 und idiomatisch\u27 beschränkt sind; dann fahren wir mit einem allgemeineren Fall fort in dem die Zielphrasen keine feststehenden Redewendungen sind, aber im Kontext eine übertragene Bedeutung haben. Wir definieren hier die Aufgabe bildhafte Ausdrücke zu erkennen als Disambiguierungs-Problem in der es, ähnlich wie in der Redewendungs-Aufgabe, zwei mögliche Bedeutungskategorien gibt (wörtlich\u27 und nicht-wörtlich\u27). Als nächstes betrachten wir eine kompliziertere Situation, in der Wörter oft mehr als zwei mögliche Bedeutungen haben und die Grenzen zwischen diesen Sinnen unschärfer sind, nämlich Wort-Bedeutungs-Unterscheidung (textit{Word Sense Disambiguation}, WSD); Schließlich diskutieren wir ein weiteres Problem lexikalischer Mehrdeutigkeit, in dem das Bedeutungsinventar nicht bereits ausdrücklich gegeben ist, d.h. Wort-Bedeutungs-Induktion (Word Sense Induction, WSI). Auf algorithmischer Seite schlagen wir Modelle vor, die Systeme auf dem aktuellen Stand der Technik übertreffen. Wir beginnen mit einem überwachten Modell, in dem wir eine Reihe von Merkmalen basierend auf semantischer Ähnlichkeit mit linguistisch fundierten Merkmalen wie lokalem/globalem Kontext, Wortarten, syntaktischer Struktur, Eigennamen und Satzzeichen kombinieren. Ausgehend von experimentellen Ergebnissen die zeigen, dass das überwachte Modell effektiv idiomatische Ausdrücke erkennen kann, verbessern wir unsere Arbeit indem wir ein unüberwachtes Bootstrapping-Modell präsentieren, das nicht auf manuell annotierte Daten angewiesen ist aber ähnlich gut funktioniert wie das überwachte Modell. Um weitere Phänomene lexikalischer Mehrdeutigkeit zu behandeln, schlagen wir des weiteren ein Gauss\u27sches Mischmodell vor, das nicht nur zur Erkennung von Redewendungen verwendet werden kann, sondern auch dazu effektiv und automatisch neue produktive bildhafte Ausdrücke aus unverarbeiteten Corpora zu extrahieren. Mit dem Ziel mehrere Aufgaben zur Disambiguierung innerhalb eines einheitlichen Systems zu modellieren, schlagen wir ein statistisches Modell (Topic-Modell) vor, um sowohl die Aufgabestellung der Redewendungs-Erkennung als auch die WSD-Probleme effektiv zu bearbeiten. Die A-priori-Wahrscheinlichkeiten dieses Modells kodieren menschliches Wissen, wozu es Gold-Standard-Bedeutungslexika benutzt. Bezüglich WSI stellen wir fest, dass der Stand der WSI-Forschung durch inadequate Evaluationsmaße behindert wird, die entweder sehr feinkörnige oder sehr grobkörnige Cluster-Ergebnisse bevorzugen. Wir behaupten, dass das Informationstheoretische V-Measure\u27 ein vielversprechender Ansatz ist, der zukünftig verfolgt werden könnte, der jedoch mit präzieseren Entropie Schätzern, unterstützt von Belegen aus der Entropie-Trend-Analyse, Simulationxexperimenten und stochastische Vorhersagen, aufbauen sollte. Wir evaluieren alle unsere vorgeschlagenen Modelle auf standardisierten Testdaten und vergleichen sie mit anderen Systemen auf dem aktuellen Forschungsstand, und wir zeigen dass unsere Ansätze den aktuellen Forschungsstand voranbringen
    corecore