29 research outputs found

    Complexity of Lexical Descriptions and its Relevance to Partial Parsing

    Get PDF
    In this dissertation, we have proposed novel methods for robust parsing that integrate the flexibility of linguistically motivated lexical descriptions with the robustness of statistical techniques. Our thesis is that the computation of linguistic structure can be localized if lexical items are associated with rich descriptions (supertags) that impose complex constraints in a local context. However, increasing the complexity of descriptions makes the number of different descriptions for each lexical item much larger and hence increases the local ambiguity for a parser. This local ambiguity can be resolved by using supertag co-occurrence statistics collected from parsed corpora. We have explored these ideas in the context of Lexicalized Tree-Adjoining Grammar (LTAG) framework wherein supertag disambiguation provides a representation that is an almost parse. We have used the disambiguated supertag sequence in conjunction with a lightweight dependency analyzer to compute noun groups, verb groups, dependency linkages and even partial parses. We have shown that a trigram-based supertagger achieves an accuracy of 92.1‰ on Wall Street Journal (WSJ) texts. Furthermore, we have shown that the lightweight dependency analysis on the output of the supertagger identifies 83‰ of the dependency links accurately. We have exploited the representation of supertags with Explanation-Based Learning to improve parsing effciency. In this approach, parsing in limited domains can be modeled as a Finite-State Transduction. We have implemented such a system for the ATIS domain which improves parsing eciency by a factor of 15. We have used the supertagger in a variety of applications to provide lexical descriptions at an appropriate granularity. In an information retrieval application, we show that the supertag based system performs at higher levels of precision compared to a system based on part-of-speech tags. In an information extraction task, supertags are used in specifying extraction patterns. For language modeling applications, we view supertags as syntactically motivated class labels in a class-based language model. The distinction between recursive and non-recursive supertags is exploited in a sentence simplification application

    Research in the Language, Information and Computation Laboratory of the University of Pennsylvania

    Get PDF
    This report takes its name from the Computational Linguistics Feedback Forum (CLiFF), an informal discussion group for students and faculty. However the scope of the research covered in this report is broader than the title might suggest; this is the yearly report of the LINC Lab, the Language, Information and Computation Laboratory of the University of Pennsylvania. It may at first be hard to see the threads that bind together the work presented here, work by faculty, graduate students and postdocs in the Computer Science and Linguistics Departments, and the Institute for Research in Cognitive Science. It includes prototypical Natural Language fields such as: Combinatorial Categorial Grammars, Tree Adjoining Grammars, syntactic parsing and the syntax-semantics interface; but it extends to statistical methods, plan inference, instruction understanding, intonation, causal reasoning, free word order languages, geometric reasoning, medical informatics, connectionism, and language acquisition. Naturally, this introduction cannot spell out all the connections between these abstracts; we invite you to explore them on your own. In fact, with this issue it’s easier than ever to do so: this document is accessible on the “information superhighway”. Just call up http://www.cis.upenn.edu/~cliff-group/94/cliffnotes.html In addition, you can find many of the papers referenced in the CLiFF Notes on the net. Most can be obtained by following links from the authors’ abstracts in the web version of this report. The abstracts describe the researchers’ many areas of investigation, explain their shared concerns, and present some interesting work in Cognitive Science. We hope its new online format makes the CLiFF Notes a more useful and interesting guide to Computational Linguistics activity at Penn

    Is question answering fit for the Semantic Web? A survey

    Get PDF
    With the recent rapid growth of the Semantic Web (SW), the processes of searching and querying content that is both massive in scale and heterogeneous have become increasingly challenging. User-friendly interfaces, which can support end users in querying and exploring this novel and diverse, structured information space, are needed to make the vision of the SW a reality. We present a survey on ontology-based Question Answering (QA), which has emerged in recent years to exploit the opportunities offered by structured semantic information on the Web. First, we provide a comprehensive perspective by analyzing the general background and history of the QA research field, from influential works from the artificial intelligence and database communities developed in the 70s and later decades, through open domain QA stimulated by the QA track in TREC since 1999, to the latest commercial semantic QA solutions, before tacking the current state of the art in open userfriendly interfaces for the SW. Second, we examine the potential of this technology to go beyond the current state of the art to support end-users in reusing and querying the SW content. We conclude our review with an outlook for this novel research area, focusing in particular on the R&D directions that need to be pursued to realize the goal of efficient and competent retrieval and integration of answers from large scale, heterogeneous, and continuously evolving semantic sources

    Argument Labeling of Discourse Relations using LSTM Neural Networks

    Get PDF
    A discourse relation can be described as a linguistic unit that is composed of sub-units that, when combined, present more information than the sum of its parts. A discourse relation is usually comprised of two arguments that relate to each other in a given form. A discourse relation may have another optional sub-unit called the discourse connective that connects the two arguments and describes the relationship between the two more explicitly. This is called Explicit Discourse relation. Extracting or labeling arguments present in an explicit discourse relations is a challenging task. In recent years, due to the CoNLL competitions, feature engineering has been applied to allow various machine learning models to achieve an F-measure value of about 55%. However, feature engineering is brittle and hand-crafted, requiring advanced knowledge of linguistics as well as the dataset in question. In this thesis, we propose an approach for segmenting (or identifying the boundaries of) Arg1 and Arg2 without feature engineering. We introduce a Bidirectional Long Short-Term Memory (LSTM) based model for argument labeling. We experimented with multiple configurations of our model. Using the Penn Discourse Treebank (PDTB) dataset, our best model achieved an F1 measure of 23.05% without any feature engineering. This is significantly higher than the 20.52% achieved by the state of the art Recurrent Neural Network (RNN) approach, but significantly lower than the feature based state of the art systems. On the other hand, because our approach learns only from the raw dataset, it is more widely applicable to multiple textual genres and languages

    Alternative Phrases Theoretical Analysis and Practical Application

    Get PDF
    Institute for Communicating and Collaborative Systems"All right, but apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, the fresh-water system and public health, what have the Romans ever done for us?" (Monty Python, The Life of Brian) Alternative phrases identify selected elements from a set and subject them to particular scrutiny with respect to the sentence's predicate. For instance, in the above example, sanitation, medicine, etc. are all identified as elements in the set of things the Romans have done for us" that should not be included in the response to the question. They are alternative responses to the desired ones. Alternative phrases come in a variety of constructions and perform a variety of tasks: excluding elements (apart from), expressing preference for particular elements (especially), and simply identifying representative examples (such as). Not a great deal of work has been done on alternative phrases in general. Hearst (1992) used a pattern-matching analysis of certain alternative phrases to learn hyponyms from unannotated corpora. Also, a few examples from a subset of alternative phrases, called exceptive phrases, have been studied, most recently, by von Fintel (1993) and Hoeksema (1995). But not all constructions are amenable to pattern-matching techniques, and the work on exceptive phrases focuses on some very specific semantic points. The focus of this thesis is to present a general program for analyzing a wide variety of alternative phrases including their presuppositional and anaphoric properties. I perform my analyses in Combinatory Categorial Grammar, a lexicalized formalism. The semantic aspects of the analysis benefit greatly from the concept of alternative sets, sets of propositions that differ in one or more argument (Karttunen and Peters, 1979; Rooth, 1985, 1992; Prevost and Steedman, 1994; Steedman, 2000a). In addition, elegant solutions are made possible by separating the semantics into assertion and presupposition (Stalnaker, 1974; Karttunen and Peters, 1979; Stone and Doran, 1997; Stone and Webber, 1998; Webber et al., 1999b)| with each performing quite different tasks. My second goal is to demonstrate the practicality and importance of this analysis to real systems. Although it is relevant to many practical applications, I will focus primarily on natural language information retrieval (NLIR) as a case study. In such a domain, queries like Where can I find other web browsers than Netscape for download? and Where can I find shoes made by Bufialino, such as the Bushwackers? are often observed. I review several techniques for NLIR and demonstrate that implementations of those techniques perform poorly on such queries. I show that understanding alternative phrases can enable simple techniques which greatly improve precision. To bridge the gap between these goals, I present Grok, a modular natural language system. Several general NLP issues necessary to support my linguistic analysis are discussed: anaphora resolution, processing of presuppositions, interface to knowledge representation, and the creation of a wide-coverage lexicon. Special attention is paid to the lexicon, which is a combination of a hand-built and an acquired lexicon

    Using natural language processing for question answering in closed and open domains

    Get PDF
    With regard to the growth in the amount of social, environmental, and biomedical information available digitally, there is a growing need for Question Answering (QA) systems that can empower users to master this new wealth of information. Despite recent progress in QA, the quality of interpretation and extraction of the desired answer is not adequate. We believe that striving for higher accuracy in QA systems is subject to on-going research, i.e., it is better to have no answer is better than wrong answers. However, there are diverse queries, which the state of the art QA systems cannot interpret and answer properly. The problem of interpreting a question in a way that could preserve its syntactic-semantic structure is considered as one of the most important challenges in this area. In this work we focus on the problems of semantic-based QA systems and analyzing the effectiveness of NLP techniques, query mapping, and answer inferencing both in closed (first scenario) and open (second scenario) domains. For this purpose, the architecture of Semantic-based closed and open domain Question Answering System (hereafter “ScoQAS”) over ontology resources is presented with two different prototyping: Ontology-based closed domain and an open domain under Linked Open Data (LOD) resource. The ScoQAS is based on NLP techniques combining semantic-based structure-feature patterns for question classification and creating a question syntactic-semantic information structure (QSiS). The QSiS provides an actual potential by building constraints to formulate the related terms on syntactic-semantic aspects and generating a question graph (QGraph) which facilitates making inference for getting a precise answer in the closed domain. In addition, our approach provides a convenient method to map the formulated comprehensive information into SPARQL query template to crawl in the LOD resources in the open domain. The main contributions of this dissertation are as follows: 1. Developing ScoQAS architecture integrated with common and specific components compatible with closed and open domain ontologies. 2. Analysing user’s question and building a question syntactic-semantic information structure (QSiS), which is constituted by several processes of the methodology: question classification, Expected Answer Type (EAT) determination, and generated constraints. 3. Presenting an empirical semantic-based structure-feature pattern for question classification and generalizing heuristic constraints to formulate the relations between the features in the recognized pattern in terms of syntactical and semantical. 4. Developing a syntactic-semantic QGraph for representing core components of the question. 5. Presenting an empirical graph-based answer inference in the closed domain. In a nutshell, a semantic-based QA system is presented which provides some experimental results over the closed and open domains. The efficiency of the ScoQAS is evaluated using measures such as precision, recall, and F-measure on LOD challenges in the open domain. We focus on quantitative evaluation in the closed domain scenario. Due to the lack of predefined benchmark(s) in the first scenario, we define measures that demonstrate the actual complexity of the problem and the actual efficiency of the solutions. The results of the analysis corroborate the performance and effectiveness of our approach to achieve a reasonable accuracy.Con respecto al crecimiento en la cantidad de información social, ambiental y biomédica disponible digitalmente, existe una creciente necesidad de sistemas de la búsqueda de la respuesta (QA) que puedan ofrecer a los usuarios la gestión de esta nueva cantidad de información. A pesar del progreso reciente en QA, la calidad de interpretación y extracción de la respuesta deseada no es la adecuada. Creemos que trabajar para lograr una mayor precisión en los sistemas de QA es todavía un campo de investigación abierto. Es decir, es mejor no tener respuestas que tener respuestas incorrectas. Sin embargo, existen diversas consultas que los sistemas de QA en el estado del arte no pueden interpretar ni responder adecuadamente. El problema de interpretar una pregunta de una manera que podría preservar su estructura sintáctica-semántica es considerado como uno de los desafíos más importantes en esta área. En este trabajo nos centramos en los problemas de los sistemas de QA basados en semántica y en el análisis de la efectividad de las técnicas de PNL, y la aplicación de consultas e inferencia respuesta tanto en dominios cerrados (primer escenario) como abiertos (segundo escenario). Para este propósito, la arquitectura del sistema de búsqueda de respuestas en dominios cerrados y abiertos basado en semántica (en adelante "ScoQAS") sobre ontologías se presenta con dos prototipos diferentes: en dominio cerrado basado en el uso de ontologías y un dominio abierto dirigido a repositorios de Linked Open Data (LOD). El ScoQAS se basa en técnicas de PNL que combinan patrones de características de estructura semánticas para la clasificación de preguntas y la creación de una estructura de información sintáctico-semántica de preguntas (QSiS). El QSiS proporciona una manera la construcción de restricciones para formular los términos relacionados en aspectos sintáctico-semánticos y generar un grafo de preguntas (QGraph) el cual facilita derivar inferencias para obtener una respuesta precisa en el dominio cerrado. Además, nuestro enfoque proporciona un método adecuado para aplicar la información integral formulada en la plantilla de consulta SPARQL para navegar en los recursos LOD en el dominio abierto. Las principales contribuciones de este trabajo son los siguientes: 1. El desarrollo de la arquitectura ScoQAS integrada con componentes comunes y específicos compatibles con ontologías de dominio cerrado y abierto. 2. El análisis de la pregunta del usuario y la construcción de una estructura de información sintáctico-semántica de las preguntas (QSiS), que está constituida por varios procesos de la metodología: clasificación de preguntas, determinación del Tipo de Respuesta Esperada (EAT) y las restricciones generadas. 3. La presentación de un patrón empírico basado en la estructura semántica para clasificar las preguntas y generalizar las restricciones heurísticas para formular las relaciones entre las características en el patrón reconocido en términos sintácticos y semánticos. 4. El desarrollo de un QGraph sintáctico-semántico para representar los componentes centrales de la pregunta. 5. La presentación de la respuesta inferida a partir de un grafo empírico en el dominio cerrado. En pocas palabras, se presenta un sistema semántico de QA que proporciona algunos resultados experimentales sobre los dominios cerrados y abiertos. La eficiencia del ScoQAS se evalúa utilizando medidas tales como una precisión, cobertura y la medida-F en desafíos LOD para el dominio abierto. Para el dominio cerrado, nos centramos en la evaluación cuantitativa; su precisión se analiza en una ontología empresarial. La falta de un banco la pruebas predefinidas es uno de los principales desafíos de la evaluación en el primer escenario. Por lo tanto, definimos medidas que demuestran la complejidad real del problema y la eficiencia real de las soluciones. Los resultados del análisis corroboran el rendimient

    Beyond topic-based representations for text mining

    Get PDF
    A massive amount of online information is natural language text: newspapers, blog articles, forum posts and comments, tweets, scientific literature, government documents, and more. While in general, all kinds of online information is useful, textual information is especially important—it is the most natural, most common, and most expressive form of information. Text representation plays a critical role in application tasks like classification or information retrieval since the quality of the underlying feature space directly impacts each task's performance. Because of this importance, many different approaches have been developed for generating text representations. By far, the most common way to generate features is to segment text into words and record their n-grams. While simple term features perform relatively well in topic-based tasks, not all downstream applications are of a topical nature and can be captured by words alone. For example, determining the native language of an English essay writer will depend on more than just word choice. Competing methods to topic-based representations (such as neural networks) are often not interpretable or rely on massive amounts of training data. This thesis proposes three novel contributions to generate and analyze a large space of non-topical features. First, structural parse tree features are solely based on structural properties of a parse tree by ignoring all of the syntactic categories in the tree. An important advantage of these "skeletons" over regular syntactic features is that they can capture global tree structures without causing problems of data sparseness or overfitting. Second, SyntacticDiff explicitly captures differences in a text document with respect to a reference corpus, creating features that are easily explained as weighted word edit differences. These edit features are especially useful since they are derived from information not present in the current document, capturing a type of comparative feature. Third, Cross-Context Lexical Analysis is a general framework for analyzing similarities and differences in both term meaning and representation with respect to different, potentially overlapping partitions of a text collection. The representations analyzed by CCLA are not limited to topic-based features
    corecore