111 research outputs found
Research in the Language, Information and Computation Laboratory of the University of Pennsylvania
This report takes its name from the Computational Linguistics Feedback Forum (CLiFF), an informal discussion group for students and faculty. However the scope of the research covered in this report is broader than the title might suggest; this is the yearly report of the LINC Lab, the Language, Information and Computation Laboratory of the University of Pennsylvania.
It may at first be hard to see the threads that bind together the work presented here, work by faculty, graduate students and postdocs in the Computer Science and Linguistics Departments, and the Institute for Research in Cognitive Science. It includes prototypical Natural Language fields such as: Combinatorial Categorial Grammars, Tree Adjoining Grammars, syntactic parsing and the syntax-semantics interface; but it extends to statistical methods, plan inference, instruction understanding, intonation, causal reasoning, free word order languages, geometric reasoning, medical informatics, connectionism, and language acquisition.
Naturally, this introduction cannot spell out all the connections between these abstracts; we invite you to explore them on your own. In fact, with this issue it’s easier than ever to do so: this document is accessible on the “information superhighway”. Just call up http://www.cis.upenn.edu/~cliff-group/94/cliffnotes.html
In addition, you can find many of the papers referenced in the CLiFF Notes on the net. Most can be obtained by following links from the authors’ abstracts in the web version of this report.
The abstracts describe the researchers’ many areas of investigation, explain their shared concerns, and present some interesting work in Cognitive Science. We hope its new online format makes the CLiFF Notes a more useful and interesting guide to Computational Linguistics activity at Penn
Latent-Variable PCFGs: Background and Applications
Latent-variable probabilistic context-free grammars are
latent-variable models that are based on context-free grammars.
Nonterminals are associated with latent states that provide
contextual information during the top-down rewriting process of
the grammar.
We survey a few of the techniques used to estimate such grammars
and to parse text with them. We also give an overview of what the latent
states represent for English Penn treebank parsing, and provide
an overview of extensions and related models to these grammars
Cross-lingual Semantic Parsing with Categorial Grammars
Humans communicate using natural language. We need to make sure that computers can understand us so that they can act on our spoken commands or independently gain new insights from knowledge that is written down as text. A “semantic parser” is a program that translates natural-language sentences into computer commands or logical formulas–something a computer can work with. Despite much recent progress on semantic parsing, most research focuses on English, and semantic parsers for other languages cannot keep up with the developments. My thesis aims to help close this gap. It investigates “cross-lingual learning” methods by which a computer can automatically adapt a semantic parser to another language, such as Dutch. The computer learns by looking at example sentences and their translations, e.g., “She likes to read books”/”Ze leest graag boeken”. Even with many such examples, learning which word means what and how word meanings combine into sentence meanings is a challenge, because translations are rarely word-for-word. They exhibit grammatical differences and non-literalities. My thesis presents a method for tackling these challenges based on the grammar formalism Combinatory Categorial Grammar. It shows that this is a suitable formalism for this purpose, that many structural differences between sentences and their translations can be dealt with in this framework, and that a (rudimentary) semantic parser for Dutch can be learned cross-lingually based on one for English. I also investigate methods for building large corpora of texts annotated with logical formulas to further study and improve semantic parsers
Current trends
Deep parsing is the fundamental process aiming at the representation of the syntactic
structure of phrases and sentences. In the traditional methodology this process is
based on lexicons and grammars representing roughly properties of words and interactions
of words and structures in sentences. Several linguistic frameworks, such as Headdriven
Phrase Structure Grammar (HPSG), Lexical Functional Grammar (LFG), Tree Adjoining
Grammar (TAG), Combinatory Categorial Grammar (CCG), etc., offer different
structures and combining operations for building grammar rules. These already contain
mechanisms for expressing properties of Multiword Expressions (MWE), which, however,
need improvement in how they account for idiosyncrasies of MWEs on the one
hand and their similarities to regular structures on the other hand. This collaborative
book constitutes a survey on various attempts at representing and parsing MWEs in the
context of linguistic theories and applications
Representation and parsing of multiword expressions
This book consists of contributions related to the definition, representation and parsing of MWEs. These reflect current trends in the representation and processing of MWEs. They cover various categories of MWEs such as verbal, adverbial and nominal MWEs, various linguistic frameworks (e.g. tree-based and unification-based grammars), various languages including English, French, Modern Greek, Hebrew, Norwegian), and various applications (namely MWE detection, parsing, automatic translation) using both symbolic and statistical approaches
Recommended from our members
Inducing grammars from linguistic universals and realistic amounts of supervision
The best performing NLP models to date are learned from large volumes of manually-annotated data. For tasks like part-of-speech tagging and grammatical parsing, high performance can be achieved with plentiful supervised data. However, such resources are extremely costly to produce, making them an unlikely option for building NLP tools in under-resourced languages or domains. This dissertation is concerned with reducing the annotation required to learn NLP models, with the goal of opening up the range of domains and languages to which NLP technologies may be applied. In this work, we explore the possibility of learning from a degree of supervision that is at or close to the amount that could reasonably be collected from annotators for a particular domain or language that currently has none. We show that just a small amount of annotation input — even that which can be collected in just a few hours — can provide enormous advantages if we have learning algorithms that can appropriately exploit it. This work presents new algorithms, models, and approaches designed to learn grammatical information from weak supervision. In particular, we look at ways of intersecting a variety of different forms of supervision in complementary ways, thus lowering the overall annotation burden. Sources of information include tag dictionaries, morphological analyzers, constituent bracketings, and partial tree annotations, as well as unannotated corpora. For example, we present algorithms that are able to combine faster-to-obtain type-level annotation with unannotated text to remove the need for slower-to-obtain token-level annotation. Much of this dissertation describes work on Combinatory Categorial Grammar (CCG), a grammatical formalism notable for its use of structured, logic-backed categories that describe how each word and constituent fits into the overall syntax of the sentence. This work shows how linguistic universals intrinsic to the CCG formalism itself can be encoded as Bayesian priors to improve learning.Computer Science
Answering Deep Queries Specified in Natural Language with Respect to a Frame Based Knowledge Base and Developing Related Natural Language Understanding Components
abstract: Question Answering has been under active research for decades, but it has recently taken the spotlight following IBM Watson's success in Jeopardy! and digital assistants such as Apple's Siri, Google Now, and Microsoft Cortana through every smart-phone and browser. However, most of the research in Question Answering aims at factual questions rather than deep ones such as ``How'' and ``Why'' questions.
In this dissertation, I suggest a different approach in tackling this problem. We believe that the answers of deep questions need to be formally defined before found.
Because these answers must be defined based on something, it is better to be more structural in natural language text; I define Knowledge Description Graphs (KDGs), a graphical structure containing information about events, entities, and classes. We then propose formulations and algorithms to construct KDGs from a frame-based knowledge base, define the answers of various ``How'' and ``Why'' questions with respect to KDGs, and suggest how to obtain the answers from KDGs using Answer Set Programming. Moreover, I discuss how to derive missing information in constructing KDGs when the knowledge base is under-specified and how to answer many factual question types with respect to the knowledge base.
After having the answers of various questions with respect to a knowledge base, I extend our research to use natural language text in specifying deep questions and knowledge base, generate natural language text from those specification. Toward these goals, I developed NL2KR, a system which helps in translating natural language to formal language. I show NL2KR's use in translating ``How'' and ``Why'' questions, and generating simple natural language sentences from natural language KDG specification. Finally, I discuss applications of the components I developed in Natural Language Understanding.Dissertation/ThesisDoctoral Dissertation Computer Science 201
- …