87 research outputs found

    Development of a General-Purpose Categorial Grammar Treebank

    Get PDF
    National Institute for Japanese Language and LinguisticsOchanomizu UniversityThe University of TokyoThe University of Toky

    ABCツリーバンク : 学際的な言語研究のための基盤資源

    Get PDF
    会議名: 国立国語研究所オープンハウス2021, 開催地: オンライン, 会期: 2021年9月10

    Towards a machine-learning architecture for lexical functional grammar parsing

    Get PDF
    Data-driven grammar induction aims at producing wide-coverage grammars of human languages. Initial efforts in this field produced relatively shallow linguistic representations such as phrase-structure trees, which only encode constituent structure. Recent work on inducing deep grammars from treebanks addresses this shortcoming by also recovering non-local dependencies and grammatical relations. My aim is to investigate the issues arising when adapting an existing Lexical Functional Grammar (LFG) induction method to a new language and treebank, and find solutions which will generalize robustly across multiple languages. The research hypothesis is that by exploiting machine-learning algorithms to learn morphological features, lemmatization classes and grammatical functions from treebanks we can reduce the amount of manual specification and improve robustness, accuracy and domain- and language -independence for LFG parsing systems. Function labels can often be relatively straightforwardly mapped to LFG grammatical functions. Learning them reliably permits grammar induction to depend less on language-specific LFG annotation rules. I therefore propose ways to improve acquisition of function labels from treebanks and translate those improvements into better-quality f-structure parsing. In a lexicalized grammatical formalism such as LFG a large amount of syntactically relevant information comes from lexical entries. It is, therefore, important to be able to perform morphological analysis in an accurate and robust way for morphologically rich languages. I propose a fully data-driven supervised method to simultaneously lemmatize and morphologically analyze text and obtain competitive or improved results on a range of typologically diverse languages

    CLiFF Notes: Research In Natural Language Processing at the University of Pennsylvania

    Get PDF
    The Computational Linguistics Feedback Forum (CLIFF) is a group of students and faculty who gather once a week to discuss the members\u27 current research. As the word feedback suggests, the group\u27s purpose is the sharing of ideas. The group also promotes interdisciplinary contacts between researchers who share an interest in Cognitive Science. There is no single theme describing the research in Natural Language Processing at Penn. There is work done in CCG, Tree adjoining grammars, intonation, statistical methods, plan inference, instruction understanding, incremental interpretation, language acquisition, syntactic parsing, causal reasoning, free word order languages, ... and many other areas. With this in mind, rather than trying to summarize the varied work currently underway here at Penn, we suggest reading the following abstracts to see how the students and faculty themselves describe their work. Their abstracts illustrate the diversity of interests among the researchers, explain the areas of common interest, and describe some very interesting work in Cognitive Science. This report is a collection of abstracts from both faculty and graduate students in Computer Science, Psychology and Linguistics. We pride ourselves on the close working relations between these groups, as we believe that the communication among the different departments and the ongoing inter-departmental research not only improves the quality of our work, but makes much of that work possible

    CLiFF Notes: Research in the Language Information and Computation Laboratory of The University of Pennsylvania

    Get PDF
    This report takes its name from the Computational Linguistics Feedback Forum (CLIFF), an informal discussion group for students and faculty. However the scope of the research covered in this report is broader than the title might suggest; this is the yearly report of the LINC Lab, the Language, Information and Computation Laboratory of the University of Pennsylvania. It may at first be hard to see the threads that bind together the work presented here, work by faculty, graduate students and postdocs in the Computer Science, Psychology, and Linguistics Departments, and the Institute for Research in Cognitive Science. It includes prototypical Natural Language fields such as: Combinatorial Categorial Grammars, Tree Adjoining Grammars, syntactic parsing and the syntax-semantics interface; but it extends to statistical methods, plan inference, instruction understanding, intonation, causal reasoning, free word order languages, geometric reasoning, medical informatics, connectionism, and language acquisition. With 48 individual contributors and six projects represented, this is the largest LINC Lab collection to date, and the most diverse

    CLiFF Notes: Research in the Language, Information and Computation Laboratory of the University of Pennsylvania

    Get PDF
    One concern of the Computer Graphics Research Lab is in simulating human task behavior and understanding why the visualization of the appearance, capabilities and performance of humans is so challenging. Our research has produced a system, called Jack, for the definition, manipulation, animation and human factors analysis of simulated human figures. Jack permits the envisionment of human motion by interactive specification and simultaneous execution of multiple constraints, and is sensitive to such issues as body shape and size, linkage, and plausible motions. Enhanced control is provided by natural behaviors such as looking, reaching, balancing, lifting, stepping, walking, grasping, and so on. Although intended for highly interactive applications, Jack is a foundation for other research. The very ubiquitousness of other people in our lives poses a tantalizing challenge to the computational modeler: people are at once the most common object around us, and yet the most structurally complex. Their everyday movements are amazingly fluid, yet demanding to reproduce, with actions driven not just mechanically by muscles and bones but also cognitively by beliefs and intentions. Our motor systems manage to learn how to make us move without leaving us the burden or pleasure of knowing how we did it. Likewise we learn how to describe the actions and behaviors of others without consciously struggling with the processes of perception, recognition, and language. Present technology lets us approach human appearance and motion through computer graphics modeling and three dimensional animation, but there is considerable distance to go before purely synthesized figures trick our senses. We seek to build computational models of human like figures which manifest animacy and convincing behavior. Towards this end, we: Create an interactive computer graphics human model; Endow it with reasonable biomechanical properties; Provide it with human like behaviors; Use this simulated figure as an agent to effect changes in its world; Describe and guide its tasks through natural language instructions. There are presently no perfect solutions to any of these problems; ultimately, however, we should be able to give our surrogate human directions that, in conjunction with suitable symbolic reasoning processes, make it appear to behave in a natural, appropriate, and intelligent fashion. Compromises will be essential, due to limits in computation, throughput of display hardware, and demands of real-time interaction, but our algorithms aim to balance the physical device constraints with carefully crafted models, general solutions, and thoughtful organization. The Jack software is built on Silicon Graphics Iris 4D workstations because those systems have 3-D graphics features that greatly aid the process of interacting with highly articulated figures such as the human body. Of course, graphics capabilities themselves do not make a usable system. Our research has therefore focused on software to make the manipulation of a simulated human figure easy for a rather specific user population: human factors design engineers or ergonomics analysts involved in visualizing and assessing human motor performance, fit, reach, view, and other physical tasks in a workplace environment. The software also happens to be quite usable by others, including graduate students and animators. The point, however, is that program design has tried to take into account a wide variety of physical problem oriented tasks, rather than just offer a computer graphics and animation tool for the already computer sophisticated or skilled animator. As an alternative to interactive specification, a simulation system allows a convenient temporal and spatial parallel programming language for behaviors. The Graphics Lab is working with the Natural Language Group to explore the possibility of using natural language instructions, such as those found in assembly or maintenance manuals, to drive the behavior of our animated human agents. (See the CLiFF note entry for the AnimNL group for details.) Even though Jack is under continual development, it has nonetheless already proved to be a substantial computational tool in analyzing human abilities in physical workplaces. It is being applied to actual problems involving space vehicle inhabitants, helicopter pilots, maintenance technicians, foot soldiers, and tractor drivers. This broad range of applications is precisely the target we intended to reach. The general capabilities embedded in Jack attempt to mirror certain aspects of human performance, rather than the specific requirements of the corresponding workplace. We view the Jack system as the basis of a virtual animated agent that can carry out tasks and instructions in a simulated 3D environment. While we have not yet fooled anyone into believing that the Jack figure is real , its behaviors are becoming more reasonable and its repertoire of actions more extensive. When interactive control becomes more labor intensive than natural language instructional control, we will have reached a significant milestone toward an intelligent agent

    D6.2 Integrated Final Version of the Components for Lexical Acquisition

    Get PDF
    The PANACEA project has addressed one of the most critical bottlenecks that threaten the development of technologies to support multilingualism in Europe, and to process the huge quantity of multilingual data produced annually. Any attempt at automated language processing, particularly Machine Translation (MT), depends on the availability of language-specific resources. Such Language Resources (LR) contain information about the language\u27s lexicon, i.e. the words of the language and the characteristics of their use. In Natural Language Processing (NLP), LRs contribute information about the syntactic and semantic behaviour of words - i.e. their grammar and their meaning - which inform downstream applications such as MT. To date, many LRs have been generated by hand, requiring significant manual labour from linguistic experts. However, proceeding manually, it is impossible to supply LRs for every possible pair of European languages, textual domain, and genre, which are needed by MT developers. Moreover, an LR for a given language can never be considered complete nor final because of the characteristics of natural language, which continually undergoes changes, especially spurred on by the emergence of new knowledge domains and new technologies. PANACEA has addressed this challenge by building a factory of LRs that progressively automates the stages involved in the acquisition, production, updating and maintenance of LRs required by MT systems. The existence of such a factory will significantly cut down the cost, time and human effort required to build LRs. WP6 has addressed the lexical acquisition component of the LR factory, that is, the techniques for automated extraction of key lexical information from texts, and the automatic collation of lexical information into LRs in a standardized format. The goal of WP6 has been to take existing techniques capable of acquiring syntactic and semantic information from corpus data, improving upon them, adapting and applying them to multiple languages, and turning them into powerful and flexible techniques capable of supporting massive applications. One focus for improving the scalability and portability of lexical acquisition techniques has been to extend exiting techniques with more powerful, less "supervised" methods. In NLP, the amount of supervision refers to the amount of manual annotation which must be applied to a text corpus before machine learning or other techniques are applied to the data to compile a lexicon. More manual annotation means more accurate training data, and thus a more accurate LR. However, given that it is impractical from a cost and time perspective to manually annotate the vast amounts of data required for multilingual MT across domains, it is important to develop techniques which can learn from corpora with less supervision. Less supervised methods are capable of supporting both large-scale acquisition and efficient domain adaptation, even in the domains where data is scarce. Another focus of lexical acquisition in PANACEA has been the need of LR users to tune the accuracy level of LRs. Some applications may require increased precision, or accuracy, where the application requires a high degree of confidence in the lexical information used. At other times a greater level of coverage may be required, with information about more words at the expense of some degree of accuracy. Lexical acquisition in PANACEA has investigated confidence thresholds for lexical acquisition to ensure that the ultimate users of LRs can generate lexical data from the PANACEA factory at the desired level of accuracy
    corecore