1,138 research outputs found
Corpus Annotation for Parser Evaluation
We describe a recently developed corpus annotation scheme for evaluating
parsers that avoids shortcomings of current methods. The scheme encodes
grammatical relations between heads and dependents, and has been used to mark
up a new public-domain corpus of naturally occurring English text. We show how
the corpus can be used to evaluate the accuracy of a robust parser, and relate
the corpus to extant resources.Comment: 7 pages, LaTeX (uses eaclap.sty
Multilingual collocation extraction with a syntactic parser
An impressive amount of work was devoted over the past few decades to collocation extraction. The state of the art shows that there is a sustained interest in the morphosyntactic preprocessing of texts in order to better identify candidate expressions; however, the treatment performed is, in most cases, limited (lemmatization, POS-tagging, or shallow parsing). This article presents a collocation extraction system based on the full parsing of source corpora, which supports four languages: English, French, Spanish, and Italian. The performance of the system is compared against that of the standard mobile-window method. The evaluation experiment investigates several levels of the significance lists, uses a fine-grained annotation schema, and covers all the languages supported. Consistent results were obtained for these languages: parsing, even if imperfect, leads to a significant improvement in the quality of results, in terms of collocational precision (between 16.4 and 29.7%, depending on the language; 20.1% overall), MWE precision (between 19.9 and 35.8%; 26.1% overall), and grammatical precision (between 47.3 and 67.4%; 55.6% overall). This positive result bears a high importance, especially in the perspective of the subsequent integration of extraction results in other NLP application
D6.1: Technologies and Tools for Lexical Acquisition
This report describes the technologies and tools to be used for Lexical Acquisition in PANACEA. It includes descriptions of existing technologies and tools which can be built on and improved within PANACEA, as well as of new technologies and tools to be developed and integrated in PANACEA platform. The report also specifies the Lexical Resources to be produced. Four main areas of lexical acquisition are included: Subcategorization frames (SCFs), Selectional Preferences (SPs), Lexical-semantic Classes (LCs), for both nouns and verbs, and Multi-Word Expressions (MWEs)
D4.1. Technologies and tools for corpus creation, normalization and annotation
The objectives of the Corpus Acquisition and Annotation (CAA) subsystem are the acquisition and processing of monolingual and bilingual language resources (LRs) required in the PANACEA context. Therefore, the CAA subsystem includes: i) a Corpus Acquisition Component (CAC) for extracting monolingual and bilingual data from the web, ii) a component for cleanup and normalization (CNC) of these data and iii) a text processing component (TPC) which consists of NLP tools including modules for sentence splitting, POS tagging, lemmatization, parsing and named entity recognition
Abstract syntax as interlingua: Scaling up the grammatical framework from controlled languages to robust pipelines
Syntax is an interlingual representation used in compilers. Grammatical Framework (GF) applies the abstract syntax idea to natural languages. The development of GF started in 1998, first as a tool for controlled language implementations, where it has gained an established position in both academic and commercial projects. GF provides grammar resources for over 40 languages, enabling accurate generation and translation, as well as grammar engineering tools and components for mobile and Web applications. On the research side, the focus in the last ten years has been on scaling up GF to wide-coverage language processing. The concept of abstract syntax offers a unified view on many other approaches: Universal Dependencies, WordNets, FrameNets, Construction Grammars, and Abstract Meaning Representations. This makes it possible for GF to utilize data from the other approaches and to build robust pipelines. In return, GF can contribute to data-driven approaches by methods to transfer resources from one language to others, to augment data by rule-based generation, to check the consistency of hand-annotated corpora, and to pipe analyses into high-precision semantic back ends. This article gives an overview of the use of abstract syntax as interlingua through both established and emerging NLP applications involving GF
Natural Language Processing at the School of Information Studies for Africa
The lack of persons trained in computational linguistic methods is a severe obstacle to making the Internet and computers accessible to people all over the world in their own languages.
The paper discusses the experiences of designing and teaching an introductory course in Natural Language Processing to graduate computer science students at Addis Ababa University, Ethiopia, in order to initiate the education of computational linguists in the Horn of Africa region
Uvid u automatsko izluÄivanje metaforiÄkih kolokacija
Collocations have been the subject of much scientific research over the years. The focus of this research is on a subset of collocations, namely metaphorical collocations. In metaphorical collocations, a semantic shift has taken place in one of the components, i.e., one of the components takes on a transferred meaning. The main goal of this paper is to review the existing literature and provide a systematic overview of the existing research on collocation extraction, as well as the overview of existing methods, measures, and resources. The existing research is classified according to the approach (statistical, hybrid, and distributional semantics) and presented in three separate sections. The insights gained from existing research serve as a first step in exploring the possibility of developing a method for automatic extraction of metaphorical collocations. The methods, tools, and resources that may prove useful for future work are highlighted.Kolokacije su veÄ dugi niz godina tema mnogih znanstvenih istraživanja. U fokusu ovoga istraživanja podskupina je kolokacija koju Äine metaforiÄke kolokacije. Kod metaforiÄkih je kolokacija kod jedne od sastavnica doÅ”lo do semantiÄkoga pomaka, tj. jedna od sastavnica poprima preneseno znaÄenje. Glavni su ciljevi ovoga rada istražiti postojeÄu literaturu te dati sustavan pregled postojeÄih istraživanja na temu izluÄivanja kolokacija i postojeÄih metoda, mjera i resursa. PostojeÄa istraživanja opisana su i klasificirana prema razliÄitim pristupima (statistiÄki, hibridni i zasnovani na distribucijskoj semantici). TakoÄer su opisane razliÄite asocijativne mjere i postojeÄi naÄini procjene rezultata automatskoga izluÄivanja kolokacija. Metode, alati i resursi koji su koriÅ”teni u prethodnim istraživanjima, a mogli bi biti korisni za naÅ” buduÄi rad posebno su istaknuti. SteÄeni uvidi u postojeÄa istraživanja Äine prvi korak u razmatranju moguÄnosti razvijanja postupka za automatsko izluÄivanje metaforiÄkih kolokacija
- ā¦