236 research outputs found

    Software Infrastructure for Natural Language Processing

    Full text link
    We classify and review current approaches to software infrastructure for research, development and delivery of NLP systems. The task is motivated by a discussion of current trends in the field of NLP and Language Engineering. We describe a system called GATE (a General Architecture for Text Engineering) that provides a software infrastructure on top of which heterogeneous NLP processing modules may be evaluated and refined individually, or may be combined into larger application systems. GATE aims to support both researchers and developers working on component technologies (e.g. parsing, tagging, morphological analysis) and those working on developing end-user applications (e.g. information extraction, text summarisation, document generation, machine translation, and second language learning). GATE promotes reuse of component technology, permits specialisation and collaboration in large-scale projects, and allows for the comparison and evaluation of alternative technologies. The first release of GATE is now available - see http://www.dcs.shef.ac.uk/research/groups/nlp/gate/Comment: LaTeX, uses aclap.sty, 8 page

    Supporting text mining for e-Science: the challenges for Grid-enabled natural language processing

    Get PDF
    Over the last few years, language technology has moved rapidly from 'applied research' to 'engineering', and from small-scale to large-scale engineering. Applications such as advanced text mining systems are feasible, but very resource-intensive, while research seeking to address the underlying language processing questions faces very real practical and methodological limitations. The e-Science vision, and the creation of the e-Science Grid, promises the level of integrated large-scale technological support required to sustain this important and successful new technology area. In this paper, we discuss the foundations for the deployment of text mining and other language technology on the Grid - the protocols and tools required to build distributed large-scale language technology systems, meeting the needs of users, application builders and researchers

    Corpora for Computational Linguistics

    Get PDF
    Since the mid 90s corpora has become very important for computational linguistics. This paper offers a survey of how they are currently used in different fields of the discipline, with particular emphasis on anaphora and coreference resolution, automatic summarisation and term extraction. Their influence on other fields is also briefly discussed

    Development of linguistic linked open data resources for collaborative data-intensive research in the language sciences

    Get PDF
    Making diverse data in linguistics and the language sciences open, distributed, and accessible: perspectives from language/language acquistiion researchers and technical LOD (linked open data) researchers. This volume examines the challenges inherent in making diverse data in linguistics and the language sciences open, distributed, integrated, and accessible, thus fostering wide data sharing and collaboration. It is unique in integrating the perspectives of language researchers and technical LOD (linked open data) researchers. Reporting on both active research needs in the field of language acquisition and technical advances in the development of data interoperability, the book demonstrates the advantages of an international infrastructure for scholarship in the field of language sciences. With contributions by researchers who produce complex data content and scholars involved in both the technology and the conceptual foundations of LLOD (linguistics linked open data), the book focuses on the area of language acquisition because it involves complex and diverse data sets, cross-linguistic analyses, and urgent collaborative research. The contributors discuss a variety of research methods, resources, and infrastructures. Contributors Isabelle Barrière, Nan Bernstein Ratner, Steven Bird, Maria Blume, Ted Caldwell, Christian Chiarcos, Cristina Dye, Suzanne Flynn, Claire Foley, Nancy Ide, Carissa Kang, D. Terence Langendoen, Barbara Lust, Brian MacWhinney, Jonathan Masci, Steven Moran, Antonio Pareja-Lora, Jim Reidy, Oya Y. Rieger, Gary F. Simons, Thorsten Trippel, Kara Warburton, Sue Ellen Wright, Claus Zin

    Automatic Extraction and Generation of XML Documents from Financial Reports

    Get PDF
    Web services require XML formatted data. Human translation of business information from the rapidly expanding volume of documents to XML is labor-intensive and impractical. Computer programs can be built to extract domain-specific facts from web documents and convert them into an XML format. With a continual feed of web articles, such a system could be used to maintain an up-to-date XML knowledge base that could power web services for businesses. In this research, we build a system to automatically extract information from electronic international corporate financial reports, and translate this information into XML or XBRL (a well-known XML extension for accounting and financial data)

    An evaluation of Lolita and related natural language processing systems

    Get PDF
    This research addresses the question, "how do we evaluate systems like LOLITA?" LOLITA is the Natural Language Processing (NLP) system under development at the University of Durham. It is intended as a platform for building NL applications. We are therefore interested in questions of evaluation for such general NLP systems. The thesis has two, parts. The first, and main, part concerns the participation of LOLITA in the Sixth Message Understanding Conference (MUC-6). The MUC-relevant portion of LOLITA is described in detail. The adaptation of LOLITA for MUC-6 is discussed, including work undertaken by the author. Performance on a specimen article is analysed qualitatively, and in detail, with anonymous comparisons to competitors' output. We also examine current LOLITA performance. A template comparison tool was implemented to aid these analyses. The overall scores are then considered. A methodology for analysis is discussed, and a comparison made with current scores. The comparison tool is used to analyse how systems performed relative to each-other. One method, Correctness Analysis, was particularly interesting. It provides a characterisation of task difficulty, and indicates how systems approached a task. Finally, MUC-6 is analysed. In particular, we consider the methodology and ways of interpreting the results. Several criticisms of MUC-6 are made, along with suggestions for future MUC-style events. The second part considers evaluation from the point of view of general systems. A literature review shows a lack of serious work on this aspect of evaluation. A first principles discussion of evaluation, starting from a view of NL systems as a particular kind of software, raises several interesting points for single task evaluation. No evaluations could be suggested for general systems; their value was seen as primarily economic. That is, we are unable to analyse their linguistic capability directly

    Development of Linguistic Linked Open Data Resources for Collaborative Data-Intensive Research in the Language Sciences

    Get PDF
    This book is the product of an international workshop dedicated to addressing data accessibility in the linguistics field. It is therefore vital to the book’s mission that its content be open access. Linguistics as a field remains behind many others as far as data management and accessibility strategies. The problem is particularly acute in the subfield of language acquisition, where international linguistic sound files are needed for reference. Linguists' concerns are very much tied to amount of information accumulated by individual researchers over the years that remains fragmented and inaccessible to the larger community. These concerns are shared by other fields, but linguistics to date has seen few efforts at addressing them. This collection, undertaken by a range of leading experts in the field, represents a big step forward. Its international scope and interdisciplinary combination of scholars/librarians/data consultants will provide an important contribution to the field
    • …
    corecore