Search CORE

533 research outputs found

New Methods, Current Trends and Software Infrastructure for NLP

Author: Cunningham Hamish
Gaizauskas Robert J.
Wilks Yorick
Publication venue
Publication date: 01/01/1996
Field of study

The increasing use of `new methods' in NLP, which the NeMLaP conference series exemplifies, occurs in the context of a wider shift in the nature and concerns of the discipline. This paper begins with a short review of this context and significant trends in the field. The review motivates and leads to a set of requirements for support software of general utility for NLP research and development workers. A freely-available system designed to meet these requirements is described (called GATE - a General Architecture for Text Engineering). Information Extraction (IE), in the sense defined by the Message Understanding Conferences (ARPA \cite{Arp95}), is an NLP application in which many of the new methods have found a home (Hobbs \cite{Hob93}; Jacobs ed. \cite{Jac92}). An IE system based on GATE is also available for research purposes, and this is described. Lastly we review related work.Comment: 12 pages, LaTeX, uses nemlap.sty (included

arXiv.org e-Print Archive

CiteSeerX

GATE -- an Environment to Support Research and Development in Natural Language Engineering

Author: Cunningham Hamish
Gaizauskas Robert
Humphreys Kevin
Rodgers Peter
Wilks Yorick
Publication venue: IEEE Computer Society
Publication date: 01/01/1996
Field of study

We describe a software environment to support research and development in natural language (NL) engineering. This environment -- GATE (General Architecture for Text Engineering) -- aims to advance research in the area of machine processing of natural languages by providing a software infrastructure on top of which heterogeneous NL component modules may be evaluated and refined individually or may be combined into larger application systems. Thus, GATE aims to support both researchers and developers working on component technologies (e.g. parsing, tagging, morphological analysis) and those working on developing end-user applications (e.g. information extraction, text summarisation, document generation, machine translation, and second language learning). GATE will promote reuse of component technology, permit specialisation and collaboration in large-scale projects, and allow for the comparison and evaluation of alternative technologies. The first release of GATE is now available

CiteSeerX

Kent Academic Repository

Using Decision Trees for Coreference Resolution

Author: Lehnert Wendy G.
McCarthy Joseph F.
Publication venue
Publication date: 01/01/1995
Field of study

This paper describes RESOLVE, a system that uses decision trees to learn how to classify coreferent phrases in the domain of business joint ventures. An experiment is presented in which the performance of RESOLVE is compared to the performance of a manually engineered set of rules for the same task. The results show that decision trees achieve higher performance than the rules in two of three evaluation metrics developed for the coreference task. In addition to achieving better performance than the rules, RESOLVE provides a framework that facilitates the exploration of the types of knowledge that are useful for solving the coreference problem.Comment: 6 pages; LaTeX source; 1 uuencoded compressed EPS file (separate); uses ijcai95.sty, named.bst, epsf.tex; to appear in Proc. IJCAI '9

arXiv.org e-Print Archive

CiteSeerX

A Progressive Visual Analytics Tool for Incremental Experimental Evaluation

Author: Giachelle Fabio
Silvello Gianmaria
Publication venue
Publication date: 01/01/2019
Field of study

This paper presents a visual tool, AVIATOR, that integrates the progressive visual analytics paradigm in the IR evaluation process. This tool serves to speed-up and facilitate the performance assessment of retrieval models enabling a result analysis through visual facilities. AVIATOR goes one step beyond the common "compute wait visualize" analytics paradigm, introducing a continuous evaluation mechanism that minimizes human and computational resource consumption

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Padova

Neural Vector Spaces for Unsupervised Information Retrieval

Author: de Rijke Maarten
Kanoulas Evangelos
Van Gysel Christophe
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 18/08/2018
Field of study

We propose the Neural Vector Space Model (NVSM), a method that learns representations of documents in an unsupervised manner for news article retrieval. In the NVSM paradigm, we learn low-dimensional representations of words and documents from scratch using gradient descent and rank documents according to their similarity with query representations that are composed from word representations. We show that NVSM performs better at document ranking than existing latent semantic vector space methods. The addition of NVSM to a mixture of lexical language models and a state-of-the-art baseline vector space model yields a statistically significant increase in retrieval effectiveness. Consequently, NVSM adds a complementary relevance signal. Next to semantic matching, we find that NVSM performs well in cases where lexical matching is needed. NVSM learns a notion of term specificity directly from the document collection without feature engineering. We also show that NVSM learns regularities related to Luhn significance. Finally, we give advice on how to deploy NVSM in situations where model selection (e.g., cross-validation) is infeasible. We find that an unsupervised ensemble of multiple models trained with different hyperparameter values performs better than a single cross-validated model. Therefore, NVSM can safely be used for ranking documents without supervised relevance judgments.Comment: TOIS 201

arXiv.org e-Print Archive

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Software as theory: a case study in the domain of text analysis

Author: Xanthos A.
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2015
Field of study

This article proposes a reflection on a specific way of envisioning and valorising the scholarly contribution of scientific software, namely by making explicit the model of data analysis that underlies it. It seeks to illustrate this way of studying a software construct by applying it to a particular text analysis program. Fundamental aspects of this program's design (input and output, data structures, process model, and user interface) are reviewed and discussed from the point of view of their implications in terms of theoretical commitments to a specific conception of text and text analysis. The conclusions of this case study notably emphasise the central role of user modelling in the assessment of scientific software's epistemological contribution as well as the necessity of extending the proposed approach to a broader range of software applications

Serveur académique lausannois

Recommended from our members

BioC: a minimalist approach to interoperability for biomedical text processing

Author: Ciccarese Paolo
Cohen Kevin Bretonnel
Comeau Donald C.
Islamaj Doğan Rezarta
Krallinger Martin
Leitner Florian
Lu Zhiyong
Peng Yifan
Rinaldi Fabio
Torii Manabu
Valencia Alfonso
Verspoor Karin
Wiegers Thomas C.
Wilbur W. John
Wu Cathy H.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 11/03/2014
Field of study

A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net

Harvard University - DASH