Search CORE

27 research outputs found

Similarity rules! Exploring methods for ad-hoc rule detection

Author: Dickinson Markus
Foster Jennifer
Publication venue
Publication date: 01/11/2008
Field of study

We examine the role of similarity in ad hoc rule detection and show how previous methods can be made more corpus independent and more generally applicable. Specifically, we show that the similarity of a rule to others in the grammar is a crucial factor in determining the reliability of a rule, providing information unavailable in frequency. We also include a way to score rules which are not in the training data, thereby providing a platform for grammar generalization

CiteSeerX

Irish Universities

DCU Online Research Access Service

Utrecht University Repository

Maîtriser les déluges de données hétérogènes

Author: Fleury Serge
Folch Helka
Habert Benoît
Heiden Serge
Illouz Gabriel
Lafon Pierre
Publication venue: Universitas Imelda Medan
Publication date: 01/01/1999
Field of study

Le traitement automatique des langues fait de plus en plus appel à de volumineux corpus textuels pour l'acquisition des connaissances. L'obstacle actuel n'est plus la disponibilité de corpus, ni même leur taille, mais l'hétérogénéité des données qui sont rassemblées sous ce nom. Dans cet article, nous examinons l'hétérogénéité que manifestent les articles du Monde quand on les regroupe selon les rubriques de la rédaction du journal. Les conséquences d'une telle hétérogénéité pour l'étiquetage et le parsage sont soulignées. Partant de ce constat, nous définissons la notion de "profilage de corpus" par le biais d'outils permettant d'évaluer l'homogénéité d'un corpus (sur-emploi du vocabulaire, de catégories morpho-syntaxiques, ou de patrons) et l'utilisation qui peut en être faite

HAL-ENS-LYON

Acquisition et évaluation sur corpus de propriétés de sous-catégorisation syntaxique

Author: Bourigault Didier
Frérot Cécile
Publication venue: HAL CCSD
Publication date: 01/01/2005
Field of study

We carry out an experiment aimed at using subcategorization information into a syntactic parser for PP attachment disambiguation. The subcategorization lexicon consists of probabilities between a word (verb, noun, adjective) and a preposition. The lexicon is acquired automatically from a 200 million word corpus, that is partially tagged and parsed. In order to assess the lexicon, we use 4 different corpora in terms of genre and domain. We D. Bourigault, C. Frérot assess various methods for PP attachment disambiguation : an exogeous method relies on the sub-categorization lexicon whereas an endogenous method relies on the corpus specific ressource only and an hybrid method makes use of both. The hybrid method proves to be the best and the results vary from 79.4 % to 87.2 %

Scientific Publications of the University of Toulouse II Le Mirail

Hal-Diderot

Profilage de textes : un cadre de travail et une expérience

Author: Fleury Serge
Folch Helka
Habert Benoît
Heiden Serge
Illouz Gabriel
Lafon Pierre
Prévost Sophie
Publication venue: JADT
Publication date: 01/01/2000
Field of study

International audienceLe recours croissant aux « très grands corpus » en Traitement Automatique des Langues (TAL) comme en analysetextuelle suppose de maîtriser l'homogénéité lexicale, morpho-syntaxique et syntaxique des données utilisées.Cela implique en amont le développement d'outils de calibrage de textes. Nous mettons en place de tels outilset la méthodologie associée dans le cadre de l'appel d'offres ELRA Contribution à la réalisation de corpus dufrançais contemporain. Nous montrons sur les discours radio-télévisés de De Gaulle et de Mitterrand les premiersrésultats de cette approche. Nous tirons les conséquences de cette expérience pour les traits que nous employonspour profiler les texte

HAL-ENS-LYON

Hal-Diderot

Measures for corpus similarity and homogeneity

Author: Kilgarriff Adam
Russell-Rose Tony
Publication venue
Publication date: 01/01/1998
Field of study

How similar are two corpora? A measure of corpus similarity would be very useful for NLP for many purposes, such as estimating the work involved in porting a system from one domain to another. First, we discuss difficulties in identifying what we mean by 'corpus similariti: human similarity judgements are not finegrained enough, corpus similarity is inherently multidimensional, and similarity can only be interpreted in the light of corpus homogeneity. We then present an operational definition of corpus similarity \vhich addresses or circumvents the problems, using purpose-built sets of aknown-similarity corpora". These KSC sets can be used to evaluate the measures. We evaluate the measures described in the literature, including three variants of the information theoretic measure 'perplexity'. A x 2-based measure, using word frequencies, is shnwn to be the best of those tested. The Problem How similar arc two corpora? The question arises on many occasions. In NLP, many useful results can be generated from corpora, but when can the results developed using one corpus be applied to another? How much will it cost to port an NLP application from one domain, with one corpus, to another, with another? For linguistics, does it matter whether language researchers use this corpora or that, or are they similar enough for it to mal<e no difference? There are also questions of more general interest. Looking at British national newspapers: is the Independent more like the Guardian or the Telegraph?' What are the constraints on a measure for corpus similarity? The first is simply that its findings correspond to unequivocal human judgements. It mus

CiteSeerX

Goldsmiths Research Online

A Comparative Analysis of Extracted Grammars

Author: Lombardo Vincenzo
Mazzei Alessandro
Publication venue: 'IOS Press'
Publication date: 01/01/2004
Field of study

Institutional Research Information System University of Turin

A Dependency Parsing Approach to Biomedical Text Mining

Author: Pyysalo Sampo
Publication venue: Turku Centre for Computer Science
Publication date: 06/09/2008
Field of study

Biomedical research is currently facing a new type of challenge: an excess of information, both in terms of raw data from experiments and in the number of scientific publications describing their results. Mirroring the focus on data mining techniques to address the issues of structured data, there has recently been great interest in the development and application of text mining techniques to make more effective use of the knowledge contained in biomedical scientific publications, accessible only in the form of natural human language. This thesis describes research done in the broader scope of projects aiming to develop methods, tools and techniques for text mining tasks in general and for the biomedical domain in particular. The work described here involves more specifically the goal of extracting information from statements concerning relations of biomedical entities, such as protein-protein interactions. The approach taken is one using full parsing—syntactic analysis of the entire structure of sentences—and machine learning, aiming to develop reliable methods that can further be generalized to apply also to other domains. The five papers at the core of this thesis describe research on a number of distinct but related topics in text mining. In the first of these studies, we assessed the applicability of two popular general English parsers to biomedical text mining and, finding their performance limited, identified several specific challenges to accurate parsing of domain text. In a follow-up study focusing on parsing issues related to specialized domain terminology, we evaluated three lexical adaptation methods. We found that the accurate resolution of unknown words can considerably improve parsing performance and introduced a domain-adapted parser that reduced the error rate of theoriginal by 10% while also roughly halving parsing time. To establish the relative merits of parsers that differ in the applied formalisms and the representation given to their syntactic analyses, we have also developed evaluation methodology, considering different approaches to establishing comparable dependency-based evaluation results. We introduced a methodology for creating highly accurate conversions between different parse representations, demonstrating the feasibility of unification of idiverse syntactic schemes under a shared, application-oriented representation. In addition to allowing formalism-neutral evaluation, we argue that such unification can also increase the value of parsers for domain text mining. As a further step in this direction, we analysed the characteristics of publicly available biomedical corpora annotated for protein-protein interactions and created tools for converting them into a shared form, thus contributing also to the unification of text mining resources. The introduced unified corpora allowed us to perform a task-oriented comparative evaluation of biomedical text mining corpora. This evaluation established clear limits on the comparability of results for text mining methods evaluated on different resources, prompting further efforts toward standardization. To support this and other research, we have also designed and annotated BioInfer, the first domain corpus of its size combining annotation of syntax and biomedical entities with a detailed annotation of their relationships. The corpus represents a major design and development effort of the research group, with manual annotation that identifies over 6000 entities, 2500 relationships and 28,000 syntactic dependencies in 1100 sentences. In addition to combining these key annotations for a single set of sentences, BioInfer was also the first domain resource to introduce a representation of entity relations that is supported by ontologies and able to capture complex, structured relationships. Part I of this thesis presents a summary of this research in the broader context of a text mining system, and Part II contains reprints of the five included publications.Siirretty Doriast

UTUPub

Lexical adaptation of link grammar to the biomedical sublanguage: a comparative evaluation of three approaches

Author: A Clegg
A Mikheev
A Yakushiji
Adeline Nazarenko
C Blaschke
C Grover
DD Sleator
E Alphonse
E Tsivtsivadze
E Tsivtsivadze
F Wilcoxon
J Demšar
J Ding
J Park
J Pustejovsky
JD Kim
M Lease
P Spyns
P Szolovits
R Grishman
S Aubin
S Kulick
S Pyysalo
S Sekine
Sampo Pyysalo
Sophie Aubin
ST Ahmed
Tapio Salakoski
Y Tsuruoka
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Exploiting multi-word units in statistical parsing and generation

Author: Cafferkey Conor
Publication venue: Dublin City University. School of Computing
Publication date: 01/11/2008
Field of study

Syntactic parsing is an important prerequisite for many natural language processing (NLP) applications. The task refers to the process of generating the tree of syntactic nodes with associated phrase category labels corresponding to a sentence. Our objective is to improve upon statistical models for syntactic parsing by leveraging multi-word units (MWUs) such as named entities and other classes of multi-word expressions. Multi-word units are phrases that are lexically, syntactically and/or semantically idiosyncratic in that they are to at least some degree non-compositional. If such units are identified prior to, or as part of, the parsing process their boundaries can be exploited as islands of certainty within the very large (and often highly ambiguous) search space. Luckily, certain types of MWUs can be readily identified in an automatic fashion (using a variety of techniques) to a near-human level of accuracy. We carry out a number of experiments which integrate knowledge about different classes of MWUs in several commonly deployed parsing architectures. In a supplementary set of experiments, we attempt to exploit these units in the converse operation to statistical parsing---statistical generation (in our case, surface realisation from Lexical-Functional Grammar f-structures). We show that, by exploiting knowledge about MWUs, certain classes of parsing and generation decisions are more accurately resolved. This translates to improvements in overall parsing and generation results which, although modest, are demonstrably significant

DCU Online Research Access Service

Maximum Entropy Models For Natural Language Ambiguity Resolution

Author: Ratnaparkhi Adwait
Publication venue: ScholarlyCommons
Publication date: 01/01/1998
Field of study

This thesis demonstrates that several important kinds of natural language ambiguities can be resolved to state-of-the-art accuracies using a single statistical modeling technique based on the principle of maximum entropy. We discuss the problems of sentence boundary detection, part-of-speech tagging, prepositional phrase attachment, natural language parsing, and text categorization under the maximum entropy framework. In practice, we have found that maximum entropy models offer the following advantages: State-of-the-art Accuracy: The probability models for all of the tasks discussed perform at or near state-of-the-art accuracies, or outperform competing learning algorithms when trained and tested under similar conditions. Methods which outperform those presented here require much more supervision in the form of additional human involvement or additional supporting resources. Knowledge-Poor Features: The facts used to model the data, or features, are linguistically very simple, or knowledge-poor but yet succeed in approximating complex linguistic relationships. Reusable Software Technology: The mathematics of the maximum entropy framework are essentially independent of any particular task, and a single software implementation can be used for all of the probability models in this thesis. The experiments in this thesis suggest that experimenters can obtain state-of-the-art accuracies on a wide range of natural language tasks, with little task-specific effort, by using maximum entropy probability models

CiteSeerX

ScholarlyCommons@Penn