1,596 research outputs found
Knowledge Base Population using Semantic Label Propagation
A crucial aspect of a knowledge base population system that extracts new
facts from text corpora, is the generation of training data for its relation
extractors. In this paper, we present a method that maximizes the effectiveness
of newly trained relation extractors at a minimal annotation cost. Manual
labeling can be significantly reduced by Distant Supervision, which is a method
to construct training data automatically by aligning a large text corpus with
an existing knowledge base of known facts. For example, all sentences
mentioning both 'Barack Obama' and 'US' may serve as positive training
instances for the relation born_in(subject,object). However, distant
supervision typically results in a highly noisy training set: many training
sentences do not really express the intended relation. We propose to combine
distant supervision with minimal manual supervision in a technique called
feature labeling, to eliminate noise from the large and noisy initial training
set, resulting in a significant increase of precision. We further improve on
this approach by introducing the Semantic Label Propagation method, which uses
the similarity between low-dimensional representations of candidate training
instances, to extend the training set in order to increase recall while
maintaining high precision. Our proposed strategy for generating training data
is studied and evaluated on an established test collection designed for
knowledge base population tasks. The experimental results show that the
Semantic Label Propagation strategy leads to substantial performance gains when
compared to existing approaches, while requiring an almost negligible manual
annotation effort.Comment: Submitted to Knowledge Based Systems, special issue on Knowledge
Bases for Natural Language Processin
Unsupervised learning of relation detection patterns
L'extracció d'informació és l'à rea del processament de llenguatge natural l'objectiu de la qual és l'obtenir dades
estructurades a partir de la informació rellevant continguda en fragments textuals.
L'extracció d'informació requereix una quantitat considerable de coneixement lingüÃstic. La especificitat d'aquest
coneixement suposa un inconvenient de cara a la portabilitat dels sistemes, ja que un canvi d'idioma, domini o estil té un
cost en termes d'esforç humà . Durant dècades, s'han aplicat tècniques d'aprenentatge automà tic per tal de superar aquest
coll d'ampolla de portabilitat, reduint progressivament la supervisió humana involucrada. Tanmateix, a mida que augmenta
la disponibilitat de grans col·leccions de documents, esdevenen necessà ries aproximacions completament nosupervisades
per tal d'explotar el coneixement que hi ha en elles.
La proposta d'aquesta tesi és la d'incorporar tècniques de clustering a l'adquisició de patrons per a extracció d'informació,
per tal de reduir encara més els elements de supervisió involucrats en el procés En particular, el treball se centra en el
problema de la detecció de relacions. L'assoliment d'aquest objectiu final ha requerit, en primer lloc, el considerar les
diferents estratègies en què aquesta combinació es podia dur a terme; en segon lloc, el desenvolupar o adaptar algorismes
de clustering adequats a les nostres necessitats; i en tercer lloc, el disseny de procediments d'adquisició de patrons que
incorporessin la informació de clustering.
Al final d'aquesta tesi, havÃem estat capaços de desenvolupar i implementar una aproximació per a l'aprenentatge de
patrons per a detecció de relacions que, utilitzant tècniques de clustering i un mÃnim de supervisió humana, és competitiu i
fins i tot supera altres aproximacions comparables en l'estat de l'art.Information extraction is the natural language processing area whose goal is to obtain structured data from the relevant
information contained in textual fragments.
Information extraction requires a significant amount of linguistic knowledge. The specificity of such knowledge supposes a
drawback on the portability of the systems, as a change of language, domain or style demands a costly human effort.
Machine learning techniques have been applied for decades so as to overcome this portability bottleneck¿progressively
reducing the amount of involved human supervision. However, as the availability of large document collections increases,
completely unsupervised approaches become necessary in order to mine the knowledge contained in them.
The proposal of this thesis is to incorporate clustering techniques into pattern learning for information extraction, in order to
further reduce the elements of supervision involved in the process. In particular, the work focuses on the problem of relation
detection. The achievement of this ultimate goal has required, first, considering the different strategies in which this
combination could be carried out; second, developing or adapting clustering algorithms suitable to our needs; and third,
devising pattern learning procedures which incorporated clustering information.
By the end of this thesis, we had been able to develop and implement an approach for learning of relation detection patterns
which, using clustering techniques and minimal human supervision, is competitive and even outperforms other comparable
approaches in the state of the art.Postprint (published version
PromptRE: Weakly-Supervised Document-Level Relation Extraction via Prompting-Based Data Programming
Relation extraction aims to classify the relationships between two entities
into pre-defined categories. While previous research has mainly focused on
sentence-level relation extraction, recent studies have expanded the scope to
document-level relation extraction. Traditional relation extraction methods
heavily rely on human-annotated training data, which is time-consuming and
labor-intensive. To mitigate the need for manual annotation, recent
weakly-supervised approaches have been developed for sentence-level relation
extraction while limited work has been done on document-level relation
extraction. Weakly-supervised document-level relation extraction faces
significant challenges due to an imbalanced number "no relation" instances and
the failure of directly probing pretrained large language models for document
relation extraction. To address these challenges, we propose PromptRE, a novel
weakly-supervised document-level relation extraction method that combines
prompting-based techniques with data programming. Furthermore, PromptRE
incorporates the label distribution and entity types as prior knowledge to
improve the performance. By leveraging the strengths of both prompting and data
programming, PromptRE achieves improved performance in relation classification
and effectively handles the "no relation" problem. Experimental results on
ReDocRED, a benchmark dataset for document-level relation extraction,
demonstrate the superiority of PromptRE over baseline approaches
Multilingual opinion mining
170 p.Cada dÃa se genera gran cantidad de texto en diferentes medios online. Gran parte de ese texto contiene opiniones acerca de multitud de entidades, productos, servicios, etc. Dada la creciente necesidad de disponer de medios automatizados para analizar, procesar y explotar esa información, las técnicas de análisis de sentimiento han recibido gran cantidad de atención por parte de la industria y la comunidad cientÃfica durante la última década y media. No obstante, muchas de las técnicas empleadas suelen requerir de entrenamiento supervisado utilizando para ello ejemplos anotados manualmente, u otros recursos lingüÃsticos relacionados con un idioma o dominio de aplicación especÃficos. Esto limita la aplicación de este tipo de técnicas, ya que dicho recursos y ejemplos anotados no son sencillos de obtener. En esta tesis se explora una serie de métodos para realizar diversos análisis automáticos de texto en el marco del análisis de sentimiento, incluyendo la obtención automática de términos de un dominio, palabras que expresan opinión, polaridad del sentimiento de dichas palabras (positivas o negativas), etc. Finalmente se propone y se evalúa un método que combina representación continua de palabras (continuous word embeddings) y topic-modelling inspirado en la técnica de Latent Dirichlet Allocation (LDA), para obtener un sistema de análisis de sentimiento basado en aspectos (ABSA), que sólo necesita unas pocas palabras semilla para procesar textos de un idioma o dominio determinados. De este modo, la adaptación a otro idioma o dominio se reduce a la traducción de las palabras semilla correspondientes
Unsupervised and knowledge-poor approaches to sentiment analysis
Sentiment analysis focuses upon automatic classiffication of a document's sentiment (and more generally extraction of opinion from text). Ways of expressing sentiment have been
shown to be dependent on what a document is about (domain-dependency). This complicates supervised methods for sentiment analysis which rely on extensive use of training data or linguistic resources that are usually either domain-specific or generic. Both kinds of resources prevent classiffiers from performing well across a range of domains, as this requires appropriate in-domain (domain-specific) data.
This thesis presents a novel unsupervised, knowledge-poor approach to sentiment analysis aimed at creating a domain-independent and multilingual sentiment analysis system.
The approach extracts domain-specific resources from documents that are to be processed, and uses them for sentiment analysis. This approach does not require any training corpora, large sets of rules or generic sentiment lexicons, which makes it domain- and languageindependent but at the same time able to utilise domain- and language-specific information.
The thesis describes and tests the approach, which is applied to diffeerent data, including customer reviews of various types of products, reviews of films and books, and news items; and to four languages: Chinese, English, Russian and Japanese. The approach is applied not only to binary sentiment classiffication, but also to three-way sentiment classiffication (positive, negative and neutral), subjectivity classifiation of documents and sentences, and to the extraction of opinion holders and opinion targets. Experimental results suggest that the approach is often a viable alternative to supervised systems, especially when applied to large document collections
- …