4 research outputs found
Recommended from our members
Poetry: Identification, Entity Recognition, and Retrieval
Modern advances in natural language processing (NLP) and information retrieval (IR) provide for the ability to automatically analyze, categorize, process and search textual resources. However, generalizing these approaches remains an open problem: models that appear to understand certain types of data must be re-trained on other domains.
Often, models make assumptions about the length, structure, discourse model and vocabulary used by a particular corpus. Trained models can often become biased toward an original dataset, learning that – for example – all capitalized words are names of people or that short documents are more relevant than longer documents. As a result, small amounts of noise or shifts in style can cause models to fail on unseen data. The key to more robust models is to look at text analytics tasks on more challenging and diverse data.
Poetry is an ancient art form that is believed to pre-date writing and is still a key form of expression through text today. Some poetry forms (e.g., haiku and sonnets) have rigid structure but still break our traditional expectations of text. Other poetry forms drop punctuation and other rules in favor of expression.
Our contributions include a set of novel, challenging datasets that extend traditional tasks: a text classification task for which content features perform poorly, a named entity recognition task that is inherently ambiguous, and a retrieval corpus over the largest public collection of poetry ever released.
We begin by looking at poetry identification - the task of finding poetry within existing textual collections, and devise an effective method of extracting poetry based on how it is usually formatted within digitally scanned books, since content models do not generalize well. Then we work on the content of poetry: we construct a dataset of around 6,000 tagged spans that identify the people, places, organizations and personified concepts within poetry. We show that cross-training with existing datasets based on news-corpora helps modern models to learn to recognize entities within poetry. Finally, we return to IR, and construct a dataset of queries and documents inspired by real-world data that expose some of the key challenges of searching through poetry. Our work is the first significant effort to use poetry in these three tasks and our datasets and models will provide strong baselines for new avenues of research on this challenging domain
Cross-lingual genre classification
Automated classification of texts into genres can benefit NLP applications, in that the
structure, location and even interpretation of information within a text are dictated
by its genre. Cross-lingual methods promise such benefits to languages which lack
genre-annotated training data. While there has been work on genre classification for
over two decades, none has considered cross-lingual methods before the start of this
project. My research aims to fill this gap. It follows previous approaches to monolingual
genre classification that exploit simple, low-level text features, many of which
can be extracted in different languages and have similar functions. This contrasts with
work on cross-lingual topic or sentiment classification of texts that typically use word
frequencies as features. These have been shown to have limited use when it comes
to genres. Many such methods also assume cross-lingual resources, such as machine
translation, which limits the range of their application. A selection of these approaches
are used as baselines in my experiments.
I report the results of two semi-supervised methods for exploiting genre-labelled
source language texts and unlabelled target language texts. The first is a relatively
simple algorithm that bridges the language gap by exploiting cross-lingual features and
then iteratively re-trains a classification model on previously predicted target texts. My
results show that this approach works well where only few cross-lingual resources are
available and texts are to be classified into broad genre categories. It is also shown that
further improvements can be achieved through multi-lingual training or cross-lingual
feature selection if genre-annotated texts are available in several source languages. The
second is a variant of the label propagation algorithm. This graph-based classifier learns
genre-specific feature set weights from both source and target language texts and uses
them to adjust the propagation channels for each text. This allows further feature sets
to be added as additional resources, such as Part of Speech taggers, become available.
While the method performs well even with basic text features, it is shown to benefit
from additional feature sets. Results also indicate that it handles fine-grained genre
classes better than the iterative re-labelling method