Search CORE

1 research outputs found

Exploratory Relation Extraction in Large Multilingual Data

Author: Akbik Alan
Publication venue: Technische Universitaet Berlin (Germany)
Publication date: 01/01/2016
Field of study

The task of Relation Extraction (RE) is concerned with creating extractors that automatically find structured, relational information in unstructured data such as natural language text. Motivated by an explosion of sources of readily available text data such as the Web, RE offers intriguing possibilities for querying, organizing, and analyzing information by drawing upon the clean semantics of structured databases and the abundance of unstructured data. However, practical applications of RE are often characterized by vague and shifting information needs on the one hand and large multilingual datasets of unknown content on the other. Classical RE approaches are unable to handle such scenarios since they require a careful, upfront definition of extraction tasks before extractors can be created in an effort-intensive, time-consuming process. With this thesis, I propose the paradigm of Exploratory Relation Extraction (ERE), a user-driven but data-guided process of exploration for relations of interest in unknown data. I show how distributional evidence and an informed linguistic abstraction can be employed to allow users to openly explore a dataset for relations of interest and rapidly prototype extractors for discovered relations at minimal effort. Furthermore, I propose the use of a language-neutral representation of shallow semantics to address the issue of multilingual data. This representation enables a shared feature space for different languages against which extractors can be developed. I present a method that expands English-language Semantic Role Labeling (SRL) to other languages and use it to generate multilingual SRL resources for seven distinct languages from different language groups, namely Arabic, Chinese, French, German, Hindi, Russian and Spanish in order to bootstrap semantic parsers for these languages. Together, the researched approaches represent a novel way for data scientists to work with large multilingual datasets of unknown content

ProQuest OAI Repository