2 research outputs found
Expanding a multilingual media monitoring and information extraction tool to a new language: Swahili
The Europe Media Monitor (EMM) family of applications is a set of
multilingual tools that gather, cluster and classify news in currently fifty languages and
that extract named entities and quotations (reported speech) from twenty languages. In
this paper, we describe the recent effort of adding the African Bantu language Swahili
to EMM. EMM is designed in an entirely modular way, allowing plugging in a new
language by providing the language-specific resources for that language. We thus
describe the type of language-specific resources needed, the effort involved, and ways
of boot-strapping the generation of these resources in order to keep the effort of adding
a new language to a minimum. The text analysis applications pursued in our efforts
include clustering, classification, recognition and disambiguation of named entities
(persons, organisations and locations), recognition and normalisation of date expressions,
as well as the identification of reported speech quotations by and about people.JRC.G.2-Global security and crisis managemen