894 research outputs found
Thematic Annotation: extracting concepts out of documents
Contrarily to standard approaches to topic annotation, the technique used in
this work does not centrally rely on some sort of -- possibly statistical --
keyword extraction. In fact, the proposed annotation algorithm uses a large
scale semantic database -- the EDR Electronic Dictionary -- that provides a
concept hierarchy based on hyponym and hypernym relations. This concept
hierarchy is used to generate a synthetic representation of the document by
aggregating the words present in topically homogeneous document segments into a
set of concepts best preserving the document's content.
This new extraction technique uses an unexplored approach to topic selection.
Instead of using semantic similarity measures based on a semantic resource, the
later is processed to extract the part of the conceptual hierarchy relevant to
the document content. Then this conceptual hierarchy is searched to extract the
most relevant set of concepts to represent the topics discussed in the
document. Notice that this algorithm is able to extract generic concepts that
are not directly present in the document.Comment: Technical report EPFL/LIA. 81 pages, 16 figure
Augmenting Translation Lexica by Learning Generalised Translation Patterns
Bilingual Lexicons do improve quality: of parallel corpora alignment, of newly extracted
translation pairs, of Machine Translation, of cross language information retrieval, among
other applications. In this regard, the first problem addressed in this thesis pertains to
the classification of automatically extracted translations from parallel corpora-collections
of sentence pairs that are translations of each other. The second problem is concerned
with machine learning of bilingual morphology with applications in the solution of first
problem and in the generation of Out-Of-Vocabulary translations.
With respect to the problem of translation classification, two separate classifiers for
handling multi-word and word-to-word translations are trained, using previously extracted
and manually classified translation pairs as correct or incorrect. Several insights
are useful for distinguishing the adequate multi-word candidates from those that are
inadequate such as, lack or presence of parallelism, spurious terms at translation ends
such as determiners, co-ordinated conjunctions, properties such as orthographic similarity
between translations, the occurrence and co-occurrence frequency of the translation
pairs. Morphological coverage reflecting stem and suffix agreements are explored as key
features in classifying word-to-word translations. Given that the evaluation of extracted
translation equivalents depends heavily on the human evaluator, incorporation of an
automated filter for appropriate and inappropriate translation pairs prior to human evaluation
contributes to tremendously reduce this work, thereby saving the time involved
and progressively improving alignment and extraction quality. It can also be applied
to filtering of translation tables used for training machine translation engines, and to
detect bad translation choices made by translation engines, thus enabling significative
productivity enhancements in the post-edition process of machine made translations.
An important attribute of the translation lexicon is the coverage it provides. Learning
suffixes and suffixation operations from the lexicon or corpus of a language is an extensively
researched task to tackle out-of-vocabulary terms. However, beyond mere words
or word forms are the translations and their variants, a powerful source of information
for automatic structural analysis, which is explored from the perspective of improving
word-to-word translation coverage and constitutes the second part of this thesis. In this
context, as a phase prior to the suggestion of out-of-vocabulary bilingual lexicon entries,
an approach to automatically induce segmentation and learn bilingual morph-like units by identifying and pairing word stems and suffixes is proposed, using the bilingual
corpus of translations automatically extracted from aligned parallel corpora, manually
validated or automatically classified. Minimally supervised technique is proposed to enable
bilingual morphology learning for language pairs whose bilingual lexicons are highly
defective in what concerns word-to-word translations representing inflection diversity.
Apart from the above mentioned applications in the classification of machine extracted
translations and in the generation of Out-Of-Vocabulary translations, learned bilingual
morph-units may also have a great impact on the establishment of correspondences of
sub-word constituents in the cases of word-to-multi-word and multi-word-to-multi-word
translations and in compression, full text indexing and retrieval applications
CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines
Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective.
The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines.
From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
CONSTRAINTS ON IZĀFA IN SORANI KURDISH
This study examines the distribution and the status of the izāfa particle in Sorani Kurdish (Central Kurdish). It uses a corpus-based analysis to investigate the forms and the pattern of distribution of the izāfa particle in Sorani, a dominant dialect of Kurdish among the Western Iranian languages. The study details an investigation of the appearance of izāfa in various NPs using a variety of data mostly from the corpus but supplemented by the grammaticality judgments of native speakers. I show that next to parallel properties seen in other Western Iranian languages, Sorani Kurdish izāfa shows a form alternation. I examine the morphological status of the izāfa and other nominal morphological features in Kurdish as well as the sensitivity of izāfa form variation to specificity in Kurdish NPs. I argue that the differences and distributional incoherence of the izāfa within Sorani and across Western Iranian languages calls for a morphomic approach, which can be formally described using a constructional approach to grammar. The study focuses on the following questions: What type of head does the izāfa mark? What is the function of this marker? What are the constraints on its distribution? What are the syntactic and morphological rules governing its distribution
Improving Product-related Patent Information Access with Automated Technology Ontology Extraction
Ph.DDOCTOR OF PHILOSOPH
- …