512 research outputs found
METRICC: Harnessing Comparable Corpora for Multilingual Lexicon Development
International audienceResearch on comparable corpora has grown in recent years bringing about the possibility of developing multilingual lexicons through the exploitation of comparable corpora to create corpus-driven multilingual dictionaries. To date, this issue has not been widely addressed. This paper focuses on the use of the mechanism of collocational networks proposed by Williams (1998) for exploiting comparable corpora. The paper first provides a description of the METRICC project, which is aimed at the automatically creation of comparable corpora and describes one of the crawlers developed for comparable corpora building, and then discusses the power of collocational networks for multilingual corpus-driven dictionary development
Bilingual Lexicon Extraction Using a Modified Perceptron Algorithm
전산 언어학 분야에서 병렬 말뭉치와 이중언어 어휘는 기계번역과 교차 정보 탐색 등의 분야에서 중요한 자원으로 사용되고 있다. 예를 들어, 병렬 말뭉치는 기계번역 시스템에서 번역 확률들을 추출하는데 사용된다. 이중언어 어휘는 교차 정보 탐색에서 직접적으로 단어 대 단어 번역을 가능하게 한다. 또한 기계번역 시스템에서 번역 프로세스를 도와주는 역할을 하고 있다. 그리고 학습을 위한 병렬 말뭉치와 이중언어 어휘의 용량이 크면 클수록 기계번역 시스템의 성능이 향상된다. 그러나 이러한 이중언어 어휘를 수동으로, 즉 사람의 힘으로 구축하는 것은 많은 비용과 시간과 노동을 필요로 한다. 이러한 이유들 때문에 이중언어 어휘를 추출하는 연구가 많은 연구자들에게 각광받게 되었다.
본 논문에서는 이중언어 어휘를 추출하는 새롭고 효과적인 방법론을 제안한다. 이중언어 어휘 추출에서 가장 많이 다루어지는 벡터 공간 모델을 기반으로 하고, 신경망의 한 종류인 퍼셉트론 알고리즘을 사용하여 이중언어 어휘의 가중치를 반복해서 학습한다. 그리고 반복적으로 학습된 이중언어 어휘의 가중치와 퍼셉트론을 사용하여 최종 이중언어 어휘들을 추출한다.
그 결과, 학습되지 않은 초기의 결과에 비해서 반복 학습된 결과가 평균 3.5%의 정확도 향상을 얻을 수 있었다1. Introduction
2. Literature Review
2.1 Linguistic resources: The text corpora
2.2 A vector space model
2.3 Neural networks: The single layer Perceptron
2.4 Evaluation metrics
3. System Architecture of Bilingual Lexicon Extraction System
3.1 Required linguistic resources
3.2 System architecture
4. Building a Seed Dictionary
4.1 Methodology: Context Based Approach (CBA)
4.2 Experiments and results
4.2.1 Experimental setups
4.2.2 Experimental results
4.3 Discussions
5. Extracting Bilingual Lexicons
4.1 Methodology: Iterative Approach (IA)
4.2 Experiments and results
4.2.1 Experimental setups
4.2.2 Experimental results
4.3 Discussions
6. Conclusions and Future Work
Learning from Noisy Data in Statistical Machine Translation
In dieser Arbeit wurden Methoden entwickelt, die in der Lage sind die negativen
Effekte von verrauschten Daten in SMT Systemen zu senken und dadurch die Leistung des
Systems zu steigern. Hierbei wird das Problem in zwei verschiedenen Schritten des
Lernprozesses behandelt: Bei der Vorverarbeitung und während der
Modellierung. Bei der Vorverarbeitung werden zwei Methoden zur Verbesserung der
statistischen Modelle durch die Erhöhung der Qualität von Trainingsdaten entwickelt.
Bei der Modellierung werden verschiedene Möglichkeiten vorgestellt, um Daten nach ihrer Nützlichkeit zu gewichten.
Zunächst wird der Effekt des Entfernens von False-Positives vom Parallel Corpus
gezeigt. Ein Parallel Corpus besteht aus einem Text in zwei Sprachen,
wobei jeder Satz einer Sprache mit dem entsprechenden Satz der
anderen Sprache gepaart ist. Hierbei wird vorausgesetzt, dass die Anzahl
der Sätzen in beiden Sprachversionen gleich ist. False-Positives in diesem
Sinne sind Satzpaare, die im Parallel Corpus gepaart sind aber keine Übersetzung voneinander sind.
Um diese zu erkennen wird ein kleiner und fehlerfreier
paralleler Corpus (Clean Corpus) vorausgesetzt. Mit Hilfe verschiedenen
lexikalischen Eigenschaften werden zuverlässig False-Positives vor der
Modellierungsphase gefiltert. Eine wichtige lexikalische Eigenschaft hierbei
ist das vom Clean Corpus erzeugte bilinguale Lexikon.
In der Extraktion dieses bilingualen Lexikons werden verschiedene Heuristiken implementiert, die zu einer verbesserten Leistung führen.
Danach betrachten wir das Problem vom Extrahieren der nützlichsten Teile der Trainingsdaten.
Dabei ordnen wir die Daten basierend auf ihren Bezug zur Zieldomaine.
Dies geschieht unter der Annahme der Existenz eines guten repräsentativen Tuning Datensatzes.
Da solche Tuning Daten typischerweise beschränkte Größe haben,
werden Wortähnlichkeiten benutzt um die Abdeckung der Tuning Daten zu erweitern.
Die im vorherigen Schritt verwendeten Wortähnlichkeiten sind entscheidend für
die Qualität des Verfahrens. Aus diesem Grund werden in der Arbeit verschiedene
automatische Methoden zur Ermittlung von solche Wortähnlichkeiten ausgehend von
monoligual und biligual Corpora vorgestellt. Interessanterweise ist dies auch
bei beschränkten Daten möglich, indem auch monolinguale
Daten, die in großen Mengen zur Verfügung stehen, zur Ermittlung der
Wortähnlichkeit herangezogen werden. Bei bilingualen Daten, die häufig nur in beschränkter Größe zur
Verfügung stehen, können auch weitere Sprachpaare herangezogen werden, die mindestens eine Sprache mit dem
vorgegebenen Sprachpaar teilen.
Im Modellierungsschritt behandeln wir das Problem mit verrauschten Daten, indem die
Trainingsdaten anhand der Güte des Corpus gewichtet werden.
Wir benutzen Statistik signifikante Messgrößen, um die weniger verlässlichen
Sequenzen zu finden und ihre Gewichtung zu reduzieren.
Ähnlich zu den vorherigen Ansätzen, werden Wortähnlichkeiten benutzt um das Problem bei begrenzten Daten zu behandeln.
Ein weiteres Problem tritt allerdings auf sobald die absolute Häufigkeiten mit den gewichteten Häufigkeiten ersetzt werden. In dieser Arbeit werden hierfür Techniken zur Glättung der Wahrscheinlichkeiten in dieser Situation entwickelt.
Die Größe der Trainingsdaten werden problematisch sobald man mit Corpora von erheblichem Volumen arbeitet.
Hierbei treten zwei Hauptschwierigkeiten auf: Die Länge der Trainingszeit und der begrenzte Arbeitsspeicher.
Für das Problem der Trainingszeit wird ein Algorithmus entwickelt, der die rechenaufwendigen Berechnungen auf mehrere Prozessoren mit gemeinsamem Speicher ausführt.
Für das Speicherproblem werden speziale Datenstrukturen und Algorithmen für externe Speicher benutzt.
Dies erlaubt ein effizientes Training von extrem großen Modellne in Hardware mit begrenztem Speicher
Translation Alignment and Extraction Within a Lexica-Centered Iterative Workflow
This thesis addresses two closely related problems. The first, translation alignment, consists of identifying bilingual document pairs that are translations of each other within
multilingual document collections (document alignment); identifying sentences, titles,
etc, that are translations of each other within bilingual document pairs (sentence alignment); and identifying corresponding word and phrase translations within bilingual
sentence pairs (phrase alignment). The second is extraction of bilingual pairs of equivalent word and multi-word expressions, which we call translation equivalents (TEs), from sentence- and phrase-aligned parallel corpora.
While these same problems have been investigated by other authors, their focus has
been on fully unsupervised methods based mostly or exclusively on parallel corpora.
Bilingual lexica, which are basically lists of TEs, have not been considered or given enough importance as resources in the treatment of these problems. Human validation of TEs, which consists of manually classifying TEs as correct or incorrect translations, has also not been considered in the context of alignment and extraction. Validation strengthens the importance of infrequent TEs (most of the entries of a validated lexicon) that otherwise would be statistically unimportant.
The main goal of this thesis is to revisit the alignment and extraction problems in the
context of a lexica-centered iterative workflow that includes human validation. Therefore, the methods proposed in this thesis were designed to take advantage of knowledge accumulated in human-validated bilingual lexica and translation tables obtained by unsupervised methods. Phrase-level alignment is a stepping stone for several applications, including the extraction of new TEs, the creation of statistical machine translation systems, and the creation of bilingual concordances. Therefore, for phrase-level alignment, the higher accuracy of human-validated bilingual lexica is crucial for achieving higher quality results in these downstream applications.
There are two main conceptual contributions. The first is the coverage maximization
approach to alignment, which makes direct use of the information contained in a lexicon, or in translation tables when this is small or does not exist. The second is the introduction of translation patterns which combine novel and old ideas and enables precise and productive extraction of TEs. As material contributions, the alignment and extraction methods proposed in this thesis have produced source materials for three lines of research, in the context of three PhD theses (two of them already defended), all sharing with me the supervision of my advisor. The topics of these lines of research are statistical machine translation, algorithms and data structures for indexing and querying phrase-aligned parallel corpora, and bilingual lexica classification and generation. Four publications have resulted directly from the work presented in this thesis and twelve from the collaborative lines of research
A Survey of Paraphrasing and Textual Entailment Methods
Paraphrasing methods recognize, generate, or extract phrases, sentences, or
longer natural language expressions that convey almost the same information.
Textual entailment methods, on the other hand, recognize, generate, or extract
pairs of natural language expressions, such that a human who reads (and trusts)
the first element of a pair would most likely infer that the other element is
also true. Paraphrasing can be seen as bidirectional textual entailment and
methods from the two areas are often similar. Both kinds of methods are useful,
at least in principle, in a wide range of natural language processing
applications, including question answering, summarization, text generation, and
machine translation. We summarize key ideas from the two areas by considering
in turn recognition, generation, and extraction methods, also pointing to
prominent articles and resources.Comment: Technical Report, Natural Language Processing Group, Department of
Informatics, Athens University of Economics and Business, Greece, 201
Thematic Annotation: extracting concepts out of documents
Contrarily to standard approaches to topic annotation, the technique used in
this work does not centrally rely on some sort of -- possibly statistical --
keyword extraction. In fact, the proposed annotation algorithm uses a large
scale semantic database -- the EDR Electronic Dictionary -- that provides a
concept hierarchy based on hyponym and hypernym relations. This concept
hierarchy is used to generate a synthetic representation of the document by
aggregating the words present in topically homogeneous document segments into a
set of concepts best preserving the document's content.
This new extraction technique uses an unexplored approach to topic selection.
Instead of using semantic similarity measures based on a semantic resource, the
later is processed to extract the part of the conceptual hierarchy relevant to
the document content. Then this conceptual hierarchy is searched to extract the
most relevant set of concepts to represent the topics discussed in the
document. Notice that this algorithm is able to extract generic concepts that
are not directly present in the document.Comment: Technical report EPFL/LIA. 81 pages, 16 figure
- …