21,305 research outputs found
Experiments on domain adaptation for English-Hindi SMT
Statistical Machine Translation (SMT) systems are usually trained on large amounts of bilingual text and monolingual target language text. If a significant amount of out-of-domain data is added to the training data, the quality of translation can drop. On the other hand, training an SMT system on a small amount of training material for given indomain data leads to narrow lexical coverage which again results in a low translation quality. In this paper, (i) we explore domain-adaptation techniques to combine large out-of-domain training data with small-scale in-domain training data for English—Hindi statistical machine translation and (ii) we cluster large out-of-domain training data to extract sentences similar to in-domain sentences and apply adaptation techniques to combine clustered sub-corpora
with in-domain training data into a unified framework, achieving a 0.44 absolute corresponding to a 4.03% relative improvement in terms of BLEU over the baseline
Clustering of syntactic and discursive information for the dynamic adaptation of Language Models
Presentamos una estrategia de agrupamiento de elementos de diálogo, de tipo semántico y discursivo. Empleando Latent Semantic Analysis (LSA) agru- pamos los diferentes elementos de acuerdo a un criterio de distancia basado en correlación. Tras seleccionar un conjunto de grupos que forman una partición del espacio semántico o discursivo considerado, entrenamos unos modelos de lenguaje estocásticos (LM) asociados a cada modelo. Dichos modelos se emplearán en la adaptación dinámica del modelo de lenguaje empleado por el reconocedor de habla incluido en un sistema de diálogo. Mediante el empleo de información de diálogo (las probabilidades a posteriori que el gestor de diálogo asigna a cada elemento de diálogo en cada turno), estimamos los pesos de interpolación correspondientes a cada LM. Los experimentos iniciales muestran una reducción de la tasa de error de palabra al emplear la información obtenida a partir de una frase para reestimar la misma frase
Language Modeling by Clustering with Word Embeddings for Text Readability Assessment
We present a clustering-based language model using word embeddings for text
readability prediction. Presumably, an Euclidean semantic space hypothesis
holds true for word embeddings whose training is done by observing word
co-occurrences. We argue that clustering with word embeddings in the metric
space should yield feature representations in a higher semantic space
appropriate for text regression. Also, by representing features in terms of
histograms, our approach can naturally address documents of varying lengths. An
empirical evaluation using the Common Core Standards corpus reveals that the
features formed on our clustering-based language model significantly improve
the previously known results for the same corpus in readability prediction. We
also evaluate the task of sentence matching based on semantic relatedness using
the Wiki-SimpleWiki corpus and find that our features lead to superior matching
performance
Improving Statistical Language Model Performance with Automatically Generated Word Hierarchies
An automatic word classification system has been designed which processes
word unigram and bigram frequency statistics extracted from a corpus of natural
language utterances. The system implements a binary top-down form of word
clustering which employs an average class mutual information metric. Resulting
classifications are hierarchical, allowing variable class granularity. Words
are represented as structural tags --- unique -bit numbers the most
significant bit-patterns of which incorporate class information. Access to a
structural tag immediately provides access to all classification levels for the
corresponding word. The classification system has successfully revealed some of
the structure of English, from the phonemic to the semantic level. The system
has been compared --- directly and indirectly --- with other recent word
classification systems. Class based interpolated language models have been
constructed to exploit the extra information supplied by the classifications
and some experiments have shown that the new models improve model performance.Comment: 17 Page Paper. Self-extracting PostScript Fil
- …