807 research outputs found

    Limitations and challenges of unsupervised cross-lingual pre-training

    Full text link
    [ES] Los métodos de alineamiento croslingüe para representaciones monolingües del lenguaje han sido objeto de un interés notable en el campo de procesamiento del lenguaje natural durante los últimos años, en gran medida debido a la capacidad que estos tienen para general alineamientos entre lenguas utilizando poca o nula información paralela. Sin embargo, su uso en técnicas de preentrenamiento de modelos de traducción automática, un papel en el que los modelos monolingües son particularmente exitosos, y que debería beneficiarse de la información croslingüe obtenida, sigue siendo limitado. Esta propuesta intenta aportar algo de luz sobre los efectos de algunos de los factores que afectan a las representaciones croslingües y las estrategias de preentrenamiento, con la esperanza de que pueda ayudar a futuras investigaciones en este campo. Para ello, este trabajo estudia los dos componentes principales que constituyen el preentrenamiento croslingüe: los alineamientos croslingües y la integración de los mismos como modelos de preentrenamiento. Los primeros son explorados a través de varios métodos croslingües no supervisados ampliamente conocidos, que emplean principalmente similaridades distribucionales para encontrar un alineamiento satisfactorio entre lenguajes. Debido a esto, resultan un interesante terreno de pruebas en el que analizar los efectos de la similaridad entre lenguajes sobre tanto las técnicas de alineamiento croslingüe como los espacios de representación sobre los que operan. En en apartado de integración en preentrenamiento, los espacios de representación croslingües son utilizados para preentrenar modelos de traducción automática, los cuales son comparados contra esquemas que emplean espacios de representación independientes. Los resultados muestran que los métodos croslingües con supervisión débil son remarcablemente efectivos a la hora de generar alineamientos incluso para parejas de lenguajes muy diferentes, y se benefician notablemente de la información a nivel de subpalabra. Sin embargo, el efecto del alineamiento croslingüe en el preentrenamiento es reducido debido a las dificultad de mantener la estructura de la proyección durante el entrenamiento, así como por la limitada influencia que el propio preentrenamiento tiene sobre el modelo supervisado.[EN] Cross-lingual alignment methods for monolingual language representations have received notable research attention in the past few years due to their capacity to induce bilingual alignments with little or no supervision signals. However, their use in machine translation pre-training, a function that monolingual models excel at, and which should benefit from cross-lingual information, remains limited. This work tries to shed light on the effects of some of the factors that play a role in cross-lingual representations and pre-training strategies, with the hope that it can help guide future endeavors in the field. To this end, the survey studies the two main components that constitute cross-lingual pre-training: cross-lingual mappings and their pre-training integration. The former are explored through some widely known fully unsupervised cross-lingual methods, which rely on distributional similarities between languages. Consequently, they are a great basis upon which to consider the effects of language similarity on both cross-mapping techniques and the representation spaces over which they operate. In pre-training integration, cross-lingual representation spaces are used to pre-train a neural machine translation models, which are compared against techniques that employ independent monolingual spaces. The results show that weakly-supervised cross-lingual methods are remarkably effective at inducing alignment even for distant languages and they benefit noticeably from subword information. However, the effect of cross-linguality in pre-training is diminished due to difficulties in maintaining the structure of the projection during training, and the limited influence that pre-training itself has in the supervised model.Quesada Zaragoza, M. (2021). Limitations and challenges of unsupervised cross-lingual pre-training. Universitat Politècnica de València. http://hdl.handle.net/10251/174111TFG

    Local Byte Fusion for Neural Machine Translation

    Full text link
    Subword tokenization schemes are the dominant technique used in current NLP models. However, such schemes can be rigid and tokenizers built on one corpus do not adapt well to other parallel corpora. It has also been observed that in multilingual corpora, subword tokenization schemes over-segment low-resource languages leading to a drop in translation performance. A simple alternative to subword tokenizers is byte-based methods i.e. tokenization into byte sequences using encoding schemes such as UTF-8. Byte tokens often represent inputs at a sub-character granularity i.e. one character can be represented by a sequence of multiple byte tokens. This results in byte sequences that are significantly longer than character sequences. Enforcing aggregation of local information in the lower layers can guide the model to build higher-level semantic information. We propose a Local Byte Fusion (LOBEF) method for byte-based machine translation -- utilizing byte nn-gram and word boundaries -- to aggregate local semantic information. Extensive experiments on multilingual translation, zero-shot cross-lingual transfer, and domain adaptation reveal a consistent improvement over traditional byte-based models and even over subword techniques. Further analysis also indicates that our byte-based models are parameter-efficient and can be trained faster than subword models.Comment: Accepted at ACL 2023 - Main Conferenc
    corecore