14,407 research outputs found

    Multilingual unsupervised word alignment models and their application

    Get PDF
    Word alignment is an essential task in natural language processing because of its critical role in training statistical machine translation (SMT) models, error analysis for neural machine translation (NMT), building bilingual lexicon, and annotation transfer. In this thesis, we explore models for word alignment, how they can be extended to incorporate linguistically-motivated alignment types, and how they can be neuralized in an end-to-end fashion. In addition to these methodological developments, we apply our word alignment models to cross-lingual part-of-speech projection. First, we present a new probabilistic model for word alignment where word alignments are associated with linguistically-motivated alignment types. We propose a novel task of joint prediction of word alignment and alignment types and propose novel semi-supervised learning algorithms for this task. We also solve a sub-task of predicting the alignment type given an aligned word pair. The proposed joint generative models (alignment-type-enhanced models) significantly outperform the models without alignment types in terms of word alignment and translation quality. Next, we present an unsupervised neural Hidden Markov Model for word alignment, where emission and transition probabilities are modeled using neural networks. The model is simpler in structure, allows for seamless integration of additional context, and can be used in an end-to-end neural network. Finally, we tackle the part-of-speech tagging task for the zero-resource scenario where no part-of-speech (POS) annotated training data is available. We present a cross-lingual projection approach where neural HMM aligners are used to obtain high quality word alignments between resource-poor and resource-rich languages. Moreover, high quality neural POS taggers are used to provide annotations for the resource-rich language side of the parallel data, as well as to train a tagger on the projected data. Our experimental results on truly low-resource languages show that our methods outperform their corresponding baselines

    Conditional Random Field Autoencoders for Unsupervised Structured Prediction

    Full text link
    We introduce a framework for unsupervised learning of structured predictors with overlapping, global features. Each input's latent representation is predicted conditional on the observable data using a feature-rich conditional random field. Then a reconstruction of the input is (re)generated, conditional on the latent structure, using models for which maximum likelihood estimation has a closed-form. Our autoencoder formulation enables efficient learning without making unrealistic independence assumptions or restricting the kinds of features that can be used. We illustrate insightful connections to traditional autoencoders, posterior regularization and multi-view learning. We show competitive results with instantiations of the model for two canonical NLP tasks: part-of-speech induction and bitext word alignment, and show that training our model can be substantially more efficient than comparable feature-rich baselines

    WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction

    Full text link
    Most existing word alignment methods rely on manual alignment datasets or parallel corpora, which limits their usefulness. Here, to mitigate the dependence on manual data, we broaden the source of supervision by relaxing the requirement for correct, fully-aligned, and parallel sentences. Specifically, we make noisy, partially aligned, and non-parallel paragraphs. We then use such a large-scale weakly-supervised dataset for word alignment pre-training via span prediction. Extensive experiments with various settings empirically demonstrate that our approach, which is named WSPAlign, is an effective and scalable way to pre-train word aligners without manual data. When fine-tuned on standard benchmarks, WSPAlign has set a new state-of-the-art by improving upon the best-supervised baseline by 3.3~6.1 points in F1 and 1.5~6.1 points in AER. Furthermore, WSPAlign also achieves competitive performance compared with the corresponding baselines in few-shot, zero-shot and cross-lingual tests, which demonstrates that WSPAlign is potentially more practical for low-resource languages than existing methods.Comment: To appear at ACL 202

    Text Line Segmentation of Historical Documents: a Survey

    Full text link
    There is a huge amount of historical documents in libraries and in various National Archives that have not been exploited electronically. Although automatic reading of complete pages remains, in most cases, a long-term objective, tasks such as word spotting, text/image alignment, authentication and extraction of specific fields are in use today. For all these tasks, a major step is document segmentation into text lines. Because of the low quality and the complexity of these documents (background noise, artifacts due to aging, interfering lines),automatic text line segmentation remains an open research field. The objective of this paper is to present a survey of existing methods, developed during the last decade, and dedicated to documents of historical interest.Comment: 25 pages, submitted version, To appear in International Journal on Document Analysis and Recognition, On line version available at http://www.springerlink.com/content/k2813176280456k3
    • …
    corecore