50 research outputs found

    Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT

    Full text link
    Pretrained contextual representation models (Peters et al., 2018; Devlin et al., 2018) have pushed forward the state-of-the-art on many NLP tasks. A new release of BERT (Devlin, 2018) includes a model simultaneously pretrained on 104 languages with impressive performance for zero-shot cross-lingual transfer on a natural language inference task. This paper explores the broader cross-lingual potential of mBERT (multilingual) as a zero shot language transfer model on 5 NLP tasks covering a total of 39 languages from various language families: NLI, document classification, NER, POS tagging, and dependency parsing. We compare mBERT with the best-published methods for zero-shot cross-lingual transfer and find mBERT competitive on each task. Additionally, we investigate the most effective strategy for utilizing mBERT in this manner, determine to what extent mBERT generalizes away from language specific features, and measure factors that influence cross-lingual transfer.Comment: EMNLP 2019 Camera Read

    An Open Dataset and Model for Language Identification

    Get PDF
    Language identification (LID) is a fundamental step in many natural language processing pipelines. However, current LID systems are far from perfect, particularly on lower-resource languages. We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages, outperforming previous work. We achieve this by training on a curated dataset of monolingual data, the reliability of which we ensure by auditing a sample from each source and each language manually. We make both the model and the dataset available to the research community. Finally, we carry out detailed analysis into our model's performance, both in comparison to existing open models and by language class.Comment: To be published in ACL 202

    Aggregate Implications of Firm Heterogeneity: A Nonparametric Analysis of Monopolistic Competition Trade Models

    Get PDF
    We measure the role of firm heterogeneity in counterfactual predictions of monopolistic competition trade models without parametric restrictions on the distribution of firm fundamentals. We show that two bilateral elasticity functions are sufficient to nonparametrically compute the counterfactual aggregate impact of trade shocks, and recover changes in economic fundamentals from observed data. These functions are identified from two semiparametric gravity equations governing the impact of bilateral trade costs on the extensive and intensive margins of firm-level exports. Applying our methodology, we estimate elasticity functions that imply an impact of trade costs on trade flows that falls when more firms serve a market because of smaller extensive margin responses. Compared to a baseline where elasticities are constant, firm heterogeneity amplifies both the gains from trade in countries with more exporter firms, and the welfare gains of European market integration in 2003-2012

    BantuBERTa : using language family grouping in multilingual language modeling for Bantu languages

    Get PDF
    Mini Dissertation (MIT (Big Data Science))--University of Pretoria, 2023.It was researched whether a multilingual Bantu pretraining corpus could be created from freely available data. Here, to create the dataset, Bantu text extracted from datasets that are freely available online (mainly from Huggingface) were used. The resulting multilingual language model (BantuBERTa) from this pretraining data proved to be predictive across multiple Bantu languages on a higher-order NLP task (NER) and in a simpler NLP task (classification). This proves that this dataset can be used for Bantu multilingual pretraining and transfer to multiple Bantu languages. Additionally, it was researched whether using this Bantu dataset could benefit transfer learning in downstream NLP tasks. BantuBERTa under-performed with respect to other models (XlM-R, mBERT, and AfriBERTa) bench-marked on MasakhaNER’s Bantu language tests (Swahili, Luganda, and Kinyarwanda). Additionally, it produced state of the art results for the Bantu language benchmarks (Zulu, and Lingala) in the African News Topic Classification dataset. It was surmised that the pretraining dataset size (which was 30% smaller than AfriBERTa’s) and dataset quality were the main cause for the poor performance in the NER test. We believe this is a case-specific failure due to poor data quality resulting from a pretraining dataset consisting mainly of web-scraped pages. Here, the resulting dataset consisted mainly of MC4 and CC100 Bantu text. However, on lower-order NLP tasks, like classification, pretraining on languages solely within the language family seemed to benefit transfer to other similar languages within the family. This potentially opens a method for effectively including low-resourced languages in low-level NLP tasks.Computer ScienceMIT (Big Data Science)Unrestricte

    Image annotation and retrieval based on multi-modal feature clustering and similarity propagation.

    Get PDF
    The performance of content-based image retrieval systems has proved to be inherently constrained by the used low level features, and cannot give satisfactory results when the user\u27s high level concepts cannot be expressed by low level features. In an attempt to bridge this semantic gap, recent approaches started integrating both low level-visual features and high-level textual keywords. Unfortunately, manual image annotation is a tedious process and may not be possible for large image databases. In this thesis we propose a system for image retrieval that has three mains components. The first component of our system consists of a novel possibilistic clustering and feature weighting algorithm based on robust modeling of the Generalized Dirichlet (GD) finite mixture. Robust estimation of the mixture model parameters is achieved by incorporating two complementary types of membership degrees. The first one is a posterior probability that indicates the degree to which a point fits the estimated distribution. The second membership represents the degree of typicality and is used to indentify and discard noise points. Robustness to noisy and irrelevant features is achieved by transforming the data to make the features independent and follow Beta distribution, and learning optimal relevance weight for each feature subset within each cluster. We extend our algorithm to find the optimal number of clusters in an unsupervised and efficient way by exploiting some properties of the possibilistic membership function. We also outline a semi-supervised version of the proposed algorithm. In the second component of our system consists of a novel approach to unsupervised image annotation. Our approach is based on: (i) the proposed semi-supervised possibilistic clustering; (ii) a greedy selection and joining algorithm (GSJ); (iii) Bayes rule; and (iv) a probabilistic model that is based on possibilistic memebership degrees to annotate an image. The third component of the proposed system consists of an image retrieval framework based on multi-modal similarity propagation. The proposed framework is designed to deal with two data modalities: low-level visual features and high-level textual keywords generated by our proposed image annotation algorithm. The multi-modal similarity propagation system exploits the mutual reinforcement of relational data and results in a nonlinear combination of the different modalities. Specifically, it is used to learn the semantic similarities between images by leveraging the relationships between features from the different modalities. The proposed image annotation and retrieval approaches are implemented and tested with a standard benchmark dataset. We show the effectiveness of our clustering algorithm to handle high dimensional and noisy data. We compare our proposed image annotation approach to three state-of-the-art methods and demonstrate the effectiveness of the proposed image retrieval system

    Essays in International Economics

    Get PDF
    This Dissertation consists of three chapters. These chapters use the power of the gravity model widely employed in the international economics literature. The first chapter investigates how economic integration agreements impact countries\u27 local technology, wages, prices and market access, and how to aggregate these effects to compute the changes in countries\u27 welfare. By examining 16 countries that have harmonized with the European Union since 1980, we show that almost all of the participating countries experience welfare gains as a result of signing integration agreements with the European Union. The objective of the second chapter is to explain the determinants of foreign direct investment (FDI) flows; in particular, we focus on the effects of estimates of economic integration agreements on FDI flows while controlling for time-varying country specific unobserved variables as well as time constant country-pair unobserved variables. As compared to the previous literature, we find that the coefficient estimates of common market and custom union are overestimated and the free trade agreement coefficient becomes insignificant after accounting for above-mentioned unobserved variables. Building on the work of Baier and Bergstrand (2009), the third chapter aims to obtain unbiased, consistent and efficient coefficient estimates of trade cost variables by accounting for unobserved country heterogeneity and approximation errors

    Global Trade and GDP Co-Movement

    Get PDF
    We revisit the association between trade and GDP comovement for 135 countries from 1970 to 2009. Guided by a simple theory, we introduce two notions of trade linkages: (i) the usual direct bilateral trade index and (ii) new indexes of common exposure to third countries capturing the role of similarity in trade networks. Both measures are economically and statistically associated with GDP correlation, suggesting an additional channel through which GDP fluctuations propagate through trade linkages. Moreover, high income countries become more synchronized when the content of their trade is tilted toward inputs while trade in final goods is key for low income countries. Finally, we present evidence that the density of the international trade network is associated with an amplification of the association between global trade flows and bilateral GDP comovement, leading to a significant evolution of the trade comovement slope over the last two decades

    How Do Multilingual Encoders Learn Cross-lingual Representation?

    Get PDF
    NLP systems typically require support for more than one language. As different languages have different amounts of supervision, cross-lingual transfer benefits languages with little to no training data by transferring from other languages. From an engineering perspective, multilingual NLP benefits development and maintenance by serving multiple languages with a single system. Both cross-lingual transfer and multilingual NLP rely on cross-lingual representations serving as the foundation. As BERT revolutionized representation learning and NLP, it also revolutionized cross-lingual representations and cross-lingual transfer. Multilingual BERT was released as a replacement for single-language BERT, trained with Wikipedia data in 104 languages. Surprisingly, without any explicit cross-lingual signal, multilingual BERT learns cross-lingual representations in addition to representations for individual languages. This thesis first shows such surprising cross-lingual effectiveness compared against prior art on various tasks. Naturally, it raises a set of questions, most notably how do these multilingual encoders learn cross-lingual representations. In exploring these questions, this thesis will analyze the behavior of multilingual models in a variety of settings on high and low resource languages. We also look at how to inject different cross-lingual signals into multilingual encoders, and the optimization behavior of cross-lingual transfer with these models. Together, they provide a better understanding of multilingual encoders on cross-lingual transfer. Our findings will lead us to suggested improvements to multilingual encoders and cross-lingual transfer
    corecore