50 research outputs found
Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT
Pretrained contextual representation models (Peters et al., 2018; Devlin et
al., 2018) have pushed forward the state-of-the-art on many NLP tasks. A new
release of BERT (Devlin, 2018) includes a model simultaneously pretrained on
104 languages with impressive performance for zero-shot cross-lingual transfer
on a natural language inference task. This paper explores the broader
cross-lingual potential of mBERT (multilingual) as a zero shot language
transfer model on 5 NLP tasks covering a total of 39 languages from various
language families: NLI, document classification, NER, POS tagging, and
dependency parsing. We compare mBERT with the best-published methods for
zero-shot cross-lingual transfer and find mBERT competitive on each task.
Additionally, we investigate the most effective strategy for utilizing mBERT in
this manner, determine to what extent mBERT generalizes away from language
specific features, and measure factors that influence cross-lingual transfer.Comment: EMNLP 2019 Camera Read
An Open Dataset and Model for Language Identification
Language identification (LID) is a fundamental step in many natural language
processing pipelines. However, current LID systems are far from perfect,
particularly on lower-resource languages. We present a LID model which achieves
a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201
languages, outperforming previous work. We achieve this by training on a
curated dataset of monolingual data, the reliability of which we ensure by
auditing a sample from each source and each language manually. We make both the
model and the dataset available to the research community. Finally, we carry
out detailed analysis into our model's performance, both in comparison to
existing open models and by language class.Comment: To be published in ACL 202
Aggregate Implications of Firm Heterogeneity: A Nonparametric Analysis of Monopolistic Competition Trade Models
We measure the role of firm heterogeneity in counterfactual predictions of monopolistic competition trade models without parametric restrictions on the distribution of firm fundamentals. We show that two bilateral elasticity functions are sufficient to nonparametrically compute the counterfactual aggregate impact of trade shocks, and recover changes in economic fundamentals from observed data. These functions are identified from two semiparametric gravity equations governing the impact of bilateral trade costs on the extensive and intensive margins of firm-level exports. Applying our methodology, we estimate elasticity functions that imply an impact of trade costs on trade flows that falls when more firms serve a market because of smaller extensive margin responses. Compared to a baseline where elasticities are constant, firm heterogeneity amplifies both the gains from trade in countries with more exporter firms, and the welfare gains of European market integration in 2003-2012
BantuBERTa : using language family grouping in multilingual language modeling for Bantu languages
Mini Dissertation (MIT (Big Data Science))--University of Pretoria, 2023.It was researched whether a multilingual Bantu pretraining corpus could be created from
freely available data. Here, to create the dataset, Bantu text extracted from datasets that
are freely available online (mainly from Huggingface) were used. The resulting multilingual
language model (BantuBERTa) from this pretraining data proved to be predictive across multiple
Bantu languages on a higher-order NLP task (NER) and in a simpler NLP task (classification).
This proves that this dataset can be used for Bantu multilingual pretraining and transfer to
multiple Bantu languages. Additionally, it was researched whether using this Bantu dataset
could benefit transfer learning in downstream NLP tasks. BantuBERTa under-performed with
respect to other models (XlM-R, mBERT, and AfriBERTa) bench-marked on MasakhaNER’s
Bantu language tests (Swahili, Luganda, and Kinyarwanda). Additionally, it produced state
of the art results for the Bantu language benchmarks (Zulu, and Lingala) in the African News
Topic Classification dataset. It was surmised that the pretraining dataset size (which was 30%
smaller than AfriBERTa’s) and dataset quality were the main cause for the poor performance in
the NER test. We believe this is a case-specific failure due to poor data quality resulting from a
pretraining dataset consisting mainly of web-scraped pages. Here, the resulting dataset consisted
mainly of MC4 and CC100 Bantu text. However, on lower-order NLP tasks, like classification,
pretraining on languages solely within the language family seemed to benefit transfer to other
similar languages within the family. This potentially opens a method for effectively including
low-resourced languages in low-level NLP tasks.Computer ScienceMIT (Big Data Science)Unrestricte
Image annotation and retrieval based on multi-modal feature clustering and similarity propagation.
The performance of content-based image retrieval systems has proved to be inherently constrained by the used low level features, and cannot give satisfactory results when the user\u27s high level concepts cannot be expressed by low level features. In an attempt to bridge this semantic gap, recent approaches started integrating both low level-visual features and high-level textual keywords. Unfortunately, manual image annotation is a tedious process and may not be possible for large image databases. In this thesis we propose a system for image retrieval that has three mains components. The first component of our system consists of a novel possibilistic clustering and feature weighting algorithm based on robust modeling of the Generalized Dirichlet (GD) finite mixture. Robust estimation of the mixture model parameters is achieved by incorporating two complementary types of membership degrees. The first one is a posterior probability that indicates the degree to which a point fits the estimated distribution. The second membership represents the degree of typicality and is used to indentify and discard noise points. Robustness to noisy and irrelevant features is achieved by transforming the data to make the features independent and follow Beta distribution, and learning optimal relevance weight for each feature subset within each cluster. We extend our algorithm to find the optimal number of clusters in an unsupervised and efficient way by exploiting some properties of the possibilistic membership function. We also outline a semi-supervised version of the proposed algorithm. In the second component of our system consists of a novel approach to unsupervised image annotation. Our approach is based on: (i) the proposed semi-supervised possibilistic clustering; (ii) a greedy selection and joining algorithm (GSJ); (iii) Bayes rule; and (iv) a probabilistic model that is based on possibilistic memebership degrees to annotate an image. The third component of the proposed system consists of an image retrieval framework based on multi-modal similarity propagation. The proposed framework is designed to deal with two data modalities: low-level visual features and high-level textual keywords generated by our proposed image annotation algorithm. The multi-modal similarity propagation system exploits the mutual reinforcement of relational data and results in a nonlinear combination of the different modalities. Specifically, it is used to learn the semantic similarities between images by leveraging the relationships between features from the different modalities. The proposed image annotation and retrieval approaches are implemented and tested with a standard benchmark dataset. We show the effectiveness of our clustering algorithm to handle high dimensional and noisy data. We compare our proposed image annotation approach to three state-of-the-art methods and demonstrate the effectiveness of the proposed image retrieval system
Essays in International Economics
This Dissertation consists of three chapters. These chapters use the power of the gravity model widely employed in the international economics literature. The first chapter investigates how economic integration agreements impact countries\u27 local technology, wages, prices and market access, and how to aggregate these effects to compute the changes in countries\u27 welfare. By examining 16 countries that have harmonized with the European Union since 1980, we show that almost all of the participating countries experience welfare gains as a result of signing integration agreements with the European Union. The objective of the second chapter is to explain the determinants of foreign direct investment (FDI) flows; in particular, we focus on the effects of estimates of economic integration agreements on FDI flows while controlling for time-varying country specific unobserved variables as well as time constant country-pair unobserved variables. As compared to the previous literature, we find that the coefficient estimates of common market and custom union are overestimated and the free trade agreement coefficient becomes insignificant after accounting for above-mentioned unobserved variables. Building on the work of Baier and Bergstrand (2009), the third chapter aims to obtain unbiased, consistent and efficient coefficient estimates of trade cost variables by accounting for unobserved country heterogeneity and approximation errors
Global Trade and GDP Co-Movement
We revisit the association between trade and GDP comovement for 135 countries from 1970 to 2009. Guided by a simple theory, we introduce two notions of trade linkages: (i) the
usual direct bilateral trade index and (ii) new indexes of common exposure to third countries capturing the role of similarity in trade networks. Both measures are economically and statistically associated with GDP correlation, suggesting an additional channel through which GDP fluctuations propagate through trade linkages. Moreover, high income countries become more synchronized when the content of their trade is tilted toward inputs while trade in final goods is key for low income countries. Finally, we present evidence that the density of the international trade network is associated with an amplification of the association between global trade flows and bilateral GDP comovement, leading to a significant evolution of the trade comovement slope over the last two decades
How Do Multilingual Encoders Learn Cross-lingual Representation?
NLP systems typically require support for more than one language. As different languages have different amounts of supervision, cross-lingual transfer benefits languages with little to no training data by transferring from other languages. From an engineering perspective, multilingual NLP benefits development and maintenance by serving multiple languages with a single system. Both cross-lingual transfer and multilingual NLP rely on cross-lingual representations serving as the foundation. As BERT revolutionized representation learning and NLP, it also revolutionized cross-lingual representations and cross-lingual transfer. Multilingual BERT was released as a replacement for single-language BERT, trained with Wikipedia data in 104 languages.
Surprisingly, without any explicit cross-lingual signal, multilingual BERT learns cross-lingual representations in addition to representations for individual languages. This thesis first shows such surprising cross-lingual effectiveness compared against prior art on various tasks. Naturally, it raises a set of questions, most notably how do these multilingual encoders learn cross-lingual representations. In exploring these questions, this thesis will analyze the behavior of multilingual models in a variety of settings on high and low resource languages. We also look at how to inject different cross-lingual signals into multilingual encoders, and the optimization behavior of cross-lingual transfer with these models. Together, they provide a better understanding of multilingual encoders on cross-lingual transfer. Our findings will lead us to suggested improvements to multilingual encoders and cross-lingual transfer