12 research outputs found
Character-level and syntax-level models for low-resource and multilingual natural language processing
There are more than 7000 languages in the world, but only a small portion of them benefit from Natural Language Processing resources and models. Although languages generally present different characteristics, “cross-lingual bridges” can be exploited, such as transliteration signals and word alignment links. Such information, together with the availability of multiparallel corpora and the urge to overcome language barriers, motivates us to build models that represent more of the world’s languages.
This thesis investigates cross-lingual links for improving the processing of low-resource languages with language-agnostic models at the character and syntax level. Specifically, we propose to (i) use orthographic similarities and transliteration between Named Entities and rare words in different languages to improve the construction of Bilingual Word Embeddings (BWEs) and named entity resources, and (ii) exploit multiparallel corpora for projecting labels from high- to low-resource languages, thereby gaining access to weakly supervised processing methods for the latter.
In the first publication, we describe our approach for improving the translation of rare words and named entities for the Bilingual Dictionary Induction (BDI) task, using orthography and transliteration information. In our second work, we tackle BDI by enriching BWEs with orthography embeddings and a number of other features, using our classification-based system to overcome script differences among languages. The third publication describes cheap cross-lingual signals that should be considered when building mapping approaches for BWEs since they are simple to extract, effective for bootstrapping the mapping of BWEs, and overcome the failure of unsupervised methods. The fourth paper shows our approach for extracting a named entity resource for 1340 languages, including very low-resource languages from all major areas of linguistic diversity. We exploit parallel corpus statistics and transliteration models and obtain improved performance over prior work. Lastly, the fifth work models annotation projection as a graph-based label propagation problem for the part of speech tagging task. Part of speech models trained on our labeled sets outperform prior work for low-resource languages like Bambara (an African language spoken in Mali), Erzya (a Uralic language spoken in Russia’s Republic of Mordovia), Manx (the Celtic language of the Isle of Man), and Yoruba (a Niger-Congo language spoken in Nigeria and surrounding countries)
Multilingual sentiment analysis in social media.
252 p.This thesis addresses the task of analysing sentiment in messages coming from social media. The ultimate goal was to develop a Sentiment Analysis system for Basque. However, because of the socio-linguistic reality of the Basque language a tool providing only analysis for Basque would not be enough for a real world application. Thus, we set out to develop a multilingual system, including Basque, English, French and Spanish.The thesis addresses the following challenges to build such a system:- Analysing methods for creating Sentiment lexicons, suitable for less resourced languages.- Analysis of social media (specifically Twitter): Tweets pose several challenges in order to understand and extract opinions from such messages. Language identification and microtext normalization are addressed.- Research the state of the art in polarity classification, and develop a supervised classifier that is tested against well known social media benchmarks.- Develop a social media monitor capable of analysing sentiment with respect to specific events, products or organizations
Multilingual sentiment analysis in social media.
252 p.This thesis addresses the task of analysing sentiment in messages coming from social media. The ultimate goal was to develop a Sentiment Analysis system for Basque. However, because of the socio-linguistic reality of the Basque language a tool providing only analysis for Basque would not be enough for a real world application. Thus, we set out to develop a multilingual system, including Basque, English, French and Spanish.The thesis addresses the following challenges to build such a system:- Analysing methods for creating Sentiment lexicons, suitable for less resourced languages.- Analysis of social media (specifically Twitter): Tweets pose several challenges in order to understand and extract opinions from such messages. Language identification and microtext normalization are addressed.- Research the state of the art in polarity classification, and develop a supervised classifier that is tested against well known social media benchmarks.- Develop a social media monitor capable of analysing sentiment with respect to specific events, products or organizations
Tune your brown clustering, please
Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal
Recommended from our members
Ancient Polities, Modern States
Political science is concerned with the study of polities. However, remarkably few scholars are familiar with the polities of the premodern era, such as Vijayanagara, Siam, Abyssinia, the Kingdoms of Kongo or Mutapa, or the Mysore or Maratha empires. This dissertation examines the legacies of precolonial polities in India, during the period from 1707 to 1857. I argue that, contrary to the widespread perception that the Indian subcontinent was a pre-state society, the late eighteenth and early nineteenth centuries were a time of rapid defensive modernization across the subcontinent, driven by the requirements of gunpowder weaponry and interstate warfare among South Asian regimes and against European colonial powers. These changes included the broadening and deepening of the tax base, consolidation of territorial control, reorganization of domestic militaries to use infantry and gunpowder weapons, rationalization of the administration through use of accounts and printed records, and the professionalization and functional differentiation of the executive branch.
I then trace the boundaries of precolonial eighteenth-century South Asian polities, in order to show that districts of India that lie narrowly within the boundary lines of historically centralized states perform significantly better today on a wide variety of district-level indicators of state effectiveness than those narrowly outside these boundaries, despite the fact that these borders largely ceased to exist in the early nineteenth century. These estimated effects are robust to a wide variety of controls, placebo tests for border displacement, the exclusion of individual polities, and controls for the boundaries of India’s contemporary federal states. I verify the persistent legacy of precolonial states using a combination of archival research, district-level colonial data on taxation and public goods from 1853 to 1901, and a field test of bureaucratic responsiveness conducted in the state of Karnataka. Using extensive archival research on the fiscal and bureaucratic structure of Indian states in the eighteenth century, I show that following the decline of the Mughal Empire, warfare between “challenger states” prompted an accumulation of bureaucratic and fiscal capacity at the local level, and that this capacity has persisted through the colonial era to the present day. In contrast to “bottom-up” theories of state capacity which root institutional strength in societal characteristics such as ethnic homogeneity, social capital, or land equality, it is argued that government effectiveness is cumulatively built through long-term historical investments in state capacity, and that, in India, an important phase of investment occurred during the warring states period of the eighteenth century.
Finally, I show that this relationship exists beyond the South Asian context, both in cross-country regressions of the effect of state antiquity on contemporary state capacity, and by conducting a subnational historical analysis within districts of the Former Soviet Union. I conclude that augmenting the state’s power to tax, regulate, or conscript is, in Weber’s phrase, “a long and slow boring of hard boards”, and the resources required in order to attain a functioning state - bureaucratic infrastructure, norms of compliance, and affective loyalty - are accumulated only very gradually. Yet where long-extant political regimes were successful in monitoring, coercing, and mobilizing citizens towards state goals they generate a reservoir of legitimacy and compliance, that is essential for making states work in the world today.Governmen