9 research outputs found

    Recreating the Network of Early Modern Natural Philosophy:A Mono- and Multilingual Text Data Vectorization Method

    Get PDF
    How could one create a network representation of a book corpus spanning over two hundred years? In this paper, we present a method based on text data vectorization for a complex and multifaceted network representation of an early modern corpus of 239 natural philosophy textbooks published in Latin, French, and English. On the one hand, we use unsupervised methods (namely topic modeling, term frequency – inverse document frequency, and multilingual word embeddings) to represent the broader features of this corpus, such as the homogeneity in the style and linguistic usages, both among works written in the same language, and across multiple languages. On the other hand, we use the collocate analysis of specific keywords to explore how certain concepts were understood, reshaped, and disseminated in the corpus. We call this the ‘semantic dimension.’ Each of these two dimensions provides a different way of correlating the books via text data vectorization and representing them as a network. Since each of these dimensions is in itself complex and multifaceted, the network we construct for each of them is a multiplex one, made of several layer-graphs. Furthermore, provided that there is enough information available about the authors of the works included in our inventory, this research offers the grounds for further expanding the described network representation in such a way as to create a third multiplex, one that explores some of the social features of the authors in question

    Character-level and syntax-level models for low-resource and multilingual natural language processing

    Get PDF
    There are more than 7000 languages in the world, but only a small portion of them benefit from Natural Language Processing resources and models. Although languages generally present different characteristics, “cross-lingual bridges” can be exploited, such as transliteration signals and word alignment links. Such information, together with the availability of multiparallel corpora and the urge to overcome language barriers, motivates us to build models that represent more of the world’s languages. This thesis investigates cross-lingual links for improving the processing of low-resource languages with language-agnostic models at the character and syntax level. Specifically, we propose to (i) use orthographic similarities and transliteration between Named Entities and rare words in different languages to improve the construction of Bilingual Word Embeddings (BWEs) and named entity resources, and (ii) exploit multiparallel corpora for projecting labels from high- to low-resource languages, thereby gaining access to weakly supervised processing methods for the latter. In the first publication, we describe our approach for improving the translation of rare words and named entities for the Bilingual Dictionary Induction (BDI) task, using orthography and transliteration information. In our second work, we tackle BDI by enriching BWEs with orthography embeddings and a number of other features, using our classification-based system to overcome script differences among languages. The third publication describes cheap cross-lingual signals that should be considered when building mapping approaches for BWEs since they are simple to extract, effective for bootstrapping the mapping of BWEs, and overcome the failure of unsupervised methods. The fourth paper shows our approach for extracting a named entity resource for 1340 languages, including very low-resource languages from all major areas of linguistic diversity. We exploit parallel corpus statistics and transliteration models and obtain improved performance over prior work. Lastly, the fifth work models annotation projection as a graph-based label propagation problem for the part of speech tagging task. Part of speech models trained on our labeled sets outperform prior work for low-resource languages like Bambara (an African language spoken in Mali), Erzya (a Uralic language spoken in Russia’s Republic of Mordovia), Manx (the Celtic language of the Isle of Man), and Yoruba (a Niger-Congo language spoken in Nigeria and surrounding countries)

    Quantitative stability of optimal transport maps and linearization of the 2-Wasserstein space

    Full text link
    This work studies an explicit embedding of the set of probability measures into a Hilbert space, defined using optimal transport maps from a reference probability density. This embedding linearizes to some extent the 2-Wasserstein space, and enables the direct use of generic supervised and unsupervised learning algorithms on measure data. Our main result is that the embedding is (bi-)H\"older continuous, when the reference density is uniform over a convex set, and can be equivalently phrased as a dimension-independent H\"older-stability results for optimal transport maps.Comment: 21 page

    Minimax estimation of smooth optimal transport maps

    Full text link
    Brenier's theorem is a cornerstone of optimal transport that guarantees the existence of an optimal transport map TT between two probability distributions PP and QQ over Rd\mathbb{R}^d under certain regularity conditions. The main goal of this work is to establish the minimax estimation rates for such a transport map from data sampled from PP and QQ under additional smoothness assumptions on TT. To achieve this goal, we develop an estimator based on the minimization of an empirical version of the semi-dual optimal transport problem, restricted to truncated wavelet expansions. This estimator is shown to achieve near minimax optimality using new stability arguments for the semi-dual and a complementary minimax lower bound. Furthermore, we provide numerical experiments on synthetic data supporting our theoretical findings and highlighting the practical benefits of smoothness regularization. These are the first minimax estimation rates for transport maps in general dimension.Comment: 53 pages, 6 figure
    corecore