994 research outputs found

    Extracting information from S-curves of language change

    Full text link
    It is well accepted that adoption of innovations are described by S-curves (slow start, accelerating period, and slow end). In this paper, we analyze how much information on the dynamics of innovation spreading can be obtained from a quantitative description of S-curves. We focus on the adoption of linguistic innovations for which detailed databases of written texts from the last 200 years allow for an unprecedented statistical precision. Combining data analysis with simulations of simple models (e.g., the Bass dynamics on complex networks) we identify signatures of endogenous and exogenous factors in the S-curves of adoption. We propose a measure to quantify the strength of these factors and three different methods to estimate it from S-curves. We obtain cases in which the exogenous factors are dominant (in the adoption of German orthographic reforms and of one irregular verb) and cases in which endogenous factors are dominant (in the adoption of conventions for romanization of Russian names and in the regularization of most studied verbs). These results show that the shape of S-curve is not universal and contains information on the adoption mechanism. (published at "J. R. Soc. Interface, vol. 11, no. 101, (2014) 1044"; DOI: http://dx.doi.org/10.1098/rsif.2014.1044)Comment: 9 pages, 5 figures, Supplementary Material is available at http://dx.doi.org/10.6084/m9.figshare.122178

    Exploiting Cross-Lingual Representations For Natural Language Processing

    Get PDF
    Traditional approaches to supervised learning require a generous amount of labeled data for good generalization. While such annotation-heavy approaches have proven useful for some Natural Language Processing (NLP) tasks in high-resource languages (like English), they are unlikely to scale to languages where collecting labeled data is di cult and time-consuming. Translating supervision available in English is also not a viable solution, because developing a good machine translation system requires expensive to annotate resources which are not available for most languages. In this thesis, I argue that cross-lingual representations are an effective means of extending NLP tools to languages beyond English without resorting to generous amounts of annotated data or expensive machine translation. These representations can be learned in an inexpensive manner, often from signals completely unrelated to the task of interest. I begin with a review of different ways of inducing such representations using a variety of cross-lingual signals and study algorithmic approaches of using them in a diverse set of downstream tasks. Examples of such tasks covered in this thesis include learning representations to transfer a trained model across languages for document classification, assist in monolingual lexical semantics like word sense induction, identify asymmetric lexical relationships like hypernymy between words in different languages, or combining supervision across languages through a shared feature space for cross-lingual entity linking. In all these applications, the representations make information expressed in other languages available in English, while requiring minimal additional supervision in the language of interest

    The Use of Cyrillic Metadata for Enhancing Discovery of Russian Digital Collection Items: A Case Study of the Bowman Gray World War I Postcards Digital Collection

    Get PDF
    This paper examines the online discoverability of multilingual digital collections, focusing on the effectiveness of romanized and original script metadata for providing access to materials in non-roman script languages. Using the World War I Postcards from the Bowman Gray Collection digital collection at the University of North Carolina at Chapel Hill as a case study, the dynamics of Russian-language user access to postcards with and without Cyrillic description were compared with those of other major language user groups accessing the collection. While limited on a dependence on Google's system of determining user language, the results suggest that the nature of the Cyrillic metadata included in postcard records, limited to title, publisher, and other information transcribed from the resource in a bibliographic cataloging context, did not enhance the discoverability of the postcards. Moreover, every language group was at a distinct disadvantage compared to English-language users in terms of numbers of items discovered. In conclusion, I discuss various factors that may have affected these results, as well as implications for cultural heritage institutions with multilingual and multi-script collections.Master of Science in Information Scienc

    The "handedness" of language: Directional symmetry breaking of sign usage in words

    Full text link
    Language, which allows complex ideas to be communicated through symbolic sequences, is a characteristic feature of our species and manifested in a multitude of forms. Using large written corpora for many different languages and scripts, we show that the occurrence probability distributions of signs at the left and right ends of words have a distinct heterogeneous nature. Characterizing this asymmetry using quantitative inequality measures, viz. information entropy and the Gini index, we show that the beginning of a word is less restrictive in sign usage than the end. This property is not simply attributable to the use of common affixes as it is seen even when only word roots are considered. We use the existence of this asymmetry to infer the direction of writing in undeciphered inscriptions that agrees with the archaeological evidence. Unlike traditional investigations of phonotactic constraints which focus on language-specific patterns, our study reveals a property valid across languages and writing systems. As both language and writing are unique aspects of our species, this universal signature may reflect an innate feature of the human cognitive phenomenon.Comment: 10 pages, 4 figures + Supplementary Information (15 pages, 8 figures), final corrected versio

    Bilingual dictionaries for all EU languages

    Get PDF
    Bilingual dictionaries can be automatically generated using the GIZA++ tool. However, these dictionaries contain a lot of noise, because of which the qualities of outputs of tools relying on the dictionaries are negatively affected. In this work, we present three different methods for cleaning noise from automatically generated bilingual dictionaries: LLR, pivot and transliteration based approach. We have applied these approaches on the GIZA++ dictionaries – dictionaries covering official EU languages – in order to remove noise. Our evaluation showed that all methods help to reduce noise. However, the best performance is achieved using the transliteration based approach. We provide all bilingual dictionaries (the original GIZA++ dictionaries and the cleaned ones) free for download. We also provide the cleaning tools and scripts for free download

    Photography as Translation. Visual Meaning, Digital Imaging, Trans-Mediality

    Get PDF
    The idea of envisaging photography through the concept of translation is based on the work of Umberto Eco on literary translation (2003) and its application to the cinema by Nicola Dusi (2003). In this article, the author seeks to clarify the terms and limits of this idea, all the while paying attention to debates surrounding the iconic sign and to issues raised by the coming of digital photography.C’est aux recherches d’Umberto Eco (2003) sur la traduction littĂ©raire, lesquelles ont inspirĂ© le travail de Nicola Dusi sur le cinĂ©ma (2003) que l’on doit l’idĂ©e de s’intĂ©resser Ă  la photographie depuis la perspective de la traduction. Dans cet article, l’auteur clarifie les termes et limites qui rendent la chose possible et porte une attention particuliĂšre aux dĂ©bats sĂ©miotiques liĂ©s au signe iconique et aux questions entourant la photographie numĂ©rique

    Creation of Digital Libraries in Indian Languages Using Unicode

    Get PDF
    Unicode is 16-bit code for character representation in a computer. Unicode is designed by Unicode consortium. It represents almost all the world script including extinct many of extinct scripts like Bramhi and Kharosthi. ISCII is another code developed for to represent the Indian characters in computer but there are problems with character representation using ISCII. It is found that Unicode can solve the problem. This paper suggests measures for creation of Digital library in Indic languages and the problem associated with Unicode

    Estopped by Grand Playsaunce: Flann O'Brien's Post-colonial Lore

    Get PDF
    This article seeks to extend our understanding of the Irish writer Flann O'Brien (Myles na gCopaleen, Brian O'Nolan) by reading him from a Law and Literature perspective. I suggest that O'Nolan's painstaking and picky mind, with its attention to linguistic nuance, was logically drawn to the languages of law. In this he confirmed the character that he showed as a civil servant of the cautious, book-keeping Irish Free State. The Free State, like other post-colonial entities, was marked at once by a rhetoric of rupture from the colonial dispensation and by a degree of legal and political continuity. I suggest that O'Nolan's writing works away at both these aspects of the state, alternating between critical and utopian perspectives. After establishing an initial context, I undertake a close reading of O'Nolan's parodies of actual legal procedure, focusing on questions of language and censorship. I then consider his critical work on the issue of Irish sovereignty, placing this in its post-colonial historical context. Finally I describe O'Nolan's treatment of Eamon de Valera's 1937 Constitution. I propose that his attention to textual detail prefigures in comic form the substantial rereadings of the Constitution that have been made in the last half-century
    • 

    corecore