12 research outputs found

    Zero-shot language transfer for cross-lingual sentence retrieval using bidirectional attention model

    Get PDF
    We present a neural architecture for cross-lingual mate sentence retrieval which encodes sentences in a joint multilingual space and learns to distinguish true translation pairs from semantically related sentences across languages. The proposed model combines a recurrent sequence encoder with a bidirectional attention layer and an intra-sentence attention mechanism. This way the final fixed-size sentence representations in each training sentence pair depend on the selection of contextualized token representations from the other sentence. The representations of both sentences are then combined using the bilinear product function to predict the relevance score. We show that, coupled with a shared multilingual word embedding space, the proposed model strongly outperforms unsupervised cross-lingual ranking functions, and that further boosts can be achieved by combining the two approaches. Most importantly, we demonstrate the model's effectiveness in zero-shot language transfer settings: our multilingual framework boosts cross-lingual sentence retrieval performance for unseen language pairs without any training examples. This enables robust cross-lingual sentence retrieval also for pairs of resource-lean languages, without any parallel data

    Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

    Get PDF
    Peer reviewe

    Multi-view Representation Learning for Unifying Languages, Knowledge and Vision

    Get PDF
    The growth of content on the web has raised various challenges, yet also provided numerous opportunities. Content exists in varied forms such as text appearing in different languages, entity-relationship graph represented as structured knowledge and as a visual embodiment like images/videos. They are often referred to as modalities. In many instances, the different amalgamation of modalities co-exists to complement each other or to provide consensus. Thus making the content either heterogeneous or homogeneous. Having an additional point of view for each instance in the content is beneficial for data-driven learning and intelligent content processing. However, despite having availability of such content. Most advancements made in data-driven learning (i.e., machine learning) is by solving tasks separately for the single modality. The similar endeavor was not shown for the challenges which required input either from all or subset of them. In this dissertation, we develop models and techniques that can leverage multiple views of heterogeneous or homogeneous content and build a shared representation for aiding several applications which require a combination of modalities mentioned above. In particular, we aim to address applications such as content-based search, categorization, and generation by providing several novel contributions. First, we develop models for heterogeneous content by jointly modeling diverse representations emerging from two views depicting text and image by learning their correlation. To be specific, modeling such correlation is helpful to retrieve cross-modal content. Second, we replace the heterogeneous content with homogeneous to learn a common space representation for content categorization across languages. Furthermore, we develop models that take input from both homogeneous and heterogeneous content to facilitate the construction of common space representation from more than two views. Specifically, representation is used to generate one view from another. Lastly, we describe a model that can handle missing views, and demonstrate that the model can generate missing views by utilizing external knowledge. We argue that techniques the models leverage internally provide many practical benefits and lot of immediate value applications. From the modeling perspective, our contributed model design in this thesis can be summarized under the phrase Multi-view Representation Learning( MVRL ). These models are variations and extensions of shallow statistical and deep neural networks approaches that can jointly optimize and exploit all views of the input content arising from different independent representations. We show that our models advance state of the art, but not limited to tasks such as cross-modal retrieval, cross-language text classification, image-caption generation in multiple languages and caption generation for images containing unseen visual object categories

    Feature-based transfer learning In natural language processing

    Get PDF

    Language-Independent Methods for Identifying Cross-Lingual Similarity in Wikipedia

    Get PDF
    The diversity and richness of multilingual information available in Wikipedia have increased its significance as a language resource. The information extracted from Wikipedia has been utilised for many tasks, such as Statistical Machine Translation (SMT) and supporting multilingual information access. These tasks often rely on gathering data from articles that describe the same topic in different languages with the assumption that the contents are equivalent to each other. However, studies have shown that this might not be the case. Given the scale and use of Wikipedia, there is a need to develop an approach to measure cross-lingual similarity across Wikipedia. Many existing similarity measures, however, require the availability of "language-dependent" resources, such as dictionaries or Machine Translation (MT) systems, to translate documents into the same language prior to comparison. This presents some challenges for some language pairs, particularly those involving "under-resourced" languages where the required linguistic resources are not widely available. This study aims to present a solution to this problem by first, investigating cross-lingual similarity in Wikipedia, and secondly, developing "language-independent" approaches to measure cross-lingual similarity in Wikipedia. Two main contributions were provided in this work to identify cross-lingual similarity in Wikipedia. The first key contribution of this work is the development of a Wikipedia similarity corpus to understand the similarity characteristics of Wikipedia articles and to evaluate and compare various approaches for measuring cross-lingual similarity. The author elicited manual judgments from people with the appropriate language skills to assess similarities between a set of 800 pairs of interlanguage-linked articles. This corpus contains Wikipedia articles for eight language pairs (all pairs involving English and including well-resourced and under-resourced languages) of varying degrees of similarity. The second contribution of this work is the development of language-independent approaches to measure cross-lingual similarity in Wikipedia. The author investigated the utility of a number of "lightweight" language-independent features in four different experiments. The first experiment investigated the use of Wikipedia links to identify and align similar sentences, prior to aggregating the scores of the aligned sentences to represent the similarity of the document pair. The second experiment investigated the usefulness of content similarity features (such as char-n-gram overlap, links overlap, word overlap and word length ratio). The third experiment focused on analysing the use of structure similarity features (such as the ratio of section length, and similarity between the section headings). And finally, the fourth experiment investigates a combination of these features in a classification and a regression approach. Most of these features are language-independent whilst others utilised freely available resources (Wikipedia and Wiktionary) to assist in identifying overlapping information across languages. The approaches proposed are lightweight and can be applied to any languages written in Latin script; non-Latin script languages need to be transliterated prior to using these approaches. The performances of these approaches were evaluated against the human judgments in the similarity corpus. Overall, the proposed language-independent approaches achieved promising results. The best performance is achieved with the combination of all features in a classification and a regression approach. The results show that the Random Forest classifier was able to classify 81.38% document pairs correctly (F1 score=0.79) in a binary classification problem, 50.88% document pairs correctly (F1 score=0.71) in a 5-class classification problem, and RMSE of 0.73 in a regression approach. These results are significantly higher compared to a classifier utilising machine translation and cosine similarity of the tf-idf scores. These findings showed that language-independent approaches can be used to measure cross-lingual similarity between Wikipedia articles. Future work is needed to evaluate these approaches in more languages and to incorporate more features

    Positioning : a linguistic ethnography of Cameroonian children in and out of South African primary school spaces

    Get PDF
    Philosophiae Doctor - PhDThis thesis traces the trajectories of a group of young Cameroonian learners as they engage in new social and educational spaces in two South African primary schools. Designed as a Linguistic Ethnography and using data from observations, interviews and more than 50 hours of recorded interaction, it illustrates the ways in which these learners position themselves and are differentially positioned within evolving discourses of inclusion and exclusion. As a current study in a multilingual African context, it joins a growing body of literature in Europe which points to the ways in which young people’s language choices and practices are socially and politically embedded in their histories of migration and implicated in relations of power, social difference and social inequality. The study is a Linguistic Ethnography of young school learners’ language experience, which falls outside the scope of much mainstream research. It is one of very few studies to focus on migrant children in contexts of the South where multilingualism is the reality yet where language-in-education policies tend to follow monoglossic norms. The focus is on how a group of 10-16 year old Cameroonian children use their multilingual repertoires to construct and negotiate identities both inside and outside the classroom. It also investigates in more detail the acts of identity of two individuals entering the same school with different linguistic profiles, who are positioned in differentiated ways in relation to transnational and local flows and interconnections. The context is a low socio-economic suburb of Cape Town, South Africa, where Cameroonian practices of language, class, and ethnicity become entangled with local economies of meaning. The study also contributes to an emerging body of qualitative research that seeks to develop greater understanding of the relationships between language learners, their socio-cultural worlds and processes of identity construction (Cummins, 1996; Gee, 2001; Holland, Lachicotte, Skinner, & Cain, 1998). ; Rampton, 1995, 2006). Recent international and South African studies tend to focus on secondary school learners, showing how they are struggling to negotiate the currents of a complex society (Adebanji, 2010; Sayed, 2002; Sookrajh, Gopal & Maharaj, 2005), although there is a recent and rapidly growing body of Scandinavian research on primary school children (for example, Cekaite & Evaldsson, 2008; Madsen, 2008; Møller, 2009; Møller, Holmen & Jørgensen, 2012). In contrast, the children in this study are negotiating the transition between childhood and adolescence, faced with issues of race, linguistic competence and discrimination at a time when moving from one age group to the next should have been relatively unproblematic. They are thus entangled in different levels of transition: emotional, physical and spatial. These issues of transition and negotiation will be highlighted through the lens of positioning. The concepts of ‘position’ and ‘positioning’ (Davis & Harré, 1990) appear to have origins in marketing, where position refers to the communication strategies that allow certain products to be placed in a market among their competitors (Tirado & Gálvez, 2007, p. 20). Holloway (1984) first used the concept of positioning in the social sciences to analyse the construction of subjectivity in the area of heterosexual relationships (Tirado & Gálvez, 2007). Positioning here was explained as relational processes that constitute interaction with other individuals. The present study focuses on how ‘interactants’ position themselves vis-à-vis their words and texts, their audiences and the contexts they both "respond to and construct linguistically" (Jaffe, 2009, p.3). As they make use of lexical and grammatical tools available to them in interaction, it becomes apparent that the process of identity construction through positioning does not "reside within the individual but in intersubjective relations of sameness and difference, […] power and disempowerment" (Bucholtz & Hall, 2005, p. 607). Thus to interpret multilingual children’s positioning requires a recursive process, using a double perspective: it means looking at the day-to-day moments of interactional and other practices, and also the wider political discourses in which these practices may be embedded and historically rooted (Maguire, 2005) and which they index in different ways. These day-to-day moments of practice thus involve different “acts of identity” (Le Page & Tabouret-Keller, 1985) which can also be described as acts of stance-taking (Jaffe, 2009). A stance may index multiple selves and social identities. However, not all stances are open to everyone: those whose who have their social, cultural or linguistic capital (Bourdieu, 1991, 1997) recognized in a particular space will be able to position themselves more strongly there than those who do not. Moreover, stances are not successful unless 'taken up' by interactants (Jaffe, 2009): this uptake may take the form of interlocutors’ stances of alignment, realignment, or misalignment (C. Goodwin, 2007; Matoesian, 2005). Uptake in multilingual contexts is influenced by the prevailing "linguistic market" (Bourdieu, 1991, pp.55-67): day to-day acts of positioning take place in inequitable markets. These ‘markets’ are fertile grounds for social stratification where speech acts and the languages in which they are realized are assigned different symbolic values (Bourdieu, 1991, 1997). Mastery of the 'legitimate' language or languages is then often a pre-condition for claiming symbolic and material resources. New institutional spaces in South Africa become interesting here, because they are characterized by new formations of class, changes in gender roles and relations and other instances of macro-structural shifts. In such spaces, linguistic hierarchies and patterns of distribution of linguistic resources are rapidly changing (Kerfoot & Bello-Nonjengele, 2014). The school as a key institution in the distribution of social, cultural and linguistic capital is thus an important site for exploring the role of language and multilingualism in social and educational change. This thesis sets out to answer the following research questions: a) How do immigrant learners use their linguistic repertoires to construct, negotiate or contest identities in new school spaces? b) How do different spaces enable or constrain the new identities negotiated? c) What are the implications for language learning policy and practice? Data collection took place over two years between February 2010 and June 2013, and followed participants from grades 5 to 7 in the English medium and Afrikaans language classrooms. Participants were 10-16 year old Cameroonian children in two Cape Town schools, ten in each. The study contains nine chapters, with chapter 1 providing an overview of the background, rationale, and conceptual and methodological framework. Chapter 2 traces the shift towards the social in language studies, considering frameworks for understanding the differential values placed on linguistic resources as actors move across social spaces, both local and transnational. Here interaction is viewed as a crucial site for identity construction, generating a social stage through which reality is constructed, shared, and made meaningful. Chapter 3 reviews studies of interactional positioning amongst multilingual learners in social and educational contexts in South Africa and more globally. Chapter 4 focuses on the methodology used in the study, discussing the research design based on Linguistic Ethnography, a qualitative approach which is based on the two broad planks of ethnography and Interactional Sociolinguistics (IS) and which enables an analytical framework combining Conversation Analysis (CA), Discourse Analysis (DA) and Systemic Functional Linguistics (SFL). Together, these analytical tools enable a multifaceted illumination of the construction of identity in discourse. The various tools used in data collection are discussed in depth followed by comment on reflexivity, challenges in the field and limitations of the study. Chapter 5 delineates the researcher’s trajectory in the field. This comprises profiles of the study schools (including the schools’ socio-economic, ethnic and linguistic make-up in relation to teachers and learners), perspectives on why the schools were chosen, the differing receptions to a research presence there, and some reflections on the researcher’s identity construction. The chapter further explores different techniques of data collection within this context: field notes and thick description, interviews, and audio recordings of interactions in and out of schools. Chapters 6, 7 and 8 present and analyse findings from classroom observation and interview data, together with audio-recordings of a group of Cameroonian learners interacting with each other and with children of other nationalities in classrooms, community and home spaces. These chapters aim to illustrate how these learners used linguistic resources to position themselves and others, to build, maintain and negotiate identities, and to assert or negate identifications. Chapters 7 and 8 build on the analysis presented in chapter 6 by focusing respectively on two key emergent themes: owning participatory spaces and defying positioning in multilingual spaces. Chapter 7 centres on the interactional and other means by which a 12 year old Anglophone learner, James, navigated his way increasingly successfully through new social and educational spaces, expanding his linguistic repertoire. Chapter 8 focuses on a 12 year old Francophone learner, Aline, and the ways in which she tried to convert her linguistic capital on new linguistic markets. Her efforts were more often than not met with negative evaluation, leading to a loss of both social and academic identities. The analysis of data thus serves as a rich point of entry for understanding the connections between linguistic repertoires, relations between ethnic groups, youth culture, and the experience of social change. Through their discursive production of selves, these adolescent learners supposed to be negotiating only the normal transition from one age group to the next) are here negotiating the currents of a complex society and dealing with issues of race, language and segregation. Findings suggest that participants had multiple identity options that were negotiated through different practices, from food choices to language and interactional norms. These different identity options were however constrained by existing norms and linguistic hierarchies in each space, allowing some to accommodate new linguistic practices and ways of doing things, while others experienced more ambivalent and contradictory processes of adaptation. In informal settings there was evidence of a third space characterized by a mélange of languages in which both formal and informal versions of English and French, along with Cameroonian Pidgin English (CPE) and other Cameroonian languages, were used. However, even in these settings there was a gradual shift to English, indicating the penetration of macrosocial and institutional discourses into private spaces. The thesis concludes with a set of recommendations for caregivers, teachers and policymakers seeking to create schools more welcoming of diversity. It is hoped, then, that this study will help families and schools to realize the variety of ways in which linguistic repertoires influence school success, both social and educational, and to find ways of using these repertoires for development and learning. In this way, they might contribute to immigrant youngsters’ ability to construct strong identities as learners and valued social beings

    Discourse and Digital Practices

    Get PDF
    Discourse and Digital Practices shows how tools from discourse analysis can be used to help us understand new communication practices associated with digital media, from video gaming and social networking to apps and photo sharing. This cutting-edge book: draws together fourteen eminent scholars in the field including James Paul Gee, David Barton, Ilana Snyder, Phil Benson, Victoria Carrington, Guy Merchant, Camilla Vasquez, Neil Selwyn and Rodney Jones answers the central question: "How does discourse analysis enable us to understand digital practices?" addresses a different type of digital media in each chapter demonstrates how digital practices and the associated new technologies challenge discourse analysts to adapt traditional analytic tools and formulate new theories and methodologies examines digital practices from a wide variety of approaches including textual analysis, conversation analysis, interactional sociolinguistics, multimodal discourse analysis, object ethnography, geosemiotics, and critical discourse analysis. Discourse and Digital Practices will be of interest to advanced students studying courses on digital literacies or language and digital practices
    corecore