5,864 research outputs found

    Patterns of grammaticalization in African languages

    Get PDF
    The approach outlined in the present paper is based on observations made with African languages. Although the 1000-odd African languages display a remarkable extent of structural variation, there are certain structures that do not seem to occur in Africa. Thus, to our knowledge, an African language having anything that could be called an ergative case or a numeral classifier system has not been discovered so far. It may turn out that our approach can, in a modified form, be made applicable to languages outside Africa. This , however, is a possibility that has not been considered here. The present approach is based essentially on diachronic findings in that it uses observations on language evolution in order to account for structural differences between languages. Thus, it has double potential: apart from describing and explaining typological diversity it can also be material to reconstructing language history

    Topic Segmentation: How Much Can We Do by Counting Words and Sequences of Words

    Get PDF
    In this paper, we present an innovative topic segmentation system based on a new informative similarity measure that takes into account word co-occurrence in order to avoid the accessibility to existing linguistic resources such as electronic dictionaries or lexico-semantic databases such as thesauri or ontology. Topic segmentation is the task of breaking documents into topically coherent multi-paragraph subparts. Topic segmentation has extensively been used in information retrieval and text summarization. In particular, our architecture proposes a language-independent topic segmentation system that solves three main problems evidenced by previous research: systems based uniquely on lexical repetition that show reliability problems, systems based on lexical cohesion using existing linguistic resources that are usually available only for dominating languages and as a consequence do not apply to less favored languages and finally systems that need previously existing harvesting training data. For that purpose, we only use statistics on words and sequences of words based on a set of texts. This solution provides a flexible solution that may narrow the gap between dominating languages and less favored languages thus allowing equivalent access to information

    Linguistic and Cultural Changes in the Spanish and English Revised Editions of The Famous Five. The Case of Five on Kirrin Island Again

    Get PDF
    [Abstracts] This undergraduate paper provides a comparative analysis of Enid Blyton’s Five on Kirrin Island Again throughout four versions of this book, two of them in English and the other two in Spanish. The main purpose of this work is to compile and classify the changes of different nature that were carried in the adapted editions of the book, namely changes that involve degree of formality, simplifications, updating of language and addition and deletion of content, as well as the fixing of some problematic elements, especially the ones that deal with gender issues. Due to the high amount of changes, this study is focused on the most relevant ones, most of them placed in the first half of the book. All these alterations have been listed in the appendix, following its order of occurrence in the body of this paper. The methodology used is mainly divided into three processes. First, the two English editions were compared, annotating the main modifications and classifying them under different categories. Second, the same procedure was followed for the Spanish editions. Finally, the English and Spanish books were juxtaposed altogether, so as to establish their main differences and similarities, taking into account the concepts of adaptation and retranslation. Throughout the analysis and classification of the changes it is possible to reach some conclusions, mainly concerning the updating of language and content to fulfil the standards of our current society. On the one hand, this study shows how the alterations in the Spanish adaptation usually correlate with the ones followed in the English one, suggesting a similar pattern of modifications. On the other hand, the divergent aspects between the English and Spanish versions are also revealed, proving a higher amount of changes in the Spanish adaptation, as well as slight differences, some which are related with the nature of each language. Lastly, this paper allows to reflect upon the role of language and how some words or expressions become outdated through the years, and it raises the issue of whether former children’s books should be adapted for the current generations, even if this implies certain changes of content.Traballo fin de grao (UDC.FIL). Inglés: estudios lingüísticos y literarios. Curso 2019/202

    Text Normalisation of Dialectal Finnish

    Get PDF
    Tekstin normalisointi on prosessi, jossa epästandardia kirjoitettua kieltä muutetaan standardisoituun muotoon. Murteet ovat yksi esimerkki epästandardista kielestä, joka voi poiketa huomattavastikin standardisoidusta yleiskielestä. Lisäksi suomen kieli on ortografialtaan varsin pitkälti foneemista, minkä ansiosta myös puhutun kielen ominaispiirteet on mahdollista tuoda esille kirjoitetussa muodossa. Etenkin epävirallisilla alustoilla ja arkikielisessä kontekstissa, kuten sosiaalisessa mediassa, suomen kielen puhujat saattavat kirjoittaa sanat kuten ääntäisivät ne normaalisti puhuessaan. Tällaista epästandardista kielestä koostuvaa aineistoa voi löytää myös luonnollisen kielen käsittelyn tarpeisiin esimerkiksi Twitteristä. Perinteiselle yleiskieliselle tekstiaineistolle suunnatut luonnollisen kielen käsittelyn työkalut eivät kuitenkaan välttämättä saavuta toivottavia tuloksia puhekieliselle aineistolle sovellettuna, jolloin ratkaisuna voidaan käyttää välivaiheena tekstin normalisointia. Normalisointiprosessissa syötteenä käytettävä puhekielinen tai muutoin epästandardia kieltä sisältävä teksti muutetaan standardisoituun kirjoitusasuun, jota luonnollisen kielen käsittelyn työkalut paremmin ymmärtävät. Tämä työ pohjaa aiempaan tutkimukseen, jota on tehty suomen murteiden normalisoinnin parissa. Aiemmissa tutkimuksissa on todettu, että merkkipohjaiset BRNN-neuroverkkomallit (Bidirectional Recurrent Neural Nerwork) saavuttavat hyviä tuloksia suomen kielen murteiden normalisoinnissa, kun syötteenä käytetään sanoja kolmen kappaleen lohkoissa. Tämä tarkoittaa, että järjestelmä saa syötteenä kerrallaan kolmen sanan joukon, ja jokainen sana on edelleen pilkottu välilyönnein eroteltuihin kirjoitusmerkkeihin. Tässä työssä pyrittiin käyttämään samoja metodeja ja aineistoa kuin aiemmassa tutkimuksessa, jotta tulokset olisivat vertailukelpoisia. Aineistona on käytetty Kotimaisten kielten keskuksen ylläpitämää Suomen kielen näytteitä -korpusta, ja normalisointiin on käytetty OpenNMT-nimistä avoimen lähdekoodin kirjastoa. Työssä toteutetuista kokeiluista saadut tulokset näyttävät vahvistavan aiempien tutkimustulosten pohjalta tehdyt löydökset, mutta lisäksi on viitteitä siitä, että neuroverkkomallit saattaisivat pidemmistä lohkoista koostuvista syötteistä. BRNN-mallin lisäksi työssä kokeillaan myös muita neuroverkkoarkkitehtuureja, mutta vertailtaessa sanavirheiden suhdelukua mittaavaa WER-arvoa (Word Error Rate) voidaan todeta, että BRNN-malli suoriutuu normalisointitehtävästä muita neuroverkkoarkkitehtuureja paremmin

    Language In My Mouth: Linguistic Variation in the Nmbo Speech Community of Southern New Guinea

    Get PDF
    This thesis is a mixed-methods investigation into the question of the sociolinguistics of linguistic diversity in Papua New Guinea. Social and cultural traits of New Guinean speech communities have been hypothesised as conducive to language differentiation and diversification (Laycock 1991, Thurston 1987, 1992, Foley 2000, Ross 2001), however there have been few empirical studies to support these hypotheses. In this thesis I investigate linguistic micro-variations within a contemporary New Guinean speech community, with the goal of identifying socio-cultural pressures that affect language variation and change. The community under investigation is the Nmbo speech community located in the Morehead area of Southern New Guinea. It is a highly multilingual community in the middle of the Nambu branch dialect chain, and consists primarily of the three villages Govav, Bevdvn, and Arovwe. The ideologically licensed speakers of Nmbo are the Kerake tribe people, but due to the practice of marriage exogamy, a large portion of non-Kerake people speak Nmbo as an additional language learnt from their parents or spouse. This thesis embraces the complexities of the multilingual ecology by including data from Kerake women who have married out of the Nmbo villages into the neighbouring Nen language village of Bimadbn. The empirical investigations bring data from three directions. First are the qualitative descriptions based on my own ethnographic fieldwork supported by prior ethnographic descriptions. The picture to emerge is of an egalitarian multilingual speech community. The qualitative descriptions also provide basic facts about demographics and social structures of the community. Second is the linguistic description of the Nmbo language. Nmbo is an under-described language without substantial prior description, and this thesis contains a sketch grammar covering the basics aspects of Nmbo grammar. Finally there are three quantitative studies of variation. The vowel sociophonetic study and the word initial [h]-drop study are classic Labovian variationist studies that investigate patterns of variation across a sample of speakers. The former is based of elicited word list data, and the latter on naturalistic speech data. The third quantitative study takes a grammaticalisation approach to an emergent topic marker in a topicalising construction from a relative clause construction. This is the first thesis ever produced providing qualitative, descriptive, and quantitative data from a New Guinean speech community within a language ecology of vital indigenous multilingualism. The contributions of the thesis are two fold. Firstly, this thesis brings grammatical and sociolinguistic descriptions from an under-studied language. It is a socio-grammar (Nagy 2009) that considers language ecology, sociolinguistics, and grammatical description. Secondly, this thesis contributes empirical data on the sociolinguistics of small-scale speech communities. The classic sociolinguistic variable of gender is not found to be particularly significant in the variables studied, despite the community being highly gendered in other social domains. Village, however, shows some significance. As far as the three variables are concerned, Nmbo speakers show little community-internal variation and paint a picture of a tight-knit society of intimates (Trudgill 2011). The conclusion to the question of the sociolinguistics of diversification is that while there is some evidence of sociolinguistic differentiation within the Nmbo speech community, the most important social groups to orient against are the other sister language groups in the Morehead area. The nascent variation within the Nmbo speech community, combined with the ethnographic evidence of a cluster of dense and multiplex social networks, suggest that should the social need to differentiate between other Kerake arise, linguistic differentiation may occur rapidly

    Collocational processing in typologically different languages, English and Turkish::Evidence from corpora and psycholinguistic experimentation

    Get PDF
    Unlike the traditional words-and-rules approach to language processing (Pinker, 1999), usage-based models of language have emphasised the role of multi-word sequences (Christiansen & Chater, 2016b; Ellis, 2002). Various psycholinguistic experiments have demonstrated that multi-word sequences (MWS) are processed quantitatively faster than novel phrases by both L1 and L2 speakers (e.g. Arnon & Snider, 2010; Wolter & Yamashita, 2018). Collocations, a specific type of MWS, hold a prominent position in psycholinguistics, corpus linguistics and language pedagogy research. (Gablasova, Brezina, McEnery, 2017a). In this dissertation, I explored the processing of adjective-noun collocations in Turkish and English by L1 speakers of these languages through a corpus-based study and psycholinguistic experiments. Turkish is an agglutinating language with a rich morphology, it is therefore valid to ask if agglutinating structure of Turkish affects collocational processing in L1 Turkish and whether the same factors affect the processing of collocations in English and Turkish. In addition, this study looked at L1 and L2 processing of collocations in English. This thesis firstly has investigated the frequency counts and associations statistics of English and Turkish adjective-noun collocations through a corpus-based analysis of general reference corpora of English and Turkish. The corpus study showed that unlemmatised collocations, which does not take into account the inflected forms of the collocations, have similar mean frequency and association counts in the both languages. This suggests that the base forms – uninflected forms of the collocations in English and Turkish do not appear to have notably different frequency and association counts from each other. To test the effect of agglutinating structure of Turkish on the collocability of adjectives and nouns, the lemmatised forms of the collocations in the both languages were examined. In other words, collocations in the two languages were lemmatised. The lemmatisation brings the benefit of including the frequency counts of both the base and inflected forms of the collocations. The findings indicated that the vast majority (%75) of the lemmatised Turkish adjective-noun combinations occur at a higher-frequency than their English equivalents. In addition, agglutinating structure of Turkish appears to increase adjective-noun collocations’ association scores in the both frequency bands since the vast majority of Turkish collocations reach higher scores of collocational strengths than their unlemmatised forms. After the corpus study, I designed psycholinguistic experiments to explore the sensitivity of speakers of these languages to the frequency of adjectives, nouns and whole collocations in acceptability judgment tasks in English and Turkish. Mixed-effects regression modelling revealed that collocations which have similar collocational frequency and association scores are processed at comparable speeds in English and Turkish by L1 speakers of these languages. That is to say, both Turkish and English speakers are sensitive to the collocation frequency counts. This finding is in line with many previous empirical studies that language users process MWS quantitively faster than control phrases (e.g. Arnon & Snider, 2010; McDonald & Shillcock, 2003; Vilkaite, 2016). However, lemmatised collocation frequency counts affected the processing of Turkish and English collocations differently, and Turkish speakers appeared to attend to word-level frequency counts of collocations to a lesser extent than English speakers. These findings suggest that different mechanisms underlie L1 processing of English and Turkish collocations. The present study also looked at the sensitivity of L1 and L2 advanced speakers to the frequency of adjectives, nouns and whole collocations in English. Mixed-effects regression modelling revealed that L2 advanced speakers are sensitive to the collocation frequency counts like L1 English speakers because as the collocation frequency counts increased, L1 Turkish-English L2 speakers responded to the collocations in English more quickly, as L1 English speakers did. The results indicated that both groups showed sensitivity to noun frequency counts, and L2 English advanced speakers did not appear to rely on the noun frequency scores more heavily than the L1 English group while processing adjective-noun collocations. These findings are in conflict with the claims that L2 speakers process MWS differently than L1 speakers (Wray, 2002)

    Thematic Annotation: extracting concepts out of documents

    Get PDF
    Contrarily to standard approaches to topic annotation, the technique used in this work does not centrally rely on some sort of -- possibly statistical -- keyword extraction. In fact, the proposed annotation algorithm uses a large scale semantic database -- the EDR Electronic Dictionary -- that provides a concept hierarchy based on hyponym and hypernym relations. This concept hierarchy is used to generate a synthetic representation of the document by aggregating the words present in topically homogeneous document segments into a set of concepts best preserving the document's content. This new extraction technique uses an unexplored approach to topic selection. Instead of using semantic similarity measures based on a semantic resource, the later is processed to extract the part of the conceptual hierarchy relevant to the document content. Then this conceptual hierarchy is searched to extract the most relevant set of concepts to represent the topics discussed in the document. Notice that this algorithm is able to extract generic concepts that are not directly present in the document.Comment: Technical report EPFL/LIA. 81 pages, 16 figure
    corecore