4,235 research outputs found

    Building a Text Collection for Urdu Information Retrieval

    Get PDF
    Urdu is a widely spoken language in the Indian subcontinent with over 300 million speakers worldwide. However, linguistic advancements in Urdu are rare compared to those in other European and Asian languages. Therefore, by following Text Retrieval Conference standards, we attempted to construct an extensive text collection of 85 304 documents from diverse categories covering over 52 topics with relevance judgment sets at 100 pool depth. We also present several applications to demonstrate the effectiveness of our collection. Although this collection is primarily intended for text retrieval, it can also be used for named entity recognition, text summarization, and other linguistic applications with suitable modifications. Ours is the most extensive existing collection for the Urdu language, and it will be freely available for future research and academic education

    A survey on sentiment analysis in Urdu: A resource-poor language

    Get PDF
    © 2020 Background/introduction: The dawn of the internet opened the doors to the easy and widespread sharing of information on subject matters such as products, services, events and political opinions. While the volume of studies conducted on sentiment analysis is rapidly expanding, these studies mostly address English language concerns. The primary goal of this study is to present state-of-art survey for identifying the progress and shortcomings saddling Urdu sentiment analysis and propose rectifications. Methods: We described the advancements made thus far in this area by categorising the studies along three dimensions, namely: text pre-processing lexical resources and sentiment classification. These pre-processing operations include word segmentation, text cleaning, spell checking and part-of-speech tagging. An evaluation of sophisticated lexical resources including corpuses and lexicons was carried out, and investigations were conducted on sentiment analysis constructs such as opinion words, modifiers, negations. Results and conclusions: Performance is reported for each of the reviewed study. Based on experimental results and proposals forwarded through this paper provides the groundwork for further studies on Urdu sentiment analysis

    On the use of prepositional verbs by Pakistani ESL learners: A corpus study

    Get PDF
    Engelsk som andrespråk (ESL) er mye brukt i Pakistan, men pakistanske ESL-studenter har fortsatt problemer med å skrive og snakke engelsk. Feil er en naturlig del av andrespråkstilegnelse, så lærere må være klar over årsakene, virkningene og konsekvensene av feil. En vanlig utfordring for ESL-studenter er korrekt bruk av preposisjoner. Det er utført flere studier på ESL-studenters bruk av engelske preposisjoner, men få studier på pakistanske ESL-studenters feil ved bruk av preposisjonsverb. Derfor er det viktig å undersøke hypotesen om at pakistanske ESL-studenter kan gjøre relativt flere feil i preposisjonsverb enn andre asiatiske ESL-studenter. Hovedmålene med denne studien er å vurdere pakistanske ESL-studenters ferdigheter i bruk av preposisjonsverb, undersøke tilbakevendende feil, og undersøke en mulig årsak relatert til L1-påvirkning. Den nåværende studien bruker en korpuslingvistisk tilnærming med en blandet metode som kombinerer kvantitativ og kvalitativ dataanalyse. Kvalitative metoder identifiserer og beskriver individuelle elementer som korrekte eller feil, mens kvantitative metoder grupperer de riktige og ukorrekte preposisjonsverbene for å skille mellom pakistanske studenter sammenlignet med andre asiatiske ESL-studenter. Denne studien fokuserer på å identifisere vanlige preposisjonsverb og analysere feil og utfordringer som møter pakistanske studenter. Dataene kommer fra ICNALE (International Corpus Network of Asian Learners of English), som har samlet inn data fra ESL-studenter i ti asiatiske regioner. Preposisjonsverb tilhørende 28 lemma i spesifiserte grammatiske sammenhenger ble selektert fra modulen med skriftlige stiler, som omfatter 5600 oppføringer av 2800 deltakere. Totalt 19027 observasjoner ble manuelt annotert som korrekte, ukorrekte eller irrelevante. Av disse ble 5106 relevante observasjoner gjenstand for en kvantitativ analyse, noe som tyder på at pakistanske studenter har en relativt høy feilrate (kun 74,62% riktig) enn de andre gruppene. Spesifikke lemma med en høy feilrate blant pakistanske studenter blir diskutert, inkludert en analyse av transfer og L1-interferens som mulige faktorer. Det har ikke vært tilstrekkelig mye forskning på preposisjonsverb hos pakistanske ESL-studenter, og derfor vil denne studien hjelpe engelsklærere til å identifisere disse problemene og revurdere undervisningsmetodene. Forhåpentlig bidrar dette til en pedagogisk forståelse av utfordringene ved andrespråkstilegnelse. Denne studien konkluderer ved å understreke viktigheten av å ta opp utfordringer knyttet til preposisjonsverb i ESL-klasser og skaffer innsikt til lærere, studenter og oversettere.English as a second language is widely used in Pakistan, but Pakistani students of ESL (English as a second language) still face difficulties in writing and speaking English. Errors are a natural part of second language learning, so teachers need to be aware of the causes, effects, and consequences of errors. A common challenge for ESL students is using prepositions correctly. Several studies have been conducted on ESL learners’ use of English prepositions, but few on Pakistani ESL learners’ errors using prepositional verbs. Therefore, it is important to investigate the hypothesis that Pakistani ESL learners may make relatively more errors in prepositional verbs than other Asian ESL learners. The main objectives of this study are to assess Pakistani ESL students’ proficiency in the use of prepositional verbs, to investigate recurring errors, and to determine the role of transfer and L1 intervention in these errors. The current study employs a corpus linguistics approach with a mixed method that combines quantitative and qualitative data analysis. Qualitative methods identify and describe individual items as correct or incorrect, whereas quantitative methods group the correct and incorrect prepositional verbs to differentiate among Pakistani students compared to other Asian ESL students. This study focuses on identifying common prepositional verbs and an exploration of errors and challenges faced by Pakistani learners. The data comes from ICNALE (International Corpus Network of Asian Learners of English), which has collected data from ESL learners in ten Asian regions. Samples of prepositional verbs belonging to 28 lemmas in specified grammatical contexts were selected from the written essays module that has 5,600 entries by 2,800 participants. A total of 19027 observations were manually annotated as correct, incorrect, or irrelevant. Of these, 5106 relevant observations were subjected to a quantitative analysis, which suggests that Pakistani learners make relatively more errors (only 74.62% correct) than the other groups. Lemmas with high error rates among Pakistani learners are discussed, including an analysis of transfer and L1 interference as possible factors. There has been insufficient research on prepositional verbs used by Pakistani ESL learners, hence, this study will help English language teachers to identify these issues and rethink their teaching methods. It contributes to a pedagogical understanding of the challenges of second language acquisition. This study concludes by emphasizing the importance of addressing the challenges of prepositional verbs in ESL classes and provides insights for teachers, students, and translators.Lingvistikk mastergradsoppgaveLING350MAHF-LIN

    Urdu intonation

    Get PDF
    The current study is an analysis of an Urdu speech corpus using a Tone and Break Indices (ToBI) transcription system to develop a model of Urdu intonation. The analysis indicates that Urdu has three pitch accents (L*, L*+H, H*) and boundary tones associated to two phrase types: accentual phrase (AP) boundaries (Ha, La) and intonational phrase (IP) boundaries (L%, H%, LH%). The AP is a pitch bearing unit on a single word, or more than one word in the context of (a) izāfat, (b) conjunctive vāo, (c) case markers, (d) complex postpositions, and (e) complex verbs. Moreover, this study also investigates the tonal structure of declarative, interrogative (wh-questions, yes/no-questions), and imperative (semi-honorific, polite honorific) sentences in neutral focus context using 50 utterances produced by ten speakers. Results indicate that (i) all declarative sentences consist of a series of APs, represented as (aL) L* (H) Ha, except the sentence final AP, represented as (H*) L%. (ii) wh-questions are different from their corresponding declaratives in terms of pitch range and the final boundary tone; (iii) imperatives are different form their corresponding declaratives in terms of final boundary tone

    THE IDENTIFICATION OF FORMULAIC SEQUENCES IN URDU LANGUAGE AND THEIR PEDAGOGICAL IMPLICATION FOR SLA (ESL/USL)

    Get PDF
    In this study, an effort has been made to explore formulaicity in the Urdu language and its pedagogical implication in second language acquisition, both for English as a second language and Urdu as second language learners. It is believed that formulaic sequences or prefabs make more than fifty percent of a language. These formulaic sequences are of various kinds encompassing idioms, proverbs, collocations and sometimes, simple fillers. For the current study, data will be collected from two widely circulated Urdu newspapers. The data will consist of lexical chunks or formulas, which will be identified on the basis of eleven criteria proposed by Wray and Namba (2003). To maintain inter-rater reliability, the data will be shared with an Urdu language expert. After the identification, the formulaic sequences will be classified into six classes. Results of the pilot study show that there is formulaicity in the Urdu language. It was found that Urdu is also replete with almost all kinds of formulaic sequences, like many other languages

    UPPC - Urdu Paraphrase Plagiarism Corpus

    Get PDF
    Paraphrase plagiarism is a significant and widespread problem and research shows that it is hard to detect. Several methods and automatic systems have been proposed to deal with it. However, evaluation and comparison of such solutions is not possible because of the unavailability of benchmark corpora with manual examples of paraphrase plagiarism. To deal with this issue, we present the novel development of a paraphrase plagiarism corpus containing simulated (manually created) examples in the Urdu language - a language widely spoken around the world. This resource is the first of its kind developed for the Urdu language and we believe that it will be a valuable contribution to the evaluation of paraphrase plagiarism detection systems

    Bilingual Lexicography: Some Issues with Modern English Urdu Lexicography – a User's Perspective

    Get PDF
    The tradition of bilingual lexicography in the Indian subcontinent is more than two centuries old and goes back to as far as 1772 (Hadley). This article examines the development of bilingual lexicography in the Indian subcontinent with special reference to English-Hindustani or -Urdu dictionary development. It further explores some issues specific to this field and tries to suggest some solutions. First of all it describes the historical perspective of linguistic work in the subcontinent and then discusses issues relating to English-Urdu bilingual lexicography in particular
    corecore