Search CORE

154 research outputs found

Classifying Homographs in Japanese Social Media Texts Using a User Interest Model

Author: Harada Tomohiko
Tsuda Kazuhiko
Publication venue: The Authors. Published by Elsevier B.V.
Publication date: 01/01/2014
Field of study

AbstractThe analysis of text data from social media is hampered by irrelevant noisy data, such as homographs. Noisy data is not usable and makes analysis, such as counting estimates, of the target data diffcult, which adversely affects the quality of the analysis results. We focus on this issue and propose a method to classify homographs that are contained in social media texts (i.e. Twitter) using topic models. We also report the results of an evaluation experiment. In the evaluation experiment, the proposed method showed an accuracy improvement of 8.5% and a reduction of 16.5% in the misidentification rate compared with conventional methods

Elsevier - Publisher Connector

Crossref

CGSpace

Classification of colloquial Arabic tweets in real-time to detect high-risk floods

Author: al-Khateeb Haider M.
Alabbas Waleed
Epiphaniou Gregory
Frommholz Ingo
Mansour Ali
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/06/2017
Field of study

Twitter has eased real-time information flow for decision makers, it is also one of the key enablers for Open-source Intelligence (OSINT). Tweets mining has recently been used in the context of incident response to estimate the location and damage caused by hurricanes and earthquakes. We aim to research the detection of a specific type of high-risk natural disasters frequently occurring and causing casualties in the Arabian Peninsula, namely `floods'. Researching how we could achieve accurate classification suitable for short informal (colloquial) Arabic text (usually used on Twitter), which is highly inconsistent and received very little attention in this field. First, we provide a thorough technical demonstration consisting of the following stages: data collection (Twitter REST API), labelling, text pre-processing, data division and representation, and training models. This has been deployed using `R' in our experiment. We then evaluate classifiers' performance via four experiments conducted to measure the impact of different stemming techniques on the following classifiers SVM, J48, C5.0, NNET, NB and k-NN. The dataset used consisted of 1434 tweets in total. Our findings show that Support Vector Machine (SVM) was prominent in terms of accuracy (F1=0.933). Furthermore, applying McNemar's test shows that using SVM without stemming on Colloquial Arabic is significantly better than using stemming techniques

Crossref

Wolverhampton Intellectual Repository and E-theses

Recommended from our members

Applying corpus and computational methods to loanword research : new approaches to Anglicisms in Spanish

Author: Serigos Jacqueline Rae Larsen
Publication venue
Publication date: 03/01/2018
Field of study

Understanding both the linguistic and social roles of loanwords is becoming more relevant as globalization has brought loanwords into new settings, often previously viewed as monolingual. Their occurrence has the potential to impact speech communities, in that they have the capacity to alter the semantic relationships and social values ascribed to individual elements within the existing lexicon. In order to identify broad patterns, we must turn towards large and varied sources of data, specifically corpora. This dissertation aims to tackle some of the practical issues involved in the use of corpora, while addressing two conceptual issues in the field of loanword research – the social distribution and semantic nature of loanwords. In this dissertation, I propose two methods, adapted from advances in computational linguistics, which will contribute to two different stages of loanword research: processing corpora to find tokens of interest and semantically analyzing tokens of interest. These methods will be employed in two case studies. The first seeks to explore the social stratification of loanwords in Argentine Spanish. The second measures the semantic specificity of loanwords relative to their native equivalents.Spanish and Portugues

Texas ScholarWorks

The Lexicon Graph Model : a generic model for multimodal lexicon development

Author: Trippel Thorsten
Publication venue: Bielefeld University
Publication date: 01/01/2006
Field of study

Trippel T. The Lexicon Graph Model : a generic model for multimodal lexicon development. Bielefeld (Germany): Bielefeld University; 2006.Das Lexicon Graph Model stellt ein Modell für Lexika dar, die korpusbasiert sein können und multimodale Informationen enthalten. Hierbei wird die Perspektive der Lexikontheorie eingenommen, wobei die zugrundeliegenden Datenstrukturen sowohl vom Lexikon als auch von Annotationen betrachtet werden. Letztere fallen dadurch in das Blickfeld, weil sie als Grundlage für die Erstellung von Lexika gesehen werden. Der Begriff des Lexikons bezieht sich hier sowohl auf den Bereich des Wörterbuchs als auch der in elektronischen Applikationen integrierten Lexikondatenbanken. Die existierenden Formalismen und Ansätze der Lexikonentwicklung zeigen verschiedene Probleme im Zusammenhang mit Lexika auf, etwa die Zusammenfassung von existierenden Lexika zu einem, die Disambiguierung von Mehrdeutigkeiten im Lexikon auf verschiedenen lexikalischen Ebenen, die Repräsentation von anderen Modalitäten im Lexikon, die Selektion des lexikalischen Schlüsselbegriffs für Lexikonartikel, etc. Der vorliegende Ansatz geht davon aus, dass sich Lexika zwar in ihrem Inhalt, nicht aber in einer grundlegenden Struktur unterscheiden, so dass verschiedenartige Lexika im Rahmen eines Unifikationsprozesses dublettenfrei miteinander verbunden werden können. Hieraus resultieren deklarative Lexika. Für Lexika können diese Graphen mit dem Lexikongraph-Modell wie hier dargestellt modelliert werden. Dabei sind Lexikongraphen analog den von Bird und Libermann beschriebenen Annotationsgraphen gesehen und können daher auch ähnlich verarbeitet werden. Die Untersuchung des Lexikonformalismus beruht auf vier Schritten. Zunächst werden existierende Lexika analysiert und beschrieben. Danach wird mit dem Lexikongraph-Modell eine generische Darstellung von Lexika vorgestellt, die auch implementiert und getestet wird. Basierend auf diesem Formalismus wird die Beziehung zu Annotationsgraphen hergestellt, wobei auch beschrieben wird, welche Maßstäbe an angemessene Annotationen für die Verwendung zur Lexikonentwicklung angelegt werden müssen.The Lexicon Graph Model provides a model and framework for lexicons that can be corpus based and contain multimodal information. The focus is more from the lexicon theory perspective, looking at the underlying data structures that are part of existing lexicons and corpora. The term lexicon in linguistics and artificial intelligence is used in different ways, including traditional print dictionaries in book form, CD-ROM editions, Web based versions of the same, but also computerized resources of similar structures to be used by applications. These applications cover systems for human-machine communication as well as spell checkers. The term lexicon in this work is used as the most generic term covering all lexical applications. Existing formalisms in lexicon development show different problems with lexicons, for example combining different kinds of lexical resources, disambiguation on different lexical levels, the representation of different modalities in a lexicon. The Lexicon Graph Model presupposes that lexicons can have different structures but have fundamentally a similar structure, making it possible to combine lexicons in a unification process, resulting in a declarative lexicon. The underlying model is a graph, the Lexicon Graph, which is modeled similar to Annotation Graphs as described by Bird and Libermann. The investigation of the lexicon formalism contains four steps, that is the analysis of existing lexicons, the introduction of the Lexicon Graph Model as a generic representation for lexicons, the implementation of the formalism in different contexts and an evaluation of the formalism. It is shown that Annotation Graphs and Lexicon Graphs are indeed related not only in their formalism and it is shown, what standards have to be applied to annotations to be usable for lexicon development

Publications at Bielefeld University

AN ANALYSIS OF THE EFL SECONDARY READING CURRICULUM IN MALAYSIA: APPROACHES TO READING AND PREPARATION FOR HIGHER EDUCATION

Author: Sidek Harison Mohd
Publication venue
Publication date: 09/12/2010
Field of study

This case study examined the overarching approaches to second language (L2) reading instruction reflected in the Malaysian EFL secondary curriculum and how well this curriculum prepares students for tertiary reading in EFL. The Malaysian context was chosen because it highly values EFL instruction and has many similarities with other English as Foreign Language (EFL) countries, in terms of EFL reading issues at the tertiary level. The research questions for this study included: What types of reading tasks are reflected in the Malaysian EFL secondary reading curriculum? What types and length of reading passages are used in the Malaysian Form Five English language textbook? What levels of cognitive demand of the reading tasks are reflected in the Malaysian EFL secondary reading curriculum? What types of learner roles are reflected in the Malaysian EFL secondary reading curriculum? This explorative study used document reviews as the primary data collection and analysis method. The Malaysian EFL Secondary Curriculum and the EFL secondary textbook were analyzed using a revision of Richards and Rodgers's (2001) framework for analyzing EFL teaching. The findings indicate that the Malaysian EFL secondary reading curriculum frequently uses reading as an explicit skill to achieve the listed learning outcomes in the EFL Secondary Curriculum. Nonetheless, the curriculum is developed based on the cognitive information processing theory of SLA, Top-Down theory of L2 reading reflecting Non-Interactive Whole Language instruction as well as learner roles that are primarily in the form of individual tasks. The findings on passage analysis show that the EFL textbook primarily uses narrative passages with the majority of passages below grade-level length. The curriculum, however, emphasizes reading tasks that require high cognitive demand as well as important types of reading tasks

D-Scholarship@Pitt

Mining Twitter for crisis management: realtime floods detection in the Arabian Peninsula

Author: Alabbas Waleed
Publication venue: University of Bedfordshire
Publication date: 01/04/2018
Field of study

A thesis submitted to the University of Bedfordshire, in partial fulfilment of the requirements for the degree of doctor of Philosophy.In recent years, large amounts of data have been made available on microblog platforms such as Twitter, however, it is difficult to filter and extract information and knowledge from such data because of the high volume, including noisy data. On Twitter, the general public are able to report real-world events such as floods in real time, and act as social sensors. Consequently, it is beneficial to have a method that can detect flood events automatically in real time to help governmental authorities, such as crisis management authorities, to detect the event and make decisions during the early stages of the event. This thesis proposes a real time flood detection system by mining Arabic Tweets using machine learning and data mining techniques. The proposed system comprises five main components: data collection, pre-processing, flooding event extract, location inferring, location named entity link, and flooding event visualisation. An effective method of flood detection from Arabic tweets is presented and evaluated by using supervised learning techniques. Furthermore, this work presents a location named entity inferring method based on the Learning to Search method, the results show that the proposed method outperformed the existing systems with significantly higher accuracy in tasks of inferring flood locations from tweets which are written in colloquial Arabic. For the location named entity link, a method has been designed by utilising Google API services as a knowledge base to extract accurate geocode coordinates that are associated with location named entities mentioned in tweets. The results show that the proposed location link method locate 56.8% of tweets with a distance range of 0 – 10 km from the actual location. Further analysis has shown that the accuracy in locating tweets in an actual city and region are 78.9% and 84.2% respectively

University of Bedfordshire Repository

My Adventure in Translation: Experience the Excitement of Translation World

Author: Kembaren Farida Repelita Waty
Silalahi Roswita
Publication venue: USU Press
Publication date: 01/01/2017
Field of study

Repository UIN Sumatera Utara

Special Libraries, December 1975

Author: Special Libraries Association
Publication venue: SJSU ScholarWorks
Publication date: 01/12/1975
Field of study

Volume 66, Issue 12https://scholarworks.sjsu.edu/sla_sl_1975/1009/thumbnail.jp

SJSU ScholarWorks

A pragmatic guide to geoparsing evaluation

Author: Collier Nigel
Gritta Milan
Pilehvar Mohammad Taher
Publication venue: Language Resources and Evaluation
Publication date: 15/09/2019
Field of study

Abstract: Empirical methods in geoparsing have thus far lacked a standard evaluation framework describing the task, metrics and data used to compare state-of-the-art systems. Evaluation is further made inconsistent, even unrepresentative of real world usage by the lack of distinction between the different types of toponyms, which necessitates new guidelines, a consolidation of metrics and a detailed toponym taxonomy with implications for Named Entity Recognition (NER) and beyond. To address these deficiencies, our manuscript introduces a new framework in three parts. (Part 1) Task Definition: clarified via corpus linguistic analysis proposing a fine-grained Pragmatic Taxonomy of Toponyms. (Part 2) Metrics: discussed and reviewed for a rigorous evaluation including recommendations for NER/Geoparsing practitioners. (Part 3) Evaluation data: shared via a new dataset called GeoWebNews to provide test/train examples and enable immediate use of our contributions. In addition to fine-grained Geotagging and Toponym Resolution (Geocoding), this dataset is also suitable for prototyping and evaluating machine learning NLP models

arXiv.org e-Print Archive

Apollo (Cambridge)

On the effects of English elements in German print advertisements

Author: Rech Stephanie
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 20/07/2015
Field of study

Diese Arbeit untersucht den Einfluss von englischen Elementen in deutschen Werbeanzeigen auf die Anmutung der Anzeige, die Bewertung des beworbenen Produkts sowie der beworbenen Marke und die Einschätzung der Zielgruppe. In einer quantitativen Onlinestudie wurden vier speziell entwickelte Werbeanzeigen, die sich nur hinsichtlich der Verwendung englischer Elemente unterschieden, von 297 Teilnehmern bewertet. Dabei zeigten sich nur in wenigen Fällen statistisch signifikante Unterschiede zwischen der Bewertung der deutschen Anzeigenversionen und der englisch-deutsch gemischten Anzeigenversionen. Da den Probanden jeweils nur eine Version der Anzeige gezeigt wurde und ihnen der linguistische Hintergrund der Untersuchung unbekannt war, spiegeln die Ergebnisse die Wirkung englischer Elemente in realen Kontaktsituationen wider. Dieser Werbewirkungsstudie ging eine Untersuchung der Sprachzuordung voraus, in der getestet wurde, welche Variablen einen Einfluss darauf haben, ob ein visuell präsentiertes Stimuluswort als Deutsch oder Englisch wahrgenommen wird. Als geeignete Prädiktoren erwiesen sich neben der etymologischen Herkunft des Wortes vor allem die Integration in das deutsche Lexikon (operationalisiert durch Konsultierung des Duden Universalwörterbuchs 7. Aufl.). Des Weiteren zeigte sich ein signifikanter Einfluss graphemischer Fremdheitsmarker auf die Sprachzuordnung der Lexeme. Dieser Einfluss konnte sowohl bei Wörtern englischen Ursprungs als auch bei Wörtern, die nicht-englischen Ursprungs waren (z.B. LINEAL, CREMIG), beobachtet werden und verdeutlicht die Wichtigkeit der visuellen Wortform für die Sprachzuordnung.This thesis studies the influence of English elements in German print advertisements on the emotional appeal of the advertisement, the evaluation of the advertised product and brand, and the evaluation of the implied target group. Four especially designed print advertisements, which only differed in their use of English elements, were evaluated by 297 participants in a quantitative online study. Only in a few cases statistically significant differences between the evaluation of the German advertisement versions and the English-German mixed advertisement versions were found. Since participants were only shown one version of the advertisement and because the linguistic background of the study was disguised, the results mirror the effects of English elements in actual contact situations. Prior to this research, a study on language decisions was conducted to test which variables influence whether a visually presented word is perceived as English or German. Next to the etymological origin of a word, especially the integration into the German lexicon (operationalised by consulting the Duden Universalwörterbuchs 7th ed.) proved to be a good predictor. Moreover, graphemic markers of foreignness significantly influenced to which language lexemes were assigned. This impact was witnessed for words of English origin as well as for words of non-English origin (e.g. LINEAL, CREMIG), which emphasises the importance of visual word form for language decisions