Search CORE

1,689 research outputs found

Cross-Lingual Blog Analysis by Cross-Lingual Comparison of Characteristic Terms and Blog Posts

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Towards Conceptual Indexing of the Blogosphere through Wikipedia Topic Hierarchy

Author: Fukuhara Tomohiro
Kawaba Mariko
Nakasaki Hiroyuki
Utsuro Takehito
Yokomoto Daisuke
Publication venue: City University of Hong Kong
Publication date: 01/01/2009
Field of study

PACLIC 23 / City University of Hong Kong / 3-5 December 200

Waseda University Repository

Neural Graph Transfer Learning in Natural Language Processing Tasks

Author: Li Irene
Publication venue: EliScholar – A Digital Platform for Scholarly Publishing at Yale
Publication date: 01/04/2022
Field of study

Natural language is essential in our daily lives as we rely on languages to communicate and exchange information. A fundamental goal for natural language processing (NLP) is to let the machine understand natural language to help or replace human experts to mine knowledge and complete tasks. Many NLP tasks deal with sequential data. For example, a sentence is considered as a sequence of works. Very recently, deep learning-based language models (i.e.,BERT \citep{devlin2018bert}) achieved significant improvement in many existing tasks, including text classification and natural language inference. However, not all tasks can be formulated using sequence models. Specifically, graph-structured data is also fundamental in NLP, including entity linking, entity classification, relation extraction, abstractive meaning representation, and knowledge graphs \citep{santoro2017simple,hamilton2017representation,kipf2016semi}. In this scenario, BERT-based pretrained models may not be suitable. Graph Convolutional Neural Network (GCN) \citep{kipf2016semi} is a deep neural network model designed for graphs. It has shown great potential in text classification, link prediction, question answering and so on. This dissertation presents novel graph models for NLP tasks, including text classification, prerequisite chain learning, and coreference resolution. We focus on different perspectives of graph convolutional network modeling: for text classification, a novel graph construction method is proposed which allows interpretability for the prediction; for prerequisite chain learning, we propose multiple aggregation functions that utilize neighbors for better information exchange; for coreference resolution, we study how graph pretraining can help when labeled data is limited. Moreover, an important branch is to apply pretrained language models for the mentioned tasks. So, this dissertation also focuses on the transfer learning method that generalizes pretrained models to other domains, including medical, cross-lingual, and web data. Finally, we propose a new task called unsupervised cross-domain prerequisite chain learning, and study novel graph-based methods to transfer knowledge over graphs

Yale University

Online suicide prevention through optimised text classification

Author: Desmet Bart
Hoste Veronique
Publication venue: 'Elsevier BV'
Publication date: 01/01/2018
Field of study

Online communication platforms are increasingly used to express suicidal thoughts. There is considerable interest in monitoring such messages, both for population-wide and individual prevention purposes, and to inform suicide research and policy. Online information overload prohibits manual detection, which is why keyword search methods are typically used. However, these are imprecise and unable to handle implicit references or linguistic noise. As an alternative, this study investigates supervised text classification to model and detect suicidality in Dutch-language forum posts. Genetic algorithms were used to optimise models through feature selection and hyperparameter optimisation. A variety of features was found to be informative, including token and character ngram bags-of-words, presence of salient suicide-related terms and features based on LSA topic models and polarity lexicons. The results indicate that text classification is a viable and promising strategy for detecting suicide-related and alarming messages, with F-scores comparable to human annotators (93% for relevant messages, 70% for severe messages). Both types of messages can be detected with high precision and minimal noise, even on large high-skew corpora. This suggests that they would be fit for use in a real-world prevention setting

Ghent University Academic Bibliography

Recommended from our members

Adapting Automatic Summarization to New Sources of Information

Author: Ouyang Jessica Jin
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2019
Field of study

English-language news articles are no longer necessarily the best source of information. The Web allows information to spread more quickly and travel farther: first-person accounts of breaking news events pop up on social media, and foreign-language news articles are accessible to, if not immediately understandable by, English-speaking users. This thesis focuses on developing automatic summarization techniques for these new sources of information. We focus on summarizing two specific new sources of information: personal narratives, first-person accounts of exciting or unusual events that are readily found in blog entries and other social media posts, and non-English documents, which must first be translated into English, often introducing translation errors that complicate the summarization process. Personal narratives are a very new area of interest in natural language processing research, and they present two key challenges for summarization. First, unlike many news articles, whose lead sentences serve as summaries of the most important ideas in the articles, personal narratives provide no such shortcuts for determining where important information occurs in within them; second, personal narratives are written informally and colloquially, and unlike news articles, they are rarely edited, so they require heavier editing and rewriting during the summarization process. Non-English documents, whether news or narrative, present yet another source of difficulty on top of any challenges inherent to their genre: they must be translated into English, potentially introducing translation errors and disfluencies that must be identified and corrected during summarization. The bulk of this thesis is dedicated to addressing the challenges of summarizing personal narratives found on the Web. We develop a two-stage summarization system for personal narrative that first extracts sentences containing important content and then rewrites those sentences into summary-appropriate forms. Our content extraction system is inspired by contextualist narrative theory, using changes in writing style throughout a narrative to detect sentences containing important information; it outperforms both graph-based and neural network approaches to sentence extraction for this genre. Our paraphrasing system rewrites the extracted sentences into shorter, standalone summary sentences, learning to mimic the paraphrasing choices of human summarizers more closely than can traditional lexicon- or translation-based paraphrasing approaches. We conclude with a chapter dedicated to summarizing non-English documents written in low-resource languages – documents that would otherwise be unreadable for English-speaking users. We develop a cross-lingual summarization system that performs even heavier editing and rewriting than does our personal narrative paraphrasing system; we create and train on large amounts of synthetic errorful translations of foreign-language documents. Our approach produces fluent English summaries from disdisfluent translations of non-English documents, and it generalizes across languages

Columbia University Academic Commons

“You’re trolling because…” – A Corpus-based Study of Perceived Trolling and Motive Attribution in the Comment Threads of Three British Political Blogs

Author: Petyko Marton
Publication venue
Publication date: 30/09/2017
Field of study

This paper investigates the linguistically marked motives that participants attribute to those they call trolls in 991 comment threads of three British political blogs. The study is concerned with how these motives affect the discursive construction of trolling and trolls. Another goal of the paper is to examine whether the mainly emotional motives ascribed to trolls in the academic literature correspond with those that the participants attribute to the alleged trolls in the analysed threads. The paper identifies five broad motives ascribed to trolls: emotional/mental health-related/social reasons, financial gain, political beliefs, being employed by a political body, and unspecified political affiliation. It also points out that depending on these motives, trolling and trolls are constructed in various ways. Finally, the study argues that participants attribute motives to trolls not only to explain their behaviour but also to insult them

ZENODO

Aston Publications Explorer

Lancaster E-Prints

Sentiment Analysis of Text Guided by Semantics and Structure

Author: Hogenboom A.C. (Alexander)
Publication venue: As moods and opinions play a pivotal role in various business and economic processes, keeping track of one's stakeholders' sentiment can be of crucial importance to decision makers. Today's abundance of user-generated content allows for the automated monitoring of the opinions of many stakeholders, like consumers. One challenge for such automated sentiment analysis systems is to identify whether pieces of natural language text are positive or negative. Typical methods of identifying this polarity involve low-level linguistic analysis. Existing systems predominantly use morphological, lexical, and syntactic cues for polarity, like a text's words, their parts-of-speech, and negation or amplification of the conveyed sentiment. This dissertation argues that the polarity of text can be analysed more accurately when additionally accounting for semantics and structure. Polarity classification performance can benefit from exploiting the interactions that emoticons have on a semantic level with words – emoticons can express, stress, or disambiguate sentiment. Furthermore, semantic relations between and within languages can help identify meaningful cues for sentiment in multi-lingual polarity classification. An even better understanding of a text's conveyed sentiment can be obtained by guiding automated sentiment analysis by the rhetorical structure of the text, or at least of its most sentiment-carrying segments. Thus, the sentiment in, e.g., conclusions can be treated differently from the sentiment in background information. The findings of this dissertation suggest that the polarity of natural language text should not be determined solely based on what is said. Instead, one should account for how this message is conveyed as well.
Publication date: 13/11/2015
Field of study

As moods and opinions play a pivotal role in various business and economic processes, keeping track of one's stakeholders' sentiment can be of crucial importance to decision makers. Today's abundance of user-generated content allows for the automated monitoring of the opinions of many stakeholders, like consumers. One challenge for such automated sentiment analysis systems is to identify whether pieces of natural language text are positive or negative. Typical methods of identifying this polarity involve low-level linguistic analysis. Existing systems predominantly use morphological, lexical, and syntactic cues for polarity, like a text's words, their parts-of-speech, and negation or amplification of the conveyed sentiment. This dissertation argues that the polarity of text can be analysed more accurately when additionally accounting for semantics and structure. Polarity classification performance can benefit from exploiting the interactions that emoticons have on a semantic level with words – emoticons can express, stress, or disambiguate sentiment. Furthermore, semantic relations between and within languages can help identify meaningful cues for sentiment in multi-lingual polarity classification. An even better understanding of a text's conveyed sentiment can be obtained by guiding automated sentiment analysis by the rhetorical structure of the text, or at least of its most sentiment-carrying segments. Thus, the sentiment in, e.g., conclusions can be treated differently from the sentiment in background information. The findings of this dissertation suggest that the polarity of natural language text should not be determined solely based on what is said. Instead, one should account for how this message is conveyed as well

EUR Research Repository

Erasmus University Digital Repository

A Technology-Enhanced German Language Course: Effects of Technology Implementation and Cross-Cultural Exchange on Students’ Language Skills, Perceptions and Cultural Awareness

Author: Dettinger Michael B.
Publication venue: LSU Digital Commons
Publication date: 01/01/2015
Field of study

This study employed a within-group case study design using a mixed methods approach. In doing so, the researcher used a concurrent triangulation process during a one semester intermediate German language course. In addition to the textbook, the researcher implemented a Technology to Support German Language Enhancement (TSGLE) intervention. The TSGLE included use of the following Web 2.0 technologies: blogs, podcasts, online chat, and wiki, to create an environment of increased asynchronous and synchronous interaction. Additionally, students embarked on a cross-cultural, virtual exchange with university students from Germany by interacting through a blog, a collaborative video conference session, a German film screening, email, and individual video conference sessions. Although certain challenges arise with adapting to technology use and communicating with native speakers, quantitative and qualitative data indicate regular use of Web 2.0 technologies and participating in a cross-cultural exchange can enhance language acquisition and cultural awareness

Louisiana State University

Mono- and cross-lingual paraphrased text reuse and extrinsic plagiarism detection

Author: Muhammad Sharjeel
Publication venue: Lancaster University
Publication date: 24/06/2020
Field of study

Text reuse is the act of borrowing text (either verbatim or paraphrased) from an earlier written text. It could occur within the same language (mono-lingual) or across languages (cross-lingual) where the reused text is in a different language than the original text. Text reuse and its related problem, plagiarism (the unacknowledged reuse of text), are becoming serious issues in many fields and research shows that paraphrased and especially the cross-lingual cases of reuse are much harder to detect. Moreover, the recent rise in readily available multi-lingual content on the Web and social media has increased the problem to an unprecedented scale. To develop, compare, and evaluate automatic methods for mono- and crosslingual text reuse and extrinsic (finding portion(s) of text that is reused from the original text) plagiarism detection, standard evaluation resources are of utmost importance. However, previous efforts on developing such resources have mostly focused on English and some other languages. On the other hand, the Urdu language, which is widely spoken and has a large digital footprint, lacks resources in terms of core language processing tools and corpora. With this consideration in mind, this PhD research focuses on developing standard evaluation corpora, methods, and supporting resources to automatically detect mono-lingual (Urdu) and cross-lingual (English-Urdu) cases of text reuse and extrinsic plagiarism This thesis contributes a mono-lingual (Urdu) text reuse corpus (COUNTER Corpus) that contains real cases of Urdu text reuse at document-level. Another contribution is the development of a mono-lingual (Urdu) extrinsic plagiarism corpus (UPPC Corpus) that contains simulated cases of Urdu paraphrase plagiarism. Evaluation results, by applying a wide range of state-of-the-art mono-lingual methods on both corpora, shows that it is easier to detect verbatim cases than paraphrased ones. Moreover, the performance of these methods decreases considerably on real cases of reuse. A couple of supporting resources are also created to assist methods used in the cross-lingual (English-Urdu) text reuse detection. A large-scale multi-domain English-Urdu parallel corpus (EUPC-20) that contains parallel sentences is mined from the Web and several bi-lingual (English-Urdu) dictionaries are compiled using multiple approaches from different sources. Another major contribution of this study is the development of a large benchmark cross-lingual (English-Urdu) text reuse corpus (TREU Corpus). It contains English to Urdu real cases of text reuse at the document-level. A diversified range of methods are applied on the TREU Corpus to evaluate its usefulness and to show how it can be utilised in the development of automatic methods for measuring cross-lingual (English-Urdu) text reuse. A new cross-lingual method is also proposed that uses bilingual word embeddings to estimate the degree of overlap amongst text documents by computing the maximum weighted cosine similarity between word pairs. The overall low evaluation results indicate that it is a challenging task to detect crosslingual real cases of text reuse, especially when the language pairs have unrelated scripts, i.e., English-Urdu. However, an improvement in the result is observed using a combination of methods used in the experiments. The research work undertaken in this PhD thesis contributes corpora, methods, and supporting resources for the mono- and cross-lingual text reuse and extrinsic plagiarism for a significantly under-resourced Urdu and English-Urdu language pair. It highlights that paraphrased and cross-lingual cross-script real cases of text reuse are harder to detect and are still an open issue. Moreover, it emphasises the need to develop standard evaluation and supporting resources for under-resourced languages to facilitate research in these languages. The resources that have been developed and methods proposed could serve as a framework for future research in other languages and language pairs

Lancaster E-Prints

SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)

Author: Atanasova Pepa
Coltekin Cagri
Derczynski Leon
Karadzhov Georgi
Mubarak Hamdy
Nakov Preslav
Pitenis Zeses
Rosenthal Sara
Zampieri Marcos
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 17/07/2020
Field of study

We present the results and main findings of SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval 2020). The task involves three subtasks corresponding to the hierarchical taxonomy of the OLID schema (Zampieri et al., 2019a) from OffensEval 2019. The task featured five languages: English, Arabic, Danish, Greek, and Turkish for Subtask A. In addition, English also featured Subtasks B and C. OffensEval 2020 was one of the most popular tasks at SemEval-2020 attracting a large number of participants across all subtasks and also across all languages. A total of 528 teams signed up to participate in the task, 145 teams submitted systems during the evaluation period, and 70 submitted system description papers.Comment: Proceedings of the International Workshop on Semantic Evaluation (SemEval-2020

arXiv.org e-Print Archive

The IT University of Copenhagen's Repository

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY