2,544 research outputs found
Echoes of Persuasion: The Effect of Euphony in Persuasive Communication
While the effect of various lexical, syntactic, semantic and stylistic
features have been addressed in persuasive language from a computational point
of view, the persuasive effect of phonetics has received little attention. By
modeling a notion of euphony and analyzing four datasets comprising persuasive
and non-persuasive sentences in different domains (political speeches, movie
quotes, slogans and tweets), we explore the impact of sounds on different forms
of persuasiveness. We conduct a series of analyses and prediction experiments
within and across datasets. Our results highlight the positive role of phonetic
devices on persuasion
Active learning in annotating micro-blogs dealing with e-reputation
Elections unleash strong political views on Twitter, but what do people
really think about politics? Opinion and trend mining on micro blogs dealing
with politics has recently attracted researchers in several fields including
Information Retrieval and Machine Learning (ML). Since the performance of ML
and Natural Language Processing (NLP) approaches are limited by the amount and
quality of data available, one promising alternative for some tasks is the
automatic propagation of expert annotations. This paper intends to develop a
so-called active learning process for automatically annotating French language
tweets that deal with the image (i.e., representation, web reputation) of
politicians. Our main focus is on the methodology followed to build an original
annotated dataset expressing opinion from two French politicians over time. We
therefore review state of the art NLP-based ML algorithms to automatically
annotate tweets using a manual initiation step as bootstrap. This paper focuses
on key issues about active learning while building a large annotated data set
from noise. This will be introduced by human annotators, abundance of data and
the label distribution across data and entities. In turn, we show that Twitter
characteristics such as the author's name or hashtags can be considered as the
bearing point to not only improve automatic systems for Opinion Mining (OM) and
Topic Classification but also to reduce noise in human annotations. However, a
later thorough analysis shows that reducing noise might induce the loss of
crucial information.Comment: Journal of Interdisciplinary Methodologies and Issues in Science -
Vol 3 - Contextualisation digitale - 201
Cross-Lingual Induction and Transfer of Verb Classes Based on Word Vector Space Specialisation
Existing approaches to automatic VerbNet-style verb classification are
heavily dependent on feature engineering and therefore limited to languages
with mature NLP pipelines. In this work, we propose a novel cross-lingual
transfer method for inducing VerbNets for multiple languages. To the best of
our knowledge, this is the first study which demonstrates how the architectures
for learning word embeddings can be applied to this challenging
syntactic-semantic task. Our method uses cross-lingual translation pairs to tie
each of the six target languages into a bilingual vector space with English,
jointly specialising the representations to encode the relational information
from English VerbNet. A standard clustering algorithm is then run on top of the
VerbNet-specialised representations, using vector dimensions as features for
learning verb classes. Our results show that the proposed cross-lingual
transfer approach sets new state-of-the-art verb classification performance
across all six target languages explored in this work.Comment: EMNLP 2017 (long paper
Winsorize tree algorithm for handling outliers in classification problem
Classification and Regression Tree (CART) is designed to predict or classify the objects in the predetermined classes from a set of predictors. However, having outliers could affect the structures of CART, purity and predictive accuracy in classification. Some researchers opt to perform pre-pruning or post-pruning of the CART in handling the outliers. This study proposes a modified classification tree algorithm called Winsorize tree based on the distribution of classes in the training dataset. The Winsorize tree investigates all possible outliers from node to node before checking the potential splitting point to gain the node with the highest purity of the nodes. The
upper fence and lower fence of a boxplot are used to detect potential outliers whose values exceeding the tail of Q ± (1.5×Interquartile range). The identified outliers are neutralized using the Winsorize method whilst the Winsorize Gini index is then used to compute the divergences among probability distributions of the target predictor’s
values until stopping criteria are met. This study uses three stopping rules: node achieved the minimum 10% of total training set
Linking social media, medical literature, and clinical notes using deep learning.
Researchers analyze data, information, and knowledge through many sources, formats, and methods. The dominant data format includes text and images. In the healthcare industry, professionals generate a large quantity of unstructured data. The complexity of this data and the lack of computational power causes delays in analysis. However, with emerging deep learning algorithms and access to computational powers such as graphics processing unit (GPU) and tensor processing units (TPUs), processing text and images is becoming more accessible. Deep learning algorithms achieve remarkable results in natural language processing (NLP) and computer vision. In this study, we focus on NLP in the healthcare industry and collect data not only from electronic medical records (EMRs) but also medical literature and social media. We propose a framework for linking social media, medical literature, and EMRs clinical notes using deep learning algorithms. Connecting data sources requires defining a link between them, and our key is finding concepts in the medical text. The National Library of Medicine (NLM) introduces a Unified Medical Language System (UMLS) and we use this system as the foundation of our own system. We recognize social media’s dynamic nature and apply supervised and semi-supervised methodologies to generate concepts. Named entity recognition (NER) allows efficient extraction of information, or entities, from medical literature, and we extend the model to process the EMRs’ clinical notes via transfer learning. The results include an integrated, end-to-end, web-based system solution that unifies social media, literature, and clinical notes, and improves access to medical knowledge for the public and experts
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
Vulnerability Clustering and other Machine Learning Applications of Semantic Vulnerability Embeddings
Cyber-security vulnerabilities are usually published in form of short natural
language descriptions (e.g., in form of MITRE's CVE list) that over time are
further manually enriched with labels such as those defined by the Common
Vulnerability Scoring System (CVSS). In the Vulnerability AI (Analytics and
Intelligence) project, we investigated different types of semantic
vulnerability embeddings based on natural language processing (NLP) techniques
to obtain a concise representation of the vulnerability space. We also
evaluated their use as a foundation for machine learning applications that can
support cyber-security researchers and analysts in risk assessment and other
related activities. The particular applications we explored and briefly
summarize in this report are clustering, classification, and visualization, as
well as a new logic-based approach to evaluate theories about the vulnerability
space.Comment: 27 pages, 13 figure
Recommended from our members
Leveraging the Power of Crowds: Automated Test Report Processing for The Maintenance of Mobile Applications
Crowdsourcing is an emerging distributed problem-solving model combining human and machine computation. It collects intelligence and knowledge from a large and diverse workforce to complete complex tasks. In the software engineering domain, crowdsourced techniques have been adopted to facilitate various tasks, such as design, testing, debugging, development, and so on. Specifically, in crowdsourced testing, crowdsourced workers are given testing tasks to perform and submit their feedback in the form of test reports. One of the key advantages of crowdsourced testing is that it is capable of providing engineers software engineers with domain knowledge and feedback from a large number of real users. Based on diverse software and hardware settings of these users, engineers can bugs that are not caught by traditional quality assurance techniques. Such benefits are particularly ideal for mobile application testing, which needs rapid development-and-deployment iterations and support diverse execution environments. However, crowdsourced testing naturally generates an overwhelming number of crowdsourced test reports, and inspecting such a large number of reports becomes a time-consuming yet inevitable task. This dissertation presents a series of techniques, tools and experiments to assist in crowdsourced report processing. These techniques are designed for improving this task in multiple aspects: 1. prioritizing crowdsourced report to assist engineers in finding as many unique bugs as possible, and as quickly as possible; 2. grouping crowdsourced report to assist engineers in identifying the representative ones in a short time; 3. summarizing the duplicate reports to provide engineers with a concise and accurate understanding of a group of reports; In the first step, I present a text-analysis-based technique to prioritize test reports for manual inspection. This technique leverages two key strategies: (1) a diversity strategy to help developers inspect a wide variety of test reports and to avoid duplicates and wasted effort on falsely classified faulty behavior, and (2) a risk-assessment strategy to help developers identify test reports that may be more likely to be fault-revealing based on past observations.Together, these two strategies form our technique to prioritize test reports in crowdsourced testing. Moreover, in the mobile testing domain, test reports often consist of more screenshots and shorter descriptive text, and thus text-analysis-based techniques may be ineffective or inapplicable. The shortage and ambiguity of natural-language text information and the well-defined screenshots of activity views within mobile applications motivate me to propose a novel technique based on using image understanding for multi-objective test-report prioritization. This technique employs the Spatial Pyramid Matching (SPM) technique to measure the similarity of the screenshots, and apply the natural-language processing technique to measure the distance between the text of test reports. Next, I design and implement CTRAS: a novel approach to leveraging duplicates to enrich the content of bug descriptions and improve the efficiency of inspecting these reports. CTRAS is capable of automatically aggregating duplicates based on both textual information and screenshots, and further summarizes the duplicate test reports into a comprehensive and comprehensible report.I validate all of these techniques on industrial data by collaborating with several companies. The results show my techniques can improve both the efficiency and effectiveness of crowdsourced test report processing. Also, I suggest settings for different usage scenarios and discuss future research directions
- …