242 research outputs found

    Mapping (Dis-)Information Flow about the MH17 Plane Crash

    Get PDF
    Digital media enables not only fast sharing of information, but also disinformation. One prominent case of an event leading to circulation of disinformation on social media is the MH17 plane crash. Studies analysing the spread of information about this event on Twitter have focused on small, manually annotated datasets, or used proxys for data annotation. In this work, we examine to what extent text classifiers can be used to label data for subsequent content analysis, in particular we focus on predicting pro-Russian and pro-Ukrainian Twitter content related to the MH17 plane crash. Even though we find that a neural classifier improves over a hashtag based baseline, labeling pro-Russian and pro-Ukrainian content with high precision remains a challenging problem. We provide an error analysis underlining the difficulty of the task and identify factors that might help improve classification in future work. Finally, we show how the classifier can facilitate the annotation task for human annotators

    Event Detection and Tracking Detection of Dangerous Events on Social Media

    Get PDF
    Online social media platforms have become essential tools for communication and information exchange in our lives. It is used for connecting with people and sharing information. This phenomenon has been intensively studied in the past decade to investigate users’ sentiments for different scenarios and purposes. As the technology advanced and popularity increased, it led to the use of different terms referring to similar topics which often result in confusion. We study such trends and intend to propose a uniform solution that deals with the subject clearly. We gather all these ambiguous terms under the umbrella of the most recent and popular terms to reach a concise verdict. Many events have been addressed in recent works that cover only specific types and domains of events. For the sake of keeping things simple and practical, the events that are extreme, negative, and dangerous are grouped under the name Dangerous Events (DE). These dangerous events are further divided into three main categories of action-based, scenario-based, and sentiments-based dangerous events to specify their characteristics. We then propose deep-learning-based models to detect events that are dangerous in nature. The deep-learning models that include BERT, RoBERTa, and XLNet provide valuable results that can effectively help solve the issue of detecting dangerous events using various dimensions. Even though the models perform well, the main constraint of fewer available event datasets and lower quality of certain events data affects the performance of these models can be tackled by handling the issue accordingly.As plataformas online de redes sociais tornaram-se ferramentas essenciais para a comunicação, conexão com outros, e troca de informação nas nossas vidas. Este fenómeno tem sido intensamente estudado na última década para investigar os sentimentos dos utilizadores em diferentes cenários e para vários propósitos. Contudo, a utilização dos meios de comunicação social tornou-se mais complexa e num fenómeno mais vasto devido ao envolvimento de múltiplos intervenientes, tais como empresas, grupos e outras organizações. À medida que a tecnologia avançou e a popularidade aumentou, a utilização de termos diferentes referentes a tópicos semelhantes gerou confusão. Por outras palavras, os modelos são treinados segundo a informação de termos e âmbitos específicos. Portanto, a padronização é imperativa. O objetivo deste trabalho é unir os diferentes termos utilizados em termos mais abrangentes e padronizados. O perigo pode ser uma ameaça como violência social, desastres naturais, danos intelectuais ou comunitários, contágio, agitação social, perda económica, ou apenas a difusão de ideologias odiosas e violentas. Estudamos estes diferentes eventos e classificamos-los em tópicos para que a ténica de deteção baseada em tópicos possa ser concebida e integrada sob o termo Evento Perigosos (DE). Consequentemente, definimos o termo proposto “Eventos Perigosos” (Dangerous Events) e dividimo-lo em três categorias principais de modo a especificar as suas características. Sendo estes denominados Eventos Perigosos, Eventos Perigosos de nível superior, e Eventos Perigosos de nível inferior. O conjunto de dados MAVEN foi utilizado para a obtenção de conjuntos de dados para realizar a experiência. Estes conjuntos de dados são filtrados manualmente com base no tipo de eventos para separar eventos perigosos de eventos gerais. Os modelos de transformação BERT, RoBERTa, e XLNet foram utilizados para classificar dados de texto consoante a respetiva categoria de Eventos Perigosos. Os resultados demonstraram que o desempenho do BERT é superior a outros modelos e pode ser eficazmente utilizado para a tarefa de deteção de Eventos Perigosos. Salienta-se que a abordagem de divisão dos conjuntos de dados aumentou significativamente o desempenho dos modelos. Existem diversos métodos propostos para a deteção de eventos. A deteção destes eventos (ED) são maioritariamente classificados na categoria de supervisonado e não supervisionados, como demonstrado nos metódos supervisionados, estão incluidos support vector machine (SVM), Conditional random field (CRF), Decision tree (DT), Naive Bayes (NB), entre outros. Enquanto a categoria de não supervisionados inclui Query-based, Statisticalbased, Probabilistic-based, Clustering-based e Graph-based. Estas são as duas abordagens em uso na deteção de eventos e são denonimados de document-pivot and feature-pivot. A diferença entre estas abordagens é na sua maioria a clustering approach, a forma como os documentos são utilizados para caracterizar vetores, e a similaridade métrica utilizada para identificar se dois documentos correspondem ao mesmo evento ou não. Além da deteção de eventos, a previsão de eventos é um problema importante mas complicado que engloba diversas dimensões. Muitos destes eventos são difíceis de prever antes de se tornarem visíveis e ocorrerem. Como um exemplo, é impossível antecipar catástrofes naturais, sendo apenas detetáveis após o seu acontecimento. Existe um número limitado de recursos em ternos de conjuntos de dados de eventos. ACE 2005, MAVEN, EVIN são alguns dos exemplos de conjuntos de dados disponíveis para a deteção de evnetos. Os trabalhos recentes demonstraram que os Transformer-based pre-trained models (PTMs) são capazes de alcançar desempenho de última geração em várias tarefas de NLP. Estes modelos são pré-treinados em grandes quantidades de texto. Aprendem incorporações para as palavras da língua ou representações de vetores de modo a que as palavras que se relacionem se agrupen no espaço vectorial. Um total de três transformadores diferentes, nomeadamente BERT, RoBERTa, e XLNet, será utilizado para conduzir a experiência e tirar a conclusão através da comparação destes modelos. Os modelos baseados em transformação (Transformer-based) estão em total sintonia utilizando uma divisão de 70,30 dos conjuntos de dados para fins de formação e teste/validação. A sintonização do hiperparâmetro inclui 10 epochs, 16 batch size, e o optimizador AdamW com taxa de aprendizagem 2e-5 para BERT e RoBERTa e 3e-5 para XLNet. Para eventos perigosos, o BERT fornece 60%, o RoBERTa 59 enquanto a XLNet fornece apenas 54% de precisão geral. Para as outras experiências de configuração de eventos de alto nível, o BERT e a XLNet dão 71% e 70% de desempenho com RoBERTa em relação aos outros modelos com 74% de precisão. Enquanto para o DE baseado em acções, DE baseado em cenários, e DE baseado em sentimentos, o BERT dá 62%, 85%, e 81% respetivamente; RoBERTa com 61%, 83%, e 71%; a XLNet com 52%, 81%, e 77% de precisão. Existe a necessidade de clarificar a ambiguidade entre os diferentes trabalhos que abordam problemas similares utilizando termos diferentes. A ideia proposta de referir acontecimentos especifícos como eventos perigosos torna mais fácil a abordagem do problema em questão. No entanto, a escassez de conjunto de dados de eventos limita o desempenho dos modelos e o progresso na deteção das tarefas. A disponibilidade de uma maior quantidade de informação relacionada com eventos perigosos pode melhorar o desempenho do modelo existente. É evidente que o uso de modelos de aprendizagem profunda, tais como como BERT, RoBERTa, e XLNet, pode ajudar a detetar e classificar eventos perigosos de forma eficiente. Tem sido evidente que a utilização de modelos de aprendizagem profunda, tais como BERT, RoBERTa, e XLNet, pode ajudar a detetar e classificar eventos perigosos de forma eficiente. Em geral, o BERT tem um desempenho superior ao do RoBERTa e XLNet na detecção de eventos perigosos. É igualmente importante rastrear os eventos após a sua detecção. Por conseguinte, para trabalhos futuros, propõe-se a implementação das técnicas que lidam com o espaço e o tempo, a fim de monitorizar a sua emergência com o tempo

    Credibility analysis of textual claims with explainable evidence

    Get PDF
    Despite being a vast resource of valuable information, the Web has been polluted by the spread of false claims. Increasing hoaxes, fake news, and misleading information on the Web have given rise to many fact-checking websites that manually assess these doubtful claims. However, the rapid speed and large scale of misinformation spread have become the bottleneck for manual verification. This calls for credibility assessment tools that can automate this verification process. Prior works in this domain make strong assumptions about the structure of the claims and the communities where they are made. Most importantly, black-box techniques proposed in prior works lack the ability to explain why a certain statement is deemed credible or not. To address these limitations, this dissertation proposes a general framework for automated credibility assessment that does not make any assumption about the structure or origin of the claims. Specifically, we propose a feature-based model, which automatically retrieves relevant articles about the given claim and assesses its credibility by capturing the mutual interaction between the language style of the relevant articles, their stance towards the claim, and the trustworthiness of the underlying web sources. We further enhance our credibility assessment approach and propose a neural-network-based model. Unlike the feature-based model, this model does not rely on feature engineering and external lexicons. Both our models make their assessments interpretable by extracting explainable evidence from judiciously selected web sources. We utilize our models and develop a Web interface, CredEye, which enables users to automatically assess the credibility of a textual claim and dissect into the assessment by browsing through judiciously and automatically selected evidence snippets. In addition, we study the problem of stance classification and propose a neural-network-based model for predicting the stance of diverse user perspectives regarding the controversial claims. Given a controversial claim and a user comment, our stance classification model predicts whether the user comment is supporting or opposing the claim.Das Web ist eine riesige Quelle wertvoller Informationen, allerdings wurde es durch die Verbreitung von Falschmeldungen verschmutzt. Eine zunehmende Anzahl an Hoaxes, Falschmeldungen und irreführenden Informationen im Internet haben viele Websites hervorgebracht, auf denen die Fakten überprüft und zweifelhafte Behauptungen manuell bewertet werden. Die rasante Verbreitung großer Mengen von Fehlinformationen sind jedoch zum Engpass für die manuelle Überprüfung geworden. Dies erfordert Tools zur Bewertung der Glaubwürdigkeit, mit denen dieser Überprüfungsprozess automatisiert werden kann. In früheren Arbeiten in diesem Bereich werden starke Annahmen gemacht über die Struktur der Behauptungen und die Portale, in denen sie gepostet werden. Vor allem aber können die Black-Box-Techniken, die in früheren Arbeiten vorgeschlagen wurden, nicht erklären, warum eine bestimmte Aussage als glaubwürdig erachtet wird oder nicht. Um diesen Einschränkungen zu begegnen, wird in dieser Dissertation ein allgemeines Framework für die automatisierte Bewertung der Glaubwürdigkeit vorgeschlagen, bei dem keine Annahmen über die Struktur oder den Ursprung der Behauptungen gemacht werden. Insbesondere schlagen wir ein featurebasiertes Modell vor, das automatisch relevante Artikel zu einer bestimmten Behauptung abruft und deren Glaubwürdigkeit bewertet, indem die gegenseitige Interaktion zwischen dem Sprachstil der relevanten Artikel, ihre Haltung zur Behauptung und der Vertrauenswürdigkeit der zugrunde liegenden Quellen erfasst wird. Wir verbessern unseren Ansatz zur Bewertung der Glaubwürdigkeit weiter und schlagen ein auf neuronalen Netzen basierendes Modell vor. Im Gegensatz zum featurebasierten Modell ist dieses Modell nicht auf Feature-Engineering und externe Lexika angewiesen. Unsere beiden Modelle machen ihre Einschätzungen interpretierbar, indem sie erklärbare Beweise aus sorgfältig ausgewählten Webquellen extrahieren. Wir verwenden unsere Modelle zur Entwicklung eines Webinterfaces, CredEye, mit dem Benutzer die Glaubwürdigkeit einer Behauptung in Textform automatisch bewerten und verstehen können, indem sie automatisch ausgewählte Beweisstücke einsehen. Darüber hinaus untersuchen wir das Problem der Positionsklassifizierung und schlagen ein auf neuronalen Netzen basierendes Modell vor, um die Position verschiedener Benutzerperspektiven in Bezug auf die umstrittenen Behauptungen vorherzusagen. Bei einer kontroversen Behauptung und einem Benutzerkommentar sagt unser Einstufungsmodell voraus, ob der Benutzerkommentar die Behauptung unterstützt oder ablehnt

    An NLP Analysis of Health Advice Giving in the Medical Research Literature

    Get PDF
    Health advice – clinical and policy recommendations – plays a vital role in guiding medical practices and public health policies. Whether or not authors should give health advice in medical research publications is a controversial issue. The proponents of actionable research advocate for the more efficient and effective transmission of science evidence into practice. The opponents are concerned about the quality of health advice in individual research papers, especially that in observational studies. Arguments both for and against giving advice in individual studies indicate a strong need for identifying and accessing health advice, for either practical use or quality evaluation purposes. However, current information services do not support the direct retrieval of health advice. Compared to other natural language processing (NLP) applications, health advice has not been computationally modeled as a language construct either. A new information service for directly accessing health advice should be able to reduce information barriers and to provide external assessment in science communication. This dissertation work built an annotated corpus of scientific claims that distinguishes health advice according to its occurrence and strength. The study developed NLP-based prediction models to identify health advice in the PubMed literature. Using the annotated corpus and prediction models, the study answered research questions regarding the practice of advice giving in medical research literature. To test and demonstrate the potential use of the prediction model, it was used to retrieve health advice regarding the use of hydroxychloroquine (HCQ) as a treatment for COVID-19 from LitCovid, a large COVID-19 research literature database curated by the National Institutes of Health. An evaluation of sentences extracted from both abstracts and discussions showed that BERT-based pre-trained language models performed well at detecting health advice. The health advice prediction model may be combined with existing health information service systems to provide more convenient navigation of a large volume of health literature. Findings from the study also show researchers are careful not to give advice solely in abstracts. They also tend to give weaker and non-specific advice in abstracts than in discussions. In addition, the study found that health advice has appeared consistently in the abstracts of observational studies over the past 25 years. In the sample, 41.2% of the studies offered health advice in their conclusions, which is lower than earlier estimations based on analyses of much smaller samples processed manually. In the abstracts of observational studies, journals with a lower impact are more likely to give health advice than those with a higher impact, suggesting the significance of the role of journals as gatekeepers of science communication. For the communities of natural language processing, information science, and public health, this work advances knowledge of the automated recognition of health advice in scientific literature. The corpus and code developed for the study have been made publicly available to facilitate future efforts in health advice retrieval and analysis. Furthermore, this study discusses the ways in which researchers give health advice in medical research articles, knowledge of which could be an essential step towards curbing potential exaggeration in the current global science communication. It also contributes to ongoing discussions of the integrity of scientific output. This study calls for caution in advice-giving in medical research literature, especially in abstracts alone. It also calls for open access to medical research publications, so that health researchers and practitioners can fully review the advice in scientific outputs and its implications. More evaluative strategies that can increase the overall quality of health advice in research articles are needed by journal editors and reviewers, given their gatekeeping role in science communication

    Utility-Preserving Anonymization of Textual Documents

    Get PDF
    Cada dia els éssers humans afegim una gran quantitat de dades a Internet, tals com piulades, opinions, fotos i vídeos. Les organitzacions que recullen aquestes dades tan diverses n'extreuen informació per tal de millorar llurs serveis o bé per a propòsits comercials. Tanmateix, si les dades recollides contenen informació personal sensible, hom no les pot compartir amb tercers ni les pot publicar sense el consentiment o una protecció adequada dels subjectes de les dades. Els mecanismes de preservació de la privadesa forneixen maneres de sanejar les dades per tal que no revelin identitats o atributs confidencials. S'ha proposat una gran varietat de mecanismes per anonimitzar bases de dades estructurades amb atributs numèrics i categòrics; en canvi, la protecció automàtica de dades textuals no estructurades ha rebut molta menys atenció. En general, l'anonimització de dades textuals exigeix, primer, detectar trossos del text que poden revelar informació sensible i, després, emmascarar aquests trossos mitjançant supressió o generalització. En aquesta tesi fem servir diverses tecnologies per anonimitzar documents textuals. De primer, millorem les tècniques existents basades en etiquetatge de seqüències. Després, estenem aquestes tècniques per alinear-les millor amb el risc de revelació i amb les exigències de privadesa. Finalment, proposem un marc complet basat en models d'immersió de paraules que captura un concepte més ampli de protecció de dades i que forneix una protecció flexible guiada per les exigències de privadesa. També recorrem a les ontologies per preservar la utilitat del text emmascarat, és a dir, la seva semàntica i la seva llegibilitat. La nostra experimentació extensa i detallada mostra que els nostres mètodes superen els mètodes existents a l'hora de proporcionar anonimització robusta tot preservant raonablement la utilitat del text protegit.Cada día las personas añadimos una gran cantidad de datos a Internet, tales como tweets, opiniones, fotos y vídeos. Las organizaciones que recogen dichos datos los usan para extraer información para mejorar sus servicios o para propósitos comerciales. Sin embargo, si los datos recogidos contienen información personal sensible, no pueden compartirse ni publicarse sin el consentimiento o una protección adecuada de los sujetos de los datos. Los mecanismos de protección de la privacidad proporcionan maneras de sanear los datos de forma que no revelen identidades ni atributos confidenciales. Se ha propuesto una gran variedad de mecanismos para anonimizar bases de datos estructuradas con atributos numéricos y categóricos; en cambio, la protección automática de datos textuales no estructurados ha recibido mucha menos atención. En general, la anonimización de datos textuales requiere, primero, detectar trozos de texto que puedan revelar información sensible, para luego enmascarar dichos trozos mediante supresión o generalización. En este trabajo empleamos varias tecnologías para anonimizar documentos textuales. Primero mejoramos las técnicas existentes basadas en etiquetaje de secuencias. Posteriormente las extendmos para alinearlas mejor con la noción de riesgo de revelación y con los requisitos de privacidad. Finalmente, proponemos un marco completo basado en modelos de inmersión de palabras que captura una noción más amplia de protección de datos y ofrece protección flexible guiada por los requisitos de privacidad. También recurrimos a las ontologías para preservar la utilidad del texto enmascarado, es decir, su semantica y legibilidad. Nuestra experimentación extensa y detallada muestra que nuestros métodos superan a los existentes a la hora de proporcionar una anonimización más robusta al tiempo que se preserva razonablemente la utilidad del texto protegido.Every day, people post a significant amount of data on the Internet, such as tweets, reviews, photos, and videos. Organizations collecting these types of data use them to extract information in order to improve their services or for commercial purposes. Yet, if the collected data contain sensitive personal information, they cannot be shared with third parties or released publicly without consent or adequate protection of the data subjects. Privacy-preserving mechanisms provide ways to sanitize data so that identities and/or confidential attributes are not disclosed. A great variety of mechanisms have been proposed to anonymize structured databases with numerical and categorical attributes; however, automatically protecting unstructured textual data has received much less attention. In general, textual data anonymization requires, first, to detect pieces of text that may disclose sensitive information and, then, to mask those pieces via suppression or generalization. In this work, we leverage several technologies to anonymize textual documents. We first improve state-of-the-art techniques based on sequence labeling. After that, we extend them to make them more aligned with the notion of privacy risk and the privacy requirements. Finally, we propose a complete framework based on word embedding models that captures a broader notion of data protection and provides flexible protection driven by privacy requirements. We also leverage ontologies to preserve the utility of the masked text, that is, its semantics and readability. Extensive experimental results show that our methods outperform the state of the art by providing more robust anonymization while reasonably preserving the utility of the protected outcome

    NLP-Based Techniques for Cyber Threat Intelligence

    Full text link
    In the digital era, threat actors employ sophisticated techniques for which, often, digital traces in the form of textual data are available. Cyber Threat Intelligence~(CTI) is related to all the solutions inherent to data collection, processing, and analysis useful to understand a threat actor's targets and attack behavior. Currently, CTI is assuming an always more crucial role in identifying and mitigating threats and enabling proactive defense strategies. In this context, NLP, an artificial intelligence branch, has emerged as a powerful tool for enhancing threat intelligence capabilities. This survey paper provides a comprehensive overview of NLP-based techniques applied in the context of threat intelligence. It begins by describing the foundational definitions and principles of CTI as a major tool for safeguarding digital assets. It then undertakes a thorough examination of NLP-based techniques for CTI data crawling from Web sources, CTI data analysis, Relation Extraction from cybersecurity data, CTI sharing and collaboration, and security threats of CTI. Finally, the challenges and limitations of NLP in threat intelligence are exhaustively examined, including data quality issues and ethical considerations. This survey draws a complete framework and serves as a valuable resource for security professionals and researchers seeking to understand the state-of-the-art NLP-based threat intelligence techniques and their potential impact on cybersecurity

    Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

    Get PDF
    Peer reviewe
    corecore