9 research outputs found

    SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization

    Full text link
    This paper introduces the SAMSum Corpus, a new dataset with abstractive dialogue summaries. We investigate the challenges it poses for automated summarization by testing several models and comparing their results with those obtained on a corpus of news articles. We show that model-generated summaries of dialogues achieve higher ROUGE scores than the model-generated summaries of news -- in contrast with human evaluators' judgement. This suggests that a challenging task of abstractive dialogue summarization requires dedicated models and non-standard quality measures. To our knowledge, our study is the first attempt to introduce a high-quality chat-dialogues corpus, manually annotated with abstractive summarizations, which can be used by the research community for further studies.Comment: Attachment contains the described dataset archived in 7z format. Please see the attached readme and licence. Update of the previous version: changed formats of train/val/test files in corpus.7

    Detecting Deceptive Utterances Using Deep Pre-Trained Neural Networks

    No full text
    Lying is an integral part of everyday communication in both written and oral forms. Detecting lies is therefore essential and has many possible applications. Our study aims to investigate the performance of automated lie detection methods, namely the most recent breed of pre-trained transformer neural networks capable of processing the Polish language. We used a dataset of nearly 1500 true and false statements, half of which were transcripts and the other half written statements, originating from possibly the largest study of deception in the Polish language. Technically, the problem was posed as text classification. We found that models perform better on typed than spoken utterances. The best-performing model achieved an accuracy of 0.69, which is much higher than the human performance average of 0.56. For transcribed utterances, human performance was at 0.58 and the models reached 0.62. We also explored model interpretability based on integrated gradient to shed light on classifier decisions. Our observations highlight the role of first words and phrases in model decisions, but more work is needed to systematically explore the observed patterns

    Czy komputer rozpozna hejtera? Wykorzystanie uczenia maszynowego (ML) w jakościowej analizie danych

    No full text
    Celem artykułu jest przedstawienie procesu automatyzacji kodowania tekstów pochodzących z mediów społecznościowych. Wdrożenie tego procesu pozwala na ilościowe potraktowanie jakościowych metod analizy treści. W efekcie otrzymujemy możliwość przeprowadzenia analizy na korpusach liczących setki tysięcy tekstów, które są kodowane w oparciu o ich znaczenia. Jest to możliwe dzięki wykorzystaniu algorytmów uczenia maszynowego (ML). Omawianą metodę kodowania prezentujemy na przykładzie projektu oznaczania „mowy nienawiści” w tekstach pochodzących z polskich forów internetowych. Kluczowym problemem jest precyzyjna konceptualizacja i operacjonalizacja tej kategorii. Pozwala to na przygotowanie dokładnej instrukcji kodowej oraz przeprowadzenie treningu zespołu kodującego. Efektem jest podwyższenie współczynnika zgodności kodujących. Oznaczone teksty zostaną wykorzystane jako dane treningowe dla metod automatycznej kategoryzacji opartych o algorytmy uczenia maszynowego. W dalszej części artykułu opisujemy zastosowane metody kodowania automatycznego. Tekst kończy podsumowanie wskazujące na czynniki, które są kluczowe dla procesu badawczego wykorzystującego uczenie maszynowe.The purpose of this article is to present the process of automatic tagging of hate speech in social media. The implementation of this process allows for quantitative treatment of qualitative methods: analysis on the corpora of hundreds thousands of texts based on their meaning. The process is possible through algorithms of machine learning (ML). The example of the hate speech designation project in texts from Polish online forums is presented. The key issue is the precise of conceptualization and operationalization of category “hate speech.” This allows for preparing specific instructions and conducting the training code unit. As a result we get higher rates of inter-coder agreement. Marked texts will be used as training data for automated categorization methods based on ML algorithms. Then we describe the course of machine coding. This article also seeks to establish problems associated with automatic coding of hate speech and propose solutions. In summary, we point the factors that are crucial to the research process that uses machine learning

    Social language in autism spectrum disorder: A computational analysis of sentiment and linguistic abstraction.

    No full text
    Individuals with autism spectrum disorder (ASD) demonstrate impairments with pragmatic (social) language, including narrative skills and conversational abilities. We aimed to quantitatively characterize narrative performance in ASD using natural language processing techniques: sentiment and language abstraction analyses based on the Linguistic Category Model. Individuals with ASD and with typical development matched for age, gender, ethnicity, and verbal and nonverbal intelligence quotients produced language samples during two standardized tasks from the Autism Diagnostic Observation Schedule, Second Edition assessment: Telling a Story from a Book and Description of a Picture. Only the narratives produced during the Book Task differed between ASD and control groups in terms of emotional polarity and language abstraction. Participants with typical development used words with positive sentiment more often in comparison to individuals with ASD. In the case of words with negative sentiment, the differences were marginally significant (participants with typical development used words with negative sentiment more often). The Book Task narratives of individuals with ASD were also characterized by a lower level of language abstraction than narratives of peers with typical development. Linguistic abstraction was strongly positively correlated with a higher number of words with emotional polarity. Neither linguistic abstraction nor emotional polarity correlated with participants' age or verbal and nonverbal IQ. The results support the promise of sentiment and language abstraction analyses as a useful tool for the quantitative, fully automated assessment of narrative abilities among individuals with ASD

    Can a Computer Recognize Hate Speech? Machine Learning (ML) in Qualitative Data Analysis

    No full text
    Celem artykułu jest przedstawienie procesu automatyzacji kodowania tekstów pochodzących z mediów społecznościowych. Wdrożenie tego procesu pozwala na ilościowe potraktowanie jakościowych metod analizy treści. W efekcie otrzymujemy możliwość przeprowadzenia analizy na korpusach liczących setki tysięcy tekstów, które są kodowane w oparciu o ich znaczenia. Jest to możliwe dzięki wykorzystaniu algorytmów uczenia maszynowego (ML). Omawianą metodę kodowania prezentujemy na przykładzie projektu oznaczania „mowy nienawiści” w tekstach pochodzących z polskich forów internetowych. Kluczowym problemem jest precyzyjna konceptualizacja i operacjonalizacja tej kategorii. Pozwala to na przygotowanie dokładnej instrukcji kodowej oraz przeprowadzenie treningu zespołu kodującego. Efektem jest podwyższenie współczynnika zgodności kodujących. Oznaczone teksty zostaną wykorzystane jako dane treningowe dla metod automatycznej kategoryzacji opartych o algorytmy uczenia maszynowego. W dalszej części artykułu opisujemy zastosowane metody kodowania automatycznego. Tekst kończy podsumowanie wskazujące na czynniki, które są kluczowe dla procesu badawczego wykorzystującego uczenie maszynowe.The purpose of this article is to present the process of automatic tagging of hate speech in social media. The implementation of this process allows for quantitative treatment of qualitative methods: analysis on the corpora of hundreds thousands of texts based on their meaning. The process is possible through algorithms of machine learning (ML). The example of the hate speech designation project in texts from Polish online forums is presented. The key issue is the precise of conceptualization and operationalization of category “hate speech.” This allows for preparing specific instructions and conducting the training code unit. As a result we get higher rates of inter-coder agreement. Marked texts will be used as training data for automated categorization methods based on ML algorithms. Then we describe the course of machine coding. This article also seeks to establish problems associated with automatic coding of hate speech and propose solutions. In summary, we point the factors that are crucial to the research process that uses machine learning

    “Pi of the Sky” Detector

    Get PDF
    “Pi of the Sky” experiment has been designed for continuous observations of a large part of the sky, in search for astrophysical phenomena characterized by short timescales, especially for prompt optical counterparts of Gamma Ray Bursts (GRBs). Other scientific goals include searching for novae and supernovae stars and monitoring of blasars and AGNs activity. “Pi of the Sky” is a fully autonomous, robotic detector, which can operate for long periods of time without a human supervision. A crucial element of the detector is an advanced software for real-time data analysis and identification of short optical transients. The most important result so far has been an independent detection and observation of the prompt optical emission of the “naked-eye” GRB080319B
    corecore