9 research outputs found
SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization
This paper introduces the SAMSum Corpus, a new dataset with abstractive
dialogue summaries. We investigate the challenges it poses for automated
summarization by testing several models and comparing their results with those
obtained on a corpus of news articles. We show that model-generated summaries
of dialogues achieve higher ROUGE scores than the model-generated summaries of
news -- in contrast with human evaluators' judgement. This suggests that a
challenging task of abstractive dialogue summarization requires dedicated
models and non-standard quality measures. To our knowledge, our study is the
first attempt to introduce a high-quality chat-dialogues corpus, manually
annotated with abstractive summarizations, which can be used by the research
community for further studies.Comment: Attachment contains the described dataset archived in 7z format.
Please see the attached readme and licence. Update of the previous version:
changed formats of train/val/test files in corpus.7
Detecting Deceptive Utterances Using Deep Pre-Trained Neural Networks
Lying is an integral part of everyday communication in both written and oral forms. Detecting lies is therefore essential and has many possible applications. Our study aims to investigate the performance of automated lie detection methods, namely the most recent breed of pre-trained transformer neural networks capable of processing the Polish language. We used a dataset of nearly 1500 true and false statements, half of which were transcripts and the other half written statements, originating from possibly the largest study of deception in the Polish language. Technically, the problem was posed as text classification. We found that models perform better on typed than spoken utterances. The best-performing model achieved an accuracy of 0.69, which is much higher than the human performance average of 0.56. For transcribed utterances, human performance was at 0.58 and the models reached 0.62. We also explored model interpretability based on integrated gradient to shed light on classifier decisions. Our observations highlight the role of first words and phrases in model decisions, but more work is needed to systematically explore the observed patterns
Czy komputer rozpozna hejtera? Wykorzystanie uczenia maszynowego (ML) w jakościowej analizie danych
Celem artykułu jest przedstawienie procesu automatyzacji kodowania tekstów pochodzących z mediów społecznościowych. Wdrożenie tego procesu pozwala na ilościowe potraktowanie jakościowych metod analizy treści. W efekcie otrzymujemy możliwość przeprowadzenia analizy na korpusach liczących setki tysięcy tekstów, które są kodowane w oparciu o ich znaczenia. Jest to możliwe dzięki wykorzystaniu algorytmów uczenia maszynowego (ML). Omawianą metodę kodowania prezentujemy na przykładzie projektu oznaczania „mowy nienawiści” w tekstach pochodzących z polskich forów internetowych. Kluczowym problemem jest precyzyjna konceptualizacja i operacjonalizacja tej kategorii. Pozwala to na przygotowanie dokładnej instrukcji kodowej oraz przeprowadzenie treningu zespołu kodującego. Efektem jest podwyższenie współczynnika zgodności kodujących. Oznaczone teksty zostaną wykorzystane jako dane treningowe dla metod automatycznej kategoryzacji opartych o algorytmy uczenia maszynowego. W dalszej części artykułu opisujemy zastosowane metody kodowania automatycznego. Tekst kończy podsumowanie wskazujące na czynniki, które są kluczowe dla procesu badawczego wykorzystującego uczenie maszynowe.The purpose of this article is to present the process of automatic tagging of hate speech in social media. The implementation of this process allows for quantitative treatment of qualitative methods: analysis on the corpora of hundreds thousands of texts based on their meaning. The process is possible through algorithms of machine learning (ML). The example of the hate speech designation project in texts from Polish online forums is presented. The key issue is the precise of conceptualization and operationalization of category “hate speech.” This allows for preparing specific instructions and conducting the training code unit. As a result we get higher rates of inter-coder agreement. Marked texts will be used as training data for automated categorization methods based on ML algorithms. Then we describe the course of machine coding. This article also seeks to establish problems associated with automatic coding of hate speech and propose solutions. In summary, we point the factors that are crucial to the research process that uses machine learning
Social language in autism spectrum disorder: A computational analysis of sentiment and linguistic abstraction.
Individuals with autism spectrum disorder (ASD) demonstrate impairments with pragmatic (social) language, including narrative skills and conversational abilities. We aimed to quantitatively characterize narrative performance in ASD using natural language processing techniques: sentiment and language abstraction analyses based on the Linguistic Category Model. Individuals with ASD and with typical development matched for age, gender, ethnicity, and verbal and nonverbal intelligence quotients produced language samples during two standardized tasks from the Autism Diagnostic Observation Schedule, Second Edition assessment: Telling a Story from a Book and Description of a Picture. Only the narratives produced during the Book Task differed between ASD and control groups in terms of emotional polarity and language abstraction. Participants with typical development used words with positive sentiment more often in comparison to individuals with ASD. In the case of words with negative sentiment, the differences were marginally significant (participants with typical development used words with negative sentiment more often). The Book Task narratives of individuals with ASD were also characterized by a lower level of language abstraction than narratives of peers with typical development. Linguistic abstraction was strongly positively correlated with a higher number of words with emotional polarity. Neither linguistic abstraction nor emotional polarity correlated with participants' age or verbal and nonverbal IQ. The results support the promise of sentiment and language abstraction analyses as a useful tool for the quantitative, fully automated assessment of narrative abilities among individuals with ASD
Can a Computer Recognize Hate Speech? Machine Learning (ML) in Qualitative Data Analysis
Celem artykułu jest przedstawienie procesu automatyzacji kodowania tekstów pochodzących z mediów społecznościowych. Wdrożenie tego procesu pozwala na ilościowe potraktowanie jakościowych
metod analizy treści. W efekcie otrzymujemy możliwość przeprowadzenia analizy na korpusach liczących setki tysięcy tekstów, które są kodowane w oparciu o ich znaczenia. Jest to możliwe dzięki wykorzystaniu algorytmów uczenia maszynowego (ML).
Omawianą metodę kodowania prezentujemy na przykładzie projektu oznaczania „mowy nienawiści”
w tekstach pochodzących z polskich forów internetowych. Kluczowym problemem jest precyzyjna
konceptualizacja i operacjonalizacja tej kategorii. Pozwala to na przygotowanie dokładnej instrukcji
kodowej oraz przeprowadzenie treningu zespołu kodującego. Efektem jest podwyższenie współczynnika zgodności kodujących. Oznaczone teksty zostaną wykorzystane jako dane treningowe dla metod
automatycznej kategoryzacji opartych o algorytmy uczenia maszynowego. W dalszej części artykułu
opisujemy zastosowane metody kodowania automatycznego. Tekst kończy podsumowanie wskazujące
na czynniki, które są kluczowe dla procesu badawczego wykorzystującego uczenie maszynowe.The purpose of this article is to present the process of automatic tagging of hate speech in social media. The implementation
of this process allows for quantitative treatment of qualitative methods: analysis on the corpora of hundreds thousands of texts based
on their meaning. The process is possible through algorithms of machine learning (ML).
The example of the hate speech designation project in texts from Polish online forums is presented. The key issue is the precise of
conceptualization and operationalization of category “hate speech.” This allows for preparing specific instructions and conducting the
training code unit. As a result we get higher rates of inter-coder agreement. Marked texts will be used as training data for automated
categorization methods based on ML algorithms. Then we describe the course of machine coding. This article also seeks to establish
problems associated with automatic coding of hate speech and propose solutions. In summary, we point the factors that are crucial to
the research process that uses machine learning
Spiral of hatred: social effects in Internet auctions. Between informativity and emotion
“Pi of the Sky” Detector
“Pi of the Sky” experiment has been designed for continuous observations of
a large part of the sky, in search for astrophysical phenomena characterized
by short timescales, especially for prompt optical counterparts of Gamma
Ray Bursts (GRBs). Other scientific goals include searching for novae and
supernovae stars and monitoring of blasars and AGNs activity. “Pi of the Sky” is a fully autonomous, robotic detector, which can operate for long periods of time without a human supervision. A crucial element of the detector is
an advanced software for real-time data analysis and identification of short
optical transients. The most important result so far has been an independent
detection and observation of the prompt optical emission of the “naked-eye” GRB080319B