179 research outputs found
Breaking the Token Barrier: Chunking and Convolution for Efficient Long Text Classification with BERT
Transformer-based models, specifically BERT, have propelled research in
various NLP tasks. However, these models are limited to a maximum token limit
of 512 tokens. Consequently, this makes it non-trivial to apply it in a
practical setting with long input. Various complex methods have claimed to
overcome this limit, but recent research questions the efficacy of these models
across different classification tasks. These complex architectures evaluated on
carefully curated long datasets perform at par or worse than simple baselines.
In this work, we propose a relatively simple extension to vanilla BERT
architecture called ChunkBERT that allows finetuning of any pretrained models
to perform inference on arbitrarily long text. The proposed method is based on
chunking token representations and CNN layers, making it compatible with any
pre-trained BERT. We evaluate chunkBERT exclusively on a benchmark for
comparing long-text classification models across a variety of tasks (including
binary classification, multi-class classification, and multi-label
classification). A BERT model finetuned using the ChunkBERT method performs
consistently across long samples in the benchmark while utilizing only a
fraction (6.25\%) of the original memory footprint. These findings suggest that
efficient finetuning and inference can be achieved through simple modifications
to pre-trained BERT models.Comment: 11 pages, 6 figures, submitted to NeurIPS 2
MPTopic: Improving topic modeling via Masked Permuted pre-training
Topic modeling is pivotal in discerning hidden semantic structures within
texts, thereby generating meaningful descriptive keywords. While innovative
techniques like BERTopic and Top2Vec have recently emerged in the forefront,
they manifest certain limitations. Our analysis indicates that these methods
might not prioritize the refinement of their clustering mechanism, potentially
compromising the quality of derived topic clusters. To illustrate, Top2Vec
designates the centroids of clustering results to represent topics, whereas
BERTopic harnesses C-TF-IDF for its topic extraction.In response to these
challenges, we introduce "TF-RDF" (Term Frequency - Relative Document
Frequency), a distinctive approach to assess the relevance of terms within a
document. Building on the strengths of TF-RDF, we present MPTopic, a clustering
algorithm intrinsically driven by the insights of TF-RDF. Through comprehensive
evaluation, it is evident that the topic keywords identified with the synergy
of MPTopic and TF-RDF outperform those extracted by both BERTopic and Top2Vec.Comment: 12 pages, will submit to ECIR 202
Early Detection of Rumor Veracity in Social Media
Rumor spread has become a significant issue in online social networks (OSNs). To mitigate and limit the spread of rumors and its detrimental effects, analyzing, detecting and better understanding rumor dynamics is required. One of the critical steps of studying rumor spread is to identify the level of the rumor truthfulness in its early stage. Understanding and identifying the level of rumor truthfulness helps prevent its viral spread and minimizes the damage a rumor may cause. In this research, we aim to debunk rumors by analyzing, visualizing, and classifying the level of rumor truthfulness from a large number of users that actively engage in rumor spread. First, we create a dataset of rumors that belong to one of five categories: False , Mostly False , True , Mostly True , and Half True . This dataset provides intrinsic characteristics of a rumor: topics, user\u27s sentiment, network structural and content features. Second, we analyze and visualize the characteristics of each rumor category to better understand its features. Third, using theories from social science and psychology, we build a feature set to classify those rumors and identify their truthfulness. The evaluation results on our new dataset show that the approach could effectively detect the truth of rumors as early as seven days. The proposed approach could be used as a valuable tool for existing fact-checking websites, such as Snopes.com or Politifact.com, to detect the veracity of rumors in its early stage automatically and educate OSN users to have a well-informed decision-making process
Using Molecular Embeddings in QSAR Modeling: Does it Make a Difference?
With the consolidation of deep learning in drug discovery, several novel
algorithms for learning molecular representations have been proposed. Despite
the interest of the community in developing new methods for learning molecular
embeddings and their theoretical benefits, comparing molecular embeddings with
each other and with traditional representations is not straightforward, which
in turn hinders the process of choosing a suitable representation for QSAR
modeling. A reason behind this issue is the difficulty of conducting a fair and
thorough comparison of the different existing embedding approaches, which
requires numerous experiments on various datasets and training scenarios. To
close this gap, we reviewed the literature on methods for molecular embeddings
and reproduced three unsupervised and two supervised molecular embedding
techniques recently proposed in the literature. We compared these five methods
concerning their performance in QSAR scenarios using different classification
and regression datasets. We also compared these representations to traditional
molecular representations, namely molecular descriptors and fingerprints. As
opposed to the expected outcome, our experimental setup consisting of over
25,000 trained models and statistical tests revealed that the predictive
performance using molecular embeddings did not significantly surpass that of
traditional representations. While supervised embeddings yielded competitive
results compared to those using traditional molecular representations,
unsupervised embeddings tended to perform worse than traditional
representations. Our results highlight the need for conducting a careful
comparison and analysis of the different embedding techniques prior to using
them in drug design tasks, and motivate a discussion about the potential of
molecular embeddings in computer-aided drug design
Causal graph extraction from news: a comparative study of time-series causality learning techniques
Causal graph extraction from news has the potential to aid in the understanding of complex scenarios. In particular, it can help explain and predict events, as well as conjecture about possible cause-effect connections. However, limited work has addressed the problem of large-scale extraction of causal graphs from news articles. This article presents a novel framework for extracting causal graphs from digital text media. The framework relies on topic-relevant variables representing terms and ongoing events that are selected from a domain under analysis by applying specially developed information retrieval and natural language processing methods. Events are represented as event-phrase embeddings, which make it possible to group similar events into semantically cohesive clusters. A time series of the selected variables is given as input to a causal structure learning techniques to learn a causal graph associated with the topic that is being examined. The complete framework is applied to the New York Times dataset, which covers news for a period of 246 months (roughly 20 years), and is illustrated through a case study. An initial evaluation based on synthetic data is carried out to gain insight into the most effective time-series causality learning techniques. This evaluation comprises a systematic analysis of nine state-of-the-art causal structure learning techniques and two novel ensemble methods derived from the most effective techniques. Subsequently, the complete framework based on the most promising causal structure learning technique is evaluated with domain experts in a real-world scenario through the use of the presented case study. The proposed analysis offers valuable insights into the problems of identifying topic-relevant variables from large volumes of news and learning causal graphs from time seriesFil: Maisonnave, Mariano. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentina. Dalhousie University Halifax; CanadáFil: Delbianco, Fernando Andrés. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Matemática Bahía Blanca. Universidad Nacional del Sur. Departamento de Matemática. Instituto de Matemática Bahía Blanca; ArgentinaFil: Tohmé, Fernando Abel. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Matemática Bahía Blanca. Universidad Nacional del Sur. Departamento de Matemática. Instituto de Matemática Bahía Blanca; ArgentinaFil: Milios, Evangelos. Dalhousie University Halifax; CanadáFil: Maguitman, Ana Gabriela. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentin
GGNN@Causal News Corpus 2022: Gated graph neural networks for causal event classification from social-political news articles
The discovery of causality mentions from text is a core cognitive concept and appears in many natural language processing (NLP) applications. In this paper, we study the task of Event Causality Identification (ECI) from social-political news. The aim of the task is to detect causal relationships between event mention pairs in text. Although deep learning models have recently achieved a state-of-the-art performance on many tasks and applications in NLP, most of them still fail to capture rich semantic and syntactic structures within sentences which is key for causality classification. We present a solution for causal event detection from social-political news that captures semantic and syntactic information based on gated graph neural networks (GGNN) and contextualized language embeddings. Experimental results show that our proposed method outperforms the baseline model (BERT (Bidirectional Embeddings from Transformers) in terms of f1-score and accuracy
Visual analysis of interactive document clustering streams
Interactive clustering techniques play a key role by putting the user in the clustering loop, allowing her to interact with document group abstractions instead of full-length documents. It allows users to focus on corpus exploration as an incremental task. To explore Information Discovery's incremental aspect, this article proposes a visual component to depict clustering membership changes throughout a clustering iteration loop in both static and dynamic data sets. The visual component is evaluated with an expert user and with an experiment with data streams
Assessing Causality Structures learned from Digital Text Media
In this paper we describe a framework to uncover potential causal relations between event mentions from streaming text of news media. This framework relies on a dataset of manually labeled events to train a recurrent neural network for event detection. It then creates a time series of event clusters, where clusters are based on BERT contextual word embedding representations of the identified events. Using these time series dataset, we assess four methods based on Granger causality for inferring causal relations. Granger causality is a statistical concept of causality that is based on forecasting. It states that a cause occurs before the effect, and the cause produces unique changes in the effect, so past values of the cause help predict future values of the effect. The four analyzed methods are the pairwise Granger test, VAR(1), BigVar and SiMoNe. The framework is applied to the New York Times dataset, which covers news for a period of 246 months. This preliminary analysis delivers important insights into the nature of each method, identifies differences and commonalities, and points out some of their strengths and weaknesses.Fil: Maisonnave, Mariano. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; ArgentinaFil: Delbianco, Fernando Andrés. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Matemática Bahía Blanca. Universidad Nacional del Sur. Departamento de Matemática. Instituto de Matemática Bahía Blanca; Argentina. Universidad Nacional del Sur. Departamento de Economía; ArgentinaFil: Tohmé, Fernando Abel. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Matemática Bahía Blanca. Universidad Nacional del Sur. Departamento de Matemática. Instituto de Matemática Bahía Blanca; Argentina. Universidad Nacional del Sur. Departamento de Economía; ArgentinaFil: Maguitman, Ana Gabriela. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; ArgentinaFil: Milios, Evangelos E.. Dalhousie University. Faculty of Computer Science; CanadáDocEng '20: ACM Symposium on Document Engineering 2020New YorkEstados UnidosAssociation for Computing Machiner
- …