14 research outputs found
PSU at CLEF-2020 ARQMath Track: Unsupervised Re-Ranking Using Pretraining
This paper elaborates on our submission to the ARQMath track at CLEF 2020. Our primary run for the main Task-1: Question Answering uses a two-stage retrieval technique in which the first stage is a fusion of traditional BM25 scoring and tf-idf with cosine similarity-based retrieval while the second stage is a finer re-ranking technique using contextualized embeddings. For the re-ranking we use a pre-trained robertabase model (110 million parameters) to make the language model more math-aware. Our approach achieves a higher NDCG0 score than the baseline, while our MAP and P@10 scores are competitive, performing better than the best submission (MathDowsers) for text and text+formula dependent topics
Ranked List Fusion and Re-Ranking With Pre-Trained Transformers for ARQMath Lab
This paper elaborates on our submission to the ARQMath track at CLEF 2021. For our submission this year we use a collection of methods to retrieve and re-rank the answers in Math Stack Exchange in addition to our two-stage model which was comparable to the best model last year in terms of NDCG’. We also provide a detailed analysis of what the transformers are learning and why is it hard to train a math language model using transformers. This year’s submission to Task-1 includes summarizing long question-answer pairs to augment and index documents, using byte-pair encoding to tokenize formula and then re-rank them, and finally important keywords extraction from posts. Using an ensemble of these methods our approach shows a 20% improvement than our ARQMath’2020 Task-1 submission
Large Scale Subject Category Classification of Scholarly Papers with Deep Attentive Neural Networks
Subject categories of scholarly papers generally refer to the knowledge
domain(s) to which the papers belong, examples being computer science or
physics. Subject category information can be used for building faceted search
for digital library search engines. This can significantly assist users in
narrowing down their search space of relevant documents. Unfortunately, many
academic papers do not have such information as part of their metadata.
Existing methods for solving this task usually focus on unsupervised learning
that often relies on citation networks. However, a complete list of papers
citing the current paper may not be readily available. In particular, new
papers that have few or no citations cannot be classified using such methods.
Here, we propose a deep attentive neural network (DANN) that classifies
scholarly papers using only their abstracts. The network is trained using 9
million abstracts from Web of Science (WoS). We also use the WoS schema that
covers 104 subject categories. The proposed network consists of two
bi-directional recurrent neural networks followed by an attention layer. We
compare our model against baselines by varying the architecture and text
representation. Our best model achieves micro-F1 measure of 0.76 with F1 of
individual subject categories ranging from 0.50-0.95. The results showed the
importance of retraining word embedding models to maximize the vocabulary
overlap and the effectiveness of the attention mechanism. The combination of
word vectors with TFIDF outperforms character and sentence level embedding
models. We discuss imbalanced samples and overlapping categories and suggest
possible strategies for mitigation. We also determine the subject category
distribution in CiteSeerX by classifying a random sample of one million
academic papers.Comment: submitted to "Frontiers Mining Scientific Papers Volume II: Knowledge
Discovery and Data Exploitation
Large Scale Subject Category Classification of Scholarly Papers With Deep Attentive Neural Networks
Subject categories of scholarly papers generally refer to the knowledge domain(s) to which the papers belong, examples being computer science or physics. Subject category classification is a prerequisite for bibliometric studies, organizing scientific publications for domain knowledge extraction, and facilitating faceted searches for digital library search engines. Unfortunately, many academic papers do not have such information as part of their metadata. Most existing methods for solving this task focus on unsupervised learning that often relies on citation networks. However, a complete list of papers citing the current paper may not be readily available. In particular, new papers that have few or no citations cannot be classified using such methods. Here, we propose a deep attentive neural network (DANN) that classifies scholarly papers using only their abstracts. The network is trained using nine million abstracts from Web of Science (WoS). We also use the WoS schema that covers 104 subject categories. The proposed network consists of two bi-directional recurrent neural networks followed by an attention layer. We compare our model against baselines by varying the architecture and text representation. Our best model achieves micro- F1 measure of 0.76 with F1 of individual subject categories ranging from 0.50 to 0.95. The results showed the importance of retraining word embedding models to maximize the vocabulary overlap and the effectiveness of the attention mechanism. The combination of word vectors with TFIDF outperforms character and sentence level embedding models. We discuss imbalanced samples and overlapping categories and suggest possible strategies for mitigation. We also determine the subject category distribution in CiteSeerX by classifying a random sample of one million academic papers
The ACL OCL Corpus: advancing Open science in Computational Linguistics
We present a scholarly corpus from the ACL Anthology to assist Open
scientific research in the Computational Linguistics domain, named as ACL OCL.
Compared with previous ARC and AAN versions, ACL OCL includes structured
full-texts with logical sections, references to figures, and links to a large
knowledge resource (semantic scholar). ACL OCL contains 74k scientific papers,
together with 210k figures extracted up to September 2022. To observe the
development in the computational linguistics domain, we detect the topics of
all OCL papers with a supervised neural model. We observe ''Syntax: Tagging,
Chunking and Parsing'' topic is significantly shrinking and ''Natural Language
Generation'' is resurging. Our dataset is open and available to download from
HuggingFace in https://huggingface.co/datasets/ACL-OCL/ACL-OCL-Corpus
Fighting Fire with Fire: The Dual Role of LLMs in Crafting and Detecting Elusive Disinformation
Recent ubiquity and disruptive impacts of large language models (LLMs) have
raised concerns about their potential to be misused (.i.e, generating
large-scale harmful and misleading content). To combat this emerging risk of
LLMs, we propose a novel "Fighting Fire with Fire" (F3) strategy that harnesses
modern LLMs' generative and emergent reasoning capabilities to counter
human-written and LLM-generated disinformation. First, we leverage
GPT-3.5-turbo to synthesize authentic and deceptive LLM-generated content
through paraphrase-based and perturbation-based prefix-style prompts,
respectively. Second, we apply zero-shot in-context semantic reasoning
techniques with cloze-style prompts to discern genuine from deceptive posts and
news articles. In our extensive experiments, we observe GPT-3.5-turbo's
zero-shot superiority for both in-distribution and out-of-distribution
datasets, where GPT-3.5-turbo consistently achieved accuracy at 68-72%, unlike
the decline observed in previous customized and fine-tuned disinformation
detectors. Our codebase and dataset are available at
https://github.com/mickeymst/F3.Comment: Accepted at EMNLP 202