Search CORE

19,660 research outputs found

Extracting Mathematical Concepts with Large Language Models

Author: de Paiva Valeria
Gao Qiyue
Kovalev Pavel
Moss Lawrence S.
Publication venue
Publication date: 29/08/2023
Field of study

We extract mathematical concepts from mathematical text using generative large language models (LLMs) like ChatGPT, contributing to the field of automatic term extraction (ATE) and mathematical text processing, and also to the study of LLMs themselves. Our work builds on that of others in that we aim for automatic extraction of terms (keywords) in one mathematical field, category theory, using as a corpus the 755 abstracts from a snapshot of the online journal "Theory and Applications of Categories", circa 2020. Where our study diverges from previous work is in (1) providing a more thorough analysis of what makes mathematical term extraction a difficult problem to begin with; (2) paying close attention to inter-annotator disagreements; (3) providing a set of guidelines which both human and machine annotators could use to standardize the extraction process; (4) introducing a new annotation tool to help humans with ATE, applicable to any mathematical field and even beyond mathematics; (5) using prompts to ChatGPT as part of the extraction process, and proposing best practices for such prompts; and (6) raising the question of whether ChatGPT could be used as an annotator on the same level as human experts. Our overall findings are that the matter of mathematical ATE is an interesting field which can benefit from participation by LLMs, but LLMs themselves cannot at this time surpass human performance on it.Comment: 13 pages, 4 figures, presented to the 14th MathUI Workshop 202

arXiv.org e-Print Archive

The High Throughput Sequence Annotation Service (HT-SAS) – the shortcut from sequence to true Medline words

Author: Kaczanowski Szymon
Siedlecki Pawel
Zielenkiewicz Piotr
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Advances in high-throughput technologies available to modern biology have created an increasing flood of experimentally determined facts. Ordering, managing and describing these raw results is the first step which allows facts to become knowledge. Currently there are limited ways to automatically annotate such data, especially utilizing information deposited in published literature. Results To aid researchers in describing results from high-throughput experiments we developed HT-SAS, a web service for automatic annotation of proteins using general English words. For each protein a poll of Medline abstracts connected to homologous proteins is gathered using the UniProt-Medline link. Overrepresented words are detected using binomial statistics approximation. We tested our automatic approach with a protein test set from SGD to determine the accuracy and usefulness of our approach. We also applied the automatic annotation service to improve annotations of proteins from <it>Plasmodium bergei </it>expressed exclusively during the blood stage. Conclusion Using HT-SAS we created new, or enriched already established annotations for over 20% of proteins from <it>Plasmodium bergei </it>expressed in the blood stage, deposited in PlasmoDB. Our tests show this approach to information extraction provides highly specific keywords, often also when the number of abstracts is limited. Our service should be useful for manual curators, as a complement to manually curated information sources and for researchers working with protein datasets, especially from poorly characterized organisms.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Keywords given by authors of scientific articles in database descriptors

Author: Alimohammadi
Ansari
Boger
Craven
Craven
Gbur
Gil-Leiva
Gil-Leiva
Gil-Leiva
Gil-Leiva
Gross
Hartley
Hersh
Hmeidi
International Association for Standardization (ISO)
Jones
Kishida
Ko
Lancheng
Montejo Ráez
Ripplinger
Silvester
Taghva
Tillotson
Turney
Voorbij
Publication venue
Publication date: 01/01/2007
Field of study

This paper analyses the keywords given by authors of scientific articles and the descriptors assigned to the articles in order to ascertain the presence of the keywords in the descriptors. 640 INSPEC, CAB abstracts, ISTA and LISA database records were consulted. After detailed comparisons it was found that keywords provided by authors have an important presence in the database descriptors studied, since nearly 25% of all the keywords appeared in exactly the same form as descriptors, with another 21% while normalized, are still detected in the descriptors. This means that almost 46% of keywords appear in the descriptors, either as such or after normalization. Elsewhere, three distinct indexing policies appear, one represented by INSPEC and LISA (indexers seem to have freedom to assign the descriptors they deem necessary); another is represented by CAB (no record has fewer than four descriptors and, in general, a large number of descriptors is employed; in contrast, in ISTA, a certain institutional code towards economy in indexing, since 84% of records contain only four descriptors

E-LIS

Crossref

Generating indicative-informative summaries with SumUM

Author: Benbrahim Mohamed
Guy Lapalme
Horacio Saggion
Jing Hongyan
Johnson Frances C
Jordan Michael P
Radev Dragomir R
Teufel S.
Tombros Anastasios
Publication venue: 'MIT Press - Journals'
Publication date: 01/01/2002
Field of study

We present and evaluate SumUM, a text summarization system that takes a raw technical text as input and produces an indicative informative summary. The indicative part of the summary identifies the topics of the document, and the informative part elaborates on some of these topics according to the reader's interest. SumUM motivates the topics, describes entities, and defines concepts. It is a first step for exploring the issue of dynamic summarization. This is accomplished through a process of shallow syntactic and semantic analysis, concept identification, and text regeneration. Our method was developed through the study of a corpus of abstracts written by professional abstractors. Relying on human judgment, we have evaluated indicativeness, informativeness, and text acceptability of the automatic summaries. The results thus far indicate good performance when compared with other summarization technologies

CiteSeerX

Crossref

White Rose Research Online

Analyzing the Semantic Relatedness of Paper Abstracts: An Application to the Educational Research Field

Author: Dascalu Mihai
Dessus Philippe
Paraschiv Ionut Cristian
Trausan-Matu Stefan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/06/2015
Field of study

International audienceEach domain, along with its knowledge base, changes over time and every timeframe is centered on specific topics that emerge from different ongoing research projects. As searching for relevant resources is a time-consuming process, the automatic extraction of the most important and relevant articles from a domain becomes essential in supporting researchers in their day-today activities. The proposed analysis extends other previous researches focused on extracting co-citations between the papers, with the purpose of comparing their overall importance within the domain from a semantic perspective. Our method focuses on the semantic analysis of paper abstracts by using Natural Language Processing (NLP) techniques such as Latent Semantic Analysis, Latent Dirichlet Allocation or specific ontology distances, i.e., WordNet. Moreover, the defined mechanisms are enforced on two different subdomains from the corpora generated around the keywords " e-learning " and " computer ". Graph visual representations are used to highlight the keywords of each subdomain, links among concepts and between articles, as well as specific document similarity views, or scores reflecting the keyword-abstract overlaps. In the end, conclusions and future improvements are presented, emphasizing nevertheless the key elements of our research support framework

Hal - Université Grenoble Alpes

Template Mining for Information Extraction from Digital Documents

Author: Chowdhury Gobinda G.
Publication venue: Graduate School of Library and Information Science. University of Illinois at Urbana-Champaign
Publication date: 01/01/1999
Field of study

published or submitted for publicatio

Illinois Digital Environment for Access to Learning and Scholarship Repository

Is automatic detection of hidden knowledge an anomaly?

Author: Preiss J
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/05/2019
Field of study

Background: The quantity of documents being published requires researchers to specialize to a narrower field, meaning that inferable connections between publications (particularly from different domains) can be missed. This has given rise to automatic literature based discovery (LBD). However, unless heavily filtered, LBD generates more potential new knowledge than can be manually verified and another form of selection is required before the results can be passed onto a user. Since a large proportion of the automatically generated hidden knowledge is valid but generally known, we investigate the hypothesis that non trivial, interesting, hidden knowledge can be treated as an anomaly and identified using anomaly detection approaches. Results: Two experiments are conducted: (1) to avoid errors arising from incorrect extraction of relations, the hypothesis is validated using manually annotated relations appearing in a thesaurus, and (2) automatically extracted relations are used to investigate the hypothesis on publication abstracts. These allow an investigation of a potential upper bound and the detection of limitations yielded by automatic relation extraction. Conclusion: We apply one-class SVM and isolation forest anomaly detection algorithms to a set of hidden connections to rank connections by identifying outlying (interesting) ones and show that the approach increases the F1 measure by a factor of 10 while greatly reducing the quantity of hidden knowledge to manually verify. We also demonstrate the statistical significance of this result. Keywords: literature based discovery; anomaly detection; unified medical language syste

University of Salford Institutional Repository

White Rose Research Online