Search CORE

12,166 research outputs found

Classification of protein interaction sentences via gaussian processes

Author: A. Aizerman
A.M. Cohen
C.D. Manning
C.D. Manning
C.E. Rasmussen
C.H. Ding
D.D. Lewis
E.M. Marcotte
H. Chen
J. Huang
J.C. Platt
J.D. Kim
J.H. Albert
K. Crammer
K. Sugiyama
K.M.A. Chai
M. Girolami
M. Girolami
N. Lama
N. Lawrence
R. Bunescu
S. Rogers
S.S. Keerthi
Silva
T. Joachims
V. Vapnik
W. Chu
W. Chu
Y. Hao
Y. Lee
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

The increase in the availability of protein interaction studies in textual format coupled with the demand for easier access to the key results has lead to a need for text mining solutions. In the text processing pipeline, classification is a key step for extraction of small sections of relevant text. Consequently, for the task of locating protein-protein interaction sentences, we examine the use of a classifier which has rarely been applied to text, the Gaussian processes (GPs). GPs are a non-parametric probabilistic analogue to the more popular support vector machines (SVMs). We find that GPs outperform the SVM and na\"ive Bayes classifiers on binary sentence data, whilst showing equivalent performance on abstract and multiclass sentence corpora. In addition, the lack of the margin parameter, which requires costly tuning, along with the principled multiclass extensions enabled by the probabilistic framework make GPs an appealing alternative worth of further adoption

Comparative Analysis of Word Embeddings for Capturing Word Similarities

Author: Kalajdjieski Jovan
Stojanovska Frosina
Toshevska Martina
Publication venue: 'Academy and Industry Research Collaboration Center (AIRCC)'
Publication date: 07/05/2020
Field of study

Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.Comment: Part of the 6th International Conference on Natural Language Processing (NATP 2020

arXiv.org e-Print Archive

Crossref

Annotating patient clinical records with syntactic chunks and named entities: the Harvey corpus

Author: A Roberts
A Shah
Aleksandar Savkov
B Efron
G Hripcsak
G Savova
J Cohen
J Foster
J-W Fan
Jackie Cassell
John Carroll
K Verspoor
KH Krippendorff
LK Tanabe
M Bada
MP Marcus
Rob Koeling
S Abney
W Sun
Ö Uzuner
Ö Uzuner
Ö Uzuner
Ö Uzuner
Ö Uzuner
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

The free text notes typed by physicians during patient consultations contain valuable information for the study of disease and treatment. These notes are difficult to process by existing natural language analysis tools since they are highly telegraphic (omitting many words), and contain many spelling mistakes, inconsistencies in punctuation, and non-standard word order. To support information extraction and classification tasks over such text, we describe a de-identified corpus of free text notes, a shallow syntactic and named entity annotation scheme for this kind of text, and an approach to training domain specialists with no linguistic background to annotate the text. Finally, we present a statistical chunking system for such clinical text with a stable learning rate and good accuracy, indicating that the manual annotation is consistent and that the annotation scheme is tractable for machine learning

Crossref

Springer - Publisher Connector

PubMed Central

Sussex Research Online