1,159 research outputs found
Semi-supervised prediction of protein interaction sentences exploiting semantically encoded metrics
Protein-protein interaction (PPI) identification is an integral component of many biomedical research and database curation tools. Automation of this task through classification is one of the key goals of text mining (TM). However, labelled PPI corpora required to train classifiers are generally small. In order to overcome this sparsity in the training data, we propose a novel method of integrating corpora that do not contain relevance judgements. Our approach uses a semantic language model to gather word similarity from a large unlabelled corpus. This additional information is integrated into the sentence classification process using kernel transformations and has a re-weighting effect on the training features that leads to an 8% improvement in F-score over the baseline results. Furthermore, we discover that some words which are generally considered indicative of interactions are actually neutralised by this process
Measuring Semantic Similarity: Representations and Methods
This dissertation investigates and proposes ways to quantify and measure semantic similarity between texts. The general approach is to rely on linguistic information at various levels, including lexical, lexico-semantic, and syntactic. The approach starts by mapping texts onto structured representations that include lexical, lexico-semantic, and syntactic information. The representation is then used as input to methods designed to measure the semantic similarity between texts based on the available linguistic information.While world knowledge is needed to properly assess semantic similarity of texts, in our approach world knowledge is not used, which is a weakness of it.We limit ourselves to answering the question of how successfully one can measure the semantic similarity of texts using just linguistic information.The lexical information in the original texts is retained by using the words in the corresponding representations of the texts. Syntactic information is encoded using dependency relations trees, which represent explicitly the syntactic relations between words. Word-level semantic information is relatively encoded through the use of semantic similarity measures like WordNet Similarity or explicitly encoded using vectorial representations such as Latent Semantic Analysis (LSA). Several methods are being studied to compare the representations, ranging from simple lexical overlap, to more complex methods such as comparing semantic representations in vector spaces as well as syntactic structures. Furthermore, a few powerful kernel models are proposed to use in combination with Support Vector Machine (SVM) classifiers for the case in which the semantic similarity problem is modeled as a classification task
End-to-End Supervised Multilabel Contrastive Learning
Multilabel representation learning is recognized as a challenging problem
that can be associated with either label dependencies between object categories
or data-related issues such as the inherent imbalance of positive/negative
samples. Recent advances address these challenges from model- and data-centric
viewpoints. In model-centric, the label correlation is obtained by an external
model designs (e.g., graph CNN) to incorporate an inductive bias for training.
However, they fail to design an end-to-end training framework, leading to high
computational complexity. On the contrary, in data-centric, the realistic
nature of the dataset is considered for improving the classification while
ignoring the label dependencies. In this paper, we propose a new end-to-end
training framework -- dubbed KMCL (Kernel-based Mutlilabel Contrastive
Learning) -- to address the shortcomings of both model- and data-centric
designs. The KMCL first transforms the embedded features into a mixture of
exponential kernels in Gaussian RKHS. It is then followed by encoding an
objective loss that is comprised of (a) reconstruction loss to reconstruct
kernel representation, (b) asymmetric classification loss to address the
inherent imbalance problem, and (c) contrastive loss to capture label
correlation. The KMCL models the uncertainty of the feature encoder while
maintaining a low computational footprint. Extensive experiments are conducted
on image classification tasks to showcase the consistent improvements of KMCL
over the SOTA methods. PyTorch implementation is provided in
\url{https://github.com/mahdihosseini/KMCL}
Transforming Graph Representations for Statistical Relational Learning
Relational data representations have become an increasingly important topic
due to the recent proliferation of network datasets (e.g., social, biological,
information networks) and a corresponding increase in the application of
statistical relational learning (SRL) algorithms to these domains. In this
article, we examine a range of representation issues for graph-based relational
data. Since the choice of relational data representation for the nodes, links,
and features can dramatically affect the capabilities of SRL algorithms, we
survey approaches and opportunities for relational representation
transformation designed to improve the performance of these algorithms. This
leads us to introduce an intuitive taxonomy for data representation
transformations in relational domains that incorporates link transformation and
node transformation as symmetric representation tasks. In particular, the
transformation tasks for both nodes and links include (i) predicting their
existence, (ii) predicting their label or type, (iii) estimating their weight
or importance, and (iv) systematically constructing their relevant features. We
motivate our taxonomy through detailed examples and use it to survey and
compare competing approaches for each of these tasks. We also discuss general
conditions for transforming links, nodes, and features. Finally, we highlight
challenges that remain to be addressed
- …