32 research outputs found
Applicability of semi-supervised learning assumptions for gene ontology terms prediction
Gene Ontology (GO) is one of the most important resources in bioinformatics, aiming to provide a unified framework for the biological annotation of genes and proteins across all species. Predicting GO terms is an essential task for bioinformatics, but the number of available labelled proteins is in several cases insufficient for training reliable machine learning classifiers. Semi-supervised learning methods arise as a powerful solution that explodes the information contained in unlabelled data in order to improve the estimations of traditional supervised approaches. However, semi-supervised learning methods have to make strong assumptions about the nature of the training data and thus, the performance of the predictor is highly dependent on these assumptions. This paper presents an analysis of the applicability of semi-supervised learning assumptions over the specific task of GO terms prediction, focused on providing judgment elements that allow choosing the most suitable tools for specific GO terms. The results show that semi-supervised approaches significantly outperform the traditional supervised methods and that the highest performances are reached when applying the cluster assumption. Besides, it is experimentally demonstrated that cluster and manifold assumptions are complimentary to each other and an analysis of which GO terms can be more prone to be correctly predicted with each assumption, is provided.Postprint (published version
Practice with Graph-based ANN Algorithms on Sparse Data: Chi-square Two-tower model, HNSW, Sign Cauchy Projections
Sparse data are common. The traditional ``handcrafted'' features are often
sparse. Embedding vectors from trained models can also be very sparse, for
example, embeddings trained via the ``ReLu'' activation function. In this
paper, we report our exploration of efficient search in sparse data with
graph-based ANN algorithms (e.g., HNSW, or SONG which is the GPU version of
HNSW), which are popular in industrial practice, e.g., search and ads
(advertising).
We experiment with the proprietary ads targeting application, as well as
benchmark public datasets. For ads targeting, we train embeddings with the
standard ``cosine two-tower'' model and we also develop the ``chi-square
two-tower'' model. Both models produce (highly) sparse embeddings when they are
integrated with the ``ReLu'' activation function. In EBR (embedding-based
retrieval) applications, after we the embeddings are trained, the next crucial
task is the approximate near neighbor (ANN) search for serving. While there are
many ANN algorithms we can choose from, in this study, we focus on the
graph-based ANN algorithm (e.g., HNSW-type).
Sparse embeddings should help improve the efficiency of EBR. One benefit is
the reduced memory cost for the embeddings. The other obvious benefit is the
reduced computational time for evaluating similarities, because, for
graph-based ANN algorithms such as HNSW, computing similarities is often the
dominating cost. In addition to the effort on leveraging data sparsity for
storage and computation, we also integrate ``sign cauchy random projections''
(SignCRP) to hash vectors to bits, to further reduce the memory cost and speed
up the ANN search. In NIPS'13, SignCRP was proposed to hash the chi-square
similarity, which is a well-adopted nonlinear kernel in NLP and computer
vision. Therefore, the chi-square two-tower model, SignCRP, and HNSW are now
tightly integrated
Fast Single-Class Classification and the Principle of Logit Separation
We consider neural network training, in applications in which there are many
possible classes, but at test-time, the task is a binary classification task of
determining whether the given example belongs to a specific class, where the
class of interest can be different each time the classifier is applied. For
instance, this is the case for real-time image search. We define the Single
Logit Classification (SLC) task: training the network so that at test-time, it
would be possible to accurately identify whether the example belongs to a given
class in a computationally efficient manner, based only on the output logit for
this class. We propose a natural principle, the Principle of Logit Separation,
as a guideline for choosing and designing losses suitable for the SLC. We show
that the cross-entropy loss function is not aligned with the Principle of Logit
Separation. In contrast, there are known loss functions, as well as novel batch
loss functions that we propose, which are aligned with this principle. In
total, we study seven loss functions. Our experiments show that indeed in
almost all cases, losses that are aligned with the Principle of Logit
Separation obtain at least 20% relative accuracy improvement in the SLC task
compared to losses that are not aligned with it, and sometimes considerably
more. Furthermore, we show that fast SLC does not cause any drop in binary
classification accuracy, compared to standard classification in which all
logits are computed, and yields a speedup which grows with the number of
classes. For instance, we demonstrate a 10x speedup when the number of classes
is 400,000. Tensorflow code for optimizing the new batch losses is publicly
available at https://github.com/cruvadom/Logit Separation.Comment: Published as a conference paper in ICDM 201