24 research outputs found
Beyond Accuracy: Measuring Representation Capacity of Embeddings to Preserve Structural and Contextual Information
Effective representation of data is crucial in various machine learning
tasks, as it captures the underlying structure and context of the data.
Embeddings have emerged as a powerful technique for data representation, but
evaluating their quality and capacity to preserve structural and contextual
information remains a challenge. In this paper, we address this need by
proposing a method to measure the \textit{representation capacity} of
embeddings. The motivation behind this work stems from the importance of
understanding the strengths and limitations of embeddings, enabling researchers
and practitioners to make informed decisions in selecting appropriate embedding
models for their specific applications. By combining extrinsic evaluation
methods, such as classification and clustering, with t-SNE-based neighborhood
analysis, such as neighborhood agreement and trustworthiness, we provide a
comprehensive assessment of the representation capacity. Additionally, the use
of optimization techniques (bayesian optimization) for weight optimization (for
classification, clustering, neighborhood agreement, and trustworthiness)
ensures an objective and data-driven approach in selecting the optimal
combination of metrics. The proposed method not only contributes to advancing
the field of embedding evaluation but also empowers researchers and
practitioners with a quantitative measure to assess the effectiveness of
embeddings in capturing structural and contextual information. For the
evaluation, we use real-world biological sequence (proteins and nucleotide)
datasets and performed representation capacity analysis of embedding
methods from the literature, namely Spike2Vec, Spaced -mers, PWM2Vec, and
AutoEncoder.Comment: Accepted at ISBRA 202
Anderson Acceleration For Bioinformatics-Based Machine Learning
Anderson acceleration (AA) is a well-known method for accelerating the
convergence of iterative algorithms, with applications in various fields
including deep learning and optimization. Despite its popularity in these
areas, the effectiveness of AA in classical machine learning classifiers has
not been thoroughly studied. Tabular data, in particular, presents a unique
challenge for deep learning models, and classical machine learning models are
known to perform better in these scenarios. However, the convergence analysis
of these models has received limited attention. To address this gap in
research, we implement a support vector machine (SVM) classifier variant that
incorporates AA to speed up convergence. We evaluate the performance of our SVM
with and without Anderson acceleration on several datasets from the biology
domain and demonstrate that the use of AA significantly improves convergence
and reduces the training loss as the number of iterations increases. Our
findings provide a promising perspective on the potential of Anderson
acceleration in the training of simple machine learning classifiers and
underscore the importance of further research in this area. By showing the
effectiveness of AA in this setting, we aim to inspire more studies that
explore the applications of AA in classical machine learning.Comment: Accepted in KDH-2023: Knowledge Discovery in Healthcare Data (IJCAI
Workshop
T Cell Receptor Protein Sequences and Sparse Coding: A Novel Approach to Cancer Classification
Cancer is a complex disease characterized by uncontrolled cell growth and
proliferation. T cell receptors (TCRs) are essential proteins for the adaptive
immune system, and their specific recognition of antigens plays a crucial role
in the immune response against diseases, including cancer. The diversity and
specificity of TCRs make them ideal for targeting cancer cells, and recent
advancements in sequencing technologies have enabled the comprehensive
profiling of TCR repertoires. This has led to the discovery of TCRs with potent
anti-cancer activity and the development of TCR-based immunotherapies. In this
study, we investigate the use of sparse coding for the multi-class
classification of TCR protein sequences with cancer categories as target
labels. Sparse coding is a popular technique in machine learning that enables
the representation of data with a set of informative features and can capture
complex relationships between amino acids and identify subtle patterns in the
sequence that might be missed by low-dimensional methods. We first compute the
k-mers from the TCR sequences and then apply sparse coding to capture the
essential features of the data. To improve the predictive performance of the
final embeddings, we integrate domain knowledge regarding different types of
cancer properties. We then train different machine learning (linear and
non-linear) classifiers on the embeddings of TCR sequences for the purpose of
supervised analysis. Our proposed embedding method on a benchmark dataset of
TCR sequences significantly outperforms the baselines in terms of predictive
performance, achieving an accuracy of 99.8\%. Our study highlights the
potential of sparse coding for the analysis of TCR protein sequences in cancer
research and other related fields
Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence Classification
The rapid spread of the COVID-19 pandemic has resulted in an unprecedented
amount of sequence data of the SARS-CoV-2 genome -- millions of sequences and
counting. This amount of data, while being orders of magnitude beyond the
capacity of traditional approaches to understanding the diversity, dynamics,
and evolution of viruses is nonetheless a rich resource for machine learning
(ML) approaches as alternatives for extracting such important information from
these data. It is of hence utmost importance to design a framework for testing
and benchmarking the robustness of these ML models.
This paper makes the first effort (to our knowledge) to benchmark the
robustness of ML models by simulating biological sequences with errors. In this
paper, we introduce several ways to perturb SARS-CoV-2 genome sequences to
mimic the error profiles of common sequencing platforms such as Illumina and
PacBio. We show from experiments on a wide array of ML models that some
simulation-based approaches are more robust (and accurate) than others for
specific embedding methods to certain adversarial attacks to the input
sequences. Our benchmarking framework may assist researchers in properly
assessing different ML models and help them understand the behavior of the
SARS-CoV-2 virus or avoid possible future pandemics
Computing Graph Descriptors on Edge Streams
Feature extraction is an essential task in graph analytics. These feature
vectors, called graph descriptors, are used in downstream vector-space-based
graph analysis models. This idea has proved fruitful in the past, with
spectral-based graph descriptors providing state-of-the-art classification
accuracy. However, known algorithms to compute meaningful descriptors do not
scale to large graphs since: (1) they require storing the entire graph in
memory, and (2) the end-user has no control over the algorithm's runtime. In
this paper, we present streaming algorithms to approximately compute three
different graph descriptors capturing the essential structure of graphs.
Operating on edge streams allows us to avoid storing the entire graph in
memory, and controlling the sample size enables us to keep the runtime of our
algorithms within desired bounds. We demonstrate the efficacy of the proposed
descriptors by analyzing the approximation error and classification accuracy.
Our scalable algorithms compute descriptors of graphs with millions of edges
within minutes. Moreover, these descriptors yield predictive accuracy
comparable to the state-of-the-art methods but can be computed using only 25%
as much memory.Comment: Extension of work accepted to PAKDD 202
Efficient Data Analytics on Augmented Similarity Triplets
Many machine learning methods (classification, clustering, etc.) start with a
known kernel that provides similarity or distance measure between two objects.
Recent work has extended this to situations where the information about objects
is limited to comparisons of distances between three objects (triplets). Humans
find the comparison task much easier than the estimation of absolute
similarities, so this kind of data can be easily obtained using crowd-sourcing.
In this work, we give an efficient method of augmenting the triplets data, by
utilizing additional implicit information inferred from the existing data.
Triplets augmentation improves the quality of kernel-based and kernel-free data
analytics tasks. Secondly, we also propose a novel set of algorithms for common
supervised and unsupervised machine learning tasks based on triplets. These
methods work directly with triplets, avoiding kernel evaluations. Experimental
evaluation on real and synthetic datasets shows that our methods are more
accurate than the current best-known techniques
Short-Term Load Forecasting Using AMI Data
Accurate short-term load forecasting is essential for efficient operation of
the power sector. Predicting load at a fine granularity such as individual
households or buildings is challenging due to higher volatility and uncertainty
in the load. In aggregate loads such as at grids level, the inherent
stochasticity and fluctuations are averaged-out, the problem becomes
substantially easier. We propose an approach for short-term load forecasting at
individual consumers (households) level, called Forecasting using Matrix
Factorization (FMF). FMF does not use any consumers' demographic or activity
patterns information. Therefore, it can be applied to any locality with the
readily available smart meters and weather data. We perform extensive
experiments on three benchmark datasets and demonstrate that FMF significantly
outperforms the computationally expensive state-of-the-art methods for this
problem. We achieve up to 26.5% and 24.4 % improvement in RMSE over Regression
Tree and Support Vector Machine, respectively and up to 36% and 73.2%
improvement in MAPE over Random Forest and Long Short-Term Memory neural
network, respectively