21 research outputs found
Beyond Accuracy: Measuring Representation Capacity of Embeddings to Preserve Structural and Contextual Information
Effective representation of data is crucial in various machine learning
tasks, as it captures the underlying structure and context of the data.
Embeddings have emerged as a powerful technique for data representation, but
evaluating their quality and capacity to preserve structural and contextual
information remains a challenge. In this paper, we address this need by
proposing a method to measure the \textit{representation capacity} of
embeddings. The motivation behind this work stems from the importance of
understanding the strengths and limitations of embeddings, enabling researchers
and practitioners to make informed decisions in selecting appropriate embedding
models for their specific applications. By combining extrinsic evaluation
methods, such as classification and clustering, with t-SNE-based neighborhood
analysis, such as neighborhood agreement and trustworthiness, we provide a
comprehensive assessment of the representation capacity. Additionally, the use
of optimization techniques (bayesian optimization) for weight optimization (for
classification, clustering, neighborhood agreement, and trustworthiness)
ensures an objective and data-driven approach in selecting the optimal
combination of metrics. The proposed method not only contributes to advancing
the field of embedding evaluation but also empowers researchers and
practitioners with a quantitative measure to assess the effectiveness of
embeddings in capturing structural and contextual information. For the
evaluation, we use real-world biological sequence (proteins and nucleotide)
datasets and performed representation capacity analysis of embedding
methods from the literature, namely Spike2Vec, Spaced -mers, PWM2Vec, and
AutoEncoder.Comment: Accepted at ISBRA 202
Learning heterogeneous subgraph representations for team discovery
The team discovery task is concerned with finding a group of experts from a collaboration network who would collectively cover a desirable set of skills. Most prior work for team discovery either adopt graph-based or neural mapping approaches. Graph-based approaches are computationally intractable often leading to sub-optimal team selection. Neural mapping approaches have better performance, however, are still limited as they learn individual representations for skills and experts and are often prone to overfitting given the sparsity of collaboration networks. Thus, we define the team discovery task as one of learning subgraph representations from a heterogeneous collaboration network where the subgraphs represent teams which are then used to identify relevant teams for a given set of skills. As such, our approach captures local (node interactions with each team) and global (subgraph interactions between teams) characteristics of the representation network and allows us to easily map between any homogeneous and heterogeneous subgraphs in the network to effectively discover teams. Our experiments over two real-world datasets from different domains, namely DBLP bibliographic dataset with 10,647 papers and IMDB with 4882 movies, illustrate that our approach outperforms the state-of-the-art baselines on a range of ranking and quality metrics. More specifically, in terms of ranking metrics, we are superior to the best baseline by approximately 15 % on the DBLP dataset and by approximately 20 % on the IMDB dataset. Further, our findings illustrate that our approach consistently shows a robust performance improvement over the baselines
The Missing Indicator Method: From Low to High Dimensions
Missing data is common in applied data science, particularly for tabular data
sets found in healthcare, social sciences, and natural sciences. Most
supervised learning methods only work on complete data, thus requiring
preprocessing such as missing value imputation to work on incomplete data sets.
However, imputation alone does not encode useful information about the missing
values themselves. For data sets with informative missing patterns, the Missing
Indicator Method (MIM), which adds indicator variables to indicate the missing
pattern, can be used in conjunction with imputation to improve model
performance. While commonly used in data science, MIM is surprisingly
understudied from an empirical and especially theoretical perspective. In this
paper, we show empirically and theoretically that MIM improves performance for
informative missing values, and we prove that MIM does not hurt linear models
asymptotically for uninformative missing values. Additionally, we find that for
high-dimensional data sets with many uninformative indicators, MIM can induce
model overfitting and thus test performance. To address this issue, we
introduce Selective MIM (SMIM), a novel MIM extension that adds missing
indicators only for features that have informative missing patterns. We show
empirically that SMIM performs at least as well as MIM in general, and improves
MIM for high-dimensional data. Lastly, to demonstrate the utility of MIM on
real-world data science tasks, we demonstrate the effectiveness of MIM and SMIM
on clinical tasks generated from the MIMIC-III database of electronic health
records
Efficient Sketching Algorithm for Sparse Binary Data
Recent advancement of the WWW, IOT, social network, e-commerce, etc. have
generated a large volume of data. These datasets are mostly represented by high
dimensional and sparse datasets. Many fundamental subroutines of common data
analytic tasks such as clustering, classification, ranking, nearest neighbour
search, etc. scale poorly with the dimension of the dataset. In this work, we
address this problem and propose a sketching (alternatively, dimensionality
reduction) algorithm -- \binsketch (Binary Data Sketch) -- for sparse binary
datasets. \binsketch preserves the binary version of the dataset after
sketching and maintains estimates for multiple similarity measures such as
Jaccard, Cosine, Inner-Product similarities, and Hamming distance, on the same
sketch. We present a theoretical analysis of our algorithm and complement it
with extensive experimentation on several real-world datasets. We compare the
performance of our algorithm with the state-of-the-art algorithms on the task
of mean-square-error and ranking. Our proposed algorithm offers a comparable
accuracy while suggesting a significant speedup in the dimensionality reduction
time, with respect to the other candidate algorithms. Our proposal is simple,
easy to implement, and therefore can be adopted in practice
Improving financial investment by deep learning method: predicting stock returns of Tehran stock exchange companies
Safe investment can be experienced by incorporating human experience and modern predicting science. Artificial Intelligence (AI) plays a vital role in reducing errors in this winning layout. This study aims at performance analysis of Deep Learning (DL) and Machine Learning (ML) methods in modellingand predicting the stock returns time series based on the return rate of previous periods and a set of exogenous variables. The data used includes the weekly data of the stock return index of 200 companies included in the Tehran Stock Exchange market from 2016 to 2021. Two Long Short-Term Memory (LSTM)and Deep Q-Network (DQN) models as DL processes and two Random Forest (RF) and Support Vector Machine (SVM) models as ML algorithms were selected. The results showed the superiority of DLalgorithms over ML, which can indicate the existence of strong dependence patterns in these time series, as well as relatively complex nonlinear relationships with uncertainty between the determinant variables. Meanwhile, LSTM with R-squared equals to 87 percent and the analysis of the results of five other evaluation models have shown the highest accuracy and the least error of prediction. On the other hand, the RF model results in the least prediction accuracy by including the highest amount of error
Modeling Events and Interactions through Temporal Processes -- A Survey
In real-world scenario, many phenomena produce a collection of events that
occur in continuous time. Point Processes provide a natural mathematical
framework for modeling these sequences of events. In this survey, we
investigate probabilistic models for modeling event sequences through temporal
processes. We revise the notion of event modeling and provide the mathematical
foundations that characterize the literature on the topic. We define an
ontology to categorize the existing approaches in terms of three families:
simple, marked, and spatio-temporal point processes. For each family, we
systematically review the existing approaches based based on deep learning.
Finally, we analyze the scenarios where the proposed techniques can be used for
addressing prediction and modeling aspects.Comment: Image replacement