21 research outputs found

    Beyond Accuracy: Measuring Representation Capacity of Embeddings to Preserve Structural and Contextual Information

    Full text link
    Effective representation of data is crucial in various machine learning tasks, as it captures the underlying structure and context of the data. Embeddings have emerged as a powerful technique for data representation, but evaluating their quality and capacity to preserve structural and contextual information remains a challenge. In this paper, we address this need by proposing a method to measure the \textit{representation capacity} of embeddings. The motivation behind this work stems from the importance of understanding the strengths and limitations of embeddings, enabling researchers and practitioners to make informed decisions in selecting appropriate embedding models for their specific applications. By combining extrinsic evaluation methods, such as classification and clustering, with t-SNE-based neighborhood analysis, such as neighborhood agreement and trustworthiness, we provide a comprehensive assessment of the representation capacity. Additionally, the use of optimization techniques (bayesian optimization) for weight optimization (for classification, clustering, neighborhood agreement, and trustworthiness) ensures an objective and data-driven approach in selecting the optimal combination of metrics. The proposed method not only contributes to advancing the field of embedding evaluation but also empowers researchers and practitioners with a quantitative measure to assess the effectiveness of embeddings in capturing structural and contextual information. For the evaluation, we use 33 real-world biological sequence (proteins and nucleotide) datasets and performed representation capacity analysis of 44 embedding methods from the literature, namely Spike2Vec, Spaced kk-mers, PWM2Vec, and AutoEncoder.Comment: Accepted at ISBRA 202

    Learning heterogeneous subgraph representations for team discovery

    Get PDF
    The team discovery task is concerned with finding a group of experts from a collaboration network who would collectively cover a desirable set of skills. Most prior work for team discovery either adopt graph-based or neural mapping approaches. Graph-based approaches are computationally intractable often leading to sub-optimal team selection. Neural mapping approaches have better performance, however, are still limited as they learn individual representations for skills and experts and are often prone to overfitting given the sparsity of collaboration networks. Thus, we define the team discovery task as one of learning subgraph representations from a heterogeneous collaboration network where the subgraphs represent teams which are then used to identify relevant teams for a given set of skills. As such, our approach captures local (node interactions with each team) and global (subgraph interactions between teams) characteristics of the representation network and allows us to easily map between any homogeneous and heterogeneous subgraphs in the network to effectively discover teams. Our experiments over two real-world datasets from different domains, namely DBLP bibliographic dataset with 10,647 papers and IMDB with 4882 movies, illustrate that our approach outperforms the state-of-the-art baselines on a range of ranking and quality metrics. More specifically, in terms of ranking metrics, we are superior to the best baseline by approximately 15 % on the DBLP dataset and by approximately 20 % on the IMDB dataset. Further, our findings illustrate that our approach consistently shows a robust performance improvement over the baselines

    The Missing Indicator Method: From Low to High Dimensions

    Full text link
    Missing data is common in applied data science, particularly for tabular data sets found in healthcare, social sciences, and natural sciences. Most supervised learning methods only work on complete data, thus requiring preprocessing such as missing value imputation to work on incomplete data sets. However, imputation alone does not encode useful information about the missing values themselves. For data sets with informative missing patterns, the Missing Indicator Method (MIM), which adds indicator variables to indicate the missing pattern, can be used in conjunction with imputation to improve model performance. While commonly used in data science, MIM is surprisingly understudied from an empirical and especially theoretical perspective. In this paper, we show empirically and theoretically that MIM improves performance for informative missing values, and we prove that MIM does not hurt linear models asymptotically for uninformative missing values. Additionally, we find that for high-dimensional data sets with many uninformative indicators, MIM can induce model overfitting and thus test performance. To address this issue, we introduce Selective MIM (SMIM), a novel MIM extension that adds missing indicators only for features that have informative missing patterns. We show empirically that SMIM performs at least as well as MIM in general, and improves MIM for high-dimensional data. Lastly, to demonstrate the utility of MIM on real-world data science tasks, we demonstrate the effectiveness of MIM and SMIM on clinical tasks generated from the MIMIC-III database of electronic health records

    Efficient Sketching Algorithm for Sparse Binary Data

    Full text link
    Recent advancement of the WWW, IOT, social network, e-commerce, etc. have generated a large volume of data. These datasets are mostly represented by high dimensional and sparse datasets. Many fundamental subroutines of common data analytic tasks such as clustering, classification, ranking, nearest neighbour search, etc. scale poorly with the dimension of the dataset. In this work, we address this problem and propose a sketching (alternatively, dimensionality reduction) algorithm -- \binsketch (Binary Data Sketch) -- for sparse binary datasets. \binsketch preserves the binary version of the dataset after sketching and maintains estimates for multiple similarity measures such as Jaccard, Cosine, Inner-Product similarities, and Hamming distance, on the same sketch. We present a theoretical analysis of our algorithm and complement it with extensive experimentation on several real-world datasets. We compare the performance of our algorithm with the state-of-the-art algorithms on the task of mean-square-error and ranking. Our proposed algorithm offers a comparable accuracy while suggesting a significant speedup in the dimensionality reduction time, with respect to the other candidate algorithms. Our proposal is simple, easy to implement, and therefore can be adopted in practice

    Improving financial investment by deep learning method: predicting stock returns of Tehran stock exchange companies

    Get PDF
    Safe investment can be experienced by incorporating human experience and modern predicting science. Artificial Intelligence (AI) plays a vital role in reducing errors in this winning layout. This study aims at performance analysis of Deep Learning (DL) and Machine Learning (ML) methods in modellingand predicting the stock returns time series based on the return rate of previous periods and a set of exogenous variables. The data used includes the weekly data of the stock return index of 200 companies included in the Tehran Stock Exchange market from 2016 to 2021. Two Long Short-Term Memory (LSTM)and Deep Q-Network (DQN) models as DL processes and two Random Forest (RF) and Support Vector Machine (SVM) models as ML algorithms were selected. The results showed the superiority of DLalgorithms over ML, which can indicate the existence of strong dependence patterns in these time series, as well as relatively complex nonlinear relationships with uncertainty between the determinant variables. Meanwhile, LSTM with R-squared equals to 87 percent and the analysis of the results of five other evaluation models have shown the highest accuracy and the least error of prediction. On the other hand, the RF model results in the least prediction accuracy by including the highest amount of error

    Modeling Events and Interactions through Temporal Processes -- A Survey

    Full text link
    In real-world scenario, many phenomena produce a collection of events that occur in continuous time. Point Processes provide a natural mathematical framework for modeling these sequences of events. In this survey, we investigate probabilistic models for modeling event sequences through temporal processes. We revise the notion of event modeling and provide the mathematical foundations that characterize the literature on the topic. We define an ontology to categorize the existing approaches in terms of three families: simple, marked, and spatio-temporal point processes. For each family, we systematically review the existing approaches based based on deep learning. Finally, we analyze the scenarios where the proposed techniques can be used for addressing prediction and modeling aspects.Comment: Image replacement
    corecore