255 research outputs found
Generalized network-based dimensionality analysis
Network analysis opens new horizons for data analysis methods, as the results of ever-developing network science can be integrated into classical data analysis techniques. This paper presents the generalized version of network-based dimensionality reduction and analysis (NDA). The main contributions of this paper are as follows: (1) The proposed generalized dimensionality reduction and analysis (GNDA) method already handles low-dimensional high-sample-size (LDHSS) and high-dimensional and low-sample-size (HDLSS) at the same time. In addition, compared with existing methods, we show that only the proposed GNDA method adequately estimates the number of latent variables (LVs). (2) The proposed GNDA already considers any symmetric and nonsymmetric similarity functions between indicators (i.e., variables or observations) to specify LVs. (3) The proposed prefiltering and resolution parameters provide the hierarchical version of GNDA to check the robustness of LVs. The proposed GNDA method is compared with traditional dimensionality reduction methods on various simulated and real-world datasets
Learning recommender systems from biased user interactions
Recommender systems have been widely deployed to help users quickly find what they need from a collection of items. Predominant recommendation methods rely on supervised learning models to predict user ratings on items or the probabilities of users interacting with items. In addition, reinforcement learning models are crucial in improving long-term user engagement within recommender systems. In practice, both of these recommendation methods are commonly trained on logged user interactions and, therefore, subject to bias present in logged user interactions. This thesis concerns complex forms of bias in real-world user behaviors and aims to mitigate the effect of bias on reinforcement learning-based recommendation methods. The first part of the thesis consists of two research chapters, each dedicated to tackling a specific form of bias: dynamic selection bias and multifactorial bias. To mitigate the effect of dynamic selection bias and multifactorial bias, we propose a bias propensity estimation method for each. By incorporating the results from the bias propensity estimation methods, the widely used inverse propensity scoring-based debiasing method can be extended to correct for the corresponding bias. The second part of the thesis consists of two chapters that concern the effect of bias on reinforcement learning-based recommendation methods. Its first chapter focuses on mitigating the effect of bias on simulators, which enables the learning and evaluation of reinforcement learning-based recommendation methods. Its second chapter further explores different state encoders for reinforcement learning-based recommendation methods when learning and evaluating with the proposed debiased simulator
Quantum Algorithm for Maximum Biclique Problem
Identifying a biclique with the maximum number of edges bears considerable
implications for numerous fields of application, such as detecting anomalies in
E-commerce transactions, discerning protein-protein interactions in biology,
and refining the efficacy of social network recommendation algorithms. However,
the inherent NP-hardness of this problem significantly complicates the matter.
The prohibitive time complexity of existing algorithms is the primary
bottleneck constraining the application scenarios. Aiming to address this
challenge, we present an unprecedented exploration of a quantum computing
approach. Efficient quantum algorithms, as a crucial future direction for
handling NP-hard problems, are presently under intensive investigation, of
which the potential has already been proven in practical arenas such as
cybersecurity. However, in the field of quantum algorithms for graph databases,
little work has been done due to the challenges presented by the quantum
representation of complex graph topologies. In this study, we delve into the
intricacies of encoding a bipartite graph on a quantum computer. Given a
bipartite graph with n vertices, we propose a ground-breaking algorithm qMBS
with time complexity O^*(2^(n/2)), illustrating a quadratic speed-up in terms
of complexity compared to the state-of-the-art. Furthermore, we detail two
variants tailored for the maximum vertex biclique problem and the maximum
balanced biclique problem. To corroborate the practical performance and
efficacy of our proposed algorithms, we have conducted proof-of-principle
experiments utilizing IBM quantum simulators, of which the results provide a
substantial validation of our approach to the extent possible to date
Machine Learning and Natural Language Processing in Stock Prediction
In this thesis, we first study the two ill-posed natural language processing tasks related to stock prediction, i.e. stock movement prediction and financial document-level event extraction. While implementing stock prediction and event extraction, we encountered difficulties that could be resolved by utilizing out-of-distribution detection. Consequently, we presented a new approach for out-of-distribution detection, which is the third focus of this thesis. First, we systematically build a platform to study the NLP-aided stock auto-trading algorithms. Our platform is characterized by three features: (1) We provide financial news for each specific stock. (2) We provide various stock factors for each stock. (3) We evaluate performance from more financial-relevant metrics. Such a design allows us to develop and evaluate NLP-aided stock auto-trading algorithms in a more realistic setting. We also propose a system to automatically learn a good feature representation from various input information. The key to our algorithm is a method called semantic role labelling Pooling (SRLP), which leverages Semantic Role Labeling (SRL) to create a compact representation of each news paragraph. Based on SRLP, we further incorporate other stock factors to make the stock movement prediction. In addition, we propose a self-supervised learning strategy based on SRLP to enhance the out-of-distribution generalization performance of our system. Through our experimental study, we show that the proposed method achieves better performance and outperforms all strong baselines’ annualized rate of return as well as the maximum drawdown in back-testing. Second, we propose a generative solution for document-level event extraction that takes into account recent developments in generative event extraction, which have been successful at the sentence level but have not yet been explored for document-level extraction. Our proposed solution includes an encoding scheme to capture entity-to-document level information and a decoding scheme that takes into account all relevant contexts. Extensive experimental results demonstrate that our generative-based solution can perform as well as state-of-theart methods that use specialized structures for document event extraction. This allows our method to serve as an easy-to-use and strong baseline for future research in this area. Finally, we propose a new unsupervised OOD detection model that separates, extracts, and learns the semantic role labelling guided fine-grained local feature representation from different sentence arguments and the full sentence using a margin-based contrastive loss. Then we demonstrate the benefit of applying a self-supervised approach to enhance such global-local feature learning by predicting the SRL extracted role. We conduct our experiments and achieve state-of-the-art performance on out-of-distribution benchmarks.Thesis (Ph.D.) -- University of Adelaide, School of Computer and Mathematical Sciences, 202
TriSig: Assessing the statistical significance of triclusters
Tensor data analysis allows researchers to uncover novel patterns and
relationships that cannot be obtained from matrix data alone. The information
inferred from the patterns provides valuable insights into disease progression,
bioproduction processes, weather fluctuations, and group dynamics. However,
spurious and redundant patterns hamper this process. This work aims at
proposing a statistical frame to assess the probability of patterns in tensor
data to deviate from null expectations, extending well-established principles
for assessing the statistical significance of patterns in matrix data. A
comprehensive discussion on binomial testing for false positive discoveries is
entailed at the light of: variable dependencies, temporal dependencies and
misalignments, and \textit{p}-value corrections under the Benjamini-Hochberg
procedure. Results gathered from the application of state-of-the-art
triclustering algorithms over distinct real-world case studies in biochemical
and biotechnological domains confer validity to the proposed statistical frame
while revealing vulnerabilities of some triclustering searches. The proposed
assessment can be incorporated into existing triclustering algorithms to
mitigate false positive/spurious discoveries and further prune the search
space, reducing their computational complexity.
Availability: The code is freely available at
https://github.com/JupitersMight/TriSig under the MIT license
Quantum-inspired algorithm for direct multi-class classification
Over the last few decades, quantum machine learning has emerged as a groundbreaking discipline. Harnessing the peculiarities of quantum computation for machine learning tasks offers promising
advantages. Quantum-inspired machine learning has revealed how relevant benefits for machine learning problems can be obtained using the quantum information theory even without employing
quantum computers. In the recent past, experiments have demonstrated how to design an algorithm for binary classification inspired by the method of quantum state discrimination, which exhibits high performance with respect to several standard classifiers. However, a generalization of this quantuminspired
binary classifier to a multi-class scenario remains nontrivial. Typically, a simple solution in machine learning decomposes multi-class classification into a combinatorial number of binary classifications, with a concomitant increase in computational resources. In this study, we introduce a quantum-inspired classifier that avoids this problem. Inspired by quantum state discrimination, our classifier performs multi-class classification directly without using binary classifiers. We first compared the performance of the quantum-inspired multi-class classifier with eleven standard classifiers. The
comparison revealed an excellent performance of the quantum-inspired classifier. Comparing these results with those obtained using the decomposition in binary classifiers shows that our method
improves the accuracy and reduces the time complexity. Therefore, the quantum-inspired machine learning algorithm proposed in this work is an effective and efficient framework for multi-class classification. Finally, although these advantages can be attained without employing any quantum component in the hardware, we discuss how it is possible to implement the model in quantum hardware
Quantum Multi-Model Fitting
Geometric model fitting is a challenging but fundamental computer vision
problem. Recently, quantum optimization has been shown to enhance robust
fitting for the case of a single model, while leaving the question of
multi-model fitting open. In response to this challenge, this paper shows that
the latter case can significantly benefit from quantum hardware and proposes
the first quantum approach to multi-model fitting (MMF). We formulate MMF as a
problem that can be efficiently sampled by modern adiabatic quantum computers
without the relaxation of the objective function. We also propose an iterative
and decomposed version of our method, which supports real-world-sized problems.
The experimental evaluation demonstrates promising results on a variety of
datasets. The source code is available at:
https://github.com/FarinaMatteo/qmmf.Comment: In Computer Vision and Pattern Recognition (CVPR) 2023; Highligh
Multiway clustering of 3-order tensor via affinity matrix
We propose a new method of multiway clustering for 3-order tensors via
affinity matrix (MCAM). Based on a notion of similarity between the tensor
slices and the spread of information of each slice, our model builds an
affinity/similarity matrix on which we apply advanced clustering methods. The
combination of all clusters of the three modes delivers the desired multiway
clustering. Finally, MCAM achieves competitive results compared with other
known algorithms on synthetics and real datasets
Uncovering the complex genetic architecture of human plasma lipidome using machine learning methods
Genetic architecture of plasma lipidome provides insights into regulation of lipid metabolism and related diseases. We applied an unsupervised machine learning method, PGMRA, to discover phenotype-genotype many-to-many relations between genotype and plasma lipidome (phenotype) in order to identify the genetic architecture of plasma lipidome profiled from 1,426 Finnish individuals aged 30-45 years. PGMRA involves biclustering genotype and lipidome data independently followed by their inter-domain integration based on hypergeometric tests of the number of shared individuals. Pathway enrichment analysis was performed on the SNP sets to identify their associated biological processes. We identified 93 statistically significant (hypergeometric p-value \u3c 0.01) lipidome-genotype relations. Genotype biclusters in these 93 relations contained 5977 SNPs across 3164 genes. Twenty nine of the 93 relations contained genotype biclusters with more than 50% unique SNPs and participants, thus representing most distinct subgroups. We identified 30 significantly enriched biological processes among the SNPs involved in 21 of these 29 most distinct genotype-lipidome subgroups through which the identified genetic variants can influence and regulate plasma lipid related metabolism and profiles. This study identified 29 distinct genotype-lipidome subgroups in the studied Finnish population that may have distinct disease trajectories and therefore could be useful in precision medicine research
Onset of an outline map to get a hold on the wildwood of clustering methods
The domain of cluster analysis is a meeting point for a very rich
multidisciplinary encounter, with cluster-analytic methods being studied and
developed in discrete mathematics, numerical analysis, statistics, data
analysis and data science, and computer science (including machine learning,
data mining, and knowledge discovery), to name but a few. The other side of the
coin, however, is that the domain suffers from a major accessibility problem as
well as from the fact that it is rife with division across many pretty isolated
islands. As a way out, the present paper offers an outline map for the
clustering domain as a whole, which takes the form of an overarching conceptual
framework and a common language. With this framework we wish to contribute to
structuring the domain, to characterizing methods that have often been
developed and studied in quite different contexts, to identifying links between
them, and to introducing a frame of reference for optimally setting up cluster
analyses in data-analytic practice.Comment: 33 pages, 4 figure
- …