3,241 research outputs found
Adaptive grid based localized learning for multidimensional data
Rapid advances in data-rich domains of science, technology, and business has amplified the computational challenges of Big Data synthesis necessary to slow the widening gap between the rate at which the data is being collected and analyzed for knowledge. This has led to the renewed need for efficient and accurate algorithms, framework, and algorithmic mechanisms essential for knowledge discovery, especially in the domains of clustering, classification, dimensionality reduction, feature ranking, and feature selection. However, data mining algorithms are frequently challenged by the sparseness due to the high dimensionality of the datasets in such domains which is particularly detrimental to the performance of unsupervised learning algorithms.
The motivation for the research presented in this dissertation is to develop novel data mining algorithms to address the challenges of high dimensionality, sparseness and large volumes of datasets by using a unique grid-based localized learning paradigm for data movement clustering and classification schema. The grid-based learning is recognized in data mining as these algorithms are inherently efficient since they reduce the search space by partitioning the feature space into effective partitions. However, these approaches have not been successfully devised for supervised learning algorithms or sparseness reduction algorithm as they require careful estimation of grid sizes, partitions and data movement error calculations. Grid-based localized learning algorithms can scale well with an increase in dimensionality and the size of the datasets.
To fulfill the goal of designing and developing learning algorithms that can handle data sparseness, high data dimensionality, and large size of data, in a concurrent manner to avoid the feature selection biases, a set of novel data mining algorithms using grid-based localized learning principles are developed and presented. The first algorithm is a unique computational framework for feature ranking that employs adaptive grid-based data shrinking for feature ranking. This method addresses the limitations of existing feature ranking methods by using a scoring function that discovers and exploits dependencies from all the features in the data. Data shrinking principles are established and metricized to capture and exploit dependencies between features. The second core algorithmic contribution is a novel supervised learning algorithm that utilizes grid-based localized learning to build a nonparametric classification model. In this classification model, feature space is divided using uniform/non-uniform partitions and data space subdivision is performed using a grid structure which is then used to build a classification model using grid-based nearest-neighbor learning. The third algorithm is an unsupervised clustering algorithm that is augmented with data shrinking to enhance the clustering performance of the algorithm. This algorithm addresses the limitations of the existing grid-based data shrinking and clustering algorithms by using an adaptive grid-based learning. Multiple experiments on a diversified set of datasets evaluate and discuss the effectiveness of dimensionality reduction, feature selection, unsupervised and supervised learning, and the scalability of the proposed methods compared to the established methods in the literature
ProtNN: Fast and Accurate Nearest Neighbor Protein Function Prediction based on Graph Embedding in Structural and Topological Space
Studying the function of proteins is important for understanding the
molecular mechanisms of life. The number of publicly available protein
structures has increasingly become extremely large. Still, the determination of
the function of a protein structure remains a difficult, costly, and time
consuming task. The difficulties are often due to the essential role of spatial
and topological structures in the determination of protein functions in living
cells. In this paper, we propose ProtNN, a novel approach for protein function
prediction. Given an unannotated protein structure and a set of annotated
proteins, ProtNN finds the nearest neighbor annotated structures based on
protein-graph pairwise similarities. Given a query protein, ProtNN finds the
nearest neighbor reference proteins based on a graph representation model and a
pairwise similarity between vector embedding of both query and reference
protein-graphs in structural and topological spaces. ProtNN assigns to the
query protein the function with the highest number of votes across the set of k
nearest neighbor reference proteins, where k is a user-defined parameter.
Experimental evaluation demonstrates that ProtNN is able to accurately classify
several datasets in an extremely fast runtime compared to state-of-the-art
approaches. We further show that ProtNN is able to scale up to a whole PDB
dataset in a single-process mode with no parallelization, with a gain of
thousands order of magnitude of runtime compared to state-of-the-art
approaches
Analysis of group evolution prediction in complex networks
In the world, in which acceptance and the identification with social
communities are highly desired, the ability to predict evolution of groups over
time appears to be a vital but very complex research problem. Therefore, we
propose a new, adaptable, generic and mutli-stage method for Group Evolution
Prediction (GEP) in complex networks, that facilitates reasoning about the
future states of the recently discovered groups. The precise GEP modularity
enabled us to carry out extensive and versatile empirical studies on many
real-world complex / social networks to analyze the impact of numerous setups
and parameters like time window type and size, group detection method,
evolution chain length, prediction models, etc. Additionally, many new
predictive features reflecting the group state at a given time have been
identified and tested. Some other research problems like enriching learning
evolution chains with external data have been analyzed as well
Bounded Coordinate-Descent for Biological Sequence Classification in High Dimensional Predictor Space
We present a framework for discriminative sequence classification where the
learner works directly in the high dimensional predictor space of all
subsequences in the training set. This is possible by employing a new
coordinate-descent algorithm coupled with bounding the magnitude of the
gradient for selecting discriminative subsequences fast. We characterize the
loss functions for which our generic learning algorithm can be applied and
present concrete implementations for logistic regression (binomial
log-likelihood loss) and support vector machines (squared hinge loss).
Application of our algorithm to protein remote homology detection and remote
fold recognition results in performance comparable to that of state-of-the-art
methods (e.g., kernel support vector machines). Unlike state-of-the-art
classifiers, the resulting classification models are simply lists of weighted
discriminative subsequences and can thus be interpreted and related to the
biological problem
Identifying networks with common organizational principles
Many complex systems can be represented as networks, and the problem of
network comparison is becoming increasingly relevant. There are many techniques
for network comparison, from simply comparing network summary statistics to
sophisticated but computationally costly alignment-based approaches. Yet it
remains challenging to accurately cluster networks that are of a different size
and density, but hypothesized to be structurally similar. In this paper, we
address this problem by introducing a new network comparison methodology that
is aimed at identifying common organizational principles in networks. The
methodology is simple, intuitive and applicable in a wide variety of settings
ranging from the functional classification of proteins to tracking the
evolution of a world trade network.Comment: 26 pages, 7 figure
Filter-wrapper combination and embedded feature selection for gene expression data
Biomedical and bioinformatics datasets are generally large in terms of their number of features - and include redundant and irrelevant features, which affect the effectiveness and efficiency of classification of these datasets. Several different features selection methods have been utilised in various fields, including bioinformatics, to reduce the number of features. This study utilised Filter-Wrapper combination and embedded (LASSO) feature selection methods on both high and low dimensional datasets before classification was performed. The results illustrate that the combination of filter and wrapper feature selection to create a hybrid form of feature selection provides better performance than using filter only. In addition, LASSO performed better on high dimensional data
Fully Bayesian Logistic Regression with Hyper-Lasso Priors for High-dimensional Feature Selection
High-dimensional feature selection arises in many areas of modern science.
For example, in genomic research we want to find the genes that can be used to
separate tissues of different classes (e.g. cancer and normal) from tens of
thousands of genes that are active (expressed) in certain tissue cells. To this
end, we wish to fit regression and classification models with a large number of
features (also called variables, predictors). In the past decade, penalized
likelihood methods for fitting regression models based on hyper-LASSO
penalization have received increasing attention in the literature. However,
fully Bayesian methods that use Markov chain Monte Carlo (MCMC) are still in
lack of development in the literature. In this paper we introduce an MCMC
(fully Bayesian) method for learning severely multi-modal posteriors of
logistic regression models based on hyper-LASSO priors (non-convex penalties).
Our MCMC algorithm uses Hamiltonian Monte Carlo in a restricted Gibbs sampling
framework; we call our method Bayesian logistic regression with hyper-LASSO
(BLRHL) priors. We have used simulation studies and real data analysis to
demonstrate the superior performance of hyper-LASSO priors, and to investigate
the issues of choosing heaviness and scale of hyper-LASSO priors.Comment: 33 pages. arXiv admin note: substantial text overlap with
arXiv:1308.469
- …