229 research outputs found
DID: Distributed Incremental Block Coordinate Descent for Nonnegative Matrix Factorization
Nonnegative matrix factorization (NMF) has attracted much attention in the
last decade as a dimension reduction method in many applications. Due to the
explosion in the size of data, naturally the samples are collected and stored
distributively in local computational nodes. Thus, there is a growing need to
develop algorithms in a distributed memory architecture. We propose a novel
distributed algorithm, called \textit{distributed incremental block coordinate
descent} (DID), to solve the problem. By adapting the block coordinate descent
framework, closed-form update rules are obtained in DID. Moreover, DID performs
updates incrementally based on the most recently updated residual matrix. As a
result, only one communication step per iteration is required. The correctness,
efficiency, and scalability of the proposed algorithm are verified in a series
of numerical experiments.Comment: Accepted by AAAI 201
Hybrid classification approach for imbalanced datasets
The research area of imbalanced dataset has been attracted increasing attention from both academic and industrial areas, because it poses a serious issues for so many supervised learning problems. Since the number of majority class dominates the number of minority class are from minority class, if training dataset includes all data in order to fit a classic classifier, the classifier tends to classify all data to majority class by ignoring minority data as noise. Thus, it is very significant to select appropriate training dataset in the prepossessing stage for classification of imbalanced dataset. We propose an combination approach of SMOTE (Synthetic Minority Over-sampling Technique) and instance selection approaches. The numeric results show that the proposed combination approach can help classifiers to achieve better performance
On the optimization and generalization of overparameterized implicit neural networks
Implicit neural networks have become increasingly attractive in the machine
learning community since they can achieve competitive performance but use much
less computational resources. Recently, a line of theoretical works established
the global convergences for first-order methods such as gradient descent if the
implicit networks are over-parameterized. However, as they train all layers
together, their analyses are equivalent to only studying the evolution of the
output layer. It is unclear how the implicit layer contributes to the training.
Thus, in this paper, we restrict ourselves to only training the implicit layer.
We show that global convergence is guaranteed, even if only the implicit layer
is trained. On the other hand, the theoretical understanding of when and how
the training performance of an implicit neural network can be generalized to
unseen data is still under-explored. Although this problem has been studied in
standard feed-forward networks, the case of implicit neural networks is still
intriguing since implicit networks theoretically have infinitely many layers.
Therefore, this paper investigates the generalization error for implicit neural
networks. Specifically, we study the generalization of an implicit network
activated by the ReLU function over random initialization. We provide a
generalization bound that is initialization sensitive. As a result, we show
that gradient flow with proper random initialization can train a sufficient
over-parameterized implicit network to achieve arbitrarily small generalization
errors
Optimal Charging Strategy for EVs with Batteries at Different States of Health
The electric vehicle (EV) is targeted as an efficient method of decreasing CO2 emission and reducing dependence on fossil fuel. Compared with filling up the internal combustion engine (ICE) vehicle, the EV power charging time is usually long. However,to the best of our knowledge, the current charging strategy does not consider the battery state of health (SOH). It is noted that a high charging current rate may damage the battery life. Motivated by this, an optimal charging strategy is proposed in the present paper, providing several optimal charging options taking into account the EV battery health, trying to prevent ‘abused battery utilization’ happening
Extracting information from deep learning models for computational biology
The advances in deep learning technologies in this decade are providing powerful tools for many machine learning tasks. Deep learning models, in contrast to traditional linear models, can learn nonlinear functions and high-order features, which enable exceptional performance. In the field of computational biology, the rapid growth of data scale and complexity increases the demand for building powerful deep learning based tools. Despite the success of using the deep learning methods, understanding of the reasons for the effectiveness and interpretation of models remain elusive.
This dissertation aims to provide several different approaches to extract information from deep models. This information could be used to address the problems of model complexity and model interpretability.
The amount of data needed to train a model depends on the complexity of the model. The cost of generating data in biology is typically large. Hence, collecting the data on the scale comparable to other deep learning application areas, such as computer vision and speech understanding, is prohibitively expensive and datasets are, consequently, small. Training models of high complexity on small datasets can result in overfitting -- model tries to over-explain the observed data and has a bad prediction accuracy on unobserved data. The number of parameters in the model is often regarded as the complexity of the model. However, deep learning models usually have thousands to millions of parameters, and they are still capable of yielding meaningful results and avoiding over-fitting even on modest datasets. To explain this phenomenon, I proposed a method to estimate the degrees of freedom -- a proper estimate of the complexity -- in deep learning models. My results show that the actual complexity of a deep learning model is much smaller than its number of parameters. Using this measure of complexity, I propose a new model selection score which obviates the need for cross-validation.
Another concern for deep learning models is the ability to extract comprehensible knowledge from the model. In linear models, a coefficient corresponding to an input variable represents that variable’s influence on the prediction. However, in a deep neural network, the relationship between input and output is much more complex. In biological and medical applications, lack of interpretability prevents deep neural networks from being a source of new scientific knowledge. To address this problem, I provide 1) a framework to select hypotheses about perturbations that lead to the largest phenotypic change, and 2) a novel auto-encoder with guided-training that selects a representation of a biological system informative of a target phenotype. Computational biology application case studies were provided to illustrate the success of both methods.Doctor of Philosoph
A Forest from the Trees: Generation through Neighborhoods
In this work, we propose to learn a generative model using both learned
features (through a latent space) and memories (through neighbors). Although
human learning makes seamless use of both learned perceptual features and
instance recall, current generative learning paradigms only make use of one of
these two components. Take, for instance, flow models, which learn a latent
space of invertible features that follow a simple distribution. Conversely,
kernel density techniques use instances to shift a simple distribution into an
aggregate mixture model. Here we propose multiple methods to enhance the latent
space of a flow model with neighborhood information. Not only does our proposed
framework represent a more human-like approach by leveraging both learned
features and memories, but it may also be viewed as a step forward in
non-parametric methods. The efficacy of our model is shown empirically with
standard image datasets. We observe compelling results and a significant
improvement over baselines
Spatial genetic subdivision among populations of Pampus chinensis between China and Pakistan: testing the barrier effect of the Malay Peninsula
Tissue samples from 84 Pampus chinensis individuals were collected from four geographic regions within the Indo–Pacific Ocean and analyzed using mitochondrial and nuclear DNA markers. Distinct genetic heterogeneity was found for both types of markers between Chinese and Pakistani populations, while the diversity of this species was high in all populations. In combination with published information on marine species with similar distributions, these results suggested that the Malay Peninsula, or a less effective supplement, played a role in shaping the contemporary genetic structure. This population structure was presumably reflected in P. chinensis, whose populations were genetically isolated during Pleistocene glaciations and then did not experience secondary contact between previous refuge populations. However, P. chinensis showed genetic continuity in China or Pakistan, which indicated that the populations in different geographical regions constituted a single panmictic stock with high gene flow, respectively. The spatial genetic subdivision evident among populations indicates that P. chinensis in this Indo–Pacific region should be managed as different independent stocks to guide the sustainability of this fisheries resource
Population genetics and molecular phylogeography of Thamnaconus modestus (Tetraodontiformes, Monachanthidae) in Northwestern Pacific inferred from variation of the mtDNA control region
In order to study the genetic diversity of Thamnaconus modestus, a species of great commercial importance in Southeast Asia, the 5′-end hypervariable regions (423 bp) of the mitochondrial control region of T. modestus in nine geographical populations (248 individuals) were sequenced and analysed in this study. The target sequence fragment contained large numbers of polymorphic sites (87) involved in high levels of haplotype diversity (h = 0.97 ± 0.01) and nucleotide diversity (π = 0.0285 ± 0.0143). The genetic variations within populations (92.71%) were significantly larger than those among populations (7.29%). No significant genetic divergences were detected among the wild populations owing to their gregarious habits, strong moving ability, r-selection strategy. Significant genetic divergences were found between the cultured and wild populations, probably resulting from kin selection and aquacultural environment. Three significant phylogenetic lineages were identified, and the variation among lineages (56.90%) was greater than that among individuals within the lineages (43.10%), with the significant ΦST value (ΦST = 0.57, P = 0.0000). The results showed great and significant genetic differentiations among these three lineages, indicating that they may have independent phylogenetic dynamics. Dominant shared haplotypes that included individuals from each population and the median-joining network of haplotypes presented a star-like structure. Historic demographic analysis of each lineage showed that population expansion occurred after the Pleistocene glacial period. At the last glacial maximum, T. modestus in China seas was scattered across variable refuges, including Central South China Sea and Okinawa Trough
- …