60 research outputs found
Divergent estimation error in portfolio optimization and in linear regression
The problem of estimation error in portfolio optimization is discussed, in
the limit where the portfolio size N and the sample size T go to infinity such
that their ratio is fixed. The estimation error strongly depends on the ratio
N/T and diverges for a critical value of this parameter. This divergence is the
manifestation of an algorithmic phase transition, it is accompanied by a number
of critical phenomena, and displays universality. As the structure of a large
number of multidimensional regression and modelling problems is very similar to
portfolio optimization, the scope of the above observations extends far beyond
finance, and covers a large number of problems in operations research, machine
learning, bioinformatics, medical science, economics, and technology.Comment: 5 pages, 2 figures, Statphys 23 Conference Proceedin
Regularizing Portfolio Optimization
The optimization of large portfolios displays an inherent instability to
estimation error. This poses a fundamental problem, because solutions that are
not stable under sample fluctuations may look optimal for a given sample, but
are, in effect, very far from optimal with respect to the average risk. In this
paper, we approach the problem from the point of view of statistical learning
theory. The occurrence of the instability is intimately related to over-fitting
which can be avoided using known regularization methods. We show how
regularized portfolio optimization with the expected shortfall as a risk
measure is related to support vector regression. The budget constraint dictates
a modification. We present the resulting optimization problem and discuss the
solution. The L2 norm of the weight vector is used as a regularizer, which
corresponds to a diversification "pressure". This means that diversification,
besides counteracting downward fluctuations in some assets by upward
fluctuations in others, is also crucial because it improves the stability of
the solution. The approach we provide here allows for the simultaneous
treatment of optimization and diversification in one framework that enables the
investor to trade-off between the two, depending on the size of the available
data set
Replica theory for learning curves for Gaussian processes on random graphs
Statistical physics approaches can be used to derive accurate predictions for
the performance of inference methods learning from potentially noisy data, as
quantified by the learning curve defined as the average error versus number of
training examples. We analyse a challenging problem in the area of
non-parametric inference where an effectively infinite number of parameters has
to be learned, specifically Gaussian process regression. When the inputs are
vertices on a random graph and the outputs noisy function values, we show that
replica techniques can be used to obtain exact performance predictions in the
limit of large graphs. The covariance of the Gaussian process prior is defined
by a random walk kernel, the discrete analogue of squared exponential kernels
on continuous spaces. Conventionally this kernel is normalised only globally,
so that the prior variance can differ between vertices; as a more principled
alternative we consider local normalisation, where the prior variance is
uniform
Atomic-scale representation and statistical learning of tensorial properties
This chapter discusses the importance of incorporating three-dimensional
symmetries in the context of statistical learning models geared towards the
interpolation of the tensorial properties of atomic-scale structures. We focus
on Gaussian process regression, and in particular on the construction of
structural representations, and the associated kernel functions, that are
endowed with the geometric covariance properties compatible with those of the
learning targets. We summarize the general formulation of such a
symmetry-adapted Gaussian process regression model, and how it can be
implemented based on a scheme that generalizes the popular smooth overlap of
atomic positions representation. We give examples of the performance of this
framework when learning the polarizability and the ground-state electron
density of a molecule
Security analyst networks, performance and career outcomes
Authors' draft. Final version to be published in The Journal of Finance. Available online at http://onlinelibrary.wiley.com/Using a sample of 42,376 board directors and 10,508 security analysts we construct a social
network, mapping the connections between analysts and directors, between directors, and
between analysts. We use social capital theory and techniques developed in social network
analysis to measure the analyst’s level of connectedness and investigate whether these
connections provide any information advantage to the analyst. We find that better-connected
(better-networked) analysts make more accurate, timely, and bold forecasts. Moreover, analysts
with better network positions are less likely to lose their job, suggesting that these analysts are
more valuable to their brokerage houses. We do not find evidence that analyst innate forecasting
ability predicts an analyst’s future network position. In contrast, past forecast optimism has a
positive association with building a better network of connections
A graph-search framework for associating gene identifiers with documents
BACKGROUND: One step in the model organism database curation process is to find, for each article, the identifier of every gene discussed in the article. We consider a relaxation of this problem suitable for semi-automated systems, in which each article is associated with a ranked list of possible gene identifiers, and experimentally compare methods for solving this geneId ranking problem. In addition to baseline approaches based on combining named entity recognition (NER) systems with a "soft dictionary" of gene synonyms, we evaluate a graph-based method which combines the outputs of multiple NER systems, as well as other sources of information, and a learning method for reranking the output of the graph-based method. RESULTS: We show that named entity recognition (NER) systems with similar F-measure performance can have significantly different performance when used with a soft dictionary for geneId-ranking. The graph-based approach can outperform any of its component NER systems, even without learning, and learning can further improve the performance of the graph-based ranking approach. CONCLUSION: The utility of a named entity recognition (NER) system for geneId-finding may not be accurately predicted by its entity-level F1 performance, the most common performance measure. GeneId-ranking systems are best implemented by combining several NER systems. With appropriate combination methods, usefully accurate geneId-ranking systems can be constructed based on easily-available resources, without resorting to problem-specific, engineered components
Classification of heterogeneous microarray data by maximum entropy kernel
<p>Abstract</p> <p>Background</p> <p>There is a large amount of microarray data accumulating in public databases, providing various data waiting to be analyzed jointly. Powerful kernel-based methods are commonly used in microarray analyses with support vector machines (SVMs) to approach a wide range of classification problems. However, the standard vectorial data kernel family (linear, RBF, etc.) that takes vectorial data as input, often fails in prediction if the data come from different platforms or laboratories, due to the low gene overlaps or consistencies between the different datasets.</p> <p>Results</p> <p>We introduce a new type of kernel called maximum entropy (ME) kernel, which has no pre-defined function but is generated by kernel entropy maximization with sample distance matrices as constraints, into the field of SVM classification of microarray data. We assessed the performance of the ME kernel with three different data: heterogeneous kidney carcinoma, noise-introduced leukemia, and heterogeneous oral cavity carcinoma metastasis data. The results clearly show that the ME kernel is very robust for heterogeneous data containing missing values and high-noise, and gives higher prediction accuracies than the standard kernels, namely, linear, polynomial and RBF.</p> <p>Conclusion</p> <p>The results demonstrate its utility in effectively analyzing promiscuous microarray data of rare specimens, e.g., minor diseases or species, that present difficulty in compiling homogeneous data in a single laboratory.</p
Bayesian Markov Random Field Analysis for Protein Function Prediction Based on Network Data
Inference of protein functions is one of the most important aims of modern
biology. To fully exploit the large volumes of genomic data typically produced
in modern-day genomic experiments, automated computational methods for protein
function prediction are urgently needed. Established methods use sequence or
structure similarity to infer functions but those types of data do not suffice
to determine the biological context in which proteins act. Current
high-throughput biological experiments produce large amounts of data on the
interactions between proteins. Such data can be used to infer interaction
networks and to predict the biological process that the protein is involved in.
Here, we develop a probabilistic approach for protein function prediction using
network data, such as protein-protein interaction measurements. We take a
Bayesian approach to an existing Markov Random Field method by performing
simultaneous estimation of the model parameters and prediction of protein
functions. We use an adaptive Markov Chain Monte Carlo algorithm that leads to
more accurate parameter estimates and consequently to improved prediction
performance compared to the standard Markov Random Fields method. We tested our
method using a high quality S.cereviciae validation network
with 1622 proteins against 90 Gene Ontology terms of different levels of
abstraction. Compared to three other protein function prediction methods, our
approach shows very good prediction performance. Our method can be directly
applied to protein-protein interaction or coexpression networks, but also can be
extended to use multiple data sources. We apply our method to physical protein
interaction data from S. cerevisiae and provide novel
predictions, using 340 Gene Ontology terms, for 1170 unannotated proteins and we
evaluate the predictions using the available literature
Disease-Aging Network Reveals Significant Roles of Aging Genes in Connecting Genetic Diseases
One of the challenging problems in biology and medicine is exploring the underlying mechanisms of genetic diseases. Recent studies suggest that the relationship between genetic diseases and the aging process is important in understanding the molecular mechanisms of complex diseases. Although some intricate associations have been investigated for a long time, the studies are still in their early stages. In this paper, we construct a human disease-aging network to study the relationship among aging genes and genetic disease genes. Specifically, we integrate human protein-protein interactions (PPIs), disease-gene associations, aging-gene associations, and physiological system–based genetic disease classification information in a single graph-theoretic framework and find that (1) human disease genes are much closer to aging genes than expected by chance; and (2) diseases can be categorized into two types according to their relationships with aging. Type I diseases have their genes significantly close to aging genes, while type II diseases do not. Furthermore, we examine the topological characters of the disease-aging network from a systems perspective. Theoretical results reveal that the genes of type I diseases are in a central position of a PPI network while type II are not; (3) more importantly, we define an asymmetric closeness based on the PPI network to describe relationships between diseases, and find that aging genes make a significant contribution to associations among diseases, especially among type I diseases. In conclusion, the network-based study provides not only evidence for the intricate relationship between the aging process and genetic diseases, but also biological implications for prying into the nature of human diseases
Candidate gene prioritization by network analysis of differential expression using machine learning approaches
<p>Abstract</p> <p>Background</p> <p>Discovering novel disease genes is still challenging for diseases for which no prior knowledge - such as known disease genes or disease-related pathways - is available. Performing genetic studies frequently results in large lists of candidate genes of which only few can be followed up for further investigation. We have recently developed a computational method for constitutional genetic disorders that identifies the most promising candidate genes by replacing prior knowledge by experimental data of differential gene expression between affected and healthy individuals.</p> <p>To improve the performance of our prioritization strategy, we have extended our previous work by applying different machine learning approaches that identify promising candidate genes by determining whether a gene is surrounded by highly differentially expressed genes in a functional association or protein-protein interaction network.</p> <p>Results</p> <p>We have proposed three strategies scoring disease candidate genes relying on network-based machine learning approaches, such as kernel ridge regression, heat kernel, and Arnoldi kernel approximation. For comparison purposes, a local measure based on the expression of the direct neighbors is also computed. We have benchmarked these strategies on 40 publicly available knockout experiments in mice, and performance was assessed against results obtained using a standard procedure in genetics that ranks candidate genes based solely on their differential expression levels (<it>Simple Expression Ranking</it>). Our results showed that our four strategies could outperform this standard procedure and that the best results were obtained using the <it>Heat Kernel Diffusion Ranking </it>leading to an average ranking position of 8 out of 100 genes, an AUC value of 92.3% and an error reduction of 52.8% relative to the standard procedure approach which ranked the knockout gene on average at position 17 with an AUC value of 83.7%.</p> <p>Conclusion</p> <p>In this study we could identify promising candidate genes using network based machine learning approaches even if no knowledge is available about the disease or phenotype.</p
- …