248 research outputs found

    A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data

    Get PDF
    BACKGROUND: As a variety of functional genomic and proteomic techniques become available, there is an increasing need for functional analysis methodologies that integrate heterogeneous data sources. METHODS: In this paper, we address this issue by proposing a general framework for gene function prediction based on the k-nearest-neighbor (KNN) algorithm. The choice of KNN is motivated by its simplicity, flexibility to incorporate different data types and adaptability to irregular feature spaces. A weakness of traditional KNN methods, especially when handling heterogeneous data, is that performance is subject to the often ad hoc choice of similarity metric. To address this weakness, we apply regression methods to infer a similarity metric as a weighted combination of a set of base similarity measures, which helps to locate the neighbors that are most likely to be in the same class as the target gene. We also suggest a novel voting scheme to generate confidence scores that estimate the accuracy of predictions. The method gracefully extends to multi-way classification problems. RESULTS: We apply this technique to gene function prediction according to three well-known Escherichia coli classification schemes suggested by biologists, using information derived from microarray and genome sequencing data. We demonstrate that our algorithm dramatically outperforms the naive KNN methods and is competitive with support vector machine (SVM) algorithms for integrating heterogenous data. We also show that by combining different data sources, prediction accuracy can improve significantly. CONCLUSION: Our extension of KNN with automatic feature weighting, multi-class prediction, and probabilistic inference, enhance prediction accuracy significantly while remaining efficient, intuitive and flexible. This general framework can also be applied to similar classification problems involving heterogeneous datasets

    L2-norm multiple kernel learning and its application to biomedical data fusion

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>This paper introduces the notion of optimizing different norms in the dual problem of support vector machines with multiple kernels. The selection of norms yields different extensions of multiple kernel learning (MKL) such as <it>L</it><sub>∞</sub>, <it>L</it><sub>1</sub>, and <it>L</it><sub>2 </sub>MKL. In particular, <it>L</it><sub>2 </sub>MKL is a novel method that leads to non-sparse optimal kernel coefficients, which is different from the sparse kernel coefficients optimized by the existing <it>L</it><sub>∞ </sub>MKL method. In real biomedical applications, <it>L</it><sub>2 </sub>MKL may have more advantages over sparse integration method for thoroughly combining complementary information in heterogeneous data sources.</p> <p>Results</p> <p>We provide a theoretical analysis of the relationship between the <it>L</it><sub>2 </sub>optimization of kernels in the dual problem with the <it>L</it><sub>2 </sub>coefficient regularization in the primal problem. Understanding the dual <it>L</it><sub>2 </sub>problem grants a unified view on MKL and enables us to extend the <it>L</it><sub>2 </sub>method to a wide range of machine learning problems. We implement <it>L</it><sub>2 </sub>MKL for ranking and classification problems and compare its performance with the sparse <it>L</it><sub>∞ </sub>and the averaging <it>L</it><sub>1 </sub>MKL methods. The experiments are carried out on six real biomedical data sets and two large scale UCI data sets. <it>L</it><sub>2 </sub>MKL yields better performance on most of the benchmark data sets. In particular, we propose a novel <it>L</it><sub>2 </sub>MKL least squares support vector machine (LSSVM) algorithm, which is shown to be an efficient and promising classifier for large scale data sets processing.</p> <p>Conclusions</p> <p>This paper extends the statistical framework of genomic data fusion based on MKL. Allowing non-sparse weights on the data sources is an attractive option in settings where we believe most data sources to be relevant to the problem at hand and want to avoid a "winner-takes-all" effect seen in <it>L</it><sub>∞ </sub>MKL, which can be detrimental to the performance in prospective studies. The notion of optimizing <it>L</it><sub>2 </sub>kernels can be straightforwardly extended to ranking, classification, regression, and clustering algorithms. To tackle the computational burden of MKL, this paper proposes several novel LSSVM based MKL algorithms. Systematic comparison on real data sets shows that LSSVM MKL has comparable performance as the conventional SVM MKL algorithms. Moreover, large scale numerical experiments indicate that when cast as semi-infinite programming, LSSVM MKL can be solved more efficiently than SVM MKL.</p> <p>Availability</p> <p>The MATLAB code of algorithms implemented in this paper is downloadable from <url>http://homes.esat.kuleuven.be/~sistawww/bioi/syu/l2lssvm.html</url>.</p

    Enhanced protein fold recognition through a novel data integration approach

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Protein fold recognition is a key step in protein three-dimensional (3D) structure discovery. There are multiple fold discriminatory data sources which use physicochemical and structural properties as well as further data sources derived from local sequence alignments. This raises the issue of finding the most efficient method for combining these different informative data sources and exploring their relative significance for protein fold classification. Kernel methods have been extensively used for biological data analysis. They can incorporate separate fold discriminatory features into kernel matrices which encode the similarity between samples in their respective data sources.</p> <p>Results</p> <p>In this paper we consider the problem of integrating multiple data sources using a kernel-based approach. We propose a novel information-theoretic approach based on a Kullback-Leibler (KL) divergence between the output kernel matrix and the input kernel matrix so as to integrate heterogeneous data sources. One of the most appealing properties of this approach is that it can easily cope with multi-class classification and multi-task learning by an appropriate choice of the output kernel matrix. Based on the position of the output and input kernel matrices in the KL-divergence objective, there are two formulations which we respectively refer to as <it>MKLdiv-dc </it>and <it>MKLdiv-conv</it>. We propose to efficiently solve MKLdiv-dc by a difference of convex (DC) programming method and MKLdiv-conv by a projected gradient descent algorithm. The effectiveness of the proposed approaches is evaluated on a benchmark dataset for protein fold recognition and a yeast protein function prediction problem.</p> <p>Conclusion</p> <p>Our proposed methods MKLdiv-dc and MKLdiv-conv are able to achieve state-of-the-art performance on the SCOP PDB-40D benchmark dataset for protein fold prediction and provide useful insights into the relative significance of informative data sources. In particular, MKLdiv-dc further improves the fold discrimination accuracy to 75.19% which is a more than 5% improvement over competitive Bayesian probabilistic and SVM margin-based kernel learning methods. Furthermore, we report a competitive performance on the yeast protein function prediction problem.</p

    “A long-term mortality analysis of subsidized firms in rural areas: an empirical study in the Portuguese Alentejo region”

    Get PDF
    Studies have demonstrated that public policies to support private firms’ investment have the ability to promote entrepreneurship, but the sustainability of subsidized firms has not often been analysed. This paper aims to examine this dimension specifically through evaluating the mortality of subsidized firms in the long-term. The analysis focuses on a case study of the LEADER+ Programme in the Alentejo region of Portugal. With this purpose, the paper examines the activity status (active or not active) of 154 private, rural, for-profit firms in Alentejo that had received a subsidy to support investment between 2002 and 2008 under the LEADER+ Programme. The methodology is based on binary choice models in order to study the probability of these firms still being active. The explanatory variables used are the following: (1) the characteristics of entrepreneurs and managers’ strategic decisions, (2) firm profile and characteristics, (3) regional economic environment. Data assessment showed that the cumulative mortality rate of firms on 31st December 2013 is over 20 %. Interpretation of the regression model revealed that he probability of firms’ survival increases with higher investment, firm age and regional business concentration, whereas the number of applications made by firms has a negative impact on their survival. So it seems that for subsidized firms the amount of investment is as important as its frequency

    Identidad étnica y redes personales entre jóvenes de Sarajevo

    Get PDF
    After fieldwork conducted among young people in Sarajevo, we found a relation between the discourses sustained by them and the ethnic categories they use to classify people and to identify themselves. Also we have found that people self-affiliated as "Bosnians" play an important role in the network of multiethnic relationships, in which strong ties, surprisingly, are still very important. Finally we found a relationship between the composition of personal networks and the ethnic discourses that are maintained.Después de un trabajo de campo realizado con un grupo de jóvenes en Sarajevo, hemos constatado la existencia de una relación entre los discursos que sostienen y las categorías étnicas que utilizan tanto para clasificar a los demás como para auto-identificarse. Asimismo hemos encontrado que los jóvenes que se autodenominan "Bosnios" juegan un rol importante en la red de relaciones multiétnicas, en la que los lazos fuertes, sorprendentemente, son muy importantes. Finalmente hemos hallado una relación entre la composición de las redes personales y los discursos étnicos que se sostienen. Vivimos, o creemos vivir, en múltiples "comunidades", imaginadas o no. Al mismo tiempo, el individuo y no el lugar, la familia o el grupo, se sitúa en el centro de la vida social y de las comunicaciones (Cf. Wellman, 2001). En este contexto, inducido por el avance del capitalismo flexible (Castells, 1996), pensamos que para entender adecuadamente la identidad o identidades postuladas por los individuos es necesario estudiar las redes personales y su dinámica. Desde esta perspectiva no podemos hablar de "etnias" o "multietnicidad" sin más precisiones, pues son conceptos basados en una concepción esencialista y estática de la identidad individual. El concepto de "sociedad multiétnica" es utilizado de una manera engañosamente progresista y objetiva, pues lo que en realidad legitima es la existencia de diferencias esenciales entre personas, alejando en lugar de acercar. Sin embargo, somos plenamente conscientes que los discursos esencialistas de la identidad étnica son omnipresentes, con enormes efectos políticos e individuales. Que planteemos que la concepción esencialista de la identidad sea inapropiada desde un punto de vista académico, no significa que ésta no se utilice políticamente y por lo tanto tenga consecuencias formidables en las relaciones sociales. Precisamente el estudio de las redes personales nos permite situarnos en una perspectiva que no utiliza con pretensiones analíticas conceptos "folk", como son los de "etnia", "pueblo" o "nación", sino que los sitúa en el terreno de los discursos sustentados por los actores (y los estados y medios de comunicación) y nos permite contextualizarlos mediante conceptos etic, es decir, impuestos por los investigadores. Sólo así podemos superar las tautologías que abundan en los discursos étnicos

    A new pairwise kernel for biological network inference with support vector machines

    Get PDF
    International audienceBACKGROUND: Much recent work in bioinformatics has focused on the inference of various types of biological networks, representing gene regulation, metabolic processes, protein-protein interactions, etc. A common setting involves inferring network edges in a supervised fashion from a set of high-confidence edges, possibly characterized by multiple, heterogeneous data sets (protein sequence, gene expression, etc.). RESULTS: Here, we distinguish between two modes of inference in this setting: direct inference based upon similarities between nodes joined by an edge, and indirect inference based upon similarities between one pair of nodes and another pair of nodes. We propose a supervised approach for the direct case by translating it into a distance metric learning problem. A relaxation of the resulting convex optimization problem leads to the support vector machine (SVM) algorithm with a particular kernel for pairs, which we call the metric learning pairwise kernel. This new kernel for pairs can easily be used by most SVM implementations to solve problems of supervised classification and inference of pairwise relationships from heterogeneous data. We demonstrate, using several real biological networks and genomic datasets, that this approach often improves upon the state-of-the-art SVM for indirect inference with another pairwise kernel, and that the combination of both kernels always improves upon each individual kernel. CONCLUSION: The metric learning pairwise kernel is a new formulation to infer pairwise relationships with SVM, which provides state-of-the-art results for the inference of several biological networks from heterogeneous genomic data

    Scoring Protein Relationships in Functional Interaction Networks Predicted from Sequence Data

    Get PDF
    The abundance of diverse biological data from various sources constitutes a rich source of knowledge, which has the power to advance our understanding of organisms. This requires computational methods in order to integrate and exploit these data effectively and elucidate local and genome wide functional connections between protein pairs, thus enabling functional inferences for uncharacterized proteins. These biological data are primarily in the form of sequences, which determine functions, although functional properties of a protein can often be predicted from just the domains it contains. Thus, protein sequences and domains can be used to predict protein pair-wise functional relationships, and thus contribute to the function prediction process of uncharacterized proteins in order to ensure that knowledge is gained from sequencing efforts. In this work, we introduce information-theoretic based approaches to score protein-protein functional interaction pairs predicted from protein sequence similarity and conserved protein signature matches. The proposed schemes are effective for data-driven scoring of connections between protein pairs. We applied these schemes to the Mycobacterium tuberculosis proteome to produce a homology-based functional network of the organism with a high confidence and coverage. We use the network for predicting functions of uncharacterised proteins

    Disease-Aging Network Reveals Significant Roles of Aging Genes in Connecting Genetic Diseases

    Get PDF
    One of the challenging problems in biology and medicine is exploring the underlying mechanisms of genetic diseases. Recent studies suggest that the relationship between genetic diseases and the aging process is important in understanding the molecular mechanisms of complex diseases. Although some intricate associations have been investigated for a long time, the studies are still in their early stages. In this paper, we construct a human disease-aging network to study the relationship among aging genes and genetic disease genes. Specifically, we integrate human protein-protein interactions (PPIs), disease-gene associations, aging-gene associations, and physiological system–based genetic disease classification information in a single graph-theoretic framework and find that (1) human disease genes are much closer to aging genes than expected by chance; and (2) diseases can be categorized into two types according to their relationships with aging. Type I diseases have their genes significantly close to aging genes, while type II diseases do not. Furthermore, we examine the topological characters of the disease-aging network from a systems perspective. Theoretical results reveal that the genes of type I diseases are in a central position of a PPI network while type II are not; (3) more importantly, we define an asymmetric closeness based on the PPI network to describe relationships between diseases, and find that aging genes make a significant contribution to associations among diseases, especially among type I diseases. In conclusion, the network-based study provides not only evidence for the intricate relationship between the aging process and genetic diseases, but also biological implications for prying into the nature of human diseases

    Internationalisation speed and MNE performance: A study of the market-seeking expansion of retail MNEs

    Get PDF
    Existing research is divided on whether firms that rapidly expand their overseas operations perform better than firms that internationalize slowly. Drawing on Penrose’s theory of the growth of the firm we argue that the positive effects of rapid internationalization give way to negative effects with increasing internationalization speed, leading to an inverted U-shaped association between internationalization speed and firm performance. We analyse the market-seeking expansion of 110 retailers over a 10-year period (2003–2012) and find support for a curvilinear relationship between internationalization speed and firm performance that is moderated by the geographic scope of firms’ internationalization path and firms’ international experience. Our study contributes to resolving conflicting views on the link between internationalization speed and firm performance
    corecore