282 research outputs found

    Adaptive Evolutionary Clustering

    Full text link
    In many practical applications of clustering, the objects to be clustered evolve over time, and a clustering result is desired at each time step. In such applications, evolutionary clustering typically outperforms traditional static clustering by producing clustering results that reflect long-term trends while being robust to short-term variations. Several evolutionary clustering algorithms have recently been proposed, often by adding a temporal smoothness penalty to the cost function of a static clustering method. In this paper, we introduce a different approach to evolutionary clustering by accurately tracking the time-varying proximities between objects followed by static clustering. We present an evolutionary clustering framework that adaptively estimates the optimal smoothing parameter using shrinkage estimation, a statistical approach that improves a naive estimate using additional information. The proposed framework can be used to extend a variety of static clustering algorithms, including hierarchical, k-means, and spectral clustering, into evolutionary clustering algorithms. Experiments on synthetic and real data sets indicate that the proposed framework outperforms static clustering and existing evolutionary clustering algorithms in many scenarios.Comment: To appear in Data Mining and Knowledge Discovery, MATLAB toolbox available at http://tbayes.eecs.umich.edu/xukevin/affec

    Markov clustering versus affinity propagation for the partitioning of protein interaction graphs

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Genome scale data on protein interactions are generally represented as large networks, or graphs, where hundreds or thousands of proteins are linked to one another. Since proteins tend to function in groups, or complexes, an important goal has been to reliably identify protein complexes from these graphs. This task is commonly executed using clustering procedures, which aim at detecting densely connected regions within the interaction graphs. There exists a wealth of clustering algorithms, some of which have been applied to this problem. One of the most successful clustering procedures in this context has been the Markov Cluster algorithm (MCL), which was recently shown to outperform a number of other procedures, some of which were specifically designed for partitioning protein interactions graphs. A novel promising clustering procedure termed Affinity Propagation (AP) was recently shown to be particularly effective, and much faster than other methods for a variety of problems, but has not yet been applied to partition protein interaction graphs.</p> <p>Results</p> <p>In this work we compare the performance of the Affinity Propagation (AP) and Markov Clustering (MCL) procedures. To this end we derive an unweighted network of protein-protein interactions from a set of 408 protein complexes from <it>S. cervisiae </it>hand curated in-house, and evaluate the performance of the two clustering algorithms in recalling the annotated complexes. In doing so the parameter space of each algorithm is sampled in order to select optimal values for these parameters, and the robustness of the algorithms is assessed by quantifying the level of complex recall as interactions are randomly added or removed to the network to simulate noise. To evaluate the performance on a weighted protein interaction graph, we also apply the two algorithms to the consolidated protein interaction network of <it>S. cerevisiae</it>, derived from genome scale purification experiments and to versions of this network in which varying proportions of the links have been randomly shuffled.</p> <p>Conclusion</p> <p>Our analysis shows that the MCL procedure is significantly more tolerant to noise and behaves more robustly than the AP algorithm. The advantage of MCL over AP is dramatic for unweighted protein interaction graphs, as AP displays severe convergence problems on the majority of the unweighted graph versions that we tested, whereas MCL continues to identify meaningful clusters, albeit fewer of them, as the level of noise in the graph increases. MCL thus remains the method of choice for identifying protein complexes from binary interaction networks.</p

    Discriminative structural approaches for enzyme active-site prediction

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Predicting enzyme active-sites in proteins is an important issue not only for protein sciences but also for a variety of practical applications such as drug design. Because enzyme reaction mechanisms are based on the local structures of enzyme active-sites, various template-based methods that compare local structures in proteins have been developed to date. In comparing such local sites, a simple measurement, RMSD, has been used so far.</p> <p>Results</p> <p>This paper introduces new machine learning algorithms that refine the similarity/deviation for comparison of local structures. The similarity/deviation is applied to two types of applications, single template analysis and multiple template analysis. In the single template analysis, a single template is used as a query to search proteins for active sites, whereas a protein structure is examined as a query to discover the possible active-sites using a set of templates in the multiple template analysis.</p> <p>Conclusions</p> <p>This paper experimentally illustrates that the machine learning algorithms effectively improve the similarity/deviation measurements for both the analyses.</p

    Making Informed Choices about Microarray Data Analysis

    Get PDF
    This article describes the typical stages in the analysis of microarray data for non-specialist researchers in systems biology and medicine. Particular attention is paid to significant data analysis issues that are commonly encountered among practitioners, some of which need wider airing. The issues addressed include experimental design, quality assessment, normalization, and summarization of multiple-probe data. This article is based on the ISMB 2008 tutorial on microarray data analysis. An expanded version of the material in this article and the slides from the tutorial can be found at http://www.people.vcu.edu/~mreimers/OGMDA/index.html

    Predicting global invasion risks: a management tool to prevent future introductions

    Get PDF
    Predicting regions at risk from introductions of non-native species and the subsequent invasions is a fundamental aspect of horizon scanning activities that enable the development of more effective preventative actions and planning of management measures. The Asian cyprinid fish topmouth gudgeon Pseudorasbora parva has proved highly invasive across Europe since its introduction in the 1960s. In addition to direct negative impacts on native fish populations, P. parva has potential for further damage through transmission of an emergent infectious disease, known to cause mortality in other species. To quantify its invasion risk, in regions where it has yet to be introduced, we trained 900 ecological niche models and constructed an Ensemble Model predicting suitability, then integrated a proxy for introduction likelihood. This revealed high potential for P. parva to invade regions well beyond its current invasive range. These included areas in all modelled continents, with several hotspots of climatic suitability and risk of introduction. We believe that these methods are easily adapted for a variety of other invasive species and that such risk maps could be used by policy-makers and managers in hotspots to formulate increased surveillance and early-warning systems that aim to prevent introductions and subsequent invasions

    Spatial point analysis based on dengue surveys at household level in central Brazil

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Dengue virus (DENV) affects nonimunne human populations in tropical and subtropical regions. In the Americas, dengue has drastically increased in the last two decades and Brazil is considered one of the most affected countries. The high frequency of asymptomatic infection makes difficult to estimate prevalence of infection using registered cases and to locate high risk intra-urban area at population level. The goal of this spatial point analysis was to identify potential high-risk intra-urban areas of dengue, using data collected at household level from surveys.</p> <p>Methods</p> <p>Two household surveys took place in the city of Goiania (~1.1 million population), Central Brazil in the year 2001 and 2002. First survey screened 1,586 asymptomatic individuals older than 5 years of age. Second survey 2,906 asymptomatic volunteers, same age-groups, were selected by multistage sampling (census tracts; blocks; households) using available digital maps. Sera from participants were tested by dengue virus-specific IgM/IgG by EIA. A Generalized Additive Model (GAM) was used to detect the spatial varying risk over the region. Initially without any fixed covariates, to depict the overall risk map, followed by a model including the main covariates and the year, where the resulting maps show the risk associated with living place, controlled for the individual risk factors. This method has the advantage to generate smoothed risk factors maps, adjusted by socio-demographic covariates.</p> <p>Results</p> <p>The prevalence of antibody against dengue infection was 37.3% (95%CI [35.5–39.1]) in the year 2002; 7.8% increase in one-year interval. The spatial variation in risk of dengue infection significantly changed when comparing 2001 with 2002, (ORadjusted = 1.35; p < 0.001), while controlling for potential confounders using GAM model. Also increasing age and low education levels were associated with dengue infection.</p> <p>Conclusion</p> <p>This study showed spatial heterogeneity in the risk areas of dengue when using a spatial multivariate approach in a short time interval. Data from household surveys pointed out that low prevalence areas in 2001 surveys shifted to high-risk area in consecutive year. This mapping of dengue risks should give insights for control interventions in urban areas.</p

    Massive mortality of invasive bivalves as a potential resource subsidy for the adjacent terrestrial food web

    Get PDF
    Large-scale mortality of invasive bivalves was observed in the River Danube basin in the autumn of 2011 due to a particularly low water discharge. The aim of this study was to quantify and compare the biomass of invasive and native bivalve die-offs amongst eight different sites and to assess the potential role of invasive bivalve die-offs as a resource subsidy for the adjacent terrestrial food web. Invasive bivalve die-offs dominated half of the study sites and their highest density and biomass were recorded at the warm water effluent. The density and biomass values recorded in this study are amongst the highest values recorded for aquatic ecosystems and show that a habitat affected by heated water can sustain an extremely high biomass of invasive bivalves. These mortalities highlight invasive bivalves as a major resource subsidy, possibly contributing remarkable amounts of nutrients and energy to the adjacent terrestrial ecosystem. Given the widespread occurrence of these invasive bivalves and the predicted increase in the frequency and intensity of extreme climatic events, the ecological impacts generated by their massive mortalities should be taken into account in other geographical areas as well.The authors are grateful to David Strayer for valuable comments on a previous version of the manuscript. Special thanks to the Danube-Ipoly National Park for the help in field work. Ronaldo Sousa was supported by the project "ECOIAS" funded by the Portuguese Foundation for the Science and the Technology and COMPETE funds (contract: PTDC/AAC-AMB/116685/2010)

    The Ascomycete Verticillium longisporum Is a Hybrid and a Plant Pathogen with an Expanded Host Range

    Get PDF
    Hybridization plays a central role in plant evolution, but its overall importance in fungi is unknown. New plant pathogens are thought to arise by hybridization between formerly separated fungal species. Evolution of hybrid plant pathogens from non-pathogenic ancestors in the fungal-like protist Phytophthora has been demonstrated, but in fungi, the most important group of plant pathogens, there are few well-characterized examples of hybrids. We focused our attention on the hybrid and plant pathogen Verticillium longisporum, the causal agent of the Verticillium wilt disease in crucifer crops. In order to address questions related to the evolutionary origin of V. longisporum, we used phylogenetic analyses of seven nuclear loci and a dataset of 203 isolates of V. longisporum, V. dahliae and related species. We confirmed that V. longisporum was diploid, and originated three different times, involving four different lineages and three different parental species. All hybrids shared a common parent, species A1, that hybridized respectively with species D1, V. dahliae lineage D2 and V. dahliae lineage D3, to give rise to three different lineages of V. longisporum. Species A1 and species D1 constituted as yet unknown taxa. Verticillium longisporum likely originated recently, as each V. longisporum lineage was genetically homogenous, and comprised species A1 alleles that were identical across lineages

    Assessing the impact of a health intervention via user-generated Internet content

    Get PDF
    Assessing the effect of a health-oriented intervention by traditional epidemiological methods is commonly based only on population segments that use healthcare services. Here we introduce a complementary framework for evaluating the impact of a targeted intervention, such as a vaccination campaign against an infectious disease, through a statistical analysis of user-generated content submitted on web platforms. Using supervised learning, we derive a nonlinear regression model for estimating the prevalence of a health event in a population from Internet data. This model is applied to identify control location groups that correlate historically with the areas, where a specific intervention campaign has taken place. We then determine the impact of the intervention by inferring a projection of the disease rates that could have emerged in the absence of a campaign. Our case study focuses on the influenza vaccination program that was launched in England during the 2013/14 season, and our observations consist of millions of geo-located search queries to the Bing search engine and posts on Twitter. The impact estimates derived from the application of the proposed statistical framework support conventional assessments of the campaign
    corecore