18 research outputs found

    Medoid-based shadow value validation and visualization

    Get PDF
    A silhouette index is a well-known measure of an internal criteria validation for the clustering algorithm results. While it is a medoid-based validation index, a centroid-based validation index that is called a centroid-based shadow value (CSV) has been developed.  Although both are similar, the CSV has an additional unique property where an image of a 2-dimensional neighborhood graph is possible. A new internal validation index is proposed in this article in order to create a medoid-based validation that has an ability to visualize the results in a 2-dimensional plot. The proposed index behaves similarly to the silhouette index and produces a network visualization, which is comparable to the neighborhood graph of the CSV. The network visualization has a multiplicative parameter (c) to adjust its edges visibility. Due to the medoid-based, in addition, it is more an appropriate visualization technique for any type of data than a neighborhood graph of the CSV

    Tailoring the Implementation of New Biomarkers Based on Their Added Predictive Value in Subgroups of Individuals

    Get PDF
    Background\ud The value of new biomarkers or imaging tests, when added to a prediction model, is currently evaluated using reclassification measures, such as the net reclassification improvement (NRI). However, these measures only provide an estimate of improved reclassification at population level. We present a straightforward approach to characterize subgroups of reclassified individuals in order to tailor implementation of a new prediction model to individuals expected to benefit from it.\ud \ud Methods\ud In a large Dutch population cohort (n = 21,992) we classified individuals to low (<5%) and high (≥5%) fatal cardiovascular disease risk by the Framingham risk score (FRS) and reclassified them based on the systematic coronary risk evaluation (SCORE). Subsequently, we characterized the reclassified individuals and, in case of heterogeneity, applied cluster analysis to identify and characterize subgroups. These characterizations were used to select individuals expected to benefit from implementation of SCORE.\ud \ud Results\ud Reclassification after applying SCORE in all individuals resulted in an NRI of 5.00% (95% CI [-0.53%; 11.50%]) within the events, 0.06% (95% CI [-0.08%; 0.22%]) within the nonevents, and a total NRI of 0.051 (95% CI [-0.004; 0.116]). Among the correctly downward reclassified individuals cluster analysis identified three subgroups. Using the characterizations of the typically correctly reclassified individuals, implementing SCORE only in individuals expected to benefit (n = 2,707,12.3%) improved the NRI to 5.32% (95% CI [-0.13%; 12.06%]) within the events, 0.24% (95% CI [0.10%; 0.36%]) within the nonevents, and a total NRI of 0.055 (95% CI [0.001; 0.123]). Overall, the risk levels for individuals reclassified by tailored implementation of SCORE were more accurate.\ud \ud Discussion\ud In our empirical example the presented approach successfully characterized subgroups of reclassified individuals that could be used to improve reclassification and reduce implementation burden. In particular when newly added biomarkers or imaging tests are costly or burdensome such a tailored implementation strategy may save resources and improve (cost-)effectivenes

    Merged consensus clustering to assess and improve class discovery with microarray data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>One of the most commonly performed tasks when analysing high throughput gene expression data is to use clustering methods to classify the data into groups. There are a large number of methods available to perform clustering, but it is often unclear which method is best suited to the data and how to quantify the quality of the classifications produced.</p> <p>Results</p> <p>Here we describe an R package containing methods to analyse the consistency of clustering results from any number of different clustering methods using resampling statistics. These methods allow the identification of the the best supported clusters and additionally rank cluster members by their fidelity within the cluster. These metrics allow us to compare the performance of different clustering algorithms under different experimental conditions and to select those that produce the most reliable clustering structures. We show the application of this method to simulated data, canonical gene expression experiments and our own novel analysis of genes involved in the specification of the peripheral nervous system in the fruitfly, <it>Drosophila melanogaster</it>.</p> <p>Conclusions</p> <p>Our package enables users to apply the merged consensus clustering methodology conveniently within the R programming environment, providing both analysis and graphical display functions for exploring clustering approaches. It extends the basic principle of consensus clustering by allowing the merging of results between different methods to provide an averaged clustering robustness. We show that this extension is useful in correcting for the tendency of clustering algorithms to treat outliers differently within datasets. The R package, <it>clusterCons</it>, is freely available at CRAN and sourceforge under the GNU public licence.</p

    Family Name Origins and Intergenerational Demographic Change in Great Britain

    Get PDF
    We develop bespoke geospatial routines to typify 88,457 surnames by their likely ancestral geographic origins within Great Britain. Linking this taxonomy to both historic and contemporary population data sets, we characterize regional populations using surnames that indicate whether their bearers are likely to be long-settled. We extend this approach in a case study application, in which we summarize intergenerational change in local populations across Great Britain over a period of 120 years. We also analyze much shorter term demographic dynamics and chart likely recent migratory flows within the country. Our research demonstrates the value of family names in characterizing long-term population change at regional and local scales. We find evidence of selective migratory flows in both time periods alongside increasing demographic diversity and distinctiveness between regions in Great Britain

    A highly efficient multi-core algorithm for clustering extremely large datasets

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer.</p> <p>Results</p> <p>We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization.</p> <p>Conclusions</p> <p>Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.</p

    Text mining without document context

    Get PDF
    We consider a challenging clustering task: the clustering of muti-word terms without document co-occurrence information in order to form coherent groups of topics. For this task, we developed a methodology taking as input multi-word terms and lexico-syntactic relations between them. Our clustering algorithm, named CPCL is implemented in the TermWatch system. We compared CPCL to other existing clustering algorithms, namely hierarchical and partitioning (k-means, k-medoids). This out-of-context clustering task led us to adapt multi-word term representation for statistical methods and also to refine an existing cluster evaluation metric, the editing distance in order to evaluate the methods. Evaluation was carried out on a list of multi-word terms from the genomic field which comes with a hand built taxonomy. Results showed that while k-means and k-medoids obtained good scores on the editing distance, they were very sensitive to term length. CPCL on the other hand obtained a better cluster homogeneity score and was less sensitive to term length. Also, CPCL showed good adaptability for handling very large and sparse matrices

    Clustering with missing data: which equivalent for Rubin's rules?

    Full text link
    Multiple imputation (MI) is a popular method for dealing with missing values. However, the suitable way for applying clustering after MI remains unclear: how to pool partitions? How to assess the clustering instability when data are incomplete? By answering both questions, this paper proposed a complete view of clustering with missing data using MI. The problem of partitions pooling is here addressed using consensus clustering while, based on the bootstrap theory, we explain how to assess the instability related to observed and missing data. The new rules for pooling partitions and instability assessment are theoretically argued and extensively studied by simulation. Partitions pooling improves accuracy while measuring instability with missing data enlarges the data analysis possibilities: it allows assessment of the dependence of the clustering to the imputation model, as well as a convenient way for choosing the number of clusters when data are incomplete, as illustrated on a real data set.Comment: 39 page

    Shifting Patterns of Summer Lake Color Phenology in Over 26,000 US Lakes

    Get PDF
    Lakes are often defined by seasonal cycles. The seasonal timing, or phenology, of many lake processes are changing in response to human activities. However, long-term records exist for few lakes, and extrapolating patterns observed in these lakes to entire landscapes is exceedingly difficult using the limited number of available in situ observations. Limited landscape-level observations mean we do not know how common shifts in lake phenology are at macroscales. Here, we use a new remote sensing data set, LimnoSat-US, to analyze U.S. summer lake color phenology between 1984 and 2020 across more than 26,000 lakes. Our results show that summer lake color seasonality can be generalized into five distinct phenology groups that follow well-known patterns of phytoplankton succession. The frequency with which lakes transition from one phenology group to another is tied to lake and landscape level characteristics. Lakes with high inflows and low variation in their seasonal surface area are generally more stable, while lakes in areas with high interannual variations in climate and catchment population density show less stability. Our results reveal previously unexamined spatiotemporal patterns in lake seasonality and demonstrate the utility of LimnoSat-US, which, with over 22 million remote sensing observations of lakes, creates novel opportunities to examine changing lake ecosystems at a national scale
    corecore