Search CORE

18 research outputs found

Medoid-based shadow value validation and visualization

Author: Budiaji Weksi
Publication venue: 'Universitas Ahmad Dahlan, Kampus 3'
Publication date: 05/04/2019
Field of study

A silhouette index is a well-known measure of an internal criteria validation for the clustering algorithm results. While it is a medoid-based validation index, a centroid-based validation index that is called a centroid-based shadow value (CSV) has been developed. Although both are similar, the CSV has an additional unique property where an image of a 2-dimensional neighborhood graph is possible. A new internal validation index is proposed in this article in order to create a medoid-based validation that has an ability to visualize the results in a 2-dimensional plot. The proposed index behaves similarly to the silhouette index and produces a network visualization, which is comparable to the neighborhood graph of the CSV. The network visualization has a multiplicative parameter (c) to adjust its edges visibility. Due to the medoid-based, in addition, it is more an appropriate visualization technique for any type of data than a neighborhood graph of the CSV

International Journal of Advances in Intelligent Informatics

International Journal of Advances in Intelligent Informatics (IJAIN)

Tailoring the Implementation of New Biomarkers Based on Their Added Predictive Value in Subgroups of Individuals

Author: Boer J.M.A.
Giessen A. van
Koffijberg H.
Moons K.G.M.
Verschuren W.M.M.
Wit G.A. de
Publication venue: Public Library of Science
Publication date: 01/01/2015
Field of study

Background\ud The value of new biomarkers or imaging tests, when added to a prediction model, is currently evaluated using reclassification measures, such as the net reclassification improvement (NRI). However, these measures only provide an estimate of improved reclassification at population level. We present a straightforward approach to characterize subgroups of reclassified individuals in order to tailor implementation of a new prediction model to individuals expected to benefit from it.\ud \ud Methods\ud In a large Dutch population cohort (n = 21,992) we classified individuals to low (<5%) and high (≥5%) fatal cardiovascular disease risk by the Framingham risk score (FRS) and reclassified them based on the systematic coronary risk evaluation (SCORE). Subsequently, we characterized the reclassified individuals and, in case of heterogeneity, applied cluster analysis to identify and characterize subgroups. These characterizations were used to select individuals expected to benefit from implementation of SCORE.\ud \ud Results\ud Reclassification after applying SCORE in all individuals resulted in an NRI of 5.00% (95% CI [-0.53%; 11.50%]) within the events, 0.06% (95% CI [-0.08%; 0.22%]) within the nonevents, and a total NRI of 0.051 (95% CI [-0.004; 0.116]). Among the correctly downward reclassified individuals cluster analysis identified three subgroups. Using the characterizations of the typically correctly reclassified individuals, implementing SCORE only in individuals expected to benefit (n = 2,707,12.3%) improved the NRI to 5.32% (95% CI [-0.13%; 12.06%]) within the events, 0.24% (95% CI [0.10%; 0.36%]) within the nonevents, and a total NRI of 0.055 (95% CI [0.001; 0.123]). Overall, the risk levels for individuals reclassified by tailored implementation of SCORE were more accurate.\ud \ud Discussion\ud In our empirical example the presented approach successfully characterized subgroups of reclassified individuals that could be used to improve reclassification and reduce implementation burden. In particular when newly added biomarkers or imaging tests are costly or burdensome such a tailored implementation strategy may save resources and improve (cost-)effectivenes

Crossref

Directory of Open Access Journals

PubMed Central

University of Twente Research Information

Utrecht University Repository

FigShare

Merged consensus clustering to assess and improve class discovery with microarray data

Author: A Thalamuthu
AK Jain
Andrew P Jarman
AP Jarman
E Levine
G Kerr
GC Tseng
I Frades
IM Gana Dresen
J Douglas Armstrong
J Gollub
J MacQueen
JC Dunn
JHH Do
L Kaufman
M Seiler
MJ van der Laan
MV Halkidi
R Suzuki
R Tibshirani
RL Camp
S Dudoit
S Dudoit
S Monti
SA Greenberg
ST Milagre
SYY Kim
T Ian Simpson
T Shimogori
TR Golub
UC Sharma
YFF Leung
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background One of the most commonly performed tasks when analysing high throughput gene expression data is to use clustering methods to classify the data into groups. There are a large number of methods available to perform clustering, but it is often unclear which method is best suited to the data and how to quantify the quality of the classifications produced. Results Here we describe an R package containing methods to analyse the consistency of clustering results from any number of different clustering methods using resampling statistics. These methods allow the identification of the the best supported clusters and additionally rank cluster members by their fidelity within the cluster. These metrics allow us to compare the performance of different clustering algorithms under different experimental conditions and to select those that produce the most reliable clustering structures. We show the application of this method to simulated data, canonical gene expression experiments and our own novel analysis of genes involved in the specification of the peripheral nervous system in the fruitfly, <it>Drosophila melanogaster</it>. Conclusions Our package enables users to apply the merged consensus clustering methodology conveniently within the R programming environment, providing both analysis and graphical display functions for exploring clustering approaches. It extends the basic principle of consensus clustering by allowing the merging of results between different methods to provide an averaged clustering robustness. We show that this extension is useful in correcting for the tendency of clustering algorithms to treat outliers differently within datasets. The R package, <it>clusterCons</it>, is freely available at CRAN and sourceforge under the GNU public licence.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Edinburgh Research Explorer

Family Name Origins and Intergenerational Demographic Change in Great Britain

Author: Kandt J
Longley P
Van Dijk J
Publication venue: 'American Geographical Society'
Publication date: 01/01/2020
Field of study

We develop bespoke geospatial routines to typify 88,457 surnames by their likely ancestral geographic origins within Great Britain. Linking this taxonomy to both historic and contemporary population data sets, we characterize regional populations using surnames that indicate whether their bearers are likely to be long-settled. We extend this approach in a case study application, in which we summarize intergenerational change in local populations across Great Britain over a period of 120 years. We also analyze much shorter term demographic dynamics and chart likely recent migratory flows within the country. Our research demonstrates the value of family names in characterizing long-term population change at regional and local scales. We find evidence of selective migratory flows in both time periods alongside increasing demographic diversity and distinctiveness between regions in Great Britain

UCL Discovery

A highly efficient multi-core algorithm for clustering extremely large datasets

Author: A Ben-Hur
A Bertoni
A Jain
AK Jain
AR Adl-Tabatabai
AWF Edwards
B Andreopoulos
B Chapman
C Herzeel
Consortium IH
D Lea
D Smirnov
DR Barr
E Levine
F Müller
G Dalgin
HA Kestler
HA Kestler
Hans A Kestler
HW Kuhn
J Fridlyand
J Handl
J Larus
J MacQueen
Johann M Kraus
JW Sammon
K Fukunaga
L Hubert
L Kuncheva
M Anderson
M Ng
MK Kerr
N Shavit
P Jaccard
P Sham
PA Bernstein
R Development Core Team
R Duan
R Graham
R Jonker
R Rajwar
R Tibshirani
R Xu
RC Gentleman
S Monti
S Peyton-Jones
S Selim
T Kohonen
T Lange
U Drepper
W Feng
W Gropp
W Rand
WJ Conover
X Gao
X Gao
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.</p

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Text mining without document context

Author: Ibekwe-SanJuan Fidelia
SanJuan Eric
Publication venue: 'Elsevier BV'
Publication date: 01/01/2006
Field of study

We consider a challenging clustering task: the clustering of muti-word terms without document co-occurrence information in order to form coherent groups of topics. For this task, we developed a methodology taking as input multi-word terms and lexico-syntactic relations between them. Our clustering algorithm, named CPCL is implemented in the TermWatch system. We compared CPCL to other existing clustering algorithms, namely hierarchical and partitioning (k-means, k-medoids). This out-of-context clustering task led us to adapt multi-word term representation for statistical methods and also to refine an existing cluster evaluation metric, the editing distance in order to evaluate the methods. Evaluation was carried out on a list of multi-word terms from the genomic field which comes with a hand built taxonomy. Results showed that while k-means and k-medoids obtained good scores on the editing distance, they were very sensitive to term length. CPCL on the other hand obtained a better cluster homogeneity score and was less sensitive to term length. Also, CPCL showed good adaptability for handling very large and sparse matrices

Clustering with missing data: which equivalent for Rubin's rules?

Author: Audigier Vincent
Niang Ndèye
Publication venue
Publication date: 27/11/2020
Field of study

Multiple imputation (MI) is a popular method for dealing with missing values. However, the suitable way for applying clustering after MI remains unclear: how to pool partitions? How to assess the clustering instability when data are incomplete? By answering both questions, this paper proposed a complete view of clustering with missing data using MI. The problem of partitions pooling is here addressed using consensus clustering while, based on the bootstrap theory, we explain how to assess the instability related to observed and missing data. The new rules for pooling partitions and instability assessment are theoretically argued and extensively studied by simulation. Partitions pooling improves accuracy while measuring instability with missing data enlarges the data analysis possibilities: it allows assessment of the dependence of the clustering to the imputation model, as well as a convenient way for choosing the number of clusters when data are incomplete, as illustrated on a real data set.Comment: 39 page

arXiv.org e-Print Archive

HAL Descartes

Hal-Diderot

Shifting Patterns of Summer Lake Color Phenology in Over 26,000 US Lakes

Author: Dugan H.A.
Gardner J.
Pavelsky T.M.
Ross M.R.V.
Topp S.N.
Yang X.
Publication venue: Blackwell Publishing Ltd
Publication date: 01/01/2021
Field of study

Lakes are often defined by seasonal cycles. The seasonal timing, or phenology, of many lake processes are changing in response to human activities. However, long-term records exist for few lakes, and extrapolating patterns observed in these lakes to entire landscapes is exceedingly difficult using the limited number of available in situ observations. Limited landscape-level observations mean we do not know how common shifts in lake phenology are at macroscales. Here, we use a new remote sensing data set, LimnoSat-US, to analyze U.S. summer lake color phenology between 1984 and 2020 across more than 26,000 lakes. Our results show that summer lake color seasonality can be generalized into five distinct phenology groups that follow well-known patterns of phytoplankton succession. The frequency with which lakes transition from one phenology group to another is tied to lake and landscape level characteristics. Lakes with high inflows and low variation in their seasonal surface area are generally more stable, while lakes in areas with high interannual variations in climate and catchment population density show less stability. Our results reveal previously unexamined spatiotemporal patterns in lake seasonality and demonstrate the utility of LimnoSat-US, which, with over 22 million remote sensing observations of lakes, creates novel opportunities to examine changing lake ecosystems at a national scale

Carolina Digital Repository