598 research outputs found

    Constrained Clustering with Minkowski Weighted K-Means

    Get PDF
    In this paper we introduce the Constrained Minkowski Weighted K-Means. This algorithm calculates cluster specific feature weights that can be interpreted as feature rescaling factors thanks to the use of the Minkowski distance. Here, we use an small amount of labelled data to select a Minkowski exponent and to generate clustering constrains based on pair-wise must-link and cannot-link rules. We validate our new algorithm with a total of 12 datasets, most of which containing features with uniformly distributed noise. We have run the algorithm numerous times in each dataset. These experiments ratify the general superiority of using feature weighting in K-Means, particularly when applying the Minkowski distance. We have also found that the use of constrained clustering rules has little effect on the average proportion of correctly clustered entities. However, constrained clustering does improve considerably the maximum of such proportion

    Improving cluster recovery with feature rescaling factors

    Get PDF
    The data preprocessing stage is crucial in clustering. Features may describe entities using different scales. To rectify this, one usually applies feature normalisation aiming at rescaling features so that none of them overpowers the others in the objective function of the selected clustering algorithm. In this paper, we argue that the rescaling procedure should not treat all features identically. Instead, it should favour the features that are more meaningful for clustering. With this in mind, we introduce a feature rescaling method that takes into account the within-cluster degree of relevance of each feature. Our comprehensive simulation study, carried out on real and synthetic data, with and without noise features, clearly demonstrates that clustering methods that use the proposed data normalization strategy clearly outperform those that use traditional data normalization

    Feature weighting as a tool for unsupervised feature selection

    Get PDF
    Feature selection is a popular data pre-processing step. The aim is to remove some of the features in a data set with minimum information loss, leading to a number of benefits including faster running time and easier data visualisation. In this paper we introduce two unsupervised feature selection algorithms. These make use of a cluster-dependent feature-weighting mechanism reflecting the within-cluster degree of relevance of a given feature. Those features with a relatively low weight are removed from the data set. We compare our algorithms to two other popular alternatives using a number of experiments on both synthetic and real-world data sets, with and without added noisy features. These experiments demonstrate our algorithms clearly outperform the alternatives

    Minkowski distances and standardisation for clustering and classification of high dimensional data

    Full text link
    There are many distance-based methods for classification and clustering, and for data with a high number of dimensions and a lower number of observations, processing distances is computationally advantageous compared to the raw data matrix. Euclidean distances are used as a default for continuous multivariate data, but there are alternatives. Here the so-called Minkowski distances, L1L_1 (city block)-, L2L_2 (Euclidean)-, L3L_3-, L4L_4-, and maximum distances are combined with different schemes of standardisation of the variables before aggregating them. Boxplot transformation is proposed, a new transformation method for a single variable that standardises the majority of observations but brings outliers closer to the main bulk of the data. Distances are compared in simulations for clustering by partitioning around medoids, complete and average linkage, and classification by nearest neighbours, of data with a low number of observations but high dimensionality. The L1L_1-distance and the boxplot transformation show good results.Comment: Preliminary version; final version to be published by Springer, using Springer's svmult LATEX styl

    A survey on feature weighting based K-Means algorithms

    Get PDF
    This is a pre-copyedited, author-produced PDF of an article accepted for publication in Journal of Classification [de Amorim, R. C., 'A survey on feature weighting based K-Means algorithms', Journal of Classification, Vol. 33(2): 210-242, August 25, 2016]. Subject to embargo. Embargo end date: 25 August 2017. The final publication is available at Springer via http://dx.doi.org/10.1007/s00357-016-9208-4 © Classification Society of North America 2016In a real-world data set there is always the possibility, rather high in our opinion, that different features may have different degrees of relevance. Most machine learning algorithms deal with this fact by either selecting or deselecting features in the data preprocessing phase. However, we maintain that even among relevant features there may be different degrees of relevance, and this should be taken into account during the clustering process. With over 50 years of history, K-Means is arguably the most popular partitional clustering algorithm there is. The first K-Means based clustering algorithm to compute feature weights was designed just over 30 years ago. Various such algorithms have been designed since but there has not been, to our knowledge, a survey integrating empirical evidence of cluster recovery ability, common flaws, and possible directions for future research. This paper elaborates on the concept of feature weighting and addresses these issues by critically analysing some of the most popular, or innovative, feature weighting mechanisms based in K-Means.Peer reviewedFinal Accepted Versio

    Extracellular Hsp72 concentration relates to a minimum endogenous criteria during acute exercise-heat exposure

    Get PDF
    Extracellular heat-shock protein 72 (eHsp72) concentration increases during exercise-heat stress when conditions elicit physiological strain. Differences in severity of environmental and exercise stimuli have elicited varied response to stress. The present study aimed to quantify the extent of increased eHsp72 with increased exogenous heat stress, and determine related endogenous markers of strain in an exercise-heat model. Ten males cycled for 90 min at 50% O2peak in three conditions (TEMP, 20°C/63% RH; HOT, 30.2°C/51%RH; VHOT, 40.0°C/37%RH). Plasma was analysed for eHsp72 pre, immediately post and 24-h post each trial utilising a commercially available ELISA. Increased eHsp72 concentration was observed post VHOT trial (+172.4%) (P<0.05), but not TEMP (-1.9%) or HOT (+25.7%) conditions. eHsp72 returned to baseline values within 24hrs in all conditions. Changes were observed in rectal temperature (Trec), rate of Trec increase, area under the curve for Trec of 38.5°C and 39.0°C, duration Trec ≥ 38.5°C and ≥ 39.0°C, and change in muscle temperature, between VHOT, and TEMP and HOT, but not between TEMP and HOT. Each condition also elicited significantly increasing physiological strain, described by sweat rate, heart rate, physiological strain index, rating of perceived exertion and thermal sensation. Stepwise multiple regression reported rate of Trec increase and change in Trec to be predictors of increased eHsp72 concentration. Data suggests eHsp72 concentration increases once systemic temperature and sympathetic activity exceeds a minimum endogenous criteria elicited during VHOT conditions and is likely to be modulated by large, rapid changes in core temperature

    Heterogeneities in leishmania infantum infection : using skin parasite burdens to identify highly infectious dogs

    Get PDF
    Background: The relationships between heterogeneities in host infection and infectiousness (transmission to arthropod vectors) can provide important insights for disease management. Here, we quantify heterogeneities in Leishmania infantum parasite numbers in reservoir and non-reservoir host populations, and relate this to their infectiousness during natural infection. Tissue parasite number was evaluated as a potential surrogate marker of host transmission potential. Methods: Parasite numbers were measured by qPCR in bone marrow and ear skin biopsies of 82 dogs and 34 crab-eating foxes collected during a longitudinal study in Amazon Brazil, for which previous data was available on infectiousness (by xenodiagnosis) and severity of infection. Results: Parasite numbers were highly aggregated both between samples and between individuals. In dogs, total parasite abundance and relative numbers in ear skin compared to bone marrow increased with the duration and severity of infection. Infectiousness to the sandfly vector was associated with high parasite numbers; parasite number in skin was the best predictor of being infectious. Crab-eating foxes, which typically present asymptomatic infection and are non-infectious, had parasite numbers comparable to those of non-infectious dogs. Conclusions: Skin parasite number provides an indirect marker of infectiousness, and could allow targeted control particularly of highly infectious dogs
    • …
    corecore