111 research outputs found

    Solving kk-means on High-dimensional Big Data

    Full text link
    In recent years, there have been major efforts to develop data stream algorithms that process inputs in one pass over the data with little memory requirement. For the kk-means problem, this has led to the development of several (1+ε)(1+\varepsilon)-approximations (under the assumption that kk is a constant), but also to the design of algorithms that are extremely fast in practice and compute solutions of high accuracy. However, when not only the length of the stream is high but also the dimensionality of the input points, then current methods reach their limits. We propose two algorithms, piecy and piecy-mr that are based on the recently developed data stream algorithm BICO that can process high dimensional data in one pass and output a solution of high quality. While piecy is suited for high dimensional data with a medium number of points, piecy-mr is meant for high dimensional data that comes in a very long stream. We provide an extensive experimental study to evaluate piecy and piecy-mr that shows the strength of the new algorithms.Comment: 23 pages, 9 figures, published at the 14th International Symposium on Experimental Algorithms - SEA 201

    Privacy Preserving Multi-Server k-means Computation over Horizontally Partitioned Data

    Full text link
    The k-means clustering is one of the most popular clustering algorithms in data mining. Recently a lot of research has been concentrated on the algorithm when the dataset is divided into multiple parties or when the dataset is too large to be handled by the data owner. In the latter case, usually some servers are hired to perform the task of clustering. The dataset is divided by the data owner among the servers who together perform the k-means and return the cluster labels to the owner. The major challenge in this method is to prevent the servers from gaining substantial information about the actual data of the owner. Several algorithms have been designed in the past that provide cryptographic solutions to perform privacy preserving k-means. We provide a new method to perform k-means over a large set using multiple servers. Our technique avoids heavy cryptographic computations and instead we use a simple randomization technique to preserve the privacy of the data. The k-means computed has exactly the same efficiency and accuracy as the k-means computed over the original dataset without any randomization. We argue that our algorithm is secure against honest but curious and passive adversary.Comment: 19 pages, 4 tables. International Conference on Information Systems Security. Springer, Cham, 201

    Algorithms for Stable Matching and Clustering in a Grid

    Full text link
    We study a discrete version of a geometric stable marriage problem originally proposed in a continuous setting by Hoffman, Holroyd, and Peres, in which points in the plane are stably matched to cluster centers, as prioritized by their distances, so that each cluster center is apportioned a set of points of equal area. We show that, for a discretization of the problem to an n×nn\times n grid of pixels with kk centers, the problem can be solved in time O(n2log5n)O(n^2 \log^5 n), and we experiment with two slower but more practical algorithms and a hybrid method that switches from one of these algorithms to the other to gain greater efficiency than either algorithm alone. We also show how to combine geometric stable matchings with a kk-means clustering algorithm, so as to provide a geometric political-districting algorithm that views distance in economic terms, and we experiment with weighted versions of stable kk-means in order to improve the connectivity of the resulting clusters.Comment: 23 pages, 12 figures. To appear (without the appendices) at the 18th International Workshop on Combinatorial Image Analysis, June 19-21, 2017, Plovdiv, Bulgari

    Linear, Deterministic, and Order-Invariant Initialization Methods for the K-Means Clustering Algorithm

    Full text link
    Over the past five decades, k-means has become the clustering algorithm of choice in many application domains primarily due to its simplicity, time/space efficiency, and invariance to the ordering of the data points. Unfortunately, the algorithm's sensitivity to the initial selection of the cluster centers remains to be its most serious drawback. Numerous initialization methods have been proposed to address this drawback. Many of these methods, however, have time complexity superlinear in the number of data points, which makes them impractical for large data sets. On the other hand, linear methods are often random and/or sensitive to the ordering of the data points. These methods are generally unreliable in that the quality of their results is unpredictable. Therefore, it is common practice to perform multiple runs of such methods and take the output of the run that produces the best results. Such a practice, however, greatly increases the computational requirements of the otherwise highly efficient k-means algorithm. In this chapter, we investigate the empirical performance of six linear, deterministic (non-random), and order-invariant k-means initialization methods on a large and diverse collection of data sets from the UCI Machine Learning Repository. The results demonstrate that two relatively unknown hierarchical initialization methods due to Su and Dy outperform the remaining four methods with respect to two objective effectiveness criteria. In addition, a recent method due to Erisoglu et al. performs surprisingly poorly.Comment: 21 pages, 2 figures, 5 tables, Partitional Clustering Algorithms (Springer, 2014). arXiv admin note: substantial text overlap with arXiv:1304.7465, arXiv:1209.196

    Use of Oral Cholera Vaccines in an Outbreak in Vietnam: A Case Control Study

    Get PDF
    Simple measures such as adequate sanitation and clean water stops the spread of cholera; however, in areas where these are not available, cholera spreads quickly and may lead to death in a few hours if treatment is not initiated immediately. The use of life-saving rehydration therapy is the mainstay in cholera control, however, the rapidity of the disease and the limited access to appropriate healthcare units in far-flung areas together result in an unacceptable number of deaths. The WHO has recommended the use of oral cholera vaccines as a preventive measure against cholera outbreaks since 2001, but this was recently updated so that vaccine use may also be considered once a cholera outbreak has begun. The findings from this study suggest that reactive use of killed oral cholera vaccines provides protection against the disease and may be a potential tool in times of outbreaks. Further studies must be conducted to confirm these findings

    Identifying Prototypical Components in Behaviour Using Clustering Algorithms

    Get PDF
    Quantitative analysis of animal behaviour is a requirement to understand the task solving strategies of animals and the underlying control mechanisms. The identification of repeatedly occurring behavioural components is thereby a key element of a structured quantitative description. However, the complexity of most behaviours makes the identification of such behavioural components a challenging problem. We propose an automatic and objective approach for determining and evaluating prototypical behavioural components. Behavioural prototypes are identified using clustering algorithms and finally evaluated with respect to their ability to represent the whole behavioural data set. The prototypes allow for a meaningful segmentation of behavioural sequences. We applied our clustering approach to identify prototypical movements of the head of blowflies during cruising flight. The results confirm the previously established saccadic gaze strategy by the set of prototypes being divided into either predominantly translational or rotational movements, respectively. The prototypes reveal additional details about the saccadic and intersaccadic flight sections that could not be unravelled so far. Successful application of the proposed approach to behavioural data shows its ability to automatically identify prototypical behavioural components within a large and noisy database and to evaluate these with respect to their quality and stability. Hence, this approach might be applied to a broad range of behavioural and neural data obtained from different animals and in different contexts

    Gray zones around diffuse large B cell lymphoma. Conclusions based on the workshop of the XIV meeting of the European Association for Hematopathology and the Society of Hematopathology in Bordeaux, France

    Get PDF
    The term “gray-zone” lymphoma has been used to denote a group of lymphomas with overlapping histological, biological, and clinical features between various types of lymphomas. It has been used in the context of Hodgkin lymphomas (HL) and non-Hodgkin lymphomas (NHL), including classical HL (CHL), and primary mediastinal large B cell lymphoma, cases with overlapping features between nodular lymphocyte predominant Hodgkin lymphoma and T-cell/histiocyte-rich large B cell lymphoma, CHL, and Epstein–Barr-virus-positive lymphoproliferative disorders, and peripheral T cell lymphomas simulating CHL. A second group of gray-zone lymphomas includes B cell NHL with intermediate features between diffuse large B cell lymphoma and classical Burkitt lymphoma. In order to review controversial issues in gray-zone lymphomas, a joint Workshop of the European Association for Hematopathology and the Society for Hematopathology was held in Bordeaux, France, in September 2008. The panel members reviewed and discussed 145 submitted cases and reached consensus diagnoses. This Workshop summary is focused on the most controversial aspects of gray-zone lymphomas and describes the panel’s proposals regarding diagnostic criteria, terminology, and new prognostic and diagnostic parameters

    In quest of a systematic framework for unifying and defining nanoscience

    Get PDF
    This article proposes a systematic framework for unifying and defining nanoscience based on historic first principles and step logic that led to a “central paradigm” (i.e., unifying framework) for traditional elemental/small-molecule chemistry. As such, a Nanomaterials classification roadmap is proposed, which divides all nanomatter into Category I: discrete, well-defined and Category II: statistical, undefined nanoparticles. We consider only Category I, well-defined nanoparticles which are >90% monodisperse as a function of Critical Nanoscale Design Parameters (CNDPs) defined according to: (a) size, (b) shape, (c) surface chemistry, (d) flexibility, and (e) elemental composition. Classified as either hard (H) (i.e., inorganic-based) or soft (S) (i.e., organic-based) categories, these nanoparticles were found to manifest pervasive atom mimicry features that included: (1) a dominance of zero-dimensional (0D) core–shell nanoarchitectures, (2) the ability to self-assemble or chemically bond as discrete, quantized nanounits, and (3) exhibited well-defined nanoscale valencies and stoichiometries reminiscent of atom-based elements. These discrete nanoparticle categories are referred to as hard or soft particle nanoelements. Many examples describing chemical bonding/assembly of these nanoelements have been reported in the literature. We refer to these hard:hard (H-n:H-n), soft:soft (S-n:S-n), or hard:soft (H-n:S-n) nanoelement combinations as nanocompounds. Due to their quantized features, many nanoelement and nanocompound categories are reported to exhibit well-defined nanoperiodic property patterns. These periodic property patterns are dependent on their quantized nanofeatures (CNDPs) and dramatically influence intrinsic physicochemical properties (i.e., melting points, reactivity/self-assembly, sterics, and nanoencapsulation), as well as important functional/performance properties (i.e., magnetic, photonic, electronic, and toxicologic properties). We propose this perspective as a modest first step toward more clearly defining synthetic nanochemistry as well as providing a systematic framework for unifying nanoscience. With further progress, one should anticipate the evolution of future nanoperiodic table(s) suitable for predicting important risk/benefit boundaries in the field of nanoscience

    Typhoid Fever and Its Association with Environmental Factors in the Dhaka Metropolitan Area of Bangladesh: A Spatial and Time-Series Approach

    Get PDF
    Typhoid fever is a major cause of death worldwide with a major part of the disease burden in developing regions such as the Indian sub-continent. Bangladesh is part of this highly endemic region, yet little is known about the spatial and temporal distribution of the disease at a regional scale. This research used a Geographic Information System to explore, spatially and temporally, the prevalence of typhoid in Dhaka Metropolitan Area (DMA) of Bangladesh over the period 2005-9. This paper provides the first study of the spatio-temporal epidemiology of typhoid for this region. The aims of the study were: (i) to analyse the epidemiology of cases from 2005 to 2009; (ii) to identify spatial patterns of infection based on two spatial hypotheses; and (iii) to determine the hydro-climatological factors associated with typhoid prevalence. Case occurrences data were collected from 11 major hospitals in DMA, geocoded to census tract level, and used in a spatio-temporal analysis with a range of demographic, environmental and meteorological variables. Analyses revealed distinct seasonality as well as age and gender differences, with males and very young children being disproportionately infected. The male-female ratio of typhoid cases was found to be 1.36, and the median age of the cases was 14 years. Typhoid incidence was higher in male population than female (χ2 = 5.88, p0.05). A statistically significant inverse association was found between typhoid incidence and distance to major waterbodies. Spatial pattern analysis showed that there was a significant clustering of typhoid distribution in the study area. Moran\u27s I was highest (0.879; p<0.01) in 2008 and lowest (0.075; p<0.05) in 2009. Incidence rates were found to form three large, multi-centred, spatial clusters with no significant difference between urban and rural rates. Temporally, typhoid incidence was seen to increase with temperature, rainfall and river level at time lags ranging from three to five weeks. For example, for a 0.1 metre rise in river levels, the number of typhoid cases increased by 4.6% (95% CI: 2.4-2.8) above the threshold of 4.0 metres (95% CI: 2.4-4.3). On the other hand, with a 1°C rise in temperature, the number of typhoid cases could increase by 14.2% (95% CI: 4.4-25.0)
    corecore