112 research outputs found
Solving -means on High-dimensional Big Data
In recent years, there have been major efforts to develop data stream
algorithms that process inputs in one pass over the data with little memory
requirement. For the -means problem, this has led to the development of
several -approximations (under the assumption that is a
constant), but also to the design of algorithms that are extremely fast in
practice and compute solutions of high accuracy. However, when not only the
length of the stream is high but also the dimensionality of the input points,
then current methods reach their limits.
We propose two algorithms, piecy and piecy-mr that are based on the recently
developed data stream algorithm BICO that can process high dimensional data in
one pass and output a solution of high quality. While piecy is suited for high
dimensional data with a medium number of points, piecy-mr is meant for high
dimensional data that comes in a very long stream. We provide an extensive
experimental study to evaluate piecy and piecy-mr that shows the strength of
the new algorithms.Comment: 23 pages, 9 figures, published at the 14th International Symposium on
Experimental Algorithms - SEA 201
Privacy Preserving Multi-Server k-means Computation over Horizontally Partitioned Data
The k-means clustering is one of the most popular clustering algorithms in
data mining. Recently a lot of research has been concentrated on the algorithm
when the dataset is divided into multiple parties or when the dataset is too
large to be handled by the data owner. In the latter case, usually some servers
are hired to perform the task of clustering. The dataset is divided by the data
owner among the servers who together perform the k-means and return the cluster
labels to the owner. The major challenge in this method is to prevent the
servers from gaining substantial information about the actual data of the
owner. Several algorithms have been designed in the past that provide
cryptographic solutions to perform privacy preserving k-means. We provide a new
method to perform k-means over a large set using multiple servers. Our
technique avoids heavy cryptographic computations and instead we use a simple
randomization technique to preserve the privacy of the data. The k-means
computed has exactly the same efficiency and accuracy as the k-means computed
over the original dataset without any randomization. We argue that our
algorithm is secure against honest but curious and passive adversary.Comment: 19 pages, 4 tables. International Conference on Information Systems
Security. Springer, Cham, 201
Algorithms for Stable Matching and Clustering in a Grid
We study a discrete version of a geometric stable marriage problem originally
proposed in a continuous setting by Hoffman, Holroyd, and Peres, in which
points in the plane are stably matched to cluster centers, as prioritized by
their distances, so that each cluster center is apportioned a set of points of
equal area. We show that, for a discretization of the problem to an
grid of pixels with centers, the problem can be solved in time , and we experiment with two slower but more practical algorithms and
a hybrid method that switches from one of these algorithms to the other to gain
greater efficiency than either algorithm alone. We also show how to combine
geometric stable matchings with a -means clustering algorithm, so as to
provide a geometric political-districting algorithm that views distance in
economic terms, and we experiment with weighted versions of stable -means in
order to improve the connectivity of the resulting clusters.Comment: 23 pages, 12 figures. To appear (without the appendices) at the 18th
International Workshop on Combinatorial Image Analysis, June 19-21, 2017,
Plovdiv, Bulgari
Linear, Deterministic, and Order-Invariant Initialization Methods for the K-Means Clustering Algorithm
Over the past five decades, k-means has become the clustering algorithm of
choice in many application domains primarily due to its simplicity, time/space
efficiency, and invariance to the ordering of the data points. Unfortunately,
the algorithm's sensitivity to the initial selection of the cluster centers
remains to be its most serious drawback. Numerous initialization methods have
been proposed to address this drawback. Many of these methods, however, have
time complexity superlinear in the number of data points, which makes them
impractical for large data sets. On the other hand, linear methods are often
random and/or sensitive to the ordering of the data points. These methods are
generally unreliable in that the quality of their results is unpredictable.
Therefore, it is common practice to perform multiple runs of such methods and
take the output of the run that produces the best results. Such a practice,
however, greatly increases the computational requirements of the otherwise
highly efficient k-means algorithm. In this chapter, we investigate the
empirical performance of six linear, deterministic (non-random), and
order-invariant k-means initialization methods on a large and diverse
collection of data sets from the UCI Machine Learning Repository. The results
demonstrate that two relatively unknown hierarchical initialization methods due
to Su and Dy outperform the remaining four methods with respect to two
objective effectiveness criteria. In addition, a recent method due to Erisoglu
et al. performs surprisingly poorly.Comment: 21 pages, 2 figures, 5 tables, Partitional Clustering Algorithms
(Springer, 2014). arXiv admin note: substantial text overlap with
arXiv:1304.7465, arXiv:1209.196
Use of Oral Cholera Vaccines in an Outbreak in Vietnam: A Case Control Study
Simple measures such as adequate sanitation and clean water stops the spread of cholera; however, in areas where these are not available, cholera spreads quickly and may lead to death in a few hours if treatment is not initiated immediately. The use of life-saving rehydration therapy is the mainstay in cholera control, however, the rapidity of the disease and the limited access to appropriate healthcare units in far-flung areas together result in an unacceptable number of deaths. The WHO has recommended the use of oral cholera vaccines as a preventive measure against cholera outbreaks since 2001, but this was recently updated so that vaccine use may also be considered once a cholera outbreak has begun. The findings from this study suggest that reactive use of killed oral cholera vaccines provides protection against the disease and may be a potential tool in times of outbreaks. Further studies must be conducted to confirm these findings
Identifying Prototypical Components in Behaviour Using Clustering Algorithms
Quantitative analysis of animal behaviour is a requirement to understand the task solving strategies of animals and the underlying control mechanisms. The identification of repeatedly occurring behavioural components is thereby a key element of a structured quantitative description. However, the complexity of most behaviours makes the identification of such behavioural components a challenging problem. We propose an automatic and objective approach for determining and evaluating prototypical behavioural components. Behavioural prototypes are identified using clustering algorithms and finally evaluated with respect to their ability to represent the whole behavioural data set. The prototypes allow for a meaningful segmentation of behavioural sequences. We applied our clustering approach to identify prototypical movements of the head of blowflies during cruising flight. The results confirm the previously established saccadic gaze strategy by the set of prototypes being divided into either predominantly translational or rotational movements, respectively. The prototypes reveal additional details about the saccadic and intersaccadic flight sections that could not be unravelled so far. Successful application of the proposed approach to behavioural data shows its ability to automatically identify prototypical behavioural components within a large and noisy database and to evaluate these with respect to their quality and stability. Hence, this approach might be applied to a broad range of behavioural and neural data obtained from different animals and in different contexts
Gray zones around diffuse large B cell lymphoma. Conclusions based on the workshop of the XIV meeting of the European Association for Hematopathology and the Society of Hematopathology in Bordeaux, France
The term “gray-zone” lymphoma has been used to denote a group of lymphomas with overlapping histological, biological, and clinical features between various types of lymphomas. It has been used in the context of Hodgkin lymphomas (HL) and non-Hodgkin lymphomas (NHL), including classical HL (CHL), and primary mediastinal large B cell lymphoma, cases with overlapping features between nodular lymphocyte predominant Hodgkin lymphoma and T-cell/histiocyte-rich large B cell lymphoma, CHL, and Epstein–Barr-virus-positive lymphoproliferative disorders, and peripheral T cell lymphomas simulating CHL. A second group of gray-zone lymphomas includes B cell NHL with intermediate features between diffuse large B cell lymphoma and classical Burkitt lymphoma. In order to review controversial issues in gray-zone lymphomas, a joint Workshop of the European Association for Hematopathology and the Society for Hematopathology was held in Bordeaux, France, in September 2008. The panel members reviewed and discussed 145 submitted cases and reached consensus diagnoses. This Workshop summary is focused on the most controversial aspects of gray-zone lymphomas and describes the panel’s proposals regarding diagnostic criteria, terminology, and new prognostic and diagnostic parameters
In quest of a systematic framework for unifying and defining nanoscience
This article proposes a systematic framework for unifying and defining nanoscience based on historic first principles and step logic that led to a “central paradigm” (i.e., unifying framework) for traditional elemental/small-molecule chemistry. As such, a Nanomaterials classification roadmap is proposed, which divides all nanomatter into Category I: discrete, well-defined and Category II: statistical, undefined nanoparticles. We consider only Category I, well-defined nanoparticles which are >90% monodisperse as a function of Critical Nanoscale Design Parameters (CNDPs) defined according to: (a) size, (b) shape, (c) surface chemistry, (d) flexibility, and (e) elemental composition. Classified as either hard (H) (i.e., inorganic-based) or soft (S) (i.e., organic-based) categories, these nanoparticles were found to manifest pervasive atom mimicry features that included: (1) a dominance of zero-dimensional (0D) core–shell nanoarchitectures, (2) the ability to self-assemble or chemically bond as discrete, quantized nanounits, and (3) exhibited well-defined nanoscale valencies and stoichiometries reminiscent of atom-based elements. These discrete nanoparticle categories are referred to as hard or soft particle nanoelements. Many examples describing chemical bonding/assembly of these nanoelements have been reported in the literature. We refer to these hard:hard (H-n:H-n), soft:soft (S-n:S-n), or hard:soft (H-n:S-n) nanoelement combinations as nanocompounds. Due to their quantized features, many nanoelement and nanocompound categories are reported to exhibit well-defined nanoperiodic property patterns. These periodic property patterns are dependent on their quantized nanofeatures (CNDPs) and dramatically influence intrinsic physicochemical properties (i.e., melting points, reactivity/self-assembly, sterics, and nanoencapsulation), as well as important functional/performance properties (i.e., magnetic, photonic, electronic, and toxicologic properties). We propose this perspective as a modest first step toward more clearly defining synthetic nanochemistry as well as providing a systematic framework for unifying nanoscience. With further progress, one should anticipate the evolution of future nanoperiodic table(s) suitable for predicting important risk/benefit boundaries in the field of nanoscience
Typhoid Fever and Its Association with Environmental Factors in the Dhaka Metropolitan Area of Bangladesh: A Spatial and Time-Series Approach
Typhoid fever is a major cause of death worldwide with a major part of the disease burden in developing regions such as the Indian sub-continent. Bangladesh is part of this highly endemic region, yet little is known about the spatial and temporal distribution of the disease at a regional scale. This research used a Geographic Information System to explore, spatially and temporally, the prevalence of typhoid in Dhaka Metropolitan Area (DMA) of Bangladesh over the period 2005-9. This paper provides the first study of the spatio-temporal epidemiology of typhoid for this region. The aims of the study were: (i) to analyse the epidemiology of cases from 2005 to 2009; (ii) to identify spatial patterns of infection based on two spatial hypotheses; and (iii) to determine the hydro-climatological factors associated with typhoid prevalence. Case occurrences data were collected from 11 major hospitals in DMA, geocoded to census tract level, and used in a spatio-temporal analysis with a range of demographic, environmental and meteorological variables. Analyses revealed distinct seasonality as well as age and gender differences, with males and very young children being disproportionately infected. The male-female ratio of typhoid cases was found to be 1.36, and the median age of the cases was 14 years. Typhoid incidence was higher in male population than female (χ2 = 5.88, p0.05). A statistically significant inverse association was found between typhoid incidence and distance to major waterbodies. Spatial pattern analysis showed that there was a significant clustering of typhoid distribution in the study area. Moran\u27s I was highest (0.879; p<0.01) in 2008 and lowest (0.075; p<0.05) in 2009. Incidence rates were found to form three large, multi-centred, spatial clusters with no significant difference between urban and rural rates. Temporally, typhoid incidence was seen to increase with temperature, rainfall and river level at time lags ranging from three to five weeks. For example, for a 0.1 metre rise in river levels, the number of typhoid cases increased by 4.6% (95% CI: 2.4-2.8) above the threshold of 4.0 metres (95% CI: 2.4-4.3). On the other hand, with a 1°C rise in temperature, the number of typhoid cases could increase by 14.2% (95% CI: 4.4-25.0)
- …