26,271 research outputs found
Characterizing and Subsetting Big Data Workloads
Big data benchmark suites must include a diversity of data and workloads to
be useful in fairly evaluating big data systems and architectures. However,
using truly comprehensive benchmarks poses great challenges for the
architecture community. First, we need to thoroughly understand the behaviors
of a variety of workloads. Second, our usual simulation-based research methods
become prohibitively expensive for big data. As big data is an emerging field,
more and more software stacks are being proposed to facilitate the development
of big data applications, which aggravates hese challenges. In this paper, we
first use Principle Component Analysis (PCA) to identify the most important
characteristics from 45 metrics to characterize big data workloads from
BigDataBench, a comprehensive big data benchmark suite. Second, we apply a
clustering technique to the principle components obtained from the PCA to
investigate the similarity among big data workloads, and we verify the
importance of including different software stacks for big data benchmarking.
Third, we select seven representative big data workloads by removing redundant
ones and release the BigDataBench simulation version, which is publicly
available from http://prof.ict.ac.cn/BigDataBench/simulatorversion/.Comment: 11 pages, 6 figures, 2014 IEEE International Symposium on Workload
Characterizatio
Clustering Methods for Electricity Consumers: An Empirical Study in Hvaler-Norway
The development of Smart Grid in Norway in specific and Europe/US in general
will shortly lead to the availability of massive amount of fine-grained
spatio-temporal consumption data from domestic households. This enables the
application of data mining techniques for traditional problems in power system.
Clustering customers into appropriate groups is extremely useful for operators
or retailers to address each group differently through dedicated tariffs or
customer-tailored services. Currently, the task is done based on demographic
data collected through questionnaire, which is error-prone. In this paper, we
used three different clustering techniques (together with their variants) to
automatically segment electricity consumers based on their consumption
patterns. We also proposed a good way to extract consumption patterns for each
consumer. The grouping results were assessed using four common internal
validity indexes. We found that the combination of Self Organizing Map (SOM)
and k-means algorithms produce the most insightful and useful grouping. We also
discovered that grouping quality cannot be measured effectively by automatic
indicators, which goes against common suggestions in literature.Comment: 12 pages, 3 figure
Anytime Hierarchical Clustering
We propose a new anytime hierarchical clustering method that iteratively
transforms an arbitrary initial hierarchy on the configuration of measurements
along a sequence of trees we prove for a fixed data set must terminate in a
chain of nested partitions that satisfies a natural homogeneity requirement.
Each recursive step re-edits the tree so as to improve a local measure of
cluster homogeneity that is compatible with a number of commonly used (e.g.,
single, average, complete) linkage functions. As an alternative to the standard
batch algorithms, we present numerical evidence to suggest that appropriate
adaptations of this method can yield decentralized, scalable algorithms
suitable for distributed/parallel computation of clustering hierarchies and
online tracking of clustering trees applicable to large, dynamically changing
databases and anomaly detection.Comment: 13 pages, 6 figures, 5 tables, in preparation for submission to a
conferenc
PhylOTU: a high-throughput procedure quantifies microbial community diversity and resolves novel taxa from metagenomic data.
Microbial diversity is typically characterized by clustering ribosomal RNA (SSU-rRNA) sequences into operational taxonomic units (OTUs). Targeted sequencing of environmental SSU-rRNA markers via PCR may fail to detect OTUs due to biases in priming and amplification. Analysis of shotgun sequenced environmental DNA, known as metagenomics, avoids amplification bias but generates fragmentary, non-overlapping sequence reads that cannot be clustered by existing OTU-finding methods. To circumvent these limitations, we developed PhylOTU, a computational workflow that identifies OTUs from metagenomic SSU-rRNA sequence data through the use of phylogenetic principles and probabilistic sequence profiles. Using simulated metagenomic data, we quantified the accuracy with which PhylOTU clusters reads into OTUs. Comparisons of PCR and shotgun sequenced SSU-rRNA markers derived from the global open ocean revealed that while PCR libraries identify more OTUs per sequenced residue, metagenomic libraries recover a greater taxonomic diversity of OTUs. In addition, we discover novel species, genera and families in the metagenomic libraries, including OTUs from phyla missed by analysis of PCR sequences. Taken together, these results suggest that PhylOTU enables characterization of part of the biosphere currently hidden from PCR-based surveys of diversity
Relation between Financial Market Structure and the Real Economy: Comparison between Clustering Methods
We quantify the amount of information filtered by different hierarchical
clustering methods on correlations between stock returns comparing it with the
underlying industrial activity structure. Specifically, we apply, for the first
time to financial data, a novel hierarchical clustering approach, the Directed
Bubble Hierarchical Tree and we compare it with other methods including the
Linkage and k-medoids. In particular, by taking the industrial sector
classification of stocks as a benchmark partition, we evaluate how the
different methods retrieve this classification. The results show that the
Directed Bubble Hierarchical Tree can outperform other methods, being able to
retrieve more information with fewer clusters. Moreover, we show that the
economic information is hidden at different levels of the hierarchical
structures depending on the clustering method. The dynamical analysis on a
rolling window also reveals that the different methods show different degrees
of sensitivity to events affecting financial markets, like crises. These
results can be of interest for all the applications of clustering methods to
portfolio optimization and risk hedging.Comment: 31 pages, 17 figure
- …