192 research outputs found
MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail!
Hadoop is currently the large-scale data analysis "hammer" of choice, but
there exist classes of algorithms that aren't "nails", in the sense that they
are not particularly amenable to the MapReduce programming model. To address
this, researchers have proposed MapReduce extensions or alternative programming
models in which these algorithms can be elegantly expressed. This essay
espouses a very different position: that MapReduce is "good enough", and that
instead of trying to invent screwdrivers, we should simply get rid of
everything that's not a nail. To be more specific, much discussion in the
literature surrounds the fact that iterative algorithms are a poor fit for
MapReduce: the simple solution is to find alternative non-iterative algorithms
that solve the same problem. This essay captures my personal experiences as an
academic researcher as well as a software engineer in a "real-world" production
analytics environment. From this combined perspective I reflect on the current
state and future of "big data" research
From Social Data Mining to Forecasting Socio-Economic Crisis
Socio-economic data mining has a great potential in terms of gaining a better
understanding of problems that our economy and society are facing, such as
financial instability, shortages of resources, or conflicts. Without
large-scale data mining, progress in these areas seems hard or impossible.
Therefore, a suitable, distributed data mining infrastructure and research
centers should be built in Europe. It also appears appropriate to build a
network of Crisis Observatories. They can be imagined as laboratories devoted
to the gathering and processing of enormous volumes of data on both natural
systems such as the Earth and its ecosystem, as well as on human
techno-socio-economic systems, so as to gain early warnings of impending
events. Reality mining provides the chance to adapt more quickly and more
accurately to changing situations. Further opportunities arise by individually
customized services, which however should be provided in a privacy-respecting
way. This requires the development of novel ICT (such as a self- organizing
Web), but most likely new legal regulations and suitable institutions as well.
As long as such regulations are lacking on a world-wide scale, it is in the
public interest that scientists explore what can be done with the huge data
available. Big data do have the potential to change or even threaten democratic
societies. The same applies to sudden and large-scale failures of ICT systems.
Therefore, dealing with data must be done with a large degree of responsibility
and care. Self-interests of individuals, companies or institutions have limits,
where the public interest is affected, and public interest is not a sufficient
justification to violate human rights of individuals. Privacy is a high good,
as confidentiality is, and damaging it would have serious side effects for
society.Comment: 65 pages, 1 figure, Visioneer White Paper, see
http://www.visioneer.ethz.c
Algorithms and Software for the Analysis of Large Complex Networks
The work presented intersects three main areas, namely graph algorithmics, network science and applied software engineering. Each computational method discussed relates to one of the main tasks of data analysis: to extract structural features from network data, such as methods for community detection; or to transform network data, such as methods to sparsify a network and reduce its size while keeping essential properties; or to realistically model networks through generative models
From social data mining to forecasting socio-economic crises
Abstract.: The purpose of this White Paper of the EU Support Action "Visioneer”(see www.visioneer.ethz.ch) is to address the following goals: 1. Develop strategies to quickly increase the objective knowledge about social and economic systems. 2. Describe requirements for efficient large-scale scientific data mining of anonymized social and economic data. 3. Formulate strategies how to collect stylized facts extracted from large data set. 4. Sketch ways how to successfully build up centers for computational social science. 5. Propose plans how to create centers for risk analysis and crisis forecasting. 6. Elaborate ethical standards regarding the storage, processing, evaluation, and publication of social and economic dat
Real-Time intelligence
Dissertação de mestrado em Computer ScienceOver the past 20 years, data has increased in a large scale in various fields. This explosive
increase of global data led to the coin of the term Big Data. Big data is mainly used to describe
enormous datasets that typically includes masses of unstructured data that may need
real-time analysis. This paradigm brings important challenges on tasks like data acquisition,
storage and analysis. The ability to perform these tasks efficiently got the attention
of researchers as it brings a lot of oportunities for creating new value. Another topic with
growing importance is the usage of biometrics, that have been used in a wide set of application
areas as, for example, healthcare and security. In this work it is intended to handle
the data pipeline of data generated by a large scale biometrics application providing basis
for real-time analytics and behavioural classification. The challenges regarding analytical
queries (with real-time requirements, due to the need of monitoring the metrics/behavior)
and classifiers’ training are particularly addressed.Nos os últimos 20 anos, a quantidade de dados armazenados e passíveis de serem processados,
tem vindo a aumentar em áreas bastante diversas. Este aumento explosivo, aliado
às potencialidades que surgem como consequência do mesmo, levou ao aparecimento do
termo Big Data. Big Data abrange essencialmente grandes volumes de dados, possivelmente
com pouca estrutura e com necessidade de processamento em tempo real. As especificidades
apresentadas levaram ao aparecimento de desafios nas diversas tarefas do pipeline
típico de processamento de dados como, por exemplo, a aquisição, armazenamento e a
análise. A capacidade de realizar estas tarefas de uma forma eficiente tem sido alvo de estudo
tanto pela indústria como pela comunidade académica, abrindo portas para a criação
de valor. Uma outra área onde a evolução tem sido notória é a utilização de biométricas comportamentais
que tem vindo a ser cada vez mais acentuada em diferentes cenários como,
por exemplo, na área dos cuidados de saúde ou na segurança. Neste trabalho um dos objetivos
passa pela gestão do pipeline de processamento de dados de uma aplicação de larga
escala, na área das biométricas comportamentais, de forma a possibilitar a obtenção de
métricas em tempo real sobre os dados (viabilizando a sua monitorização) e a classificação
automática de registos sobre fadiga na interação homem-máquina (em larga escala)
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)
Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.The PhD Symposium was a very good opportunity for the young researchers to share information and knowledge, to
present their current research, and to discuss topics with other students in order to look for synergies and common research
topics. The idea was very successful and the assessment made by the PhD Student was very good. It also helped to
achieve one of the major goals of the NESUS Action: to establish an open European research network targeting sustainable
solutions for ultrascale computing aiming at cross fertilization among HPC, large scale distributed systems, and big
data management, training, contributing to glue disparate researchers working across different areas and provide a meeting
ground for researchers in these separate areas to exchange ideas, to identify synergies, and to pursue common activities in
research topics such as sustainable software solutions (applications and system software stack), data management, energy
efficiency, and resilience.European Cooperation in Science and Technology. COS
Research and Education in Computational Science and Engineering
Over the past two decades the field of computational science and engineering
(CSE) has penetrated both basic and applied research in academia, industry, and
laboratories to advance discovery, optimize systems, support decision-makers,
and educate the scientific and engineering workforce. Informed by centuries of
theory and experiment, CSE performs computational experiments to answer
questions that neither theory nor experiment alone is equipped to answer. CSE
provides scientists and engineers of all persuasions with algorithmic
inventions and software systems that transcend disciplines and scales. Carried
on a wave of digital technology, CSE brings the power of parallelism to bear on
troves of data. Mathematics-based advanced computing has become a prevalent
means of discovery and innovation in essentially all areas of science,
engineering, technology, and society; and the CSE community is at the core of
this transformation. However, a combination of disruptive
developments---including the architectural complexity of extreme-scale
computing, the data revolution that engulfs the planet, and the specialization
required to follow the applications to new frontiers---is redefining the scope
and reach of the CSE endeavor. This report describes the rapid expansion of CSE
and the challenges to sustaining its bold advances. The report also presents
strategies and directions for CSE research and education for the next decade.Comment: Major revision, to appear in SIAM Revie
GraphMineSuite: Enabling High-Performance and Programmable Graph Mining Algorithms with Set Algebra
We propose GraphMineSuite (GMS): the first benchmarking suite for graph
mining that facilitates evaluating and constructing high-performance graph
mining algorithms. First, GMS comes with a benchmark specification based on
extensive literature review, prescribing representative problems, algorithms,
and datasets. Second, GMS offers a carefully designed software platform for
seamless testing of different fine-grained elements of graph mining algorithms,
such as graph representations or algorithm subroutines. The platform includes
parallel implementations of more than 40 considered baselines, and it
facilitates developing complex and fast mining algorithms. High modularity is
possible by harnessing set algebra operations such as set intersection and
difference, which enables breaking complex graph mining algorithms into simple
building blocks that can be separately experimented with. GMS is supported with
a broad concurrency analysis for portability in performance insights, and a
novel performance metric to assess the throughput of graph mining algorithms,
enabling more insightful evaluation. As use cases, we harness GMS to rapidly
redesign and accelerate state-of-the-art baselines of core graph mining
problems: degeneracy reordering (by up to >2x), maximal clique listing (by up
to >9x), k-clique listing (by 1.1x), and subgraph isomorphism (by up to 2.5x),
also obtaining better theoretical performance bounds
- …