473 research outputs found
Some Contribution of Statistical Techniques in Big Data: A Review
Big Data is a popular topic in research work. Everyone is talking about big data, and it is believed that science, business, industry, government, society etc. will undergo a through change with the impact of big data.Big data is used to refer to very huge data set having large, more complex, hidden pattern, structured and unstructured nature of data with the difficulties to collect, storage, analysing for process or result. So proper advanced techniques to use to gain knowledge about big data. In big data research big challenge is created in storage, process, search, sharing, transfer, analysis and visualizing. To deeply discuss on introduction of big data, issue, management and all used big data techniques. Also in this paper present a review of various advanced statistical techniques to handling the key application of big data have large data set. These advanced techniques handle the structure as well as unstructured big data in different area
From Social Data Mining to Forecasting Socio-Economic Crisis
Socio-economic data mining has a great potential in terms of gaining a better
understanding of problems that our economy and society are facing, such as
financial instability, shortages of resources, or conflicts. Without
large-scale data mining, progress in these areas seems hard or impossible.
Therefore, a suitable, distributed data mining infrastructure and research
centers should be built in Europe. It also appears appropriate to build a
network of Crisis Observatories. They can be imagined as laboratories devoted
to the gathering and processing of enormous volumes of data on both natural
systems such as the Earth and its ecosystem, as well as on human
techno-socio-economic systems, so as to gain early warnings of impending
events. Reality mining provides the chance to adapt more quickly and more
accurately to changing situations. Further opportunities arise by individually
customized services, which however should be provided in a privacy-respecting
way. This requires the development of novel ICT (such as a self- organizing
Web), but most likely new legal regulations and suitable institutions as well.
As long as such regulations are lacking on a world-wide scale, it is in the
public interest that scientists explore what can be done with the huge data
available. Big data do have the potential to change or even threaten democratic
societies. The same applies to sudden and large-scale failures of ICT systems.
Therefore, dealing with data must be done with a large degree of responsibility
and care. Self-interests of individuals, companies or institutions have limits,
where the public interest is affected, and public interest is not a sufficient
justification to violate human rights of individuals. Privacy is a high good,
as confidentiality is, and damaging it would have serious side effects for
society.Comment: 65 pages, 1 figure, Visioneer White Paper, see
http://www.visioneer.ethz.c
Safe Automated Refactoring for Intelligent Parallelization of Java 8 Streams
Streaming APIs are becoming more pervasive in mainstream Object-Oriented programming languages and platforms. For example, the Stream API introduced in Java 8 allows for functional-like, MapReduce-style operations in processing both finite, e.g., collections, and infinite data structures. However, using this API efficiently involves subtle considerations such as determining when it is best for stream operations to run in parallel, when running operations in parallel can be less efficient, and when it is safe to run in parallel due to possible lambda expression side-effects. Also, streams may not run all operations in parallel depending on particular collectors used in reductions. In this paper, we present an automated refactoring approach that assists developers in writing efficient stream code in a semantics-preserving fashion. The approach, based on a novel data ordering and typestate analysis, consists of preconditions and transformations for automatically determining when it is safe and possibly advantageous to convert sequential streams to parallel, unorder or de-parallelize already parallel streams, and optimize streams involving complex reductions. The approach was implemented as a plug-in to the popular Eclipse IDE, uses the WALA and SAFE analysis frameworks, and was evaluated on 11 Java projects consisting of ∼642K lines of code. We found that 57 of 157 candidate streams (36.31%) were refactorable, and an average speedup of 3.49 on performance tests was observed. The results indicate that the approach is useful in optimizing stream code to their full potential
PRINS : scalable model inference for component-based system logs
Behavioral software models play a key role in many software engineering tasks; unfortunately, these models either are not available during software development or, if available, quickly become outdated as implementations evolve. Model inference techniques have been proposed as a viable solution to extract finite state models from execution logs. However, existing techniques do not scale well when processing very large logs that can be commonly found in practice. In this paper, we address the scalability problem of inferring the model of a component-based system from large system logs, without requiring any extra information. Our model inference technique, called PRINS, follows a divide-and-conquer approach. The idea is to first infer a model of each system component from the corresponding logs; then, the individual component models are merged together taking into account the flow of events across components, as reflected in the logs. We evaluated PRINS in terms of scalability and accuracy, using nine datasets composed of logs extracted from publicly available benchmarks and a personal computer running desktop business applications. The results show that PRINS can process large logs much faster than a publicly available and well-known state-of-the-art tool, without significantly compromising the accuracy of inferred models
Quality of Information in Mobile Crowdsensing: Survey and Research Challenges
Smartphones have become the most pervasive devices in people's lives, and are
clearly transforming the way we live and perceive technology. Today's
smartphones benefit from almost ubiquitous Internet connectivity and come
equipped with a plethora of inexpensive yet powerful embedded sensors, such as
accelerometer, gyroscope, microphone, and camera. This unique combination has
enabled revolutionary applications based on the mobile crowdsensing paradigm,
such as real-time road traffic monitoring, air and noise pollution, crime
control, and wildlife monitoring, just to name a few. Differently from prior
sensing paradigms, humans are now the primary actors of the sensing process,
since they become fundamental in retrieving reliable and up-to-date information
about the event being monitored. As humans may behave unreliably or
maliciously, assessing and guaranteeing Quality of Information (QoI) becomes
more important than ever. In this paper, we provide a new framework for
defining and enforcing the QoI in mobile crowdsensing, and analyze in depth the
current state-of-the-art on the topic. We also outline novel research
challenges, along with possible directions of future work.Comment: To appear in ACM Transactions on Sensor Networks (TOSN
PRINS: Scalable Model Inference for Component-based System Logs
Behavioral software models play a key role in many software engineering tasks;
unfortunately, these models either are not available during software development
or, if available, quickly become outdated as implementations evolve.
Model inference techniques have been proposed as a viable solution to extract
finite state models from execution logs. However, existing techniques do not
scale well when processing very large logs that can be commonly found in
practice.
In this paper, we address the scalability problem of inferring the model of a
component-based system from large system logs, without requiring any extra
information. Our model inference technique, called PRINS, follows a divide-and-conquer
approach. The idea is to first infer a model of each system component
from the corresponding logs; then, the individual component models are merged
together taking into account the flow of events across components, as reflected in
the logs. We evaluated PRINS in terms of scalability and accuracy, using nine
datasets composed of logs extracted from publicly available benchmarks and a
personal computer running desktop business applications. The results show that
PRINS can process large logs much faster than a publicly available and well-known
state-of-the-art tool, without significantly compromising the accuracy of
inferred models
Data-driven modelling and probabilistic analysis of interactive software usage
This paper answers the research question: how can we model and understand the ways in which users actually interact with software, given that usage styles vary from user to user, and even from use to use for an individual user. Our first contribution is to introduce two new probabilistic, admixture models, inferred from sets of logged user traces, which include observed and latent states. The models encapsulate the temporal and stochastic aspects of usage, the heterogeneous and dynamic nature of users, and the temporal aspects of the time interval over which the data was collected (e.g. one day, one month, etc.). A key concept is activity patterns, which encapsulate common observed temporal behaviours shared across a set of logged user traces. Each activity pattern is a discrete-time Markov chain in which observed variables label the states; latent states specify the activity patterns. The second contribution is how we use parametrised, probabilistic, temporal logic properties to reason about hypothesised behaviours within an activity pattern, and between activity patterns. Different combinations of inferred model and hypothesised property afford a rich set of techniques for understanding software usage. The third contribution is a demonstration of the models and temporal logic properties by application to user traces from a software application that has been used by tens of thousands of users worldwide
Recommended from our members
MapReduce based RDF assisted distributed SVM for high throughput spam filtering
This thesis was submitted for the degree of Doctor of Philosophy and was awarded by Brunel UniversityElectronic mail has become cast and embedded in our everyday lives. Billions of legitimate emails are sent on a daily basis. The widely established underlying infrastructure, its widespread availability as well as its ease of use have all acted as catalysts to such pervasive proliferation. Unfortunately, the same can be alleged about unsolicited bulk email, or rather spam. Various methods, as well as enabling architectures are available to try to mitigate spam permeation. In this respect, this dissertation compliments existing survey work in this area by contributing an extensive literature review of traditional and emerging spam filtering approaches. Techniques, approaches and architectures employed for spam filtering are appraised, critically assessing respective strengths and weaknesses.
Velocity, volume and variety are key characteristics of the spam challenge. MapReduce (M/R) has become increasingly popular as an Internet scale, data intensive processing platform. In the context of machine learning based spam filter training, support vector machine (SVM) based techniques have been proven effective. SVM training is however a computationally intensive process. In this dissertation, a M/R based distributed SVM algorithm for scalable spam filter training, designated MRSMO, is presented. By distributing and processing subsets of the training data across multiple participating computing nodes, the distributed SVM reduces spam filter training time significantly. To mitigate the accuracy degradation introduced by the adopted approach, a Resource Description Framework (RDF) based feedback loop is evaluated. Experimental results demonstrate that this improves the accuracy levels of the distributed SVM beyond the original sequential counterpart.
Effectively exploiting large scale, ‘Cloud’ based, heterogeneous processing capabilities for M/R in what can be considered a non-deterministic environment requires the consideration of a number of perspectives. In this work, gSched, a Hadoop M/R based, heterogeneous aware task to node matching and allocation scheme is designed. Using MRSMO as a baseline, experimental evaluation indicates that gSched improves on the performance of the out-of-the box Hadoop counterpart in a typical Cloud based infrastructure.
The focal contribution to knowledge is a scalable, heterogeneous infrastructure and machine learning based spam filtering scheme, able to capitalize on collaborative accuracy improvements through RDF based, end user feedback. MapReduce based RDF Assisted Distributed SVM for High Throughput Spam Filterin
- …