37 research outputs found
Evaluating espresso coffee quality by means of time-series feature engineering
Espresso quality attracts the interest of many stakeholders: from consumers to local business activities, from coffee-machine vendors to international coffee industries. So far, it has been mostly addressed by means of human experts, electronic noses, and chemical approaches. The current work, instead, proposes a datadriven analysis exploiting time-series feature engineering.We analyze a real-world dataset of espresso brewing by professional coffee-making machines. The novelty of the proposed work is provided by the focus on the brewing time series, from which we propose to engineer features able to improve previous data-driven metrics determining the quality of the espresso. Thanks to the exploitation of the proposed features, better quality-evaluation predictions are achieved with respect to previous data-driven approaches that relied solely on metrics describing each brewing as a whole (e.g., average flow, total amount of water). Yet, the engineered features are simple to compute and add a very limited workload to the coffee-machine sensor-data collection device, hence being suitable for large-scale IoT installations on-board of professional coffee machines, such as those typically installed in consumer-oriented business activities, shops, and workplaces. To the best of the authors' knowledge, this is the first attempt to perform a data-driven analysis of real-world espresso-brewing time series. Presented results yield to three-fold improvements in classification accuracy of high-quality espresso coffees with respect to current data-driven approaches (from 30% to 100%), exploiting simple threshold-based quality evaluations, defined in the newly proposed feature space
In-Network Outlier Detection in Wireless Sensor Networks
To address the problem of unsupervised outlier detection in wireless sensor
networks, we develop an approach that (1) is flexible with respect to the
outlier definition, (2) computes the result in-network to reduce both bandwidth
and energy usage,(3) only uses single hop communication thus permitting very
simple node failure detection and message reliability assurance mechanisms
(e.g., carrier-sense), and (4) seamlessly accommodates dynamic updates to data.
We examine performance using simulation with real sensor data streams. Our
results demonstrate that our approach is accurate and imposes a reasonable
communication load and level of power consumption.Comment: Extended version of a paper appearing in the Int'l Conference on
Distributed Computing Systems 200
Challenges in managing real-time data in health information system (HIS)
© Springer International Publishing Switzerland 2016. In this paper, we have discussed the challenges in handling real-time medical big data collection and storage in health information system (HIS). Based on challenges, we have proposed a model for realtime analysis of medical big data. We exemplify the approach through Spark Streaming and Apache Kafka using the processing of health big data Stream. Apache Kafka works very well in transporting data among different systems such as relational databases, Apache Hadoop and nonrelational databases. However, Apache Kafka lacks analyzing the stream, Spark Streaming framework has the capability to perform some operations on the stream. We have identified the challenges in current realtime systems and proposed our solution to cope with the medical big data streams
A feature selection method for classification within functional genomics experiments based on the proportional overlapping score
Background: Microarray technology, as well as other functional genomics experiments, allow simultaneous measurements of thousands of genes within each sample. Both the prediction accuracy and interpretability of a classifier could be enhanced by performing the classification based only on selected discriminative genes. We propose a statistical method for selecting genes based on overlapping analysis of expression data across classes. This method results in a novel measure, called proportional overlapping score (POS), of a feature's relevance to a classification task.Results: We apply POS, along-with four widely used gene selection methods, to several benchmark gene expression datasets. The experimental results of classification error rates computed using the Random Forest, k Nearest Neighbor and Support Vector Machine classifiers show that POS achieves a better performance.Conclusions: A novel gene selection method, POS, is proposed. POS analyzes the expressions overlap across classes taking into account the proportions of overlapping samples. It robustly defines a mask for each gene that allows it to minimize the effect of expression outliers. The constructed masks along-with a novel gene score are exploited to produce the selected subset of genes
Forecasting: theory and practice
Forecasting has always been at the forefront of decision making and planning. The uncertainty that surrounds the future is both exciting and challenging, with individuals and organisations seeking to minimise risks and maximise utilities. The large number of forecasting applications calls for a diverse set of forecasting methods to tackle real-life challenges. This article provides a non-systematic review of the theory and the practice of forecasting. We provide an overview of a wide range of theoretical, state-of-the-art models, methods, principles, and approaches to prepare, produce, organise, and evaluate forecasts. We then demonstrate how such theoretical concepts are applied in a variety of real-life contexts. We do not claim that this review is an exhaustive list of methods and applications. However, we wish that our encyclopedic presentation will offer a point of reference for the rich work that has been undertaken over the last decades, with some key insights for the future of forecasting theory and practice. Given its encyclopedic nature, the intended mode of reading is non-linear. We offer cross-references to allow the readers to navigate through the various topics. We complement the theoretical concepts and applications covered by large lists of free or open-source software implementations and publicly-available databases
Business analytics in industry 4.0: a systematic review
Recently, the term âIndustry 4.0â has emerged to characterize several Information Technology and Communication (ICT) adoptions in production processes (e.g., Internet-of-Things, implementation of digital production support information technologies). Business Analytics is often used within the Industry 4.0, thus incorporating its data intelligence (e.g., statistical analysis, predictive modelling, optimization) expert system component. In this paper, we perform a Systematic Literature Review (SLR) on the usage of Business Analytics within the Industry 4.0 concept, covering a selection of 169 papers obtained from six major scientific publication sources from 2010 to March 2020. The selected papers were first classified in three major types, namely, Practical Application, Reviews and Framework Proposal. Then, we analysed with more detail the practical application studies which were further divided into three main categories of the Gartner analytical maturity model, Descriptive Analytics, Predictive Analytics and Prescriptive Analytics. In particular, we characterized the distinct analytics studies in terms of the industry application and data context used, impact (in terms of their Technology Readiness Level) and selected data modelling method. Our SLR analysis provides a mapping of how data-based Industry 4.0 expert systems are currently used, disclosing also research gaps and future research opportunities.The work of P. Cortez was supported by FCT - Fundação para a CiĂȘncia e Tecnologia within the R&D Units Project Scope: UIDB/00319/2020. We
would like to thank to the three anonymous reviewers for their helpful suggestions
A Survey of Bayesian Statistical Approaches for Big Data
The modern era is characterised as an era of information or Big Data. This
has motivated a huge literature on new methods for extracting information and
insights from these data. A natural question is how these approaches differ
from those that were available prior to the advent of Big Data. We present a
review of published studies that present Bayesian statistical approaches
specifically for Big Data and discuss the reported and perceived benefits of
these approaches. We conclude by addressing the question of whether focusing
only on improving computational algorithms and infrastructure will be enough to
face the challenges of Big Data
Exploiting scalable machine-learning distributed frameworks to forecast power consumption of buildings
The pervasive and increasing deployment of smart meters allows collecting a huge amount of fine-grained energy data in different urban scenarios.
The analysis of such data is challenging and opening up a variety of interesting and new research issues across energy and computer science research areas.
The key role of computer scientists is providing energy researchers and practitioners with cutting-edge and scalable analytics engines to effectively support their daily research activities, hence fostering and leveraging data-driven approaches.
This paper presents SPEC, a scalable and distributed engine to predict building-specific power consumption.
SPEC addresses the full analytic stack and exploits a data stream approach over sliding time windows to train a prediction model tailored to each building. The model allows us to predict the upcoming power consumption at a time instant in the near future.
SPEC integrates different machine learning approaches, specifically
ridge regression, artificial neural networks, and random forest regression, to predict fine-grained values of power consumption, and
a classification model, the random forest classifier, to forecast a coarse consumption level.
SPEC exploits state-of-the-art distributed computing frameworks to address the big data challenges in harvesting energy data: the current implementation runs on Apache Spark, the most widespread high-performance data-processing platform, and can natively scale to huge datasets.
As a case study, SPEC has been tested on real data {of an heating distribution network and power consumption data} collected in a major Italian city.
Experimental results demonstrate the effectiveness of SPEC to forecast both fine-grained values and coarse levels of power consumption of~buildings