Search CORE

1,324 research outputs found

Scalable aggregation predictive analytics: a query-driven machine learning approach

Author: Anagnostopoulos Christos
Savva Fotis
Triantafillou Peter
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/09/2018
Field of study

We introduce a predictive modeling solution that provides high quality predictive analytics over aggregation queries in Big Data environments. Our predictive methodology is generally applicable in environments in which large-scale data owners may or may not restrict access to their data and allow only aggregation operators like COUNT to be executed over their data. In this context, our methodology is based on historical queries and their answers to accurately predict ad-hoc queries’ answers. We focus on the widely used set-cardinality, i.e., COUNT, aggregation query, as COUNT is a fundamental operator for both internal data system optimizations and for aggregation-oriented data exploration and predictive analytics. We contribute a novel, query-driven Machine Learning (ML) model whose goals are to: (i) learn the query-answer space from past issued queries, (ii) associate the query space with local linear regression & associative function estimators, (iii) define query similarity, and (iv) predict the cardinality of the answer set of unseen incoming queries, referred to the Set Cardinality Prediction (SCP) problem. Our ML model incorporates incremental ML algorithms for ensuring high quality prediction results. The significance of contribution lies in that it (i) is the only query-driven solution applicable over general Big Data environments, which include restricted-access data, (ii) offers incremental learning adjusted for arriving ad-hoc queries, which is well suited for query-driven data exploration, and (iii) offers a performance (in terms of scalability, SCP accuracy, processing time, and memory requirements) that is superior to data-centric approaches. We provide a comprehensive performance evaluation of our model evaluating its sensitivity, scalability and efficiency for quality predictive analytics. In addition, we report on the development and incorporation of our ML model in Spark showing its superior performance compared to the Spark’s COUNT method

Warwick Research Archives Portal Repository

Enlighten

Query-driven learning for predictive analytics of data subspace cardinality

Author: Anagnostopoulos Christos
Triantafillou Peter
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 29/06/2017
Field of study

Fundamental to many predictive analytics tasks is the ability to estimate the cardinality (number of data items) of multi-dimensional data subspaces, defined by query selections over datasets. This is crucial for data analysts dealing with, e.g., interactive data subspace explorations, data subspace visualizations, and in query processing optimization. However, in many modern data systems, predictive analytics may be (i) too costly money-wise, e.g., in clouds, (ii) unreliable, e.g., in modern Big Data query engines, where accurate statistics are difficult to obtain/maintain, or (iii) infeasible, e.g., for privacy issues. We contribute a novel, query-driven, function estimation model of analyst-defined data subspace cardinality. The proposed estimation model is highly accurate in terms of prediction and accommodating the well-known selection queries: multi-dimensional range and distance-nearest neighbors (radius) queries. Our function estimation model: (i) quantizes the vectorial query space, by learning the analysts’ access patterns over a data space, (ii) associates query vectors with their corresponding cardinalities of the analyst-defined data subspaces, (iii) abstracts and employs query vectorial similarity to predict the cardinality of an unseen/unexplored data subspace, and (iv) identifies and adapts to possible changes of the query subspaces based on the theory of optimal stopping. The proposed model is decentralized, facilitating the scaling-out of such predictive analytics queries. The research significance of the model lies in that (i) it is an attractive solution when data-driven statistical techniques are undesirable or infeasible, (ii) it offers a scale-out, decentralized training solution, (iii) it is applicable to different selection query types, and (iv) it offers a performance that is superior to that of data-driven approaches

Warwick Research Archives Portal Repository

Enlighten

Statistique et Big Data Analytics; Volumétrie, L'Attaque des Clones

Author: Besse Philippe
Vialaneix Nathalie
Publication venue: HAL CCSD
Publication date: 03/10/2014
Field of study

This article assumes acquired the skills and expertise of a statistician in unsupervised (NMF, k-means, SVD) and supervised learning (regression, CART, random forest). What skills and knowledge do a statistician must acquire to reach the "Volume" scale of big data? After a quick overview of the different strategies available and especially of those imposed by Hadoop, the algorithms of some available learning methods are outlined in order to understand how they are adapted to the strong stresses of the Map-Reduce functionalitie

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

HAL-INSA Toulouse

Hal-Diderot

FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems

Author: Basgall María José
Fernández Alberto
Naiouf Ricardo Marcelo
Publication venue: 'MDPI AG'
Publication date: 01/08/2021
Field of study

In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR2-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.Fil: Basgall, María José. Universidad de Granada; España. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; ArgentinaFil: Naiouf, Ricardo Marcelo. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; ArgentinaFil: Fernández, Alberto. Universidad de Granada; Españ

CONICET Digital

StreamApprox: Approximate Computing for Stream Analytics

Author: Bhatotia Pramod
Chen Ruichuan
Fetzer Christof
Hilt Volker
Quoc Do Le
Strufe Thorsten
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 11/12/2017
Field of study

Edinburgh Research Explorer

Infrastructure for Detector Research and Development towards the International Linear Collider

Author: Abramowicz H.
Aguilar J.
Alozy J.
Ambalathankandy P.
Andricek L.
Anduze M.
Aplin S.
Apostolakis J.
Aspell P.
Attie D.
Bachynska O.
Bailey D.S.
Bamberger A.
Bartsch V.
Bassignana D.
Behnke T.
Behr J.
Ben-Hamu Y.
Benyamna M.
Bergauer T.
Bergsma F.
Besson A.
Beyer E.
Bilevych Y.
Boisvert V.
Bonis J.
Bonnard J.
Bonnemaison A.
Boudry V.
Brezina Ch.
Brient J.C.
Bryngemark L.
Bulgheroni A.
Caccia M.
Calderone A.
Callier S.
Calvet D.
Campbell M.
Carballo V.M.Blanco
Carloganu C.
Cauchois A.
Charpy A.
Chefdeville M.
Christiansen P.
Claus G.
Clerc C.
Colas P.
Coppolani X.
Cornat R.
Cornebise P.
Corrin E.
Cotta-Ramusino A.
Cudie X.Llopart
Cussans D.G.
Cvach J.
Da Silva W.
Daniluk W.
David J.
de Freitas P.Mora
de Gaspari M.
de la Taille Ch.
De Lentdecker G.
de Masi R.
de Nooij L.
Degerli Y.
Dehmelt K.
Delagnes E.
Desch K.
Dewulf J.P.
Dhellot M.
Diener R.
Dolezal Z.
Doziere G.
Dragicevic M.
Drasal Z.
Dulinski W.
Dulucq F.
Dzahini D.
Eigen G.
Engels J.
Fehr F.
Fernandez M.
Fischer P.
Fiutowski T.
Fleury J.
Formenti F.
Fransen M.
Friedl M.
Frotin M.
Furletova J.
Gadow K.
Gaede F.
Garcia E.Garcia
Garutti E.
Gastaldi F.
Gay P.
Gelin M.
Ghislain P.
Giannelli M.Faucci
Giomataris I.
Giraud J.
Giudice P.A.
Goffe M.
Goodrick M.J.
Gottlicher P.
Green B.
Green M.G.
Grefe Ch.
Gregor I.M.
Grichine V.
Grondin D.
Gross P.
Guilhem G.
Haas D.
Haas T.
Haensel S.
Hartjes F.
Hauschild M.
Heath H.F.
Henschel H.
Himmi A.
Hommels L.B.A.
Hostachy J.Y.
Hu-Guo Ch.
Idzik M.
Imbault D.
Irmler C.
Ivantchenko V.
Janata M.
Janssen X.
Jaramillo R.
Jastrzab M.
Jauffret C.
Jeans D.
Jikhleb I.
Jonsson L.
Kalliopuska J.
Kaminski J.
Kananov S.
Kapusta F.
Karar A.
Kaukher A.
Kehrli A.
Kelly M.
Kielar E.
Kiesenhofer W.
Killenberg M.
Kloukinas K.
Kockner F.
Kodys P.
Koetz U.
Koffmane Ch.
Kohli M.
Kotula J.
Krammer M.
Krautscheid T.
Kruger H.
Kulis Sz.
Kvasnicka J.
Kvasnicka P.
Lange W.
Levy A.
Levy I.
Libov V.
Linssen L.
Ljunggren M.
Lohmann W.
Lozano M.
Lundberg B.
Lupberger M.
Lutz B.
Lutz P.
Mandry S.
Mannen S.
Marchioro A.
Marcisovsky M.
Martin-Chassard G.
Mathieu A.
Mehtaelae P.
Misiejuk A.
Mjornmark U.
Mnich J.
Morel F.
Morin L.
Moser H.G.
Moszczynski A.
Muhl C.
Munoz F.J.
Musa L.
Musat G.
Ninkovich J.
Ohlerich M.
Oliwa K.
Orava R.
Orsini F.
Oskarsson A.
Osterman L.
Page R.F.
Pawlik B.
Pellegrini G.
Peric I.
Pham T.Hung
Piemontese L.
Pohl M.
Polak I.
Poschl R.
Postranecky M.
Potylitsina-Kube N.
Prahl V.
Przyborowski D.
Quirion D.
Ratti L.
Raux L.
Re V.
Reinecke M.
Renz U.
Reuen L.
Rialot M.
Ribon A.
Richert T.
Richter R.
Roloff P.
Rosemann Ch.
Rouge A.
Royer L.
Ruan M.
Rubinski Igor
Rummel S.
Sadeh I.
Santos H.Franca
Savoy-Navarro A.
Schade P.
Schafer O.
Schroder H.
Schumacher M.
Schuwalov S.
Schwartz R.
Sefkow F.
Sefri R.
Seguin-Moreau N.
Senee F.
Shaw R.
Sicho P.
Smolik J.
Stenlund E.
Stern A.
Swientek K.
Terwort M.
Timmermans J.
Trampitsch G.
Traversi G.
Uzhinskiy V.
Valentan M.
Valin I.
van der Graaf H.
van Remortel N.
Vanel J.C.
Velthuis J.J.
Videau H.
Vila I.
Volkenborn R.
Vrba V.
Wang W.
Ward D.R.
Warren M.
Wicek F.
Wienemann P.
Wierba W.
Wing M.
Winter M.
Wu T.
Wurth R.
Yang Y.
Zalesak J.
Zarnecki A.F.
Zawiejski L.
Zimmermann R.
Zimmermann S.
Zwerger Andreas
Publication venue
Publication date: 23/01/2012
Field of study

The EUDET-project was launched to create an infrastructure for developing and testing new and advanced detector technologies to be used at a future linear collider. The aim was to make possible experimentation and analysis of data for institutes, which otherwise could not be realized due to lack of resources. The infrastructure comprised an analysis and software network, and instrumentation infrastructures for tracking detectors as well as for calorimetry.Comment: 54 pages, 48 picture

arXiv.org e-Print Archive

DESY Publication Database

DESY

CERN Document Server

Explore Bristol Research

Efficient Large-scale Distance-Based Join Queries in SpatialHadoop

Author: Corral Liria Antonio Leopoldo
García García Francisco
Iribarne Martínez Luis Fernando
Manolopoulos Yannis
Vassilakopoulos Michael
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Efficient processing of Distance-Based Join Queries (DBJQs) in spatial databases is of paramount importance in many application domains. The most representative and known DBJQs are the K Closest Pairs Query (KCPQ) and the ε Distance Join Query (εDJQ). These types of join queries are characterized by a number of desired pairs (K) or a distance threshold (ε) between the components of the pairs in the final result, over two spatial datasets. Both are expensive operations, since two spatial datasets are combined with additional constraints. Given the increasing volume of spatial data originating from multiple sources and stored in distributed servers, it is not always efficient to perform DBJQs on a centralized server. For this reason, this paper addresses the problem of computing DBJQs on big spatial datasets in SpatialHadoop, an extension of Hadoop that supports efficient processing of spatial queries in a cloud-based setting. We propose novel algorithms, based on plane-sweep, to perform efficient parallel DBJQs on large-scale spatial datasets in Spatial Hadoop. We evaluate the performance of the proposed algorithms in several situations with large real-world as well as synthetic datasets. The experiments demonstrate the efficiency and scalability of our proposed methodologies

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Institucional de la Universidad de Almería (Spain)