660 research outputs found
A unified view of data-intensive flows in business intelligence systems : a survey
Data-intensive flows are central processes in today’s business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. To meet complex requirements of next generation BI systems, we often need an effective combination of the traditionally batched extract-transform-load (ETL) processes that populate a data warehouse (DW) from integrated data sources, and more real-time and operational data flows that integrate source data at runtime. Both academia and industry thus must have a clear understanding of the foundations of data-intensive flows and the challenges of moving towards next generation BI environments. In this paper we present a survey of today’s research on data-intensive flows and the related fundamental fields of database theory. The study is based on a proposed set of dimensions describing the important challenges of data-intensive flows in the next generation BI setting. As a result of this survey, we envision an architecture of a system for managing the lifecycle of data-intensive flows. The results further provide a comprehensive understanding of data-intensive flows, recognizing challenges that still are to be addressed, and how the current solutions can be applied for addressing these challenges.Peer ReviewedPostprint (author's final draft
NOSQL design for analytical workloads: Variability matters
Big Data has recently gained popularity and has strongly questioned relational databases as universal storage systems, especially in the presence of analytical workloads. As result, co-relational alternatives, commonly known as NOSQL (Not Only SQL) databases, are extensively used for Big Data. As the primary focus of NOSQL is on performance, NOSQL databases are directly designed at the physical level, and consequently the resulting schema is tailored to the dataset and access patterns of the problem in hand. However, we believe that NOSQL design can also benefit from traditional design approaches. In this paper we present a method to design databases for analytical workloads. Starting from the conceptual model and adopting the classical 3-phase design used for relational databases, we propose a novel design method considering the new features brought by NOSQL and encompassing relational and co-relational design altogether.Peer ReviewedPostprint (author's final draft
H-word: Supporting job scheduling in Hadoop with workload-driven data redistribution
The final publication is available at http://link.springer.com/chapter/10.1007/978-3-319-44039-2_21Today’s distributed data processing systems typically follow a query shipping approach and exploit data locality for reducing network traffic. In such systems the distribution of data over the cluster resources plays a significant role, and when skewed, it can harm the performance of executing applications. In this paper, we addressthe challenges of automatically adapting the distribution of data in a cluster to the workload imposed by the input applications. We propose a generic algorithm, named H-WorD, which, based on the estimated workload over resources, suggests alternative execution scenarios of tasks, and hence identifies required transfers of input data a priori, for timely bringing data close to the execution. We exemplify our algorithm in the context of MapReduce jobs in a Hadoop ecosystem. Finally, we evaluate our approach and demonstrate the performance gains of automatic data redistribution.Peer ReviewedPostprint (author's final draft
Requirement-driven creation and deployment of multidimensional and ETL designs
We present our tool for assisting designers in the error-prone and time-consuming tasks carried out at the early stages of a data warehousing project. Our tool semi-automatically produces multidimensional (MD) and ETL conceptual designs from a given set of business requirements (like SLAs) and data source descriptions. Subsequently, our tool translates both the MD and ETL conceptual designs produced into physical designs, so they can be further deployed on a DBMS and an ETL engine. In this paper, we describe the system architecture and present our demonstration proposal by means of an example.Peer ReviewedPostprint (author's final draft
High-pressure optical and vibrational properties of Ga2O3 nanocrystals
Màster en Nanociència i Nanotecnologia, Facultat de Física, Universitat de Barcelona, Curs: 2016-2017. Tutors: Jordi Ibáñez, Sergi HernándezIn this project the optical and vibrational properties of monoclinic gallium oxide (β-Ga2O3) nanocrystals (NCs) are studied by Raman scattering spectroscopy under high-hydrostatic pressure conditions, from ambient pressure up to 21.6 GPa. Phonon pressure coefficients and Grüneisen parameters are obtained for different optical phonon modes of nanocrystalline β-Ga2O3. In the first part of the work, the investigated material is characterized by means of different techniques like X-ray diffraction (XRD), scanning electron microscopy (SEM) and Raman scattering. While XRD and SEM confirm the nanocrystalline nature of the investigated sample, from the Raman spectra we are able to properly identify the Raman-active modes of β-Ga2O3 at ambient pressure. By monitoring their peak position at different pressures, phonon pressure coefficients for several of the optical Raman-active modes of β-Ga2O3 have been successfully determined, with values significantly lower than those reported in previous works for bulk β-Ga2O3. This suggests that the compressibility of the NCs could be reduced with respect to the bulk material. In order to test the validity of the experimental data, density functional theory calculations of the structural properties of bulk β-Ga2O3 have also been performed as a function of pressure. From the ab initio calculations we obtain a bulk modulus of 160.7 ± 5.0 GPa for bulk β-Ga2O3, which is comparable, and even lower, than that measured in previous works for bulk material by means of synchrotron XRD as a function of pressure (~ 200 GPa). Our theoretical results thus confirm that the lower compressibility of the β-Ga2O3 NCs studied in this work may be a consequence of the nanocrystalline nature of the investigated material. The possible physical mechanisms giving rise to this observation are discussed in terms of similar results reported in the literature. It is concluded that more work dealing with the high-pressure structural and vibrational properties of β-Ga2O3 samples of different origin (i.e., bulk vs. NCs, and doped vs. undoped material) should be performed in order to fully understand the origin of the lower compressibility displayed by the nanocrystalline β-Ga2O3 sample studied in this work
Towards information profiling: data lake content metadata management
There is currently a burst of Big Data (BD) processed and stored in huge raw data repositories, commonly called Data Lakes (DL). These BD require new techniques of data integration and schema alignment in order to make the data usable by its consumers and to discover the relationships linking their content. This can be provided by metadata services which discover and describe their content. However, there is currently a lack of a systematic approach for such kind of metadata discovery and management. Thus, we propose a framework for the profiling of informational content stored in the DL, which we call information profiling. The profiles are stored as metadata to support data analysis. We formally define a metadata management process which identifies the key activities required to effectively handle this.We demonstrate the alternative techniques and performance of our process using a prototype implementation handling a real-life case-study from the OpenML DL, which showcases the value and feasibility of our approach.Peer ReviewedPostprint (author's final draft
Automatically configuring parallelism for hybrid layouts
Distributed processing frameworks process data in parallel by dividing it into multiple partitions and each partition is processed in a separate task. The number of tasks is always created based on the total file size. However, this can lead to launch more tasks than needed in the case of hybrid layouts, because they help to read less data for certain operations (i.e., projection, selection). The over-provisioning of tasks may increase the job execution time and induce significant waste of computing resources. The latter due to the fact that each task introduces extra overhead (e.g., initialization, garbage collection, etc.).
To allow a more efficient use of resources and reduce the job execution time, we propose a cost-based approach that decides the number of tasks based on the data being read. The proposed cost-model can be utilized in a multi-objective approach to decide both the number of tasks and number of machines for execution.Peer ReviewedPostprint (author's final draft
Nuevas citas para tres híbridos de "Asplenium" ("Aspleniaceae", Pteridophyta) en la Península Ibérica
Nuevas citas para tres híbridos de Asplenium (Aspleniaceae, Pteridophyta) en la Península Ibérica. Se dan a conocer nuevas localidades en la Península Ibérica para los siguientes híbridos de Asplenium: Asplenium x protomajoricum nothosubsp. protomajoricum (= A. fontanum subsp. fontanum x A.petrarchae subsp. bivalens), A. x recoderi nothosubsp. recoderi (= A. fontanum subsp. fontanum x A. ruta-muraria subsp. ruta-muraria) y A. x sleepiae nothosubsp. sleepiae (= A. obovatum subsp. lanceolatum x A. foreziense). Se aportan los datos del análisis citológico y se describen los principales caracteres morfológicos de las frondes
Keeping the data lake in form: DS-kNN datasets categorization using proximity mining
With the growth of the number of datasets stored in data repositories, there has been a trend of using Data Lakes (DLs) to store such data. DLs store datasets in their raw formats without any transformations or preprocessing, with accessibility available using schema-on-read. This makes it difficult for analysts to find datasets that can be crossed and that belong to the same topic. To support them in this DL governance challenge, we propose in this paper an algorithm for categorizing datasets in the DL into pre-defined topic-wise categories of interest. We utilise a k-NN approach for this task which uses a proximity score for computing similarities of datasets based on metadata. We test our algorithm on a real-life DL with a known ground-truth categorization. Our approach is successful in detecting the correct categories for datasets and outliers with a precision of more than 90% and recall rates exceeding 75% in specific settings.Peer ReviewedPostprint (author's final draft
Resilient store: a heuristic-based data format selector for intermediate results
The final publication is available at link.springer.comLarge-scale data analysis is an important activity in many organizations that typically requires the deployment of data-intensive workflows. As data is processed these workflows generate large intermediate results, which are typically pipelined from one operator to the following. However, if materialized, these results become reusable, hence, subsequent workflows need not recompute them. There are already many solutions that materialize
intermediate results but all of them assume a fixed data format. A fixed format, however, may not be the optimal one for every situation. For example, it is well-known that different data fragmentation strategies (e.g., horizontal and
vertical) behave better or worse according to the access patterns of the subsequent operations. In this paper, we present ResilientStore, which assists on selecting the most appropriate data format for materializing intermediate
results. Given a workflow and a set of materialization points, it uses rule-based heuristics to choose the best storage data format based on subsequent access patterns.We have implemented ResilientStore for HDFS and three different
data formats: SequenceFile, Parquet and Avro. Experimental results show that our solution gives 18% better performance than any solution based on a single fixed format.Peer ReviewedPostprint (author's final draft
- …