Search CORE

53 research outputs found

A Logistic Based Mathematical Model to Optimize Duplicate Elimination Ratio in Content Defined Chunking Based Big Data Storage System

Author: Fuliang Guo
Longxiang Wang
Weifeng Gong
Xiaoshe Dong
Xingjun Zhang
Yinfeng Wang
Publication venue: 'MDPI AG'
Publication date: 01/07/2016
Field of study

Deduplication is an efficient data reduction technique, and it is used to mitigate the problem of huge data volume in big data storage systems. Content defined chunking (CDC) is the most widely used algorithm in deduplication systems. The expected chunk size is an important parameter of CDC, and it influences the duplicate elimination ratio (DER) significantly. We collected two realistic datasets to perform an experiment. The experimental results showed that the current approach of setting the expected chunk size to 4 KB or 8 KB empirically cannot optimize DER. Therefore, we present a logistic based mathematical model to reveal the hidden relationship between the expected chunk size and the DER. This model provides a theoretical basis for optimizing DER by setting the expected chunk size reasonably. We used the collected datasets to verify this model. The experimental results showed that the R2 values, which describe the goodness of fit, are above 0.9, validating the correctness of this mathematic model. Based on the DER model, we discussed how to make DER close to the optimum by setting the expected chunk size reasonably

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

A Logistic Based Mathematical Model to Optimize Duplicate Elimination Ratio in Content Defined Chunking Based Big Data Storage System

Author: Fuliang Guo
Longxiang Wang
Sun
Suzaki
Weifeng Gong
Xiaoshe Dong
Xingjun Zhang
Yinfeng Wang
Publication venue: 'MDPI AG'
Publication date
Field of study

Crossref

Efficient Learning Machines

Author: Awad Mariette
Khanna Rahul
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Computer scienc

OAPEN Library

Density-Aware Linear Algebra in a Column-Oriented In-Memory Database System

Author: Kernert David
Publication venue
Publication date: 20/09/2016
Field of study

Linear algebra operations appear in nearly every application in advanced analytics, machine learning, and of various science domains. Until today, many data analysts and scientists tend to use statistics software packages or hand-crafted solutions for their analysis. In the era of data deluge, however, the external statistics packages and custom analysis programs that often run on single-workstations are incapable to keep up with the vast increase in data volume and size. In particular, there is an increasing demand of scientists for large scale data manipulation, orchestration, and advanced data management capabilities. These are among the key features of a mature relational database management system (DBMS). With the rise of main memory database systems, it now has become feasible to also consider applications that built up on linear algebra. This thesis presents a deep integration of linear algebra functionality into an in-memory column-oriented database system. In particular, this work shows that it has become feasible to execute linear algebra queries on large data sets directly in a DBMS-integrated engine (LAPEG), without the need of transferring data and being restricted by hard disc latencies. From various application examples that are cited in this work, we deduce a number of requirements that are relevant for a database system that includes linear algebra functionality. Beside the deep integration of matrices and numerical algorithms, these include optimization of expressions, transparent matrix handling, scalability and data-parallelism, and data manipulation capabilities. These requirements are addressed by our linear algebra engine. In particular, the core contributions of this thesis are: firstly, we show that the columnar storage layer of an in-memory DBMS yields an easy adoption of efficient sparse matrix data types and algorithms. Furthermore, we show that the execution of linear algebra expressions significantly benefits from different techniques that are inspired from database technology. In a novel way, we implemented several of these optimization strategies in LAPEG’s optimizer (SpMachO), which uses an advanced density estimation method (SpProdest) to predict the matrix density of intermediate results. Moreover, we present an adaptive matrix data type AT Matrix to obviate the need of scientists for selecting appropriate matrix representations. The tiled substructure of AT Matrix is exploited by our matrix multiplication to saturate the different sockets of a multicore main-memory platform, reaching up to a speed-up of 6x compared to alternative approaches. Finally, a major part of this thesis is devoted to the topic of data manipulation; where we propose a matrix manipulation API and present different mutable matrix types to enable fast insertions and deletes. We finally conclude that our linear algebra engine is well-suited to process dynamic, large matrix workloads in an optimized way. In particular, the DBMS-integrated LAPEG is filling the linear algebra gap, and makes columnar in-memory DBMS attractive as efficient, scalable ad-hoc analysis platform for scientists

Technische Universität Dresden: Qucosa

Big data-driven multimodal traffic management : trends and challenges

Author: Gautama Sidharta
Semanjski Ivana
Publication venue: IARIA, The International Academy, Research and Industry Association
Publication date: 01/01/2018
Field of study

Ghent University Academic Bibliography

Analysis, Modeling, and Algorithms for Scalable Web Crawling

Author: Ahmed Sarker Tanzir
Publication venue
Publication date: 16/09/2016
Field of study

This dissertation presents a modeling framework for the intermediate data generated by external-memory sorting algorithms (e.g., merge sort, bucket sort, hash sort, replacement selection) that are well-known, yet without accurate models of produced data volume. The motivation comes from the IRLbot crawl experience in June 2007, where a collection of scalable and high-performance external sorting methods are used to handle such problems as URL uniqueness checking, real-time frontier ranking, budget allocation, spam avoidance, all being monumental tasks, especially when limited to the resources of a single-machine. We discuss this crawl experience in detail, use novel algorithms to collect data from the crawl image, and then advance to a broader problem – sorting arbitrarily large-scale data using limited resources and accurately capturing the required cost (e.g., time and disk usage). To solve these problems, we present an accurate model of uniqueness probability the probability to encounter previous unseen data and use that to analyze the amount of intermediate data generated the above-mentioned sorting methods. We also demonstrate how the intermediate data volume and runtime vary based on the input properties (e.g., frequency distribution), hardware configuration (e.g., main memory size, CPU and disk speed) and the choice of sorting method, and that our proposed models accurately capture such variation. Furthermore, we propose a novel hash-based method for replacement selection sort and its model in case of duplicate data, where existing literature is limited to random or mostly-unique data. Note that the classic replacement selection method has the ability to increase the length of sorted runs and reduce their number, both directly benefiting the merge step of external sorting and . But because of a priority queue-assisted sort operation that is inherently slow, the application of replacement selection was limited. Our hash-based design solves this problem by making the sort phase significantly faster compared to existing methods, making this method a preferred choice. The presented models also enable exact analysis of Least-Recently-Used (LRU) and Random Replacement caches (i.e., their hit rate) that are used as part of the algorithms presented here. These cache models are more accurate than the ones in existing literature, since the existing ones mostly assume infinite stream of data, while our models work accurately on finite streams (e.g., sampled web graphs, click stream) as well. In addition, we present accurate models for various crawl characteristics of random graphs, which can forecast a number of aspects of crawl experience based on the graph properties (e.g., degree distribution). All these models are presented under a unified umbrella to analyze a set of large-scale information processing algorithms that are streamlined for high performance and scalability

Texas A&M Repository

Quantity and Quality: Not a Zero-Sum Game

Author: Alday Phillip Marshel
Publication venue: Philipps-Universität Marburg
Publication date: 01/01/2015
Field of study

Quantification of existing theories is a great challenge but also a great chance for the study of language in the brain. While quantification is necessary for the development of precise theories, it demands new methods and new perspectives. In light of this, four complementary methods were introduced to provide a quantitative and computational account of the extended Argument Dependency Model from Bornkessel-Schlesewsky and Schlesewsky. First, a computational model of human language comprehension was introduced on the basis of dependency parsing. This model provided an initial comparison of two potential mechanisms for human language processing, the traditional "subject" strategy, based on grammatical relations, and the "actor" strategy based on prominence and adopted from the eADM. Initial results showed an advantage for the traditional subject" model in a restricted context; however, the "actor" model demonstrated behavior in a test run that was more similar to human behavior than that of the "subject" model. Next, a computational-quantitative implementation of the "actor" strategy as weighted feature comparison between memory units was used to compare it to other memory-based models from the literature on the basis of EEG data. The "actor" strategy clearly provided the best model, showing a better global fit as well as better match in all details. Building upon the success modeling EEG data, the feasibility of estimating free parameters from empirical data was demonstrated. Both the procedure for doing so and the necessary software were introduced and applied at the level of individual participants. Using empirically estimated parameters, the models from the previous EEG experiment were calculated again and yielded similar results, thus reinforcing the previous work. In a final experiment, the feasibility of analyzing EEG data from a naturalistic auditory stimulus was demonstrated, which conventional wisdom says is not possible. The analysis suggested a new perspective on the nature of event-related potentials (ERPs), which does not contradict existing theory yet nonetheless goes against previous intuition. Using this new perspective as a basis, a preliminary attempt at a parsimonious neurocomputational theory of cognitive ERP components was developed

Publikations- und Dokumentenserver der Universitätsbibliothek Marburg