662 research outputs found

    Data Fingerprinting -- Identifying Files and Tables with Hashing Schemes

    Get PDF
    Master's thesis in Computer scienceINTRODUCTION: Although hash functions are nothing new, these are not limited to cryptographic purposes. One important field is data fingerprinting. Here, the purpose is to generate a digest which serves as a fingerprint (or a license plate) that uniquely identifies a file. More recently, fuzzy fingerprinting schemes — which will scrap the avalanche effect in favour of detecting local changes — has hit the spotlight. The main purpose of this project is to find ways to classify text tables, and discover where potential changes or inconsitencies have happened. METHODS: Large parts of this report can be considered applied discrete mathematics — and finite fields and combinatorics have played an important part. Rabin’s fingerprinting scheme was tested extensively and compared against existing cryptographic algorithms, CRC and FNV. Moreover, a self-designed fuzzy hashing algorithm with the preliminary name No-Frills Hash has been created and tested against Nilsimsa and Spamsum. NFHash is based on Mersenne primes, and uses a sliding window to create a fuzzy hash. Futhermore, the usefullness of lookup tables (with partial seeds) were also explored. The fuzzy hashing algorithm has also been combined with a k-NN classifier to get an overview over it’s ability to classify files. In addition to NFHash, Bloom filters combined with Merkle Trees have been the most important part of this report. This combination will allow a user to see where a change was made, despite the fact that hash functions are one-way. Large parts of this project has dealt with the study of other open-source libraries and applications, such as Cassandra and SSDeep — as well as how bitcoins work. Optimizations have played a crucial role as well; different approaches to a problem might lead to the same solution, but resource consumption can be very different. RESULTS: The results have shown that the Merkle Tree-based approach can track changes to a table very quickly and efficiently, due to it being conservative when it comes to CPU resources. Moreover, the self-designed algorithm NFHash also does well in terms of file classification when it is coupled with a k-NN classifyer. CONCLUSION: Hash functions refers to a very diverse set of algorithms, and not just algorithms that serve a limited purpose. Fuzzy Fingerprinting Schemes can still be considered to be at their infant stage, but a lot has still happened the last ten years. This project has introduced two new ways to create and compare hashes that can be compared to similar, yet not necessarily identical files — or to detect if (and to what extent) a file was changed. Note that the algorithms presented here should be considered prototypes, and still might need some large scale testing to sort out potential flaw

    A survey of parallel algorithms for fractal image compression

    Get PDF
    This paper presents a short survey of the key research work that has been undertaken in the application of parallel algorithms for Fractal image compression. The interest in fractal image compression techniques stems from their ability to achieve high compression ratios whilst maintaining a very high quality in the reconstructed image. The main drawback of this compression method is the very high computational cost that is associated with the encoding phase. Consequently, there has been significant interest in exploiting parallel computing architectures in order to speed up this phase, whilst still maintaining the advantageous features of the approach. This paper presents a brief introduction to fractal image compression, including the iterated function system theory upon which it is based, and then reviews the different techniques that have been, and can be, applied in order to parallelize the compression algorithm

    Distributed Time Series Analytics

    Get PDF
    In recent years time series data has become ubiquitous thanks to affordable sensors and advances in embedded technology. Large amount of time-series data are continuously produced in a wide spectrum of applications, such as sensor networks, medical monitoring and so on. Availability of such large scale time series data highlights the importance of of scalable data management, efï¬cient querying and analysis. Meanwhile, in the online setting time series carries invaluable information and knowledge about the real-time status of involved entities or monitored phenomena, which calls for online time series data mining for serving timely decision making or event detection. In this thesis we aim to address these important issues pertaining to scalable and distributed analytics techniques for massive time series data. Concretely, this thesis is centered around the following three topics: As the number of sensors that pervade our lives signiï¬cantly increases (e.g., environmental sensors, mobile phone sensors, IoT applications, etc.), the efï¬cient management of massive amount of time series from such sensors is becoming increasingly important. The inï¬nite nature of sensor data poses a serious challenge for query processing even in a cloud infrastructure. Traditional raw sensor data management systems based on relational databases lack scalability to accommodate large scale sensor data efï¬ciently. Thus, distributed key-value stores in the cloud are becoming a prime tool to manage sensor data. However, currently there are no techniques for indexing and/or query optimization of the model-view sensor time series data in the cloud. In Chapter 2, we propose an innovative index for modeled segments in key-value stores, namely KVI-index. KVI-index consists of two interval indices on the time and sensor value dimensions respectively, each of which has an in-memory search tree and a secondary list materialized in the key-value store. The dramatic increase in the availability of data streams fuels the development of many distributed real-time computation engines (e.g., Storm, Samza, Spark Streaming, S4 etc.). In Chapter 3, we focus on a fundamental time series mining task in such a new computation paradigm, namely continuously mining dynamic (lagged) correlations in time series via a distributed real-time computation engine. Correlations reveal the hidden and temporal interactions across time series and are widely used in scientiï¬c data analysis, data-driven event detection, ï¬nance markets and so on. We propose the P2H framework consisting of a parallelism-partitioning based data shufï¬ing and a hypercube structure based computation pruning method, so as to enhance both the communication and computation efï¬ciency for mining correlations in the distributed context. In numerous real-world applications large datasets collected from observations and measurements of physical entities are inevitably noisy and contain outliers. The outliers in such large and noisy datasets can dramatically degrade the performance of standard distributed machine learning approaches such as s regression trees. In Chapter 4 we present a novel distributed regression tree approach that utilizes robust regression statistics, statistics that are more robust to outliers, for handling large and noisy datasets. Then we present an adaptive gradient learning method for recurrent neural networks (RNN) to forecast streaming time series in the presence of both outliers and change points

    Constant-Weight Gray Codes for Local Rank Modulation

    Get PDF
    We consider the local rank-modulation scheme in which a sliding window going over a sequence of real-valued variables induces a sequence of permutations. Local rank- modulation is a generalization of the rank-modulation scheme, which has been recently suggested as a way of storing information in flash memory. We study constant-weight Gray codes for the local rank- modulation scheme in order to simulate conventional multi-level flash cells while retaining the benefits of rank modulation. We provide necessary conditions for the existence of cyclic and cyclic optimal Gray codes. We then specifically study codes of weight 2 and upper bound their efficiency, thus proving that there are no such asymptotically-optimal cyclic codes. In contrast, we study codes of weight 3 and efficiently construct codes which are asymptotically-optimal. We conclude with a construction of codes with asymptotically-optimal rate and weight asymptotically half the length, thus having an asymptotically-optimal charge difference between adjacent cells

    Inverse regression in MR Fingerprinting: reducing dictionary size while increasing parameters accuracy

    Get PDF
    Purpose:To reduce dictionary size and increase parameter estimate accuracy in MR Fingerprinting (MRF).Methods:A dictionary-based learning (DBL) method is investigated to bypass inherent MRF limitations in high dimension: reconstruction time and memory requirement. The DBL method is a 3-step procedure: (1) a quasi-random sampling strategy to produce the dictionary, (2) a statistical inverse regression model to learn from the dictionary a probabilistic mapping between MR fingerprints and parameters, and (3) this mapping to provide both parameter estimates and their confidence levels.Results:On synthetic data, experiments show that the quasi-random sampling outperforms the grid when designing the dictionary for inverse regression. Dictionaries up to 100 times smaller than usually employed in MRF yield more accurate parameter estimates with a 500 time gain.Estimates are supplied with a confidence index, well correlated with the estimation bias (r~\ge~0.89). On microvascular MRI data, results show that dictionary-based methods (MRF and DBL) yield more accurate estimates than the conventional, closed-form equation, method.On MRI signals from tumor bearing rats, the DBL method shows very little sensitivity to the dictionary size in contrast to the MRF method.Conclusion:The proposed method efficiently reduces the number of required simulations to produce the dictionary, speeds up parameter estimation, and improve estimates accuracy. The DBL method also introduces a confidence index for each parameter estimate

    A Novel Flexible and Steerable Probe for Minimally Invasive Soft Tissue Intervention

    No full text
    Current trends in surgical intervention favour a minimally invasive (MI) approach, in which complex procedures are performed through increasingly small incisions. Specifically, in neurosurgery, there is a need for minimally invasive keyhole access, which conflicts with the lack of maneuverability of conventional rigid instruments. In an attempt to address this fundamental shortcoming, this thesis describes the concept design, implementation and experimental validation of a novel flexible and steerable probe, named “STING” (Soft Tissue Intervention and Neurosurgical Guide), which is able to steer along curvilinear trajectories within a compliant medium. The underlying mechanism of motion of the flexible probe, based on the reciprocal movement of interlocked probe segments, is biologically inspired and was designed around the unique features of the ovipositor of certain parasitic wasps. Such insects are able to lay eggs by penetrating different kinds of “host” (e.g. wood, larva) with a very thin and flexible multi-part channel, thanks to a micro-toothed surface topography, coupled with a reciprocating “push and pull” motion of each segment. This thesis starts by exploring these foundations, where the “microtexturing” of the surface of a rigid probe prototype is shown to facilitate probe insertion into soft tissue (porcine brain), while gaining tissue purchase when the probe is tensioned outwards. Based on these findings, forward motion into soft tissue via a reciprocating mechanism is then demonstrated through a focused set of experimental trials in gelatine and agar gel. A flexible probe prototype (10 mm diameter), composed of four interconnected segments, is then presented and shown to be able to steer in a brain-like material along multiple curvilinear trajectories on a plane. The geometry and certain key features of the probe are optimised through finite element models, and a suitable actuation strategy is proposed, where the approach vector of the tip is found to be a function of the offset between interlocked segments. This concept of a “programmable bevel”, which enables the steering angle to be chosen with virtually infinite resolution, represents a world-first in percutaneous soft tissue surgery. The thesis concludes with a description of the integration and validation of a fully functional prototype within a larger neurosurgical robotic suite (EU FP7 ROBOCAST), which is followed by a summary of the corresponding implications for future work

    Formal Verification of Input-Output Mappings of Tree Ensembles

    Full text link
    Recent advances in machine learning and artificial intelligence are now being considered in safety-critical autonomous systems where software defects may cause severe harm to humans and the environment. Design organizations in these domains are currently unable to provide convincing arguments that their systems are safe to operate when machine learning algorithms are used to implement their software. In this paper, we present an efficient method to extract equivalence classes from decision trees and tree ensembles, and to formally verify that their input-output mappings comply with requirements. The idea is that, given that safety requirements can be traced to desirable properties on system input-output patterns, we can use positive verification outcomes in safety arguments. This paper presents the implementation of the method in the tool VoTE (Verifier of Tree Ensembles), and evaluates its scalability on two case studies presented in current literature. We demonstrate that our method is practical for tree ensembles trained on low-dimensional data with up to 25 decision trees and tree depths of up to 20. Our work also studies the limitations of the method with high-dimensional data and preliminarily investigates the trade-off between large number of trees and time taken for verification
    corecore