3 research outputs found

    Adaptive Normalization in Streaming Data

    Full text link
    In todays digital era, data are everywhere from Internet of Things to health care or financial applications. This leads to potentially unbounded ever-growing Big data streams and it needs to be utilized effectively. Data normalization is an important preprocessing technique for data analytics. It helps prevent mismodeling and reduce the complexity inherent in the data especially for data integrated from multiple sources and contexts. Normalization of Big Data stream is challenging because of evolving inconsistencies, time and memory constraints, and non-availability of whole data beforehand. This paper proposes a distributed approach to adaptive normalization for Big data stream. Using sliding windows of fixed size, it provides a simple mechanism to adapt the statistics for normalizing changing data in each window. Implemented on Apache Storm, a distributed real-time stream data framework, our approach exploits distributed data processing for efficient normalization. Unlike other existing adaptive approaches that normalize data for a specific use (e.g., classification), ours does not. Moreover, our adaptive mechanism allows flexible controls, via user-specified thresholds, for normalization tradeoffs between time and precision. The paper illustrates our proposed approach along with a few other techniques and experiments on both synthesized and real-world data. The normalized data obtained from our proposed approach, on 160,000 instances of data stream, improves over the baseline by 89% with 0.0041 root-mean-square error compared with the actual data

    MACHINE LEARNING AND SOFTWARE SOLUTIONS FOR DATA QUALITY ASSESSMENT IN CERN’S ATLAS EXPERIMENT

    Get PDF
    The Large Hadron Collider (LHC) is home to multiple particle physics experiments designed to verify the standard model and push our understanding of the universe to its limits. The ATLAS detector is one of the large general-purpose experiments that make use of the LHC and generates a significant amount of data as part of its regular operations. Prior to physics analysis, this data is cleaned through a data assessment process which involves significant operator resources. With the evolution of the field of machine learning and anomaly detection, there is great opportunity to upgrade the ATLAS Data Quality Monitoring Framework to include automated, machine learning based solutions to reduce operator requirements and improve data quality for physics analysis. This thesis provides an infrastructure, theoretical foundation and a unique machine learning approach to automate this process. It accomplishes this by combining 2 heavily documented algorithms (Autoencoders and DBScan) and organizing the dataset around geometric descriptor features. The results of this work are released as code and software solutions for the benefit of current and future data quality assessment, research, and collaborations in the ATLAS experiment
    corecore