305 research outputs found

    Correlation-based methods for data cleaning, with application to biological databases

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    End-to-End Entity Resolution for Big Data: A Survey

    Get PDF
    One of the most important tasks for improving data quality and the reliability of data analytics results is Entity Resolution (ER). ER aims to identify different descriptions that refer to the same real-world entity, and remains a challenging problem. While previous works have studied specific aspects of ER (and mostly in traditional settings), in this survey, we provide for the first time an end-to-end view of modern ER workflows, and of the novel aspects of entity indexing and matching methods in order to cope with more than one of the Big Data characteristics simultaneously. We present the basic concepts, processing steps and execution strategies that have been proposed by different communities, i.e., database, semantic Web and machine learning, in order to cope with the loose structuredness, extreme diversity, high speed and large scale of entity descriptions used by real-world applications. Finally, we provide a synthetic discussion of the existing approaches, and conclude with a detailed presentation of open research directions

    ACTS: Extracting Android App Topological Signature through Graphlet Sampling

    Get PDF
    Android systems are widely used in mobile & wireless distributed systems. In the near future, Android is believed to dominate the mobile distributed environment. However, with the popularity of Android-based smartphones/tablets comes the rampancy of Android-based malware. In this paper, we propose a novel topological signature of Android apps based on the function call graphs (FCGs) extracted from their Android App Packages (APKs). Specifically, by leveraging recent advances in graphlet sampling, the proposed method fully captures the invocator-invocatee relationship at local neighborhoods in an FCG without exponentially inflating the state space. Using real benign app and malware samples, we demonstrate that our method, ACTS (App topologiCal signature through graphleT Sampling), can detect malware and identify malware families robustly and efficiently. More importantly, we demonstrate that, without augmenting the FCG with any semantic features such as bytecode-based vertex typing, local topological information captured by ACTS alone can achieve a high malware detection accuracy. Since ACTS only uses structural features, which are orthogonal to semantic features, it is expected that combining them would give a greater improvement in malware detection accuracy than combining non-orthogonal semantic features

    Potable Water Leakage Prediction and Detection using Geospatial Analysis

    Get PDF
    Due to increasing water treatment costs and conservation needs, traditional water loss analysis and acoustic leak detection methods are becoming heavily scrutinized by water utilities. This study explores water loss in Johnson City, Tennessee and how geospatial data analysis techniques improve water loss mitigation. This project uses sample water system pressure data and ordinary kriging spatial interpolation methods to identify leakage areas for further investigation. Analysis of existing geographic information system (GIS) water utility datasets with interpolated hydraulic grade values at sample water pressure points produce manageable survey areas that pinpoint areas with possible water leakage. Field detection methods, including ground-penetrating radar (GPR) and traditional acoustic methods, are employed to verify leakage predictions. Ten leakage areas are identified and verified using traditional acoustic detection methods, work order research, and GPR. The resulting data show that spatial analysis coupled with geospatial analysis of field pressure information improves water loss mitigation

    Exploring the Potentials of Using Crowdsourced Waze Data in Traffic Management: Characteristics and Reliability

    Get PDF
    Real-time traffic information is essential to a variety of practical applications. To obtain traffic data, various traffic monitoring devices, such as loop detectors, infrastructure-mounted sensors, and cameras, have been installed on road networks. However, transportation agencies have sought alternative data sources to monitor traffic, due to the high installation and maintenance cost of conventional data collecting methods. Recently, crowdsourced traffic data has become available and is widely considered to have great potential in intelligent transportation systems. Waze is a crowdsourcing traffic application that enables users to share real-time traffic information. Waze data, including passively collected speed data and actively reported user reports, is valuable for traffic management but has not been explored or evaluated extensively. This dissertation evaluated and explored the potential of Waze data in traffic management from different perspectives. First, this dissertation evaluated and explored Waze traffic speed to understand the characteristics and reliability of Waze traffic speed data. Second, a calibration-free incident detection algorithm with traffic speed data on freeways was proposed, and the results were compared with other commonly used algorithms. Third, a spatial and temporal quality analysis of Waze accident reports to better understand their quality and accuracy was performed. Last, the dissertation proposed a network-based clustering algorithm to identify secondary crashes with Waze user reports, and a case study was performed to demonstrate the applicability of our method and the potential of crowdsourced Waze user reports

    Approximate Matching of Hierarchial Data

    Get PDF
    corecore