11 research outputs found

    Finding Top-k Dominance on Incomplete Big Data Using Map-Reduce Framework

    Full text link
    Incomplete data is one major kind of multi-dimensional dataset that has random-distributed missing nodes in its dimensions. It is very difficult to retrieve information from this type of dataset when it becomes huge. Finding top-k dominant values in this type of dataset is a challenging procedure. Some algorithms are present to enhance this process but are mostly efficient only when dealing with a small-size incomplete data. One of the algorithms that make the application of TKD query possible is the Bitmap Index Guided (BIG) algorithm. This algorithm strongly improves the performance for incomplete data, but it is not originally capable of finding top-k dominant values in incomplete big data, nor is it designed to do so. Several other algorithms have been proposed to find the TKD query, such as Skyband Based and Upper Bound Based algorithms, but their performance is also questionable. Algorithms developed previously were among the first attempts to apply TKD query on incomplete data; however, all these had weak performances or were not compatible with the incomplete data. This thesis proposes MapReduced Enhanced Bitmap Index Guided Algorithm (MRBIG) for dealing with the aforementioned issues. MRBIG uses the MapReduce framework to enhance the performance of applying top-k dominance queries on huge incomplete datasets. The proposed approach uses the MapReduce parallel computing approach using multiple computing nodes. The framework separates the tasks between several computing nodes that independently and simultaneously work to find the result. This method has achieved up to two times faster processing time in finding the TKD query result in comparison to previously presented algorithms

    Outlier Detection Ensemble with Embedded Feature Selection

    Full text link
    Feature selection places an important role in improving the performance of outlier detection, especially for noisy data. Existing methods usually perform feature selection and outlier scoring separately, which would select feature subsets that may not optimally serve for outlier detection, leading to unsatisfying performance. In this paper, we propose an outlier detection ensemble framework with embedded feature selection (ODEFS), to address this issue. Specifically, for each random sub-sampling based learning component, ODEFS unifies feature selection and outlier detection into a pairwise ranking formulation to learn feature subsets that are tailored for the outlier detection method. Moreover, we adopt the thresholded self-paced learning to simultaneously optimize feature selection and example selection, which is helpful to improve the reliability of the training set. After that, we design an alternate algorithm with proved convergence to solve the resultant optimization problem. In addition, we analyze the generalization error bound of the proposed framework, which provides theoretical guarantee on the method and insightful practical guidance. Comprehensive experimental results on 12 real-world datasets from diverse domains validate the superiority of the proposed ODEFS.Comment: 10pages, AAAI202

    Crowdsourcing for Top-K Query Processing over Uncertain Data

    Get PDF
    Querying uncertain data has become a prominent application due to the proliferation of user-generated content from social media and of data streams from sensors. When data ambiguity cannot be reduced algorithmically, crowdsourcing proves a viable approach, which consists of posting tasks to humans and harnessing their judgment for improving the confidence about data values or relationships. This paper tackles the problem of processing top- K queries over uncertain data with the help of crowdsourcing for quickly converging to the realordering of relevant results. Several offline and online approaches for addressing questions to a crowd are defined and contrasted on both synthetic and real data sets, with the aim of minimizing the crowd interactions necessary to find the realordering of the result set

    Fake News Detection Based on Subjective Opinions

    Get PDF
    Fake news fluctuates social media, leading to harmful consequences. Several types of information could be utilized to detect fake news, such as news content features and news propagation features. In this study, we focus on the user spreading news behaviors on social media platforms and aim to detect fake news more effectively with more accurate data reliability assessment. We introduce Subjective Opinions into reliability evaluation and proposed two new methods. Experiments on two popular real-world datasets, BuzzFeed and PolitiFact, validates that our proposed Subjective Opinions based method can detect fake news more accurately than all existing methods, and another proposed probability based method achieves state-of-art performance

    Infinite Probabilistic Databases

    Get PDF
    Probabilistic databases (PDBs) model uncertainty in data in a quantitative way. In the established formal framework, probabilistic (relational) databases are finite probability spaces over relational database instances. This finiteness can clash with intuitive query behavior (Ceylan et al., KR 2016), and with application scenarios that are better modeled by continuous probability distributions (Dalvi et al., CACM 2009). We formally introduced infinite PDBs in (Grohe and Lindner, PODS 2019) with a primary focus on countably infinite spaces. However, an extension beyond countable probability spaces raises nontrivial foundational issues concerned with the measurability of events and queries and ultimately with the question whether queries have a well-defined semantics. We argue that finite point processes are an appropriate model from probability theory for dealing with general probabilistic databases. This allows us to construct suitable (uncountable) probability spaces of database instances in a systematic way. Our main technical results are measurability statements for relational algebra queries as well as aggregate queries and Datalog queries.Comment: This is the full version of the paper "Infinite Probabilistic Databases" presented at ICDT 2020 (arXiv:1904.06766

    Processing Uncertain RFID Data in Traceability Supply Chains

    Get PDF
    Radio Frequency Identification (RFID) is widely used to track and trace objects in traceability supply chains. However, massive uncertain data produced by RFID readers are not effective and efficient to be used in RFID application systems. Following the analysis of key features of RFID objects, this paper proposes a new framework for effectively and efficiently processing uncertain RFID data, and supporting a variety of queries for tracking and tracing RFID objects. We adjust different smoothing windows according to different rates of uncertain data, employ different strategies to process uncertain readings, and distinguish ghost, missing, and incomplete data according to their apparent positions. We propose a comprehensive data model which is suitable for different application scenarios. In addition, a path coding scheme is proposed to significantly compress massive data by aggregating the path sequence, the position, and the time intervals. The scheme is suitable for cyclic or long paths. Moreover, we further propose a processing algorithm for group and independent objects. Experimental evaluations show that our approach is effective and efficient in terms of the compression and traceability queries

    ๋น…๋ฐ์ดํ„ฐ์˜ ํšจ์œจ์ ์ธ ์Šค์นด์ด๋ผ์ธ ์งˆ์˜ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2017. 2. ์‹ฌ๊ทœ์„.์Šค์นด์ด๋ผ์ธ ์งˆ์˜์™€ ์Šค์นด์ด๋ผ์ธ์—์„œ ํŒŒ์ƒ๋œ ๋™์  ์Šค์นด์ด๋ผ์ธ, ์—ญ ์Šค์นด์ด๋ผ์ธ ๊ทธ๋ฆฌ๊ณ  ํ™•๋ฅ ์  ์Šค์นด์ด๋ผ์ธ ์งˆ์˜๋“ค์€ ๋‹ค์–‘ํ•œ ์‘์šฉ์ด ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ตœ๊ทผ์— ๋งŽ์€ ์—ฐ๊ตฌ๊ฐ€ ์ง„ํ–‰๋˜์–ด ์™”๋‹ค. ์Šค์นด์ด๋ผ์ธ ์งˆ์˜๋“ค์€ ํฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•ด์•ผํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— ํšจ์œจ์ ์ธ ์Šค์นด์ด๋ผ์ธ ์งˆ์˜ ์ฒ˜๋ฆฌ๋Š” ์ค‘์š”ํ•œ ๋ฌธ์ œ์ด๋‹ค. ํฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•ด์•ผํ•˜๋Š” ๊ฒฝ์šฐ๋ฅผ ์œ„ํ•ด ๋งต๋ฆฌ๋“€์Šค ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ์ œ์•ˆ๋˜์—ˆ๊ณ , ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์Šค์นด์ด๋ผ์ธ, ๋™์  ์Šค์นด์ด๋ผ์ธ, ์—ญ ์Šค์นด์ด๋ผ์ธ, ํ™•๋ฅ ์  ์Šค์นด์ด๋ผ์ธ ์งˆ์˜ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ํšจ์œจ์ ์ธ ๋งต๋ฆฌ๋“€์Šค ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐœ๋ฐœํ•œ๋‹ค. ์Šค์นด์ด๋ผ์ธ, ๋™์  ์Šค์นด์ด๋ผ์ธ, ์—ญ ์Šค์นด์ด๋ผ์ธ์— ๋Œ€ํ•ด์„œ๋Š” ์งˆ์˜ ๊ฒฐ๊ณผ์— ํฌํ•จ๋  ์ˆ˜ ์—†๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋น ๋ฅด๊ฒŒ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ฟผ๋“œํŠธ๋ฆฌ์— ๊ธฐ๋ฐ˜ํ•œ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ์ƒ์„ฑํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ํžˆ์Šคํ† ๊ทธ๋žจ์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ ํŒŒํ‹ฐ์…˜์œผ๋กœ ๋‚˜๋ˆ„๊ณ  ๊ฐ ํŒŒํ‹ฐ์…˜์— ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋งŒ์„ ์ด์šฉํ•˜์—ฌ ์Šค์นด์ด๋ผ์ธ์ด ๋  ์ˆ˜ ์žˆ๋Š” ํ›„๋ณด ๋ฐ์ดํ„ฐ๋ฅผ ๋งต๋ฆฌ๋“€์Šค๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ณ‘๋ ฌ์ ์œผ๋กœ ๋ฝ‘์•„๋‚ธ๋‹ค. ๊ทธ ํ›„์— ๋‹ค์‹œ ๋งต๋ฆฌ๋“€์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ‘๋ ฌ์ ์œผ๋กœ ํ›„๋ณด ๋ฐ์ดํ„ฐ์ค‘ ์‹ค์ œ ์Šค์นด์ด๋ผ์ธ์„ ์ฐพ์•„๋‚ธ๋‹ค. ํ™•๋ฅ ์  ์Šค์นด์ด๋ผ์ธ์˜ ํšจ์œจ์ ์ธ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด ๋จผ์ € ์„ธ๊ฐ€์ง€ ํ•„ํ„ฐ๋ง ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด ํ•„ํ„ฐ๋ง ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ์ฟผ๋“œํŠธ๋ฆฌ์— ๊ธฐ๋ฐ˜ํ•œ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ์ƒ์„ฑํ•œ๋‹ค. ์ฟผ๋“œํŠธ๋ฆฌ์˜ ์˜์—ญ์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ๋ฅผ ํŒŒํ‹ฐ์…˜ํ•˜๊ณ  ๊ฐ ํŒŒํ‹ฐ์…˜๋งˆ๋‹ค ํ™•๋ฅ ์  ์Šค์นด์ด๋ผ์ธ ์ ๋“ค์„ ์ฐพ์•„๋‚ธ๋‹ค. ๊ฐ ์ปดํ“จํ„ฐ์˜ ์ˆ˜ํ–‰์‹œ๊ฐ„์„ ๋น„์Šทํ•˜๊ฒŒ ๋งž์ถ”๊ธฐ ์œ„ํ•ด์„œ ๋ถ€ํ•˜๊ท ํ˜• ๊ธฐ๋ฒ•๋„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋‹ค์–‘ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ์ œ์•ˆํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์„ฑ๋Šฅ๋“ค์ด ์ตœ์‹  ๊ด€๋ จ ์—ฐ๊ตฌ ๋ณด๋‹ค ์ข‹์Œ์„ ํ™•์ธํ•˜์˜€๊ณ , ์‚ฌ์šฉํ•˜๋Š” ์ปดํ“จํ„ฐ์˜ ์ˆ˜๋ฅผ ๋Š˜๋ฆผ์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ํ™•์žฅ์„ฑ์„ ๊ฐ–๊ณ  ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค.The skyline operator and its variants such as dynamic skyline, reverse skyline and probabilistic skyline operators have attracted considerable attention recently due to its broad applications. However, computing a skyline is challenging today since we have to deal with big data. For data-intensive applications, the MapReduce framework has been widely used recently. In this dissertation, we propose the efficient parallel algorithms for processing skyline, dynamic skyline, reverse skyline and probabilistic skyline queries using MapReduce. For the skyline, dynamic skyline and reverse skyline queries, we first build quadtree-based histograms to prune out non-skyline points. We next partition data based on the regions divided by the histograms and compute candidate skyline points for each partition using MapReduce. Finally, in every partition, we check whether each skyline candidate point is actually a skyline point or not using MapReduce. For the probabilistic skyline query, we first introduce three filtering techniques to prune out points that are not probabilistic skyline points. Then, we build a quadtree-based histogram and split data into partitions according to the regions divided by the quadtree. We finally compute the probabilistic skyline points for each partition using MapReduce. We also develop the workload balancing methods to make the estimated execution times of all available machines to be similar. We did experiments to compare our algorithms with the state-of-the-art algorithms using MapReduce and confirmed the effectiveness as well as the scalability of our proposed skyline algorithms.1 INTRODUCTION 1 1.1 Motivation 1 1.2 Contributions of This Dissertation 6 1.3 Dissertation Overview 8 2 Related Work 10 2.1 Skyline Queries 10 2.2 Reverse Skyline Queries 13 2.3 Probabilistic Skyline Queries 14 3 Background 17 3.1 Skyline and Its Variants 17 3.2 MapReduce Framework 22 4 Parallel Skyline Query Processing 24 4.1 SKY-MR: Our Skyline Computation Algorithm 24 4.1.1 SKY-QTREE: The Sky-Quadtree Building Algorithm 25 4.1.2 L-SKY-MR: The Local Skyline Computation Algorithm 29 4.1.3 G-SKY-MR: The Global Skyline Computation Algorithm 32 4.2 Experiment 34 4.2.1 Performance Results for Skylines 36 4.2.2 Performance Results in Other Environments 41 5 Parallel Reverse Skyline Query Processing 45 5.1 RSKY-MR: Our Reverse Skyline Computation Algorithm 45 5.1.1 RSKY-QTREE: The Rsky-Quadtree Building Algorithm 47 5.1.2 Computations of Reverse Skylines using Rsky-Quadtrees 50 5.1.3 L-RSKY-MR: The Local Reverse Skyline Computation Algorithm 53 5.1.4 G-RSKY-MR: The Global Reverse Skyline Computation Algorithm 57 5.2 Experiment 59 5.2.1 Performance Results for Reverse Skylines 59 6 Parallel Probabilistic Skyline Query Processing 63 6.1 Early Pruning Techniques 63 6.1.1 Upper-bound Filtering 63 6.1.2 Zero-probability Filtering 67 6.1.3 Dominance-Power Filtering 68 6.2 Utilization of a PS-QTREE for Pruning 69 6.2.1 Generating a PS-QTREE 70 6.2.2 Exploiting a PS-QTREE for Filtering 70 6.2.3 Partitioning Objects by a PS-QTREE 71 6.3 PS-QPF-MR: Our Algorithm with Quadtree Partitiong and Filtering 73 6.3.1 Optimizations of PS-QPF-MR 79 6.3.2 Sample Size and Split Threshold of a PSQtree 83 6.4 PS-BRF-MR: Our Algorithm with Random Partitioning and Filtering 84 6.5 Experiments 87 6.5.1 Performance Results for Probabilistic Skylines 89 7 Conclusion 97 Bibliography 99 Abstract (In Korean) 105Docto
    corecore