352 research outputs found

    Finding Top-k Dominance on Incomplete Big Data Using Map-Reduce Framework

    Full text link
    Incomplete data is one major kind of multi-dimensional dataset that has random-distributed missing nodes in its dimensions. It is very difficult to retrieve information from this type of dataset when it becomes huge. Finding top-k dominant values in this type of dataset is a challenging procedure. Some algorithms are present to enhance this process but are mostly efficient only when dealing with a small-size incomplete data. One of the algorithms that make the application of TKD query possible is the Bitmap Index Guided (BIG) algorithm. This algorithm strongly improves the performance for incomplete data, but it is not originally capable of finding top-k dominant values in incomplete big data, nor is it designed to do so. Several other algorithms have been proposed to find the TKD query, such as Skyband Based and Upper Bound Based algorithms, but their performance is also questionable. Algorithms developed previously were among the first attempts to apply TKD query on incomplete data; however, all these had weak performances or were not compatible with the incomplete data. This thesis proposes MapReduced Enhanced Bitmap Index Guided Algorithm (MRBIG) for dealing with the aforementioned issues. MRBIG uses the MapReduce framework to enhance the performance of applying top-k dominance queries on huge incomplete datasets. The proposed approach uses the MapReduce parallel computing approach using multiple computing nodes. The framework separates the tasks between several computing nodes that independently and simultaneously work to find the result. This method has achieved up to two times faster processing time in finding the TKD query result in comparison to previously presented algorithms

    Distributed Web-Scale Infrastructure For Crawling, Indexing And Search With Semantic Support

    Get PDF
    In this paper, we describe our work in progress in the scope of web-scale informationextraction and information retrieval utilizing distributed computing. Wepresent a distributed architecture built on top of the MapReduce paradigm forinformation retrieval, information processing and intelligent search supportedby spatial capabilities. Proposed architecture is focused on crawling documentsin several different formats, information extraction, lightweight semantic annotationof the extracted information, indexing of extracted information andfinally on indexing of documents based on the geo-spatial information foundin a document. We demonstrate the architecture on two use cases, where thefirst is search in job offers retrieved from the LinkedIn portal and the second issearch in BBC news feeds and discuss several problems we had to face duringthe implementation. We also discuss spatial search applications for both casesbecause both LinkedIn job offer pages and BBC news feeds contain a lot of spatialinformation to extract and process

    Doctor of Philosophy

    Get PDF
    dissertationWe are living in an age where data are being generated faster than anyone has previously imagined across a broad application domain, including customer studies, social media, sensor networks, and the sciences, among many others. In some cases, data are generated in massive quantities as terabytes or petabytes. There have been numerous emerging challenges when dealing with massive data, including: (1) the explosion in size of data; (2) data have increasingly more complex structures and rich semantics, such as representing temporal data as a piecewise linear representation; (3) uncertain data are becoming a common occurrence for numerous applications, e.g., scientific measurements or observations such as meteorological measurements; (4) and data are becoming increasingly distributed, e.g., distributed data collected and integrated from distributed locations as well as data stored in a distributed file system within a cluster. Due to the massive nature of modern data, it is oftentimes infeasible for computers to efficiently manage and query them exactly. An attractive alternative is to use data summarization techniques to construct data summaries, where even efficiently constructing data summaries is a challenging task given the enormous size of data. The data summaries we focus on in this thesis include the histogram and ranking operator. Both data summaries enable us to summarize a massive dataset to a more succinct representation which can then be used to make queries orders of magnitude more efficient while still allowing approximation guarantees on query answers. Our study has focused on the critical task of designing efficient algorithms to summarize, query, and manage massive data

    Analyzing social media data and performance comparison with traditional database, data warehouse, and MapReduce approaches

    Get PDF
    Data warehouse, OLAP technology and distributed analysis show great potential in improving business analysis, tendency prediction and decision making. With the assistance of data mining techniques, databases can also be a useful tool for analyzing societal trends by gathering data from social media networks. As these networks can contain huge amounts of text data, it can serve as a perfect platform for testing text mining technologies, and discovering what kind of trend or what kind of topic concern people the most during a certain time period. This project utilizes a data set of tweets generated from May to June 2019, which contains more than 2 million tweets with content and location data. After applying some data cleaning techniques, we were able to establish a data cube and provide various analyses based on location. Our results show Twitter users\u27 preference and use frequency varies significantly based on their locations. Ultimately, this project provides a case study about utilizing database, data warehouse and distributed analysis technology to analyze social media, and provides some insight regarding trending topics of interest. This work could be applied by those interested in gaining a better understanding of social media users

    Cloud-Scale Entity Resolution: Current State and Open Challenges

    Get PDF
    Entity resolution (ER) is a process to identify records in information systems, which refer to the same real-world entity. Because in the two recent decades the data volume has grown so large, parallel techniques are called upon to satisfy the ER requirements of high performance and scalability. The development of parallel ER has reached a relatively prosperous stage, and has found its way into several applications. In this work, we first comprehensively survey the state of the art of parallel ER approaches. From the comprehensive overview, we then extract the classification criteria of parallel ER, classify and compare these approaches based on these criteria. Finally, we identify open research questions and challenges and discuss potential solutions and further research potentials in this field

    End-to-End Entity Resolution for Big Data: A Survey

    Get PDF
    One of the most important tasks for improving data quality and the reliability of data analytics results is Entity Resolution (ER). ER aims to identify different descriptions that refer to the same real-world entity, and remains a challenging problem. While previous works have studied specific aspects of ER (and mostly in traditional settings), in this survey, we provide for the first time an end-to-end view of modern ER workflows, and of the novel aspects of entity indexing and matching methods in order to cope with more than one of the Big Data characteristics simultaneously. We present the basic concepts, processing steps and execution strategies that have been proposed by different communities, i.e., database, semantic Web and machine learning, in order to cope with the loose structuredness, extreme diversity, high speed and large scale of entity descriptions used by real-world applications. Finally, we provide a synthetic discussion of the existing approaches, and conclude with a detailed presentation of open research directions
    corecore