1,202 research outputs found

    Query Optimization for On-Demand Information Extraction Tasks over Text Databases

    Get PDF
    Many modern applications involve analyzing large amounts of data that comes from unstructured text documents. In its original format, data contains information that, if extracted, can give more insight and help in the decision-making process. The ability to answer structured SQL queries over unstructured data allows for more complex data analysis. Querying unstructured data can be accomplished with the help of information extraction (IE) techniques. The traditional way is by using the Extract-Transform-Load (ETL) approach, which performs all possible extractions over the document corpus and stores the extracted relational results in a data warehouse. Then, the extracted data is queried. The ETL approach produces results that are out of date and causes an explosion in the number of possible relations and attributes to extract. Therefore, new approaches to perform extraction on-the-fly were developed; however, previous efforts relied on specialized extraction operators, or particular IE algorithms, which limited the optimization opportunities of such queries. In this work, we propose an on-line approach that integrates the engine of the database management system with IE systems using a new type of view called extraction views. Queries on text documents are evaluated using these extraction views, which get populated at query-time with newly extracted data. Our approach enables the optimizer to apply all well-defined optimization techniques. The optimizer selects the best execution plan using a defined cost model that considers a user-defined balance between the cost and quality of extraction, and we explain the trade-off between the two factors. The main contribution is the ability to run on-demand information extraction to consider latest changes in the data, while avoiding unnecessary extraction from irrelevant text documents

    Collusion in Peer-to-Peer Systems

    Get PDF
    Peer-to-peer systems have reached a widespread use, ranging from academic and industrial applications to home entertainment. The key advantage of this paradigm lies in its scalability and flexibility, consequences of the participants sharing their resources for the common welfare. Security in such systems is a desirable goal. For example, when mission-critical operations or bank transactions are involved, their effectiveness strongly depends on the perception that users have about the system dependability and trustworthiness. A major threat to the security of these systems is the phenomenon of collusion. Peers can be selfish colluders, when they try to fool the system to gain unfair advantages over other peers, or malicious, when their purpose is to subvert the system or disturb other users. The problem, however, has received so far only a marginal attention by the research community. While several solutions exist to counter attacks in peer-to-peer systems, very few of them are meant to directly counter colluders and their attacks. Reputation, micro-payments, and concepts of game theory are currently used as the main means to obtain fairness in the usage of the resources. Our goal is to provide an overview of the topic by examining the key issues involved. We measure the relevance of the problem in the current literature and the effectiveness of existing philosophies against it, to suggest fruitful directions in the further development of the field

    Massively Parallel Entity Matching with Linear Classification in Low Dimensional Space

    Get PDF
    In entity matching classification, we are given two sets R and S of objects where whether r and s form a match is known for each pair (r, s) in R x S. If R and S are subsets of domains D(R) and D(S) respectively, the goal is to discover a classifier function f: D(R) x D(S) -> {0, 1} from a certain class satisfying the property that, for every (r, s) in R x S, f(r, s) = 1 if and only if r and s are a match. Past research is accustomed to running a learning algorithm directly on all the labeled (i.e., match or not) pairs in R times S. This, however, suffers from the drawback that even reading through the input incurs a quadratic cost. We pursue a direction towards removing the quadratic barrier. Denote by T the set of matching pairs in R times S. We propose to accept R, S, and T as the input, and aim to solve the problem with cost proportional to |R|+|S|+|T|, thereby achieving a large performance gain in the (typical) scenario where |T|<<|R||S|. This paper provides evidence on the feasibility of the new direction, by showing how to accomplish the aforementioned purpose for entity matching with linear classification, where a classifier is a linear multi-dimensional plane separating the matching and non-matching pairs. We actually do so in the MPC model, echoing the trend of deploying massively parallel computing systems for large-scale learning. As a side product, we obtain new MPC algorithms for three geometric problems: linear programming, batched range counting, and dominance join

    Processing SPARQL Queries Over Distributed RDF Graphs

    Full text link
    We propose techniques for processing SPARQL queries over a large RDF graph in a distributed environment. We adopt a "partial evaluation and assembly" framework. Answering a SPARQL query Q is equivalent to finding subgraph matches of the query graph Q over RDF graph G. Based on properties of subgraph matching over a distributed graph, we introduce local partial match as partial answers in each fragment of RDF graph G. For assembly, we propose two methods: centralized and distributed assembly. We analyze our algorithms from both theoretically and experimentally. Extensive experiments over both real and benchmark RDF repositories of billions of triples confirm that our method is superior to the state-of-the-art methods in both the system's performance and scalability.Comment: 30 page

    Query processing in temporal object-oriented databases

    Get PDF
    This PhD thesis is concerned with historical data management in the context of objectoriented databases. An extensible approach has been explored to processing temporal object queries within a uniform query framework. By the uniform framework, we mean temporal queries can be processed within the existing object-oriented framework that is extended from relational framework, by extending the existing query processing techniques and strategies developed for OODBs and RDBs. The unified model of OODBs and RDBs in UmSQL/X has been adopted as a basis for this purpose. A temporal object data model is thereby defined by incorporating a time dimension into this unified model of OODBs and RDBs to form temporal relational-like cubes but with the addition of aggregation and inheritance hierarchies. A query algebra, that accesses objects through these associations of aggregation, inheritance and timereference, is then defined as a general query model /language. Due to the extensive features of our data model and reducibility of the algebra, a layered structure of query processor is presented that provides a uniforrn framework for processing temporal object queries. Within the uniform framework, query transformation is carried out based on a set of transformation rules identified that includes the known relational and object rules plus those pertaining to the time dimension. To evaluate a temporal query involving a path with timereference, a strategy of decomposition is proposed. That is, evaluation of an enhanced path, which is defined to extend a path with time-reference, is decomposed by initially dividing the path into two sub-paths: one containing the time-stamped class that can be optimized by making use of the ordering information of temporal data and another an ordinary sub-path (without time-stamped classes) which can be further decomposed and evaluated using different algorithms. The intermediate results of traversing the two sub-paths are then joined together to create the query output. Algorithms for processing the decomposed query components, i. e., time-related operation algorithms, four join algorithms (nested-loop forward join, sort-merge forward join, nested-loop reverse join and sort-merge reverse join) and their modifications, have been presented with cost analysis and implemented with stream processing techniques using C++. Simulation results are also provided. Both cost analysis and simulation show the effects of time on the query processing algorithms: the join time cost is linearly increased with the expansion in the number of time-epochs (time-dimension in the case of a regular TS). It is also shown that using heuristics that make use of time information can lead to a significant time cost saving. Query processing with incomplete temporal data has also been discussed

    Data Warehousing Modernization: Big Data Technology Implementation

    Get PDF
    Considering the challenges posed by Big Data, the cost to scale traditional data warehouses is high and the performances would be inadequate to meet the growing needs of the volume, variety and velocity of data. The Hadoop ecosystem answers both of the shortcomings. Hadoop has the ability to store and analyze large data sets in parallel on a distributed environment but cannot replace the existing data warehouses and RDBMS systems due to its own limitations explained in this paper. In this paper, I identify the reasons why many enterprises fail and struggle to adapt to Big Data technologies. A brief outline of two different technologies to handle Big Data will be presented in this paper: Using IBM’s Pure Data system for analytics (Netezza) usually used in reporting, and Hadoop with Hive which is used in analytics. Also, this paper covers the Enterprise architecture consisting of Hadoop that successful companies are adapting to analyze, filter, process, and store the data running along a massively parallel processing data warehouse. Despite, having the technology to support and process Big Data, industries are still struggling to meet their goals due to the lack of skilled personnel to study and analyze the data, in short data scientists and data statisticians

    A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing

    Full text link
    Data Grids have been adopted as the platform for scientific communities that need to share, access, transport, process and manage large data collections distributed worldwide. They combine high-end computing technologies with high-performance networking and wide-area storage management techniques. In this paper, we discuss the key concepts behind Data Grids and compare them with other data sharing and distribution paradigms such as content delivery networks, peer-to-peer networks and distributed databases. We then provide comprehensive taxonomies that cover various aspects of architecture, data transportation, data replication and resource allocation and scheduling. Finally, we map the proposed taxonomy to various Data Grid systems not only to validate the taxonomy but also to identify areas for future exploration. Through this taxonomy, we aim to categorise existing systems to better understand their goals and their methodology. This would help evaluate their applicability for solving similar problems. This taxonomy also provides a "gap analysis" of this area through which researchers can potentially identify new issues for investigation. Finally, we hope that the proposed taxonomy and mapping also helps to provide an easy way for new practitioners to understand this complex area of research.Comment: 46 pages, 16 figures, Technical Repor

    Towards an internet-scale stream processing service with loosely-coupled entities

    Get PDF
    Master'sMASTER OF SCIENC

    Application of overlay techniques to network monitoring

    Get PDF
    Measurement and monitoring are important for correct and efficient operation of a network, since these activities provide reliable information and accurate analysis for characterizing and troubleshooting a network’s performance. The focus of network measurement is to measure the volume and types of traffic on a particular network and to record the raw measurement results. The focus of network monitoring is to initiate measurement tasks, collect raw measurement results, and report aggregated outcomes. Network systems are continuously evolving: besides incremental change to accommodate new devices, more drastic changes occur to accommodate new applications, such as overlay-based content delivery networks. As a consequence, a network can experience significant increases in size and significant levels of long-range, coordinated, distributed activity; furthermore, heterogeneous network technologies, services and applications coexist and interact. Reliance upon traditional, point-to-point, ad hoc measurements to manage such networks is becoming increasingly tenuous. In particular, correlated, simultaneous 1-way measurements are needed, as is the ability to access measurement information stored throughout the network of interest. To address these new challenges, this dissertation proposes OverMon, a new paradigm for edge-to-edge network monitoring systems through the application of overlay techniques. Of particular interest, the problem of significant network overheads caused by normal overlay network techniques has been addressed by constructing overlay networks with topology awareness - the network topology information is derived from interior gateway protocol (IGP) traffic, i.e. OSPF traffic, thus eliminating all overlay maintenance network overhead. Through a prototype that uses overlays to initiate measurement tasks and to retrieve measurement results, systematic evaluation has been conducted to demonstrate the feasibility and functionality of OverMon. The measurement results show that OverMon achieves good performance in scalability, flexibility and extensibility, which are important in addressing the new challenges arising from network system evolution. This work, therefore, contributes an innovative approach of applying overly techniques to solve realistic network monitoring problems, and provides valuable first hand experience in building and evaluating such a distributed system

    Tunable Security for Deployable Data Outsourcing

    Get PDF
    Security mechanisms like encryption negatively affect other software quality characteristics like efficiency. To cope with such trade-offs, it is preferable to build approaches that allow to tune the trade-offs after the implementation and design phase. This book introduces a methodology that can be used to build such tunable approaches. The book shows how the proposed methodology can be applied in the domains of database outsourcing, identity management, and credential management
    corecore