5 research outputs found

    Understanding, Estimating, and Incorporating Output Quality Into Join Algorithms For Information Extraction

    Get PDF
    Information extraction (IE) systems are trained to extract specific relations from text databases. Real-world applications often require that the output of multiple IE systems be joined to produce the data of interest. To optimize the execution of a join of multiple extracted relations, it is not sufficient to consider only execution time. In fact, the quality of the join output is of critical importance: unlike in the relational world, different join execution plans can produce join results of widely different quality whenever IE systems are involved. In this paper, we develop a principled approach to understand, estimate, and incorporate output quality into the join optimization process over extracted relations. We argue that the output quality is affected by (a) the configuration of the IE systems used to process the documents, (b) the document retrieval strategies used to retrieve documents, and (c) the actual join algorithm used. Our analysis considers a variety of join algorithms from relational query optimization, and predicts the output quality –and, of course, the execution time– of the alternate execution plans. We establish the accuracy of our analytical models, as well as study the effectiveness of a quality-aware join optimizer, with a large-scale experimental evaluation over real-world text collections and state-of-the-art IE systems

    Understanding, Estimating, and Incorporating Output Quality Into Join Algorithms For Information Extraction

    Get PDF
    Information extraction (IE) systems are trained to extract specific relations from text databases. Real-world applications often require that the output of multiple IE systems be joined to produce the data of interest. To optimize the execution of a join of multiple extracted relations, it is not sufficient to consider only execution time. In fact, the quality of the join output is of critical importance: unlike in the relational world, different join execution plans can produce join results of widely different quality whenever IE systems are involved. In this paper, we develop a principled approach to understand, estimate, and incorporate output quality into the join optimization process over extracted relations. We argue that the output quality is affected by (a) the configuration of the IE systems used to process the documents, (b) the document retrieval strategies used to retrieve documents, and (c) the actual join algorithm used. Our analysis considers a variety of join algorithms from relational query optimization, and predicts the output quality –and, of course, the execution time– of the alternate execution plans. We establish the accuracy of our analytical models, as well as study the effectiveness of a quality-aware join optimizer, with a large-scale experimental evaluation over real-world text collections and state-of-the-art IE systems

    Efficient querying of Linked Data by distributing workload

    Get PDF
    Online data is presented in different ways and in various forms which are not mutually compatible. This problem is also present in Web APIs, because we usually have to implement a specialised client, suited for the kind of data the Web service is providing. This problem is solved with Linked Data. The problem with Linked Data is the query performance and the availability of remote SPARQL endpoints. With Triple Pattern Fragments we can execute SPARQL queries by transferring some workload to the client, but in contrast we have to transfer more data. The existing AMF extension reduces the amount of HTTP requests and consequently the amount of transferred data on some queries, while increasing the amount of transferred data with others. In this thesis we present our extension, where we try to lower the amount of HTTP requests and the amount of transferred data by extending the metadata with a Bloom filter, containing data, linked with triples on the current page of the Triple Pattern Fragment. We have compared our extension with the AMF extension and achieved encouraging results. We have also proposed a fix for the AMF extension, which is already included in the official repository. Finally, we have developed a simple graphical user interface that enables composition of SPARQL queries and their execution using our extension

    Content And Multimedia Database Management Systems

    Get PDF
    A database management system is a general-purpose software system that facilitates the processes of defining, constructing, and manipulating databases for various applications. The main characteristic of the ‘database approach’ is that it increases the value of data by its emphasis on data independence. DBMSs, and in particular those based on the relational data model, have been very successful at the management of administrative data in the business domain. This thesis has investigated data management in multimedia digital libraries, and its implications on the design of database management systems. The main problem of multimedia data management is providing access to the stored objects. The content structure of administrative data is easily represented in alphanumeric values. Thus, database technology has primarily focused on handling the objects’ logical structure. In the case of multimedia data, representation of content is far from trivial though, and not supported by current database management systems
    corecore