77 research outputs found

    Controlling Risk of Web Question Answering

    Full text link
    Web question answering (QA) has become an indispensable component in modern search systems, which can significantly improve users' search experience by providing a direct answer to users' information need. This could be achieved by applying machine reading comprehension (MRC) models over the retrieved passages to extract answers with respect to the search query. With the development of deep learning techniques, state-of-the-art MRC performances have been achieved by recent deep methods. However, existing studies on MRC seldom address the predictive uncertainty issue, i.e., how likely the prediction of an MRC model is wrong, leading to uncontrollable risks in real-world Web QA applications. In this work, we first conduct an in-depth investigation over the risk of Web QA. We then introduce a novel risk control framework, which consists of a qualify model for uncertainty estimation using the probe idea, and a decision model for selectively output. For evaluation, we introduce risk-related metrics, rather than the traditional EM and F1 in MRC, for the evaluation of risk-aware Web QA. The empirical results over both the real-world Web QA dataset and the academic MRC benchmark collection demonstrate the effectiveness of our approach.Comment: 42nd International ACM SIGIR Conference on Research and Development in Information Retrieva

    Using English information in Non-English web search

    Get PDF

    Neo: A Learned Query Optimizer

    Full text link
    Query optimization is one of the most challenging problems in database systems. Despite the progress made over the past decades, query optimizers remain extremely complex components that require a great deal of hand-tuning for specific workloads and datasets. Motivated by this shortcoming and inspired by recent advances in applying machine learning to data management challenges, we introduce Neo (Neural Optimizer), a novel learning-based query optimizer that relies on deep neural networks to generate query executions plans. Neo bootstraps its query optimization model from existing optimizers and continues to learn from incoming queries, building upon its successes and learning from its failures. Furthermore, Neo naturally adapts to underlying data patterns and is robust to estimation errors. Experimental results demonstrate that Neo, even when bootstrapped from a simple optimizer like PostgreSQL, can learn a model that offers similar performance to state-of-the-art commercial optimizers, and in some cases even surpass them

    Understanding and modeling users of modern search engines

    Get PDF

    Statistical Extraction of Multilingual Natural Language Patterns for RDF Predicates: Algorithms and Applications

    Get PDF
    The Data Web has undergone a tremendous growth period. It currently consists of more then 3300 publicly available knowledge bases describing millions of resources from various domains, such as life sciences, government or geography, with over 89 billion facts. In the same way, the Document Web grew to the state where approximately 4.55 billion websites exist, 300 million photos are uploaded on Facebook as well as 3.5 billion Google searches are performed on average every day. However, there is a gap between the Document Web and the Data Web, since for example knowledge bases available on the Data Web are most commonly extracted from structured or semi-structured sources, but the majority of information available on the Web is contained in unstructured sources such as news articles, blog post, photos, forum discussions, etc. As a result, data on the Data Web not only misses a significant fragment of information but also suffers from a lack of actuality since typical extraction methods are time-consuming and can only be carried out periodically. Furthermore, provenance information is rarely taken into consideration and therefore gets lost in the transformation process. In addition, users are accustomed to entering keyword queries to satisfy their information needs. With the availability of machine-readable knowledge bases, lay users could be empowered to issue more specific questions and get more precise answers. In this thesis, we address the problem of Relation Extraction, one of the key challenges pertaining to closing the gap between the Document Web and the Data Web by four means. First, we present a distant supervision approach that allows finding multilingual natural language representations of formal relations already contained in the Data Web. We use these natural language representations to find sentences on the Document Web that contain unseen instances of this relation between two entities. Second, we address the problem of data actuality by presenting a real-time data stream RDF extraction framework and utilize this framework to extract RDF from RSS news feeds. Third, we present a novel fact validation algorithm, based on natural language representations, able to not only verify or falsify a given triple, but also to find trustworthy sources for it on the Web and estimating a time scope in which the triple holds true. The features used by this algorithm to determine if a website is indeed trustworthy are used as provenance information and therewith help to create metadata for facts in the Data Web. Finally, we present a question answering system that uses the natural language representations to map natural language question to formal SPARQL queries, allowing lay users to make use of the large amounts of data available on the Data Web to satisfy their information need

    Guest editorial

    Get PDF

    Merging Queries in OLTP Workloads

    Get PDF
    OLTP applications are usually executed by a high number of clients in parallel and are typically faced with high throughput demand as well as a constraint latency requirement for individual statements. In enterprise scenarios, they often face the challenge to deal with overload spikes resulting from events such as Cyber Monday or Black Friday. The traditional solution to prevent running out of resources and thus coping with such spikes is to use a significant over-provisioning of the underlying infrastructure. In this thesis, we analyze real enterprise OLTP workloads with respect to statement types, complexity, and hot-spot statements. Interestingly, our findings reveal that workloads are often read-heavy and comprise similar query patterns, which provides a potential to share work of statements belonging to different transactions. In the past, resource sharing has been extensively studied for OLAP workloads. Naturally, the question arises, why studies mainly focus on OLAP and not on OLTP workloads? At first sight, OLTP queries often consist of simple calculations, such as index look-ups with little sharing potential. In consequence, such queries – due to their short execution time – may not have enough potential for the additional overhead. In addition, OLTP workloads do not only execute read operations but also updates. Therefore, sharing work needs to obey transactional semantics, such as the given isolation level and read-your-own-writes. This thesis presents THE LEVIATHAN, a novel batching scheme for OLTP workloads, an approach for merging read statements within interactively submitted multi-statement transactions consisting of reads and updates. Our main idea is to merge the execution of statements by merging their plans, thus being able to merge the execution of not only complex, but also simple calculations, such as the aforementioned index look-up. We identify mergeable statements by pattern matching of prepared statement plans, which comes with low overhead. For obeying the isolation level properties and providing read-your-own-writes, we first define a formal framework for merging transactions running under a given isolation level and provide insights into a prototypical implementation of merging within a commercial database system. Our experimental evaluation shows that, depending on the isolation level, the load in the system, and the read-share of the workload, an improvement of the transaction throughput by up to a factor of 2.5x is possible without compromising the transactional semantics. Another interesting effect we show is that with our strategy, we can increase the throughput of a real enterprise workload by 20%.:1 INTRODUCTION 1.1 Summary of Contributions 1.2 Outline 2 WORKLOAD ANALYSIS 2.1 Analyzing OLTP Benchmarks 2.1.1 YCSB 2.1.2 TATP 2.1.3 TPC Benchmark Scenarios 2.1.4 Summary 2.2 Analyzing OLTP Workloads from Open Source Projects 2.2.1 Characteristics of Workloads 2.2.2 Summary 2.3 Analyzing Enterprise OLTP Workloads 2.3.1 Overview of Reports about OLTP Workload Characteristics 2.3.2 Analysis of SAP Hybris Workload 2.3.3 Summary 2.4 Conclusion 3 RELATED WORK ON QUERY MERGING 3.1 Merging the Execution of Operators 3.2 Merging the Execution of Subplans 3.3 Merging the Results of Subplans 3.4 Merging the Execution of Full Plans 3.5 Miscellaneous Works on Merging 3.6 Discussion 4 MERGING STATEMENTS IN MULTI STATEMENT TRANSACTIONS 4.1 Overview of Our Approach 4.1.1 Examples 4.1.2 Why Naïve Merging Fails 4.2 THE LEVIATHAN Approach 4.3 Formalizing THE LEVIATHAN Approach 4.3.1 Transaction Theory 4.3.2 Merging Under MVCC 4.4 Merging Reads Under Different Isolation Levels 4.4.1 Read Uncommitted 4.4.2 Read Committed 4.4.3 Repeatable Read 4.4.4 Snapshot Isolation 4.4.5 Serializable 4.4.6 Discussion 4.5 Merging Writes Under Different Isolation Levels 4.5.1 Read Uncommitted 4.5.2 Read Committed 4.5.3 Snapshot Isolation 4.5.4 Serializable 4.5.5 Handling Dependencies 4.5.6 Discussion 5 SYSTEM MODEL 5.1 Definition of the Term “Overload” 5.2 Basic Queuing Model 5.2.1 Option (1): Replacement with a Merger Thread 5.2.2 Option (2): Adding Merger Thread 5.2.3 Using Multiple Merger Threads 5.2.4 Evaluation 5.3 Extended Queue Model 5.3.1 Option (1): Replacement with a Merger Thread 5.3.2 Option (2): Adding Merger Thread 5.3.3 Evaluation 6 IMPLEMENTATION 6.1 Background: SAP HANA 6.2 System Design 6.2.1 Read Committed 6.2.2 Snapshot Isolation 6.3 Merger Component 6.3.1 Overview 6.3.2 Dequeuing 6.3.3 Merging 6.3.4 Sending 6.3.5 Updating MTx State 6.4 Challenges in the Implementation of Merging Writes 6.4.1 SQL String Implementation 6.4.2 Update Count 6.4.3 Error Propagation 6.4.4 Abort and Rollback 7 EVALUATION 7.1 Benchmark Settings 7.2 System Settings 7.2.1 Experiment I: End-to-end Response Time Within a SAP Hybris System 7.2.2 Experiment II: Dequeuing Strategy 7.2.3 Experiment III: Merging Improvement on Different Statement, Transaction and Workload Types 7.2.4 Experiment IV: End-to-End Latency in YCSB 7.2.5 Experiment V: Breakdown of Execution in YCSB 7.2.6 Discussion of System Settings 7.3 Merging in Interactive Transactions 7.3.1 Experiment VI: Merging TATP in Read Uncommitted 7.3.2 Experiment VII: Merging TATP in Read Committed 7.3.3 Experiment VIII: Merging TATP in Snapshot Isolation 7.4 Merging Queries in Stored Procedures Experiment IX: Merging TATP Stored Procedures in Read Committed 7.5 Merging SAP Hybris 7.5.1 Experiment X: CPU-time Breakdown on HANA Components 7.5.2 Experiment XI: Merging Media Query in SAP Hybris 7.5.3 Discussion of our Results in Comparison with Related Work 8 CONCLUSION 8.1 Summary 8.2 Future Research Directions REFERENCES A UML CLASS DIAGRAM
    • …
    corecore