663 research outputs found

    Scheduling queries to improve the freshness of a website

    Get PDF
    The WWW is a new advertising media in recent years where corporations utilize it to increase their exposure to consumers. For a very large website whose content is derived from some source database, it is important to maintain its freshness in response to changes to the base data. This issue is particularly signicant for websites presenting fast changing information such as stock exchange information and product information. In this paper, we formally dene and study the freshness of a website that is refreshed by scheduling a set of queries to fetch fresh data from the databases. Then, we propose several online scheduling algorithms and compare the performance of the algorithms on the freshness metric. Our conclusion is veried by empirical results. Keywords: Internet Data Management, View Maintenance, Query Optimization, Hard Real-Time Scheduling 1 Introduction The popularity of the World-Wide Web (WWW) has made it a prime vehicle for disseminating information. More and ..

    Implementation Steps to Optimize Search Engine Marketing (SEM) Results for Small and Medium Sized E-Commerce Companies

    Get PDF
    In terms of Internet marketing, search engines are the channel of choice for most of the internet advertisers in nowadays online-market. Considering its growth as an online advertising media, online marketers exploit this media using Search Engine Marketing (SEM) together with its strategies and implementation steps. This paper suggests some implementation steps for SEM to facilitate startups websites to be visible and competitive throughout this media. Delicate search engine mechanisms regarding indexing and web crawling develop Search Engine Marketing into the implementation steps including short-term and long-term marketing strategies. Search engine mechanisms schemes with the aid of author’s experience in organizing SEM contribute to the steps implication and conceptualization. In the light of implementation steps, marketers are competent to constantly advance their websites with the direction

    LiveRank: How to Refresh Old Datasets

    Get PDF
    This paper considers the problem of refreshing a dataset. More precisely , given a collection of nodes gathered at some time (Web pages, users from an online social network) along with some structure (hyperlinks, social relationships), we want to identify a significant fraction of the nodes that still exist at present time. The liveness of an old node can be tested through an online query at present time. We call LiveRank a ranking of the old pages so that active nodes are more likely to appear first. The quality of a LiveRank is measured by the number of queries necessary to identify a given fraction of the active nodes when using the LiveRank order. We study different scenarios from a static setting where the Liv-eRank is computed before any query is made, to dynamic settings where the LiveRank can be updated as queries are processed. Our results show that building on the PageRank can lead to efficient LiveRanks, for Web graphs as well as for online social networks

    Scaling Up Concurrent Analytical Workloads on Multi-Core Servers

    Get PDF
    Today, an ever-increasing number of researchers, businesses, and data scientists collect and analyze massive amounts of data in database systems. The database system needs to process the resulting highly concurrent analytical workloads by exploiting modern multi-socket multi-core processor systems with non-uniform memory access (NUMA) architectures and increasing memory sizes. Conventional execution engines, however, are not designed for many cores, and neither scale nor perform efficiently on modern multi-core NUMA architectures. Firstly, their query-centric approach, where each query is optimized and evaluated independently, can result in unnecessary contention for hardware resources due to redundant work found across queries in highly concurrent workloads. Secondly, they are unaware of the non-uniform memory access costs and the underlying hardware topology, incurring unnecessarily expensive memory accesses and bandwidth saturation. In this thesis, we show how these scalability and performance impediments can be solved by exploiting sharing among concurrent queries and incorporating NUMA-aware adaptive task scheduling and data placement strategies in the execution engine. Regarding sharing, we identify and categorize state-of-the-art techniques for sharing data and work across concurrent queries at run-time into two categories: reactive sharing, which shares intermediate results across common query sub-plans, and proactive sharing, which builds a global query plan with shared operators to evaluate queries. We integrate the original research prototypes that introduce reactive and proactive sharing, perform a sensitivity analysis, and show how and when each technique benefits performance. Our most significant finding is that reactive and proactive sharing can be combined to exploit the advantages of both sharing techniques for highly concurrent analytical workloads. Regarding NUMA-awareness, we identify, implement, and compare various combinations of task scheduling and data placement strategies under a diverse set of highly concurrent analytical workloads. We develop a prototype based on a commercial main-memory column-store database system. Our most significant finding is that there is no single strategy for task scheduling and data placement that is best for all workloads. In specific, inter-socket stealing of memory-intensive tasks can hurt overall performance, and unnecessary partitioning of data across sockets involves an overhead. For this reason, we implement algorithms that adapt task scheduling and data placement to the workload at run-time. Our experiments show that both sharing and NUMA-awareness can significantly improve the performance and scalability of highly concurrent analytical workloads on modern multi-core servers. Thus, we argue that sharing and NUMA-awareness are key factors for supporting faster processing of big data analytical applications, fully exploiting the hardware resources of modern multi-core servers, and for more responsive user experience

    Improved User News Feed Customization for an Open Source Search Engine

    Get PDF
    Yioop is an open source search engine project hosted on the site of the same name.It offers several features outside of searching, with one such feature being a news feed. The current news feed system aggregates articles from a curated list of news sites determined by the owner. However in its current state, the feed list is limited in size, constrained by the hardware that the aggregator is run on. The goal of my project was to overcome this limit by improving the current storage method used. The solution was derived by making use of IndexArchiveBundles and IndexShards, both of which are abstract data structures designed to handle large indexes. An additional aspect needed to accomodate for news feed was the ability to traverse said data structures in decreasing order of recently added. New methods were added to the preexisting WordIterator to handle this need. The result is a system with two new advantages, the capacity to store more feed items than before and the functionality of moving through indexes from the end back to the start. Our findings also indicate that the new process is much faster, with insertions taking one-tenth of the time at its fastest. Additionally, whereas the old system only stored around 37500 items at most, the new system allows for potentially unlimited news items to be stored. The methodology detailed in this project can also be applied to any information retrieval system to construct an index and read from it

    A simulation model for evaluating national patient record networks in South Africa.

    Get PDF
    Includes abstract.Includes bibliographical references.This study has shown that modelling and simulation is a feasible approach for evaluating NPR solutions in the developing context. The model can represent different network models, patient types and performance metrics to aid in the evaluation of NPR solutions. Using the current model, more case studies can be investigated for various public health issues - such as the impact of disease or regional services planning

    Recurring Query Processing on Big Data

    Get PDF
    The advances in hardware, software, and networks have enabled applications from business enterprises, scientific and engineering disciplines, to social networks, to generate data at unprecedented volume, variety, velocity, and varsity not possible before. Innovation in these domains is thus now hindered by their ability to analyze and discover knowledge from the collected data in a timely and scalable fashion. To facilitate such large-scale big data analytics, the MapReduce computing paradigm and its open-source implementation Hadoop is one of the most popular and widely used technologies. Hadoop\u27s success as a competitor to traditional parallel database systems lies in its simplicity, ease-of-use, flexibility, automatic fault tolerance, superior scalability, and cost effectiveness due to its use of inexpensive commodity hardware that can scale petabytes of data over thousands of machines. Recurring queries, repeatedly being executed for long periods of time on rapidly evolving high-volume data, have become a bedrock component in most of these analytic applications. Efficient execution and optimization techniques must be designed to assure the responsiveness and scalability of these recurring queries. In this dissertation, we thoroughly investigate topics in the area of recurring query processing on big data. In this dissertation, we first propose a novel scalable infrastructure called Redoop that treats recurring query over big evolving data as first class citizens during query processing. This is in contrast to state-of-the-art MapReduce/Hadoop system experiencing significant challenges when faced with recurring queries including redundant computations, significant latencies, and huge application development efforts. Redoop offers innovative window-aware optimization techniques for recurring query execution including adaptive window-aware data partitioning, window-aware task scheduling, and inter-window caching mechanisms. Redoop retains the fault-tolerance of MapReduce via automatic cache recovery and task re-execution support as well. Second, we address the crucial need to accommodate hundreds or even thousands of recurring analytics queries that periodically execute over frequently updated data sets, e.g., latest stock transactions, new log files, or recent news feeds. For many applications, such recurring queries come with user-specified service-level agreements (SLAs), commonly expressed as the maximum allowed latency for producing results before their merits decay. On top of Redoop, we built a scalable multi-query sharing engine tailored for recurring workloads in the MapReduce infrastructure, called Helix. Helix deploys new sliced window-alignment techniques to create sharing opportunities among recurring queries without introducing additional I/O overheads or unnecessary data scans. Furthermore, Helix introduces a cost/benefit model for creating a sharing plan among the recurring queries, and a scheduling strategy for executing them to maximize the SLA satisfaction. Third, recurring analytics queries tend to be expensive, especially when query processing consumes data sets in the hundreds of terabytes or more. Time sensitive recurring queries, such as fraud detection, often come with tight response time constraints as query deadlines. Data sampling is a popular technique for computing approximate results with an acceptable error bound while reducing high-demand resource consumption and thus improving query turnaround times. In this dissertation, we propose the first fast approximate query engine for recurring workloads in the MapReduce infrastructure, called Faro. Faro introduces two key innovations: (1) a deadline-aware sampling strategy that builds samples from the original data with reduced sample sizes compared to uniform sampling, and (2) adaptive resource allocation strategies that maximally improve the approximate results while assuring to still meet the response time requirements specified in recurring queries. In our comprehensive experimental study of each part of this dissertation, we demonstrate the superiority of the proposed strategies over state-of-the-art techniques in scalability, effectiveness, as well as robustness

    Improving Data Delivery in Wide Area and Mobile Environments

    Get PDF
    The popularity of the Internet has dramatically increased the diversity of clients and applications that access data across wide area networks and mobile environments. Data delivery in these environments presents several challenges. First, applications often have diverse requirements with respect to the latency of their requests and recency of data. Traditional data delivery architectures do not provide interfaces to express these requirements. Second, it is difficult to accurately estimate when objects are updated. Existing solutions either require servers to notify clients (push-based), which adds overhead at servers and may not scale, or require clients to contact servers (pull-based), which rely on estimates that are often inaccurate in practice. Third, cache managers need a flexible and scalable way to determine if an object in the cache meets a client's latency and recency preferences. Finally, mobile clients who access data on wireless networks share limited wireless bandwidth and typically have different QoS requirements for different applications. In this dissertation we address these challenges using two complementary techniques, client profiles and server cooperation. Client profiles are a set of parameters that enable clients to communicate application-specific latency and recency preferences to caches and wireless base stations. Profiles are used by cache managers to determine whether to deliver a cached object to the client or to validate the object at a remote server, and for scheduling data delivery to mobile clients. Server cooperation enables servers to provide resource information to cache managers, which enables cache managers to estimate the recency of cached objects. The main contributions of this dissertation are as follows: First, we present a flexible and scalable architecture to support client profiles that is straightforward to implement at a cache. wireless base station. Second, we present techniques to improve estimates of the recency of cached objects using server cooperation by increasing the amount of information servers provide to caches. Third, for mobile clients, we present a framework for incorporating profiles into the cache utilization, downloading, and scheduling decisions at a We evaluate client profiles and server cooperation using synthetic and trace data. Finally, we present an implementation of profiles and experimental results
    • …
    corecore