663 research outputs found
Scheduling queries to improve the freshness of a website
The WWW is a new advertising media in recent years where corporations utilize it to increase their exposure to consumers. For a very large website whose content is derived from some source database, it is important to maintain its freshness in response to changes to the base data. This issue is particularly signicant for websites presenting fast changing information such as stock exchange information and product information. In this paper, we formally dene and study the freshness of a website that is refreshed by scheduling a set of queries to fetch fresh data from the databases. Then, we propose several online scheduling algorithms and compare the performance of the algorithms on the freshness metric. Our conclusion is veried by empirical results. Keywords: Internet Data Management, View Maintenance, Query Optimization, Hard Real-Time Scheduling 1 Introduction The popularity of the World-Wide Web (WWW) has made it a prime vehicle for disseminating information. More and ..
Implementation Steps to Optimize Search Engine Marketing (SEM) Results for Small and Medium Sized E-Commerce Companies
In terms of Internet marketing, search engines are the channel of choice for most of the internet advertisers in nowadays online-market. Considering its growth as an online advertising media, online marketers exploit this media using Search Engine Marketing (SEM) together with its strategies and implementation steps. This paper suggests some implementation steps for SEM to facilitate startups websites to be visible and competitive throughout this media. Delicate search engine mechanisms regarding indexing and web crawling develop Search Engine Marketing into the implementation steps including short-term and long-term marketing strategies. Search engine mechanisms schemes with the aid of author’s experience in organizing SEM contribute to the steps implication and conceptualization. In the light of implementation steps, marketers are competent to constantly advance their websites with the direction
LiveRank: How to Refresh Old Datasets
This paper considers the problem of refreshing a dataset. More precisely ,
given a collection of nodes gathered at some time (Web pages, users from an
online social network) along with some structure (hyperlinks, social
relationships), we want to identify a significant fraction of the nodes that
still exist at present time. The liveness of an old node can be tested through
an online query at present time. We call LiveRank a ranking of the old pages so
that active nodes are more likely to appear first. The quality of a LiveRank is
measured by the number of queries necessary to identify a given fraction of the
active nodes when using the LiveRank order. We study different scenarios from a
static setting where the Liv-eRank is computed before any query is made, to
dynamic settings where the LiveRank can be updated as queries are processed.
Our results show that building on the PageRank can lead to efficient LiveRanks,
for Web graphs as well as for online social networks
Scaling Up Concurrent Analytical Workloads on Multi-Core Servers
Today, an ever-increasing number of researchers, businesses, and data scientists collect and analyze massive amounts of data in database systems. The database system needs to process the resulting highly concurrent analytical workloads by exploiting modern multi-socket multi-core processor systems with non-uniform memory access (NUMA) architectures and increasing memory sizes. Conventional execution engines, however, are not designed for many cores, and neither scale nor perform efficiently on modern multi-core NUMA architectures. Firstly, their query-centric approach, where each query is optimized and evaluated independently, can result in unnecessary contention for hardware resources due to redundant work found across queries in highly concurrent workloads. Secondly, they are unaware of the non-uniform memory access costs and the underlying hardware topology, incurring unnecessarily expensive memory accesses and bandwidth saturation. In this thesis, we show how these scalability and performance impediments can be solved by exploiting sharing among concurrent queries and incorporating NUMA-aware adaptive task scheduling and data placement strategies in the execution engine. Regarding sharing, we identify and categorize state-of-the-art techniques for sharing data and work across concurrent queries at run-time into two categories: reactive sharing, which shares intermediate results across common query sub-plans, and proactive sharing, which builds a global query plan with shared operators to evaluate queries. We integrate the original research prototypes that introduce reactive and proactive sharing, perform a sensitivity analysis, and show how and when each technique benefits performance. Our most significant finding is that reactive and proactive sharing can be combined to exploit the advantages of both sharing techniques for highly concurrent analytical workloads. Regarding NUMA-awareness, we identify, implement, and compare various combinations of task scheduling and data placement strategies under a diverse set of highly concurrent analytical workloads. We develop a prototype based on a commercial main-memory column-store database system. Our most significant finding is that there is no single strategy for task scheduling and data placement that is best for all workloads. In specific, inter-socket stealing of memory-intensive tasks can hurt overall performance, and unnecessary partitioning of data across sockets involves an overhead. For this reason, we implement algorithms that adapt task scheduling and data placement to the workload at run-time. Our experiments show that both sharing and NUMA-awareness can significantly improve the performance and scalability of highly concurrent analytical workloads on modern multi-core servers. Thus, we argue that sharing and NUMA-awareness are key factors for supporting faster processing of big data analytical applications, fully exploiting the hardware resources of modern multi-core servers, and for more responsive user experience
Improved User News Feed Customization for an Open Source Search Engine
Yioop is an open source search engine project hosted on the site of the same name.It offers several features outside of searching, with one such feature being a news feed. The current news feed system aggregates articles from a curated list of news sites determined by the owner. However in its current state, the feed list is limited in size, constrained by the hardware that the aggregator is run on. The goal of my project was to overcome this limit by improving the current storage method used. The solution was derived by making use of IndexArchiveBundles and IndexShards, both of which are abstract data structures designed to handle large indexes. An additional aspect needed to accomodate for news feed was the ability to traverse said data structures in decreasing order of recently added. New methods were added to the preexisting WordIterator to handle this need. The result is a system with two new advantages, the capacity to store more feed items than before and the functionality of moving through indexes from the end back to the start. Our findings also indicate that the new process is much faster, with insertions taking one-tenth of the time at its fastest. Additionally, whereas the old system only stored around 37500 items at most, the new system allows for potentially unlimited news items to be stored. The methodology detailed in this project can also be applied to any information retrieval system to construct an index and read from it
A simulation model for evaluating national patient record networks in South Africa.
Includes abstract.Includes bibliographical references.This study has shown that modelling and simulation is a feasible approach for evaluating NPR solutions in the developing context. The model can represent different network models, patient types and performance metrics to aid in the evaluation of NPR solutions. Using the current model, more case studies can be investigated for various public health issues - such as the impact of disease or regional services planning
Recurring Query Processing on Big Data
The advances in hardware, software, and networks have enabled applications from business enterprises, scientific and engineering disciplines, to social networks, to generate data at unprecedented volume, variety, velocity, and varsity not possible before. Innovation in these domains is thus now hindered by their ability to analyze and discover knowledge from the collected data in a timely and scalable fashion. To facilitate such large-scale big data analytics, the MapReduce computing paradigm and its open-source implementation Hadoop is one of the most popular and widely used technologies. Hadoop\u27s success as a competitor to traditional parallel database systems lies in its simplicity, ease-of-use, flexibility, automatic fault tolerance, superior scalability, and cost effectiveness due to its use of inexpensive commodity hardware that can scale petabytes of data over thousands of machines. Recurring queries, repeatedly being executed for long periods of time on rapidly evolving high-volume data, have become a bedrock component in most of these analytic applications. Efficient execution and optimization techniques must be designed to assure the responsiveness and scalability of these recurring queries. In this dissertation, we thoroughly investigate topics in the area of recurring query processing on big data.
In this dissertation, we first propose a novel scalable infrastructure called Redoop that treats recurring query over big evolving data as first class citizens during query processing. This is in contrast to state-of-the-art MapReduce/Hadoop system experiencing significant challenges when faced with recurring queries including redundant computations, significant latencies, and huge application development efforts. Redoop offers innovative window-aware optimization techniques for recurring query execution including adaptive window-aware data partitioning, window-aware task scheduling, and inter-window caching mechanisms. Redoop retains the fault-tolerance of MapReduce via automatic cache recovery and task re-execution support as well.
Second, we address the crucial need to accommodate hundreds or even thousands of recurring analytics queries that periodically execute over frequently updated data sets, e.g., latest stock transactions, new log files, or recent news feeds. For many applications, such recurring queries come with user-specified service-level agreements (SLAs), commonly expressed as the maximum allowed latency for producing results before their merits decay. On top of Redoop, we built a scalable multi-query sharing engine tailored for recurring workloads in the MapReduce infrastructure, called Helix. Helix deploys new sliced window-alignment techniques to create sharing opportunities among recurring queries without introducing additional I/O overheads or unnecessary data scans. Furthermore, Helix introduces a cost/benefit model for creating a sharing plan among the recurring queries, and a scheduling strategy for executing them to maximize the SLA satisfaction.
Third, recurring analytics queries tend to be expensive, especially when query processing consumes data sets in the hundreds of terabytes or more. Time sensitive recurring queries, such as fraud detection, often come with tight response time constraints as query deadlines. Data sampling is a popular technique for computing approximate results with an acceptable error bound while reducing high-demand resource consumption and thus improving query turnaround times. In this dissertation, we propose the first fast approximate query engine for recurring workloads in the MapReduce infrastructure, called Faro. Faro introduces two key innovations: (1) a deadline-aware sampling strategy that builds samples from the original data with reduced sample sizes compared to uniform sampling, and (2) adaptive resource allocation strategies that maximally improve the approximate results while assuring to still meet the response time requirements specified in recurring queries.
In our comprehensive experimental study of each part of this dissertation, we demonstrate the superiority of the proposed strategies over state-of-the-art techniques in scalability, effectiveness, as well as robustness
Improving Data Delivery in Wide Area and Mobile Environments
The popularity of the Internet has dramatically increased the
diversity of clients and applications that access data across wide
area networks and mobile environments. Data delivery in these
environments presents several challenges. First, applications often
have diverse requirements with respect to the latency of their
requests and recency of data. Traditional data delivery architectures
do not provide interfaces to express these requirements. Second, it
is difficult to accurately estimate when objects are updated.
Existing solutions either require servers to notify clients
(push-based), which adds overhead at servers and may not scale, or
require clients to contact servers (pull-based), which rely on
estimates that are often inaccurate in practice. Third, cache
managers need a flexible and scalable way to determine if an object in
the cache meets a client's latency and recency preferences. Finally,
mobile clients who access data on wireless networks share limited
wireless bandwidth and typically have different QoS requirements for
different applications.
In this dissertation we address these challenges using two
complementary techniques, client profiles and server cooperation.
Client profiles are a set of parameters that enable clients to
communicate application-specific latency and recency preferences
to caches and wireless base stations. Profiles are used by cache
managers to determine whether to deliver a cached object to the client
or to validate the object at a remote server, and for scheduling data
delivery to mobile clients. Server cooperation enables servers to
provide resource information to cache managers, which enables cache
managers to estimate the recency of cached objects.
The main contributions of this dissertation are as follows: First, we
present a flexible and scalable architecture to support client
profiles that is straightforward to implement at a cache. wireless
base station. Second, we present techniques to improve estimates of
the recency of cached objects using server cooperation by increasing
the amount of information servers provide to caches. Third, for
mobile clients, we present a framework for incorporating profiles into
the cache utilization, downloading, and scheduling decisions at a We
evaluate client profiles and server cooperation using synthetic and
trace data. Finally, we present an implementation of profiles and
experimental results
- …