968 research outputs found

    A Brief History of Web Crawlers

    Full text link
    Web crawlers visit internet applications, collect data, and learn about new web pages from visited pages. Web crawlers have a long and interesting history. Early web crawlers collected statistics about the web. In addition to collecting statistics about the web and indexing the applications for search engines, modern crawlers can be used to perform accessibility and vulnerability checks on the application. Quick expansion of the web, and the complexity added to web applications have made the process of crawling a very challenging one. Throughout the history of web crawling many researchers and industrial groups addressed different issues and challenges that web crawlers face. Different solutions have been proposed to reduce the time and cost of crawling. Performing an exhaustive crawl is a challenging question. Additionally capturing the model of a modern web application and extracting data from it automatically is another open question. What follows is a brief history of different technique and algorithms used from the early days of crawling up to the recent days. We introduce criteria to evaluate the relative performance of web crawlers. Based on these criteria we plot the evolution of web crawlers and compare their performanc

    Active caching for recommender systems

    Get PDF
    Web users are often overwhelmed by the amount of information available while carrying out browsing and searching tasks. Recommender systems substantially reduce the information overload by suggesting a list of similar documents that users might find interesting. However, generating these ranked lists requires an enormous amount of resources that often results in access latency. Caching frequently accessed data has been a useful technique for reducing stress on limited resources and improving response time. Traditional passive caching techniques, where the focus is on answering queries based on temporal locality or popularity, achieve a very limited performance gain. In this dissertation, we are proposing an ‘active caching’ technique for recommender systems as an extension of the caching model. In this approach estimation is used to generate an answer for queries whose results are not explicitly cached, where the estimation makes use of the partial order lists cached for related queries. By answering non-cached queries along with cached queries, the active caching system acts as a form of query processor and offers substantial improvement over traditional caching methodologies. Test results for several data sets and recommendation techniques show substantial improvement in the cache hit rate, byte hit rate and CPU costs, while achieving reasonable recall rates. To ameliorate the performance of proposed active caching solution, a shared neighbor similarity measure is introduced which improves the recall rates by eliminating the dependence on monotinicity in the partial order lists. Finally, a greedy balancing cache selection policy is also proposed to select most appropriate data objects for the cache that help to improve the cache hit rate and recall further

    DNS to the rescue: Discerning Content and Services in a Tangled Web

    Get PDF
    A careful perusal of the Internet evolution reveals two major trends - explosion of cloud-based services and video stream- ing applications. In both of the above cases, the owner (e.g., CNN, YouTube, or Zynga) of the content and the organiza- tion serving it (e.g., Akamai, Limelight, or Amazon EC2) are decoupled, thus making it harder to understand the asso- ciation between the content, owner, and the host where the content resides. This has created a tangled world wide web that is very hard to unwind, impairing ISPs' and network ad- ministrators' capabilities to control the traffic flowing on the network. In this paper, we present DN-Hunter, a system that lever- ages the information provided by DNS traffic to discern the tangle. Parsing through DNS queries, DN-Hunter tags traffic flows with the associated domain name. This association has several applications and reveals a large amount of useful in- formation: (i) Provides a fine-grained traffic visibility even when the traffic is encrypted (i.e., TLS/SSL flows), thus en- abling more effective policy controls, (ii) Identifies flows even before the flows begin, thus providing superior net- work management capabilities to administrators, (iii) Un- derstand and track (over time) different CDNs and cloud providers that host content for a particular resource, (iv) Discern all the services/content hosted by a given CDN or cloud provider in a particular geography and time, and (v) Provides insights into all applications/services running on any given layer-4 port number. We conduct extensive experimental analysis and show that the results from real traffic traces, ranging from FTTH to 4G ISPs, that support our hypothesis. Simply put, the informa- tion provided by DNS traffic is one of the key components required to unveil the tangled web, and bring the capabilities of controlling the traffic back to the network carrier

    Quality of experience aware adaptive hypermedia system

    Get PDF
    The research reported in this thesis proposes, designs and tests a novel Quality of Experience Layer (QoE-layer) for the classic Adaptive Hypermedia Systems (AHS) architecture. Its goal is to improve the end-user perceived Quality of Service in different operational environments suitable for residential users. While the AHS’ main role of delivering personalised content is not altered, its functionality and performance is improved and thus the user satisfaction with the service provided. The QoE Layer takes into account multiple factors that affect Quality of Experience (QoE), such as Web components and network connection. It uses a novel Perceived Performance Model that takes into consideration a variety of performance metrics, in order to learn about the Web user operational environment characteristics, about changes in network connection and the consequences of these changes on the user’s quality of experience. This model also considers the user’s subjective opinion about his/her QoE, increasing its effectiveness and suggests strategies for tailoring Web content in order to improve QoE. The user related information is modelled using a stereotype-based technique that makes use of probability and distribution theory. The QoE-Layer has been assessed through both simulations and qualitative evaluation in the educational area (mainly distance learning), when users interact with the system in a low bit rate operational environment. The simulations have assessed “learning” and “adaptability” behaviour of the proposed layer in different and variable home connections when a learning task is performed. The correctness of Perceived Performance Model (PPM) suggestions, access time of the learning process and quantity of transmitted data were analysed. The results show that the QoE layer significantly improves the performance in terms of the access time of the learning process with a reduction in the quantity of data sent by using image compression and/or elimination. A visual quality assessment confirmed that this image quality reduction does not significantly affect the viewers’ perceived quality that was close to “good” perceptual level. For qualitative evaluation the QoE layer has been deployed on the open-source AHA! system. The goal of this evaluation was to compare the learning outcome, system usability and user satisfaction when AHA! and QoE-ware AHA systems were used. The assessment was performed in terms of learner achievement, learning performance and usability assessment. The results indicate that QoE-aware AHA system did not affect the learning outcome (the students have similar-learning achievements) but the learning performance was improved in terms of study time. Most significantly, QoE-aware AHA provides an important improvement in system usability as indicated by users’ opinion about their satisfaction related to QoE

    Transaction-filtering data mining and a predictive model for intelligent data management

    Get PDF
    This thesis, first of all, proposes a new data mining paradigm (transaction-filtering association rule mining) addressing a time consumption issue caused by the repeated scans of original transaction databases in conventional associate rule mining algorithms. An in-memory transaction filter is designed to discard those infrequent items in the pruning steps. This filter is a data structure to be updated at the end of each iteration. The results based on an IBM benchmark show that an execution time reduction of 10% - 19% is achieved compared with the base case. Next, a data mining-based predictive model is then established contributing to intelligent data management within the context of Centre for Grid Computing. The capability of discovering unseen rules, patterns and correlations enables data mining techniques favourable in areas where massive amounts of data are generated. The past behaviours of two typical scenarios (network file systems and Data Grids) have been analyzed to build the model. The future popularity of files can be forecasted with an accuracy of 90% by deploying the above predictor based on the given real system traces. A further step towards intelligent policy design is achieved by analyzing the prediction results of files’ future popularity. The real system trace-based simulations have shown improvements of 2-4 times in terms of data response time in network file system scenario and 24% mean job time reduction in Data Grids compared with conventional cases.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Building high-performance web-caching servers

    Get PDF

    Wiki-health: from quantified self to self-understanding

    Get PDF
    Today, healthcare providers are experiencing explosive growth in data, and medical imaging represents a significant portion of that data. Meanwhile, the pervasive use of mobile phones and the rising adoption of sensing devices, enabling people to collect data independently at any time or place is leading to a torrent of sensor data. The scale and richness of the sensor data currently being collected and analysed is rapidly growing. The key challenges that we will be facing are how to effectively manage and make use of this abundance of easily-generated and diverse health data. This thesis investigates the challenges posed by the explosive growth of available healthcare data and proposes a number of potential solutions to the problem. As a result, a big data service platform, named Wiki-Health, is presented to provide a unified solution for collecting, storing, tagging, retrieving, searching and analysing personal health sensor data. Additionally, it allows users to reuse and remix data, along with analysis results and analysis models, to make health-related knowledge discovery more available to individual users on a massive scale. To tackle the challenge of efficiently managing the high volume and diversity of big data, Wiki-Health introduces a hybrid data storage approach capable of storing structured, semi-structured and unstructured sensor data and sensor metadata separately. A multi-tier cloud storage system—CACSS has been developed and serves as a component for the Wiki-Health platform, allowing it to manage the storage of unstructured data and semi-structured data, such as medical imaging files. CACSS has enabled comprehensive features such as global data de-duplication, performance-awareness and data caching services. The design of such a hybrid approach allows Wiki-Health to potentially handle heterogeneous formats of sensor data. To evaluate the proposed approach, we have developed an ECG-based health monitoring service and a virtual sensing service on top of the Wiki-Health platform. The two services demonstrate the feasibility and potential of using the Wiki-Health framework to enable better utilisation and comprehension of the vast amounts of sensor data available from different sources, and both show significant potential for real-world applications.Open Acces
    • 

    corecore