9 research outputs found

    Scaling Causality Analysis for Production Systems.

    Full text link
    Causality analysis reveals how program values influence each other. It is important for debugging, optimizing, and understanding the execution of programs. This thesis scales causality analysis to production systems consisting of desktop and server applications as well as large-scale Internet services. This enables developers to employ causality analysis to debug and optimize complex, modern software systems. This thesis shows that it is possible to scale causality analysis to both fine-grained instruction level analysis and analysis of Internet scale distributed systems with thousands of discrete software components by developing and employing automated methods to observe and reason about causality. First, we observe causality at a fine-grained instruction level by developing the first taint tracking framework to support tracking millions of input sources. We also introduce flexible taint tracking to allow for scoping different queries and dynamic filtering of inputs, outputs, and relationships. Next, we introduce the Mystery Machine, which uses a ``big data'' approach to discover causal relationships between software components in a large-scale Internet service. We leverage the fact that large-scale Internet services receive a large number of requests in order to observe counterexamples to hypothesized causal relationships. Using discovered casual relationships, we identify the critical path for request execution and use the critical path analysis to explore potential scheduling optimizations. Finally, we explore using causality to make data-quality tradeoffs in Internet services. A data-quality tradeoff is an explicit decision by a software component to return lower-fidelity data in order to improve response time or minimize resource usage. We perform a study of data-quality tradeoffs in a large-scale Internet service to show the pervasiveness of these tradeoffs. We develop DQBarge, a system that enables better data-quality tradeoffs by propagating critical information along the causal path of request processing. Our evaluation shows that DQBarge helps Internet services mitigate load spikes, improve utilization of spare resources, and implement dynamic capacity planning.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/135888/1/mcchow_1.pd

    Information Retrieval Based on DOM Trees

    Full text link
    [ES] Desde hace varios años, la cantidad de información disponible en la web crece de manera exponencial. Cada día se genera una gran cantidad de información que prácticamente de inmediato está disponible en la web. Los buscadores e indexadores recorren diariamente la web para encontrar toda esa información que se ha ido añadiendo y así, ponerla a disposición del usuario devolviéndola en los resultados de las búsquedas. Sin embargo, la cantidad de información es tan grande que debe ser preprocesada con anterioridad. Dado que el usuario que realiza una búsqueda de información solamente está interesado en la información relevante, no tiene sentido que los buscadores e indexadores procesen el resto de elementos de las páginas web. El procesado de elementos irrelevantes de páginas web supone un gasto de recursos innecesario, como por ejemplo espacio de almacenamiento, tiempo de procesamiento, uso de ancho de banda, etc. Se estima que entre el 40% y el 50% del contenido de las páginas web son elementos irrelevantes. Por eso, en los últimos 20 años se han desarrollado técnicas para la detección de elementos tanto relevantes como irrelevantes de páginas web. Este objetivo se puede abordar de diversas maneras, por lo que existen técnicas diametralmente distintas para afrontar el problema. Esta tesis se centra en el desarrollo de técnicas basadas en árboles DOM para la detección de diversas partes de las páginas web, como son el contenido principal, la plantilla, y el menú. La mayoría de técnicas existentes se centran en la detección de texto dentro del contenido principal de las páginas web, ya sea eliminando la plantilla de dichas páginas o detectando directamente el contenido principal. Las técnicas que proponemos no sólo son capaces de realizar la extracción de texto, sino que, bien por eliminación de plantilla o bien por detección del contenido principal, son capaces de aislar cualquier elemento relevante de las páginas web, como por ejemplo imágenes, animaciones, videos, etc. Dichas técnicas no sólo son útiles para buscadores y rastreadores, sino que también pueden ser útiles directamente para el usuario que navega por la web. Por ejemplo, en el caso de usuarios con diversidad funcional (como sería una ceguera) puede ser interesante la eliminación de elementos irrelevantes para facilitar la lectura (o escucha) de las páginas web. Para hacer las técnicas accesibles a todo el mundo, las hemos implementado como extensiones del navegador, y son compatibles con navegadores basados en Mozilla o en Chromium. Además, estas herramientas están públicamente disponibles para que cualquier persona interesada pueda acceder a ellas y continuar con la investigación si así lo deseara.[CA] Des de fa diversos anys, la quantitat d'informació disponible en la web creix de manera exponencial. Cada dia es genera una gran quantitat d'informació que immediatament es posa disponible en la web. Els cercadors i indexadors recorren diàriament la web per a trobar tota aqueixa informació que s'ha anat afegint i així, posar-la a la disposició de l'usuari retornant-la en els resultats de les cerques. No obstant això, la quantitat d'informació és tan gran que aquesta ha de ser preprocessada. Atés que l'usuari que realitza una cerca d'informació solament es troba interessat en la informació rellevant, no té sentit que els cercadors i indexadors processen la resta d'elements de les pàgines web. El processament d'elements irrellevants de pàgines web suposa una despesa de recursos innecessària, com per exemple espai d'emmagatzematge, temps de processament, ús d'amplada de banda, etc. S'estima que entre el 40% i el 50% del contingut de les pàgines web són elements irrellevants. Precisament per això, en els últims 20 anys s'han desenvolupat tècniques per a la detecció d'elements tant rellevants com irrellevants de pàgines web. Aquest objectiu es pot afrontar de diverses maneres, per la qual cosa existeixen tècniques diametralment diferents per a afrontar el problema. Aquesta tesi se centra en el desenvolupament de tècniques basades en arbres DOM per a la detecció de diverses parts de les pàgines web, com són el contingut principal, la plantilla, i el menú. La majoria de tècniques existents se centren en la detecció de text dins del contingut principal de les pàgines web, ja siga eliminant la plantilla d'aquestes pàgines o detectant directament el contingut principal. Les tècniques que hi proposem no sols són capaces de realitzar l'extracció de text, sinó que, bé per eliminació de plantilla o bé per detecció del contingut principal, són capaços d'aïllar qualsevol element rellevant de les pàgines web, com per exemple imatges, animacions, vídeos, etc. Aquestes tècniques no sols són útils per a cercadors i rastrejadors, sinó també poden ser útils directament per a l'usuari que navega per la web. Per exemple, en el cas d'usuaris amb diversitat funcional (com ara una ceguera) pot ser interessant l'eliminació d'elements irrellevants per a facilitar-ne la lectura (o l'escolta) de les pàgines web. Per a fer les tècniques accessibles a tothom, les hem implementades com a extensions del navegador, i són compatibles amb navegadors basats en Mozilla i en Chromium. A més, aquestes eines estan públicament disponibles perquè qualsevol persona interessada puga accedir a elles i continuar amb la investigació si així ho desitjara.[EN] For several years, the amount of information available on the Web has been growing exponentially. Every day, a huge amount of data is generated and it is made immediately available on the Web. Indexers and crawlers browse the Web daily to find the new information that has been added, and they make it available to answer the users' search queries. However, the amount of information is so huge that it must be preprocessed. Given that users are only interested in the relevant information, it is not necessary for indexers and crawlers to process other boilerplate, redundant or useless elements of the web pages. Processing such irrelevant elements lead to an unnecessary waste of resources, such as storage space, runtime, bandwidth, etc. Different studies have shown that between 40% and 50% of the data on the Web are noisy elements. For this reason, several techniques focused on the detection of both, relevant and irrelevant data, have been developed over the last 20 years. The problems of identifying the relevant content of a web page, its template, its menu, etc. can be faced in various ways, and for this reason, there exist completely different techniques to address those problems. This thesis is focused on the development of information retrieval techniques based on DOM trees. Its goal is to detect different parts of a web page, such as the main content, the template, and the main menu. Most of the existing techniques are focused on the detection of text inside the main content of the web pages, mainly by removing the template of the web page or by inferring the main content. The techniques proposed in this thesis do not only extract text by eliminating the template or inferring the main content, but also extract any other relevant information from web pages such as images, animations, videos, etc. Our techniques are not only useful for indexers and crawlers but also for the user browsing the Web. For instance, in the case of users with functional diversity problems (such as blindness), removing noisy elements can facilitate them to read (or listen to) the web pages. To make the techniques broadly accessible to everybody, we have implemented them as browser extensions, which are compatible with Mozilla-based and Chromium-based browsers. In addition, these tools are publicly available, so any interested person can access them and continue with the research if they wish to do so.Alarte Aleixandre, J. (2023). Information Retrieval Based on DOM Trees [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/19667

    AH 2003 : workshop on adaptive hypermedia and adaptive web-based systems

    Get PDF

    AH 2003 : workshop on adaptive hypermedia and adaptive web-based systems

    Get PDF

    Search Interfaces on the Web: Querying and Characterizing

    Get PDF
    Current-day web search engines (e.g., Google) do not crawl and index a significant portion of theWeb and, hence, web users relying on search engines only are unable to discover and access a large amount of information from the non-indexable part of the Web. Specifically, dynamic pages generated based on parameters provided by a user via web search forms (or search interfaces) are not indexed by search engines and cannot be found in searchers’ results. Such search interfaces provide web users with an online access to myriads of databases on the Web. In order to obtain some information from a web database of interest, a user issues his/her query by specifying query terms in a search form and receives the query results, a set of dynamic pages that embed required information from a database. At the same time, issuing a query via an arbitrary search interface is an extremely complex task for any kind of automatic agents including web crawlers, which, at least up to the present day, do not even attempt to pass through web forms on a large scale. In this thesis, our primary and key object of study is a huge portion of the Web (hereafter referred as the deep Web) hidden behind web search interfaces. We concentrate on three classes of problems around the deep Web: characterization of deep Web, finding and classifying deep web resources, and querying web databases. Characterizing deep Web: Though the term deep Web was coined in 2000, which is sufficiently long ago for any web-related concept/technology, we still do not know many important characteristics of the deep Web. Another matter of concern is that surveys of the deep Web existing so far are predominantly based on study of deep web sites in English. One can then expect that findings from these surveys may be biased, especially owing to a steady increase in non-English web content. In this way, surveying of national segments of the deep Web is of interest not only to national communities but to the whole web community as well. In this thesis, we propose two new methods for estimating the main parameters of deep Web. We use the suggested methods to estimate the scale of one specific national segment of the Web and report our findings. We also build and make publicly available a dataset describing more than 200 web databases from the national segment of the Web. Finding deep web resources: The deep Web has been growing at a very fast pace. It has been estimated that there are hundred thousands of deep web sites. Due to the huge volume of information in the deep Web, there has been a significant interest to approaches that allow users and computer applications to leverage this information. Most approaches assumed that search interfaces to web databases of interest are already discovered and known to query systems. However, such assumptions do not hold true mostly because of the large scale of the deep Web – indeed, for any given domain of interest there are too many web databases with relevant content. Thus, the ability to locate search interfaces to web databases becomes a key requirement for any application accessing the deep Web. In this thesis, we describe the architecture of the I-Crawler, a system for finding and classifying search interfaces. Specifically, the I-Crawler is intentionally designed to be used in deepWeb characterization studies and for constructing directories of deep web resources. Unlike almost all other approaches to the deep Web existing so far, the I-Crawler is able to recognize and analyze JavaScript-rich and non-HTML searchable forms. Querying web databases: Retrieving information by filling out web search forms is a typical task for a web user. This is all the more so as interfaces of conventional search engines are also web forms. At present, a user needs to manually provide input values to search interfaces and then extract required data from the pages with results. The manual filling out forms is not feasible and cumbersome in cases of complex queries but such kind of queries are essential for many web searches especially in the area of e-commerce. In this way, the automation of querying and retrieving data behind search interfaces is desirable and essential for such tasks as building domain-independent deep web crawlers and automated web agents, searching for domain-specific information (vertical search engines), and for extraction and integration of information from various deep web resources. We present a data model for representing search interfaces and discuss techniques for extracting field labels, client-side scripts and structured data from HTML pages. We also describe a representation of result pages and discuss how to extract and store results of form queries. Besides, we present a user-friendly and expressive form query language that allows one to retrieve information behind search interfaces and extract useful data from the result pages based on specified conditions. We implement a prototype system for querying web databases and describe its architecture and components design.Siirretty Doriast

    Practice and Evaluation of Pagelet-Based Client-Side Rendering Mechanism

    No full text

    Practice and evaluation of pagelet-based client-side rendering mechanism

    No full text
    The rendering mechanism plays an indispensable role in browser-based Web application. It generates active webpages dynamically and provides human-readable layout through template engines, which are used as a standard programming model to separate the business logic and data computations from the webpage presentation. The client-side rendering mechanism, owing to the advances of rich application technologies, has been widely adopted. The adoption of client side rendering brings not only various merits but also new problems. In this paper, we propose and construct “pagelet”, a segment-based template engine for developing flexible and extensible Web applications. By presenting principles, practice and usage experience of pagelet, we conduct a comprehensive analysis of possible advantages and disadvantages brought by client-side rendering mechanism from the viewpoints of both developers and end-users.Published versio
    corecore