45 research outputs found

    Exploiting interclass rules for focused crawling

    Get PDF
    A baseline crawler was developed at the Bilkent University based on a focused-crawling approach. The focused crawler is an agent that targets a particular topic and visits and gathers only a relevant, narrow Web segment while trying not to waste resources on irrelevant materials. The rule-based Web-crawling approach uses linkage statistics among topics to improve a baseline focused crawler's harvest rate and coverage. The crawler also employs a canonical topic taxonomy to train a naĂŻve-Bayesian classifier, which then helps determine the relevancy of crawled pages

    Improving the efficiency of search engines : strategies for focused crawling, searching, and index pruning

    Get PDF
    Ankara : The Department of Computer Engineering and the Instıtute of Engineering and Science of Bilkent University, 2009.Thesis (Ph. D.) -- Bilkent University, 2009.Includes bibliographical references leaves 157-169.Search engines are the primary means of retrieval for text data that is abundantly available on the Web. A standard search engine should carry out three fundamental tasks, namely; crawling the Web, indexing the crawled content, and finally processing the queries using the index. Devising efficient methods for these tasks is an important research topic. In this thesis, we introduce efficient strategies related to all three tasks involved in a search engine. Most of the proposed strategies are essentially applicable when a grouping of documents in its broadest sense (i.e., in terms of automatically obtained classes/clusters, or manually edited categories) is readily available or can be constructed in a feasible manner. Additionally, we also introduce static index pruning strategies that are based on the query views. For the crawling task, we propose a rule-based focused crawling strategy that exploits interclass rules among the document classes in a topic taxonomy. These rules capture the probability of having hyperlinks between two classes. The rulebased crawler can tunnel toward the on-topic pages by following a path of off-topic pages, and thus yields higher harvest rate for crawling on-topic pages. In the context of indexing and query processing tasks, we concentrate on conducting efficient search, again, using document groups; i.e., clusters or categories. In typical cluster-based retrieval (CBR), first, clusters that are most similar to a given free-text query are determined, and then documents from these clusters are selected to form the final ranked output. For efficient CBR, we first identify and evaluate some alternative query processing strategies. Next, we introduce a new index organization, so-called cluster-skipping inverted index structure (CS-IIS). It is shown that typical-CBR with CS-IIS outperforms previous CBR strategies (with an ordinary index) for a number of datasets and under varying search parameters. In this thesis, an enhanced version of CS-IIS is further proposed, in which all information to compute query-cluster similarities during query evaluation is stored. We introduce an incremental-CBR strategy that operates on top of this latter index structure, and demonstrate its search efficiency for different scenarios. Finally, we exploit query views that are obtained from the search engine query logs to tailor more effective static pruning techniques. This is also related to the indexing task involved in a search engine. In particular, query view approach is incorporated into a set of existing pruning strategies, as well as some new variants proposed by us. We show that query view based strategies significantly outperform the existing approaches in terms of the query output quality, for both disjunctive and conjunctive evaluation of queries.Altıngövde, İsmail SengörPh.D

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Vulnerability Assessment of IPv6 Websites to SQL Injection and Other Application Level Attacks

    Get PDF
    Given the proliferation of internet connected devices, IPv6 has been proposed to replace IPv4. Aside from providing a larger address space which can be assigned to internet enabled devices, it has been suggested that the IPv6 protocol offers increased security due to the fact that with the large number of addresses available, standard IP scanning attacks will no longer become feasible. However, given the interest in attacking organizations rather than individual devices, most initial points of entry onto an organization's network and their attendant devices are visible and reachable through web crawling techniques, and, therefore, attacks on the visible application layer may offer ways to compromise the overall network. In this evaluation, we provide a straightforward implementation of a web crawler in conjunction with a benign black box penetration testing system and analyze the ease at which SQL injection attacks can be carried out

    The Age of Innocence: The First 25 Years of the National Collegiate Athletic Association, 1906 to 1931

    Get PDF
    The article traces the history of the most powerful body in amateur sports, the NCAA, discussing the regulation of amateur sports before it arose, the factors that led to its creation, early definitions of amateurism, key issues facing the early body, its promotion of University amateur sports as a training ground for soldiers during World War I, emerging conflicts among members, its treatment of collegiate segregation policies and campus neglect of women\u27s sports opportunities, and how past problems in amateur sports regulation were prologue for the issues facing intercollegiate athletics regulators and participants today

    Models and algorithms for parallel text retrieval

    Get PDF
    Cataloged from PDF version of article.In the last decade, search engines became an integral part of our lives. The current state-of-the-art in search engine technology relies on parallel text retrieval. Basically, a parallel text retrieval system is composed of three components: a crawler, an indexer, and a query processor. The crawler component aims to locate, fetch, and store the Web pages in a local document repository. The indexer component converts the stored, unstructured text into a queryable form, most often an inverted index. Finally, the query processing component performs the search over the indexed content. In this thesis, we present models and algorithms for efficient Web crawling and query processing. First, for parallel Web crawling, we propose a hybrid model that aims to minimize the communication overhead among the processors while balancing the number of page download requests and storage loads of processors. Second, we propose models for documentand term-based inverted index partitioning. In the document-based partitioning model, the number of disk accesses incurred during query processing is minimized while the posting storage is balanced. In the term-based partitioning model, the total amount of communication is minimized while, again, the posting storage is balanced. Finally, we develop and evaluate a large number of algorithms for query processing in ranking-based text retrieval systems. We test the proposed algorithms over our experimental parallel text retrieval system, Skynet, currently running on a 48-node PC cluster. In the thesis, we also discuss the design and implementation details of another, somewhat untraditional, grid-enabled search engine, SE4SEE. Among our practical work, we present the Harbinger text classification system, used in SE4SEE for Web page classification, and the K-PaToH hypergraph partitioning toolkit, to be used in the proposed models.Cambazoğlu, Berkant BarlaPh.D

    Smartphone-based human activity recognition

    Get PDF
    Cotutela Universitat PolitĂšcnica de Catalunya i UniversitĂ  degli Studi di GenovaHuman Activity Recognition (HAR) is a multidisciplinary research field that aims to gather data regarding people's behavior and their interaction with the environment in order to deliver valuable context-aware information. It has nowadays contributed to develop human-centered areas of study such as Ambient Intelligence and Ambient Assisted Living, which concentrate on the improvement of people's Quality of Life. The first stage to accomplish HAR requires to make observations from ambient or wearable sensor technologies. However, in the second case, the search for pervasive, unobtrusive, low-powered, and low-cost devices for achieving this challenging task still has not been fully addressed. In this thesis, we explore the use of smartphones as an alternative approach for performing the identification of physical activities. These self-contained devices, which are widely available in the market, are provided with embedded sensors, powerful computing capabilities and wireless communication technologies that make them highly suitable for this application. This work presents a series of contributions regarding the development of HAR systems with smartphones. In the first place we propose a fully operational system that recognizes in real-time six physical activities while also takes into account the effects of postural transitions that may occur between them. For achieving this, we cover some research topics from signal processing and feature selection of inertial data, to Machine Learning approaches for classification. We employ two sensors (the accelerometer and the gyroscope) for collecting inertial data. Their raw signals are the input of the system and are conditioned through filtering in order to reduce noise and allow the extraction of informative activity features. We also emphasize on the study of Support Vector Machines (SVMs), which are one of the state-of-the-art Machine Learning techniques for classification, and reformulate various of the standard multiclass linear and non-linear methods to find the best trade off between recognition performance, computational costs and energy requirements, which are essential aspects in battery-operated devices such as smartphones. In particular, we propose two multiclass SVMs for activity classification:one linear algorithm which allows to control over dimensionality reduction and system accuracy; and also a non-linear hardware-friendly algorithm that only uses fixed-point arithmetic in the prediction phase and enables a model complexity reduction while maintaining the system performance. The efficiency of the proposed system is verified through extensive experimentation over a HAR dataset which we have generated and made publicly available. It is composed of inertial data collected from a group of 30 participants which performed a set of common daily activities while carrying a smartphone as a wearable device. The results achieved in this research show that it is possible to perform HAR in real-time with a precision near 97\% with smartphones. In this way, we can employ the proposed methodology in several higher-level applications that require HAR such as ambulatory monitoring of the disabled and the elderly during periods above five days without the need of a battery recharge. Moreover, the proposed algorithms can be adapted to other commercial wearable devices recently introduced in the market (e.g. smartwatches, phablets, and glasses). This will open up new opportunities for developing practical and innovative HAR applications.El Reconocimiento de Actividades Humanas (RAH) es un campo de investigaciĂłn multidisciplinario que busca recopilar informaciĂłn sobre el comportamiento de las personas y su interacciĂłn con el entorno con el propĂłsito de ofrecer informaciĂłn contextual de alta significancia sobre las acciones que ellas realizan. Recientemente, el RAH ha contribuido en el desarrollo de ĂĄreas de estudio enfocadas a la mejora de la calidad de vida del hombre tales como: la inteligĂšncia ambiental (Ambient Intelligence) y la vida cotidiana asistida por el entorno para personas dependientes (Ambient Assisted Living). El primer paso para conseguir el RAH consiste en realizar observaciones mediante el uso de sensores fijos localizados en el ambiente, o bien portĂĄtiles incorporados de forma vestible en el cuerpo humano. Sin embargo, para el segundo caso, aĂșn se dificulta encontrar dispositivos poco invasivos, de bajo consumo energĂ©tico, que permitan ser llevados a cualquier lugar, y de bajo costo. En esta tesis, nosotros exploramos el uso de telĂ©fonos mĂłviles inteligentes (Smartphones) como una alternativa para el RAH. Estos dispositivos, de uso cotidiano y fĂĄcilmente asequibles en el mercado, estĂĄn dotados de sensores embebidos, potentes capacidades de cĂłmputo y diversas tecnologĂ­as de comunicaciĂłn inalĂĄmbrica que los hacen apropiados para esta aplicaciĂłn. Nuestro trabajo presenta una serie de contribuciones en relaciĂłn al desarrollo de sistemas para el RAH con Smartphones. En primera instancia proponemos un sistema que permite la detecciĂłn de seis actividades fĂ­sicas en tiempo real y que, ademĂĄs, tiene en cuenta las transiciones posturales que puedan ocurrir entre ellas. Con este fin, hemos contribuido en distintos ĂĄmbitos que van desde el procesamiento de señales y la selecciĂłn de caracterĂ­sticas, hasta algoritmos de Aprendizaje AutomĂĄtico (AA). Nosotros utilizamos dos sensores inerciales (el acelerĂłmetro y el giroscopio) para la captura de las señales de movimiento de los usuarios. Estas han de ser procesadas a travĂ©s de tĂ©cnicas de filtrado para la reducciĂłn de ruido, segmentaciĂłn y obtenciĂłn de caracterĂ­sticas relevantes en la detecciĂłn de actividad. TambiĂ©n hacemos Ă©nfasis en el estudio de MĂĄquinas de soporte vectorial (MSV) que son uno de los algoritmos de AA mĂĄs usados en la actualidad. Para ello reformulamos varios de sus mĂ©todos estĂĄndar (lineales y no lineales) con el propĂłsito de encontrar la mejor combinaciĂłn de variables que garanticen un buen desempeño del sistema en cuanto a precisiĂłn, coste computacional y requerimientos de energĂ­a, los cuales son aspectos esenciales en dispositivos portĂĄtiles con suministro de energĂ­a mediante baterĂ­as. En concreto, proponemos dos MSV multiclase para la clasificaciĂłn de actividad: un algoritmo lineal que permite el balance entre la reducciĂłn de la dimensionalidad y la precisiĂłn del sistema; y asimismo presentamos un algoritmo no lineal conveniente para dispositivos con limitaciones de hardware que solo utiliza aritmĂ©tica de punto fijo en la fase de predicciĂłn y que permite reducir la complejidad del modelo de aprendizaje mientras mantiene el rendimiento del sistema. La eficacia del sistema propuesto es verificada a travĂ©s de una experimentaciĂłn extensiva sobre la base de datos RAH que hemos generado y hecho pĂșblica en la red. Esta contiene la informaciĂłn inercial obtenida de un grupo de 30 participantes que realizaron una serie de actividades de la vida cotidiana en un ambiente controlado mientras tenĂ­an sujeto a su cintura un smartphone que capturaba su movimiento. Los resultados obtenidos en esta investigaciĂłn demuestran que es posible realizar el RAH en tiempo real con una precisiĂłn cercana al 97%. De esta manera, podemos emplear la metodologĂ­a propuesta en aplicaciones de alto nivel que requieran el RAH tales como monitorizaciones ambulatorias para personas dependientes (ej. ancianos o discapacitados) durante periodos mayores a cinco dĂ­as sin la necesidad de recarga de baterĂ­as.Postprint (published version

    Website-Klassifikation und Informationsextraktion aus Informationsseiten einer Firmenwebsite

    Get PDF
    corecore