Search CORE

45 research outputs found

Exploiting interclass rules for focused crawling

Author: Altingövde I.S.
Ulusoy Ö.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2004
Field of study

A baseline crawler was developed at the Bilkent University based on a focused-crawling approach. The focused crawler is an agent that targets a particular topic and visits and gathers only a relevant, narrow Web segment while trying not to waste resources on irrelevant materials. The rule-based Web-crawling approach uses linkage statistics among topics to improve a baseline focused crawler's harvest rate and coverage. The crawler also employs a canonical topic taxonomy to train a naïve-Bayesian classifier, which then helps determine the relevancy of crawled pages

Bilkent University Institutional Repository

Improving the efficiency of search engines : strategies for focused crawling, searching, and index pruning

Author: Altıngövde İsmail Sengör
Publication venue: Bilkent University
Publication date: 01/01/2009
Field of study

Ankara : The Department of Computer Engineering and the Instıtute of Engineering and Science of Bilkent University, 2009.Thesis (Ph. D.) -- Bilkent University, 2009.Includes bibliographical references leaves 157-169.Search engines are the primary means of retrieval for text data that is abundantly available on the Web. A standard search engine should carry out three fundamental tasks, namely; crawling the Web, indexing the crawled content, and finally processing the queries using the index. Devising efficient methods for these tasks is an important research topic. In this thesis, we introduce efficient strategies related to all three tasks involved in a search engine. Most of the proposed strategies are essentially applicable when a grouping of documents in its broadest sense (i.e., in terms of automatically obtained classes/clusters, or manually edited categories) is readily available or can be constructed in a feasible manner. Additionally, we also introduce static index pruning strategies that are based on the query views. For the crawling task, we propose a rule-based focused crawling strategy that exploits interclass rules among the document classes in a topic taxonomy. These rules capture the probability of having hyperlinks between two classes. The rulebased crawler can tunnel toward the on-topic pages by following a path of off-topic pages, and thus yields higher harvest rate for crawling on-topic pages. In the context of indexing and query processing tasks, we concentrate on conducting efficient search, again, using document groups; i.e., clusters or categories. In typical cluster-based retrieval (CBR), first, clusters that are most similar to a given free-text query are determined, and then documents from these clusters are selected to form the final ranked output. For efficient CBR, we first identify and evaluate some alternative query processing strategies. Next, we introduce a new index organization, so-called cluster-skipping inverted index structure (CS-IIS). It is shown that typical-CBR with CS-IIS outperforms previous CBR strategies (with an ordinary index) for a number of datasets and under varying search parameters. In this thesis, an enhanced version of CS-IIS is further proposed, in which all information to compute query-cluster similarities during query evaluation is stored. We introduce an incremental-CBR strategy that operates on top of this latter index structure, and demonstrate its search efficiency for different scenarios. Finally, we exploit query views that are obtained from the search engine query logs to tailor more effective static pruning techniques. This is also related to the indexing task involved in a search engine. In particular, query view approach is incorporated into a set of existing pruning strategies, as well as some new variants proposed by us. We show that query view based strategies significantly outperform the existing approaches in terms of the query output quality, for both disjunctive and conjunctive evaluation of queries.Altıngövde, İsmail SengörPh.D

Bilkent University Institutional Repository

BlogForever D2.6: Data Extraction Methodology

Author: Banos V.
Davis R.
Gkotsis G.
Pincent E.
Stepanyan K.
Publication venue
Publication date: 25/10/2013
Field of study

This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Vulnerability Assessment of IPv6 Websites to SQL Injection and Other Application Level Attacks

Author: Jen-Yi Pan
Ying-Chiang Cho
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2013
Field of study

Given the proliferation of internet connected devices, IPv6 has been proposed to replace IPv4. Aside from providing a larger address space which can be assigned to internet enabled devices, it has been suggested that the IPv6 protocol offers increased security due to the fact that with the large number of addresses available, standard IP scanning attacks will no longer become feasible. However, given the interest in attacking organizations rather than individual devices, most initial points of entry onto an organization's network and their attendant devices are visible and reachable through web crawling techniques, and, therefore, attacks on the visible application layer may offer ways to compromise the overall network. In this evaluation, we provide a straightforward implementation of a web crawler in conjunction with a benign black box penetration testing system and analyze the ease at which SQL injection attacks can be carried out

Crossref

Directory of Open Access Journals

Assessing gross motor function, functional skills, and caregiver assistance in children with cerebral palsy (CP) and cerebral visual impairment (CVI)

Author: Salavati Abul
Publication venue: Rijksuniversiteit Groningen
Publication date: 01/01/2016
Field of study

Dissertations of the University of Groningen

The Age of Innocence: The First 25 Years of the National Collegiate Athletic Association, 1906 to 1931

Author: Carter W. Burlette
Publication venue: Scholarly Commons
Publication date: 01/01/2012
Field of study

The article traces the history of the most powerful body in amateur sports, the NCAA, discussing the regulation of amateur sports before it arose, the factors that led to its creation, early definitions of amateurism, key issues facing the early body, its promotion of University amateur sports as a training ground for soldiers during World War I, emerging conflicts among members, its treatment of collegiate segregation policies and campus neglect of women\u27s sports opportunities, and how past problems in amateur sports regulation were prologue for the issues facing intercollegiate athletics regulators and participants today

George Washington University Law School

Models and algorithms for parallel text retrieval

Author: Cambazoğlu Berkant Barla
Publication venue: Bilkent University
Publication date: 01/01/2006
Field of study

Cataloged from PDF version of article.In the last decade, search engines became an integral part of our lives. The current state-of-the-art in search engine technology relies on parallel text retrieval. Basically, a parallel text retrieval system is composed of three components: a crawler, an indexer, and a query processor. The crawler component aims to locate, fetch, and store the Web pages in a local document repository. The indexer component converts the stored, unstructured text into a queryable form, most often an inverted index. Finally, the query processing component performs the search over the indexed content. In this thesis, we present models and algorithms for efficient Web crawling and query processing. First, for parallel Web crawling, we propose a hybrid model that aims to minimize the communication overhead among the processors while balancing the number of page download requests and storage loads of processors. Second, we propose models for documentand term-based inverted index partitioning. In the document-based partitioning model, the number of disk accesses incurred during query processing is minimized while the posting storage is balanced. In the term-based partitioning model, the total amount of communication is minimized while, again, the posting storage is balanced. Finally, we develop and evaluate a large number of algorithms for query processing in ranking-based text retrieval systems. We test the proposed algorithms over our experimental parallel text retrieval system, Skynet, currently running on a 48-node PC cluster. In the thesis, we also discuss the design and implementation details of another, somewhat untraditional, grid-enabled search engine, SE4SEE. Among our practical work, we present the Harbinger text classification system, used in SE4SEE for Web page classification, and the K-PaToH hypergraph partitioning toolkit, to be used in the proposed models.Cambazoğlu, Berkant BarlaPh.D

Bilkent University Institutional Repository

Smartphone-based human activity recognition

Author: Reyes Ortiz Jorge Luis
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2014
Field of study

Cotutela Universitat Politècnica de Catalunya i Università degli Studi di GenovaHuman Activity Recognition (HAR) is a multidisciplinary research field that aims to gather data regarding people's behavior and their interaction with the environment in order to deliver valuable context-aware information. It has nowadays contributed to develop human-centered areas of study such as Ambient Intelligence and Ambient Assisted Living, which concentrate on the improvement of people's Quality of Life. The first stage to accomplish HAR requires to make observations from ambient or wearable sensor technologies. However, in the second case, the search for pervasive, unobtrusive, low-powered, and low-cost devices for achieving this challenging task still has not been fully addressed. In this thesis, we explore the use of smartphones as an alternative approach for performing the identification of physical activities. These self-contained devices, which are widely available in the market, are provided with embedded sensors, powerful computing capabilities and wireless communication technologies that make them highly suitable for this application. This work presents a series of contributions regarding the development of HAR systems with smartphones. In the first place we propose a fully operational system that recognizes in real-time six physical activities while also takes into account the effects of postural transitions that may occur between them. For achieving this, we cover some research topics from signal processing and feature selection of inertial data, to Machine Learning approaches for classification. We employ two sensors (the accelerometer and the gyroscope) for collecting inertial data. Their raw signals are the input of the system and are conditioned through filtering in order to reduce noise and allow the extraction of informative activity features. We also emphasize on the study of Support Vector Machines (SVMs), which are one of the state-of-the-art Machine Learning techniques for classification, and reformulate various of the standard multiclass linear and non-linear methods to find the best trade off between recognition performance, computational costs and energy requirements, which are essential aspects in battery-operated devices such as smartphones. In particular, we propose two multiclass SVMs for activity classification:one linear algorithm which allows to control over dimensionality reduction and system accuracy; and also a non-linear hardware-friendly algorithm that only uses fixed-point arithmetic in the prediction phase and enables a model complexity reduction while maintaining the system performance. The efficiency of the proposed system is verified through extensive experimentation over a HAR dataset which we have generated and made publicly available. It is composed of inertial data collected from a group of 30 participants which performed a set of common daily activities while carrying a smartphone as a wearable device. The results achieved in this research show that it is possible to perform HAR in real-time with a precision near 97\% with smartphones. In this way, we can employ the proposed methodology in several higher-level applications that require HAR such as ambulatory monitoring of the disabled and the elderly during periods above five days without the need of a battery recharge. Moreover, the proposed algorithms can be adapted to other commercial wearable devices recently introduced in the market (e.g. smartwatches, phablets, and glasses). This will open up new opportunities for developing practical and innovative HAR applications.El Reconocimiento de Actividades Humanas (RAH) es un campo de investigación multidisciplinario que busca recopilar información sobre el comportamiento de las personas y su interacción con el entorno con el propósito de ofrecer información contextual de alta significancia sobre las acciones que ellas realizan. Recientemente, el RAH ha contribuido en el desarrollo de áreas de estudio enfocadas a la mejora de la calidad de vida del hombre tales como: la inteligència ambiental (Ambient Intelligence) y la vida cotidiana asistida por el entorno para personas dependientes (Ambient Assisted Living). El primer paso para conseguir el RAH consiste en realizar observaciones mediante el uso de sensores fijos localizados en el ambiente, o bien portátiles incorporados de forma vestible en el cuerpo humano. Sin embargo, para el segundo caso, aún se dificulta encontrar dispositivos poco invasivos, de bajo consumo energético, que permitan ser llevados a cualquier lugar, y de bajo costo. En esta tesis, nosotros exploramos el uso de teléfonos móviles inteligentes (Smartphones) como una alternativa para el RAH. Estos dispositivos, de uso cotidiano y fácilmente asequibles en el mercado, están dotados de sensores embebidos, potentes capacidades de cómputo y diversas tecnologías de comunicación inalámbrica que los hacen apropiados para esta aplicación. Nuestro trabajo presenta una serie de contribuciones en relación al desarrollo de sistemas para el RAH con Smartphones. En primera instancia proponemos un sistema que permite la detección de seis actividades físicas en tiempo real y que, además, tiene en cuenta las transiciones posturales que puedan ocurrir entre ellas. Con este fin, hemos contribuido en distintos ámbitos que van desde el procesamiento de señales y la selección de características, hasta algoritmos de Aprendizaje Automático (AA). Nosotros utilizamos dos sensores inerciales (el acelerómetro y el giroscopio) para la captura de las señales de movimiento de los usuarios. Estas han de ser procesadas a través de técnicas de filtrado para la reducción de ruido, segmentación y obtención de características relevantes en la detección de actividad. También hacemos énfasis en el estudio de Máquinas de soporte vectorial (MSV) que son uno de los algoritmos de AA más usados en la actualidad. Para ello reformulamos varios de sus métodos estándar (lineales y no lineales) con el propósito de encontrar la mejor combinación de variables que garanticen un buen desempeño del sistema en cuanto a precisión, coste computacional y requerimientos de energía, los cuales son aspectos esenciales en dispositivos portátiles con suministro de energía mediante baterías. En concreto, proponemos dos MSV multiclase para la clasificación de actividad: un algoritmo lineal que permite el balance entre la reducción de la dimensionalidad y la precisión del sistema; y asimismo presentamos un algoritmo no lineal conveniente para dispositivos con limitaciones de hardware que solo utiliza aritmética de punto fijo en la fase de predicción y que permite reducir la complejidad del modelo de aprendizaje mientras mantiene el rendimiento del sistema. La eficacia del sistema propuesto es verificada a través de una experimentación extensiva sobre la base de datos RAH que hemos generado y hecho pública en la red. Esta contiene la información inercial obtenida de un grupo de 30 participantes que realizaron una serie de actividades de la vida cotidiana en un ambiente controlado mientras tenían sujeto a su cintura un smartphone que capturaba su movimiento. Los resultados obtenidos en esta investigación demuestran que es posible realizar el RAH en tiempo real con una precisión cercana al 97%. De esta manera, podemos emplear la metodología propuesta en aplicaciones de alto nivel que requieran el RAH tales como monitorizaciones ambulatorias para personas dependientes (ej. ancianos o discapacitados) durante periodos mayores a cinco días sin la necesidad de recarga de baterías.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Website-Klassifikation und Informationsextraktion aus Informationsseiten einer Firmenwebsite

Author: Lee Yeong Su
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 31/01/2008
Field of study

Digitale Hochschulschriften der LMU