4,031 research outputs found

    Adaptive content mapping for internet navigation

    Get PDF
    The Internet as the biggest human library ever assembled keeps on growing. Although all kinds of information carriers (e.g. audio/video/hybrid file formats) are available, text based documents dominate. It is estimated that about 80% of all information worldwide stored electronically exists in (or can be converted into) text form. More and more, all kinds of documents are generated by means of a text processing system and are therefore available electronically. Nowadays, many printed journals are also published online and may even discontinue to appear in print form tomorrow. This development has many convincing advantages: the documents are both available faster (cf. prepress services) and cheaper, they can be searched more easily, the physical storage only needs a fraction of the space previously necessary and the medium will not age. For most people, fast and easy access is the most interesting feature of the new age; computer-aided search for specific documents or Web pages becomes the basic tool for information-oriented work. But this tool has problems. The current keyword based search machines available on the Internet are not really appropriate for such a task; either there are (way) too many documents matching the specified keywords are presented or none at all. The problem lies in the fact that it is often very difficult to choose appropriate terms describing the desired topic in the first place. This contribution discusses the current state-of-the-art techniques in content-based searching (along with common visualization/browsing approaches) and proposes a particular adaptive solution for intuitive Internet document navigation, which not only enables the user to provide full texts instead of manually selected keywords (if available), but also allows him/her to explore the whole database

    Behaviour modelling with data obtained from the Internet and contributions to cluster validation

    Get PDF
    [EN]This PhD thesis makes contributions in modelling behaviours found in different types of data acquired from the Internet and in the field of clustering evaluation. Two different types of Internet data were processed, on the one hand, internet traffic with the objective of attack detection and on the other hand, web surfing activity with the objective of web personalization, both data being of sequential nature. To this aim, machine learning techniques were applied, mostly unsupervised techniques. Moreover, contributions were made in cluster evaluation, in order to make easier the selection of the best partition in clustering problems. With regard to network attack detection, first, gureKDDCup database was generated which adds payload data to KDDCup99 connection attributes because it is essential to detect non-flood attacks. Then, by modelling this data a network Intrusion Detection System (nIDS) was proposed where context-independent payload processing was done obtaining satisfying detection rates. In the web mining context web surfing activity was modelled for web personalization. In this context, generic and non-invasive systems to extract knowledge were proposed just using the information stored in webserver log files. Contributions were done in two senses: in problem detection and in link suggestion. In the first application a meaningful list of navigation attributes was proposed for each user session to group and detect different navigation profiles. In the latter, a general and non-invasive link suggestion system was proposed which was evaluated with satisfactory results in a link prediction context. With regard to the analysis of Cluster Validity Indices (CVI), the most extensive CVI comparison found up to a moment was carried out using a partition similarity measure based evaluation methodology. Moreover, we analysed the behaviour of CVIs in a real web mining application with elevated number of clusters in which they tend to be unstable. We proposed a procedure which automatically selects the best partition analysing the slope of different CVI values.[EU]Doktorego-tesi honek internetetik eskuratutako datu mota ezberdinetan aurkitutako portaeren modelugintzan eta multzokatzeen ebaluazioan egiten ditu bere ekarpenak. Zehazki, bi mota ezberdinetako interneteko datuak prozesatu dira: batetik, interneteko trafikoa, erasoak hautemateko helburuarekin; eta bestetik, web nabigazioen jarduera, weba pertsonalizatzeko helburuarekin; bi datu motak izaera sekuentzialekoak direlarik. Helburu hauek lortzeko, ikasketa automatikoko teknikak aplikatu dira, nagusiki gainbegiratu-gabeko teknikak. Testuinguru honetan, multzokatzeen partizio onenaren aukeraketak dakartzan arazoak gutxitzeko multzokatzeen ebaluazioan ere ekarpenak egin dira. Sareko erasoen hautemateari dagokionez, lehenik gureKDDCup datubasea eratu da KDDCup99-ko konexio atributuei payload-ak (sareko paketeen datu eremuak) gehituz, izan ere, ez-flood erasoak (pakete gutxi erabiltzen dituzten erasoak) hautemateko ezinbestekoak baitira. Ondoren, datu hauek modelatuz testuinguruarekiko independenteak diren payload prozesaketak oinarri dituen sareko erasoak hautemateko sistema (network Intrusion Detection System (nIDS)) bat proposatu da maila oneko eraso hautemate-tasak lortuz. Web meatzaritzaren testuinguruan, weba pertsonalizatzeko helburuarekin web nabigazioen jarduera modelatu da. Honetarako, web zerbizarietako lorratz fitxategietan metatutako informazioa soilik erabiliz ezagutza erabilgarria erauziko duen sistema orokor eta ez-inbasiboak proposatu dira. Ekarpenak bi zentzutan eginaz: arazoen hautematean eta esteken iradokitzean. Lehen aplikazioan sesioen nabigazioa adierazteko atributu esanguratsuen zerrenda bat proposatu da, gero nabigazioak multzokatu eta nabigazio profil ezberdinak hautemateko. Bigarren aplikazioan, estekak iradokitzeko sistema orokor eta ez-inbasibo bat proposatu da, eta berau, estekak aurresateko testuinguruan ebaluatu da emaitza onak lortuz. Multzokatzeak balioztatzeko indizeen (Cluster Validity Indices (CVI)) azterketari dagokionez, gaurdaino aurkitu den CVI-en konparaketa zabalena burutu da partizioen antzekotasun neurrian oinarritutako ebaluazio metodologia erabiliz. Gainera, CVI-en portaera aztertu da egiazko web meatzaritza aplikazio batean normalean baino multzo kopuru handiagoak dituena, non CVI-ek ezegonkorrak izateko joera baitute. Arazo honi aurre eginaz, CVI ezberdinek partizio ezberdinetarako lortzen dituzten balioen maldak aztertuz automatikoki partiziorik onena hautatzen duen prozedura proposatu da.[ES]Esta tesis doctoral hace contribuciones en el modelado de comportamientos encontrados en diferentes tipos de datos adquiridos desde internet y en el campo de la evaluación del clustering. Dos tipos de datos de internet han sido procesados: en primer lugar el tráfico de internet con el objetivo de detectar ataques; y en segundo lugar la actividad generada por los usuarios web con el objetivo de personalizar la web; siendo los dos tipos de datos de naturaleza secuencial. Para este fin, se han aplicado técnicas de aprendizaje automático, principalmente técnicas no-supervisadas. Además, se han hecho aportaciones en la evaluación de particiones de clusters para facilitar la selección de la mejor partición de clusters. Respecto a la detección de ataques en la red, primero, se generó la base de datos gureKDDCup que añade el payload (la parte de contenido de los paquetes de la red) a los atributos de la conexión de KDDCup99 porque el payload es esencial para la detección de ataques no-flood (ataques que utilizan pocos paquetes). Después, se propuso un sistema de detección de intrusos (network Intrusion Detection System (IDS)) modelando los datos de gureKDDCup donde se propusieron varios preprocesos del payload independientes del contexto obteniendo resultados satisfactorios. En el contexto de la minerı́a web, se ha modelado la actividad de la navegación web para la personalización web. En este contexto se propondrán sistemas genéricos y no-invasivos para la extracción del conocimiento, utilizando únicamente la información almacenada en los ficheros log de los servidores web. Se han hecho aportaciones en dos sentidos: en la detección de problemas y en la sugerencia de links. En la primera aplicación, se propuso una lista de atributos significativos para representar las sesiones de navegación web para después agruparlos y detectar diferentes perfiles de navegación. En la segunda aplicación, se propuso un sistema general y no-invasivo para sugerir links y se evaluó en el contexto de predicción de links con resultados satisfactorios. Respecto al análisis de ı́ndices de validación de clusters (Cluster Validity Indices (CVI)), se ha realizado la más amplia comparación encontrada hasta el momento que utiliza la metodologı́a de evaluación basada en medidas de similitud de particiones. Además, se ha analizado el comportamiento de los CVIs en una aplicación real de minerı́a web con un número elevado de clusters, contexto en el que los CVIs tienden a ser inestables, ası́ que se propuso un procedimiento para la selección automática de la mejor partición en base a la pendiente de los valores de diferentes CVIs.Grant of the Basque Government (ref.: BFI08.226); Grant of Ministry of Economy and Competitiveness of the Spanish Government (ref.: BES-2011-045989); Research stay grant of Spanish Ministry of Economy and Competitiveness (ref.: EEBB-I-14-08862); University of the Basque Country UPV/EHU (BAILab, grant UFI11/45); Department of Education, Universities and Research of the Basque Government (grant IT-395-10); Ministry of Economy and Competitiveness of the Spanish Government and by the European Regional Development Fund - ERDF (eGovernAbility, grant TIN2014-52665-C2-1-R)

    XML Matchers: approaches and challenges

    Full text link
    Schema Matching, i.e. the process of discovering semantic correspondences between concepts adopted in different data source schemas, has been a key topic in Database and Artificial Intelligence research areas for many years. In the past, it was largely investigated especially for classical database models (e.g., E/R schemas, relational databases, etc.). However, in the latest years, the widespread adoption of XML in the most disparate application fields pushed a growing number of researchers to design XML-specific Schema Matching approaches, called XML Matchers, aiming at finding semantic matchings between concepts defined in DTDs and XSDs. XML Matchers do not just take well-known techniques originally designed for other data models and apply them on DTDs/XSDs, but they exploit specific XML features (e.g., the hierarchical structure of a DTD/XSD) to improve the performance of the Schema Matching process. The design of XML Matchers is currently a well-established research area. The main goal of this paper is to provide a detailed description and classification of XML Matchers. We first describe to what extent the specificities of DTDs/XSDs impact on the Schema Matching task. Then we introduce a template, called XML Matcher Template, that describes the main components of an XML Matcher, their role and behavior. We illustrate how each of these components has been implemented in some popular XML Matchers. We consider our XML Matcher Template as the baseline for objectively comparing approaches that, at first glance, might appear as unrelated. The introduction of this template can be useful in the design of future XML Matchers. Finally, we analyze commercial tools implementing XML Matchers and introduce two challenging issues strictly related to this topic, namely XML source clustering and uncertainty management in XML Matchers.Comment: 34 pages, 8 tables, 7 figure

    Analysis Of DNA Motifs In The Human Genome

    Full text link
    DNA motifs include repeat elements, promoter elements and gene regulator elements, and play a critical role in the human genome. This thesis describes a genome-wide computational study on two groups of motifs: tandem repeats and core promoter elements. Tandem repeats in DNA sequences are extremely relevant in biological phenomena and diagnostic tools. Computational programs that discover tandem repeats generate a huge volume of data, which can be difficult to decipher without further organization. A new method is presented here to organize and rank detected tandem repeats through clustering and classification. Our work presents multiple ways of expressing tandem repeats using the n-gram model with different clustering distance measures. Analysis of the clusters for the tandem repeats in the human genome shows that the method yields a well-defined grouping in which similarity among repeats is apparent. Our new, alignment-free method facilitates the analysis of the myriad of tandem repeats replete in the human genome. We believe that this work will lead to new discoveries on the roles, origins, and significance of tandem repeats. As with tandem repeats, promoter sequences of genes contain binding sites for proteins that play critical roles in mediating expression levels. Promoter region binding proteins and their co-factors influence timing and context of transcription. Despite the critical regulatory role of these non-coding sequences, computational methods to identify and predict DNA binding sites are extremely limited. The work reported here analyzes the relative occurrence of core promoter elements (CPEs) in and around transcription start sites. We found that out of all the data sets 49\%-63\% upstream regions have either TATA box or DPE elements. Our results suggest the possibility of predicting transcription start sites through combining CPEs signals with other promoter signals such as CpG islands and clusters of specific transcription binding sites
    corecore