85 research outputs found

    A Large Visual, Qualitative, and Quantitative Dataset for Web Intelligence Applications

    Get PDF
    The Web is the communication platform and source of information par excellence. The volume and complexity of its content have grown enormously, with organizing, retrieving, and cleaning Web information becoming a challenge for traditional techniques. Web intelligence is a novel research area to improve Web-based services and applications using artificial intelligence and automatic learning algorithms, for which a large amount of Web-related data are essential. Current datasets are, however, limited and do not combine visual representation and attributes of Web pages. Our work provides a large dataset of 49,438 Web pages, composed of webshots, along with qualitative and quantitative attributes. This dataset covers all the countries in the world and a wide range of topics, such as art, entertainment, economics, business, education, government, news, media, science, and the environment, addressing different cultural characteristics and varied design preferences. We use this dataset to develop three Web Intelligence applications: knowledge extraction on Web design using statistical analysis, recognition of error Web pages using a customized convolutional neural network (CNN) to eliminate invalid pages, and Web categorization based solely on screenshots using a CNN with transfer learning to assist search engines, indexers, and Web directories.This work has been funded by the grant awarded by the Central University of Ecuador through budget certification No. 34 of March 25, 2022 for the development of the research project with code: DOCT-DI-2020-37

    Autonomous Consolidation of Heterogeneous Record-Structured HTML Data in Chameleon

    Get PDF
    While progress has been made in querying digital information contained in XML and HTML documents, success in retrieving information from the so called hidden Web (data behind Web forms) has been modest. There has been a nascent trend of developing autonomous tools for extracting information from the hidden Web. Automatic tools for ontology generation, wrapper generation, Weborm querying, response gathering, etc., have been reported in recent research. This thesis presents a system called Chameleon for automatic querying of and response gathering from the hidden Web. The approach to response gathering is based on automatic table structure identification, since most information repositories of the hidden Web are structured databases, and so the information returned in response to a query will have regularities. Information extraction from the identified record structures is performed based on domain knowledge corresponding to the domain specified in a query. So called domain plug-ins are used to make the dynamically generated wrappers domain-specific, rather than conventionally used document-specific

    Autonomous Consolidation of Heterogeneous Record-Structured HTML Data in Chameleon

    Get PDF
    While progress has been made in querying digital information contained in XML and HTML documents, success in retrieving information from the so called hidden Web (data behind Web forms) has been modest. There has been a nascent trend of developing autonomous tools for extracting information from the hidden Web. Automatic tools for ontology generation, wrapper generation, Weborm querying, response gathering, etc., have been reported in recent research. This thesis presents a system called Chameleon for automatic querying of and response gathering from the hidden Web. The approach to response gathering is based on automatic table structure identification, since most information repositories of the hidden Web are structured databases, and so the information returned in response to a query will have regularities. Information extraction from the identified record structures is performed based on domain knowledge corresponding to the domain specified in a query. So called domain plug-ins are used to make the dynamically generated wrappers domain-specific, rather than conventionally used document-specific

    Extending information retrieval system model to improve interactive web searching.

    Get PDF
    The research set out with the broad objective of developing new tools to support Web information searching. A survey showed that a substantial number of interactive search tools were being developed but little work on how these new developments fitted into the general aim of helping people find information. Due to this it proved difficult to compare and analyse how tools help and affect users and where they belong in a general scheme of information search tools. A key reason for a lack of better information searching tools was identified in the ill-suited nature of existing information retrieval system models. The traditional information retrieval model is extended by synthesising work in information retrieval and information seeking research. The purpose of this new holistic search model is to assist information system practitioners in identifying, hypothesising, designing and evaluating Web information searching tools. Using the model, a term relevance feedback tool called ‘Tag and Keyword’ (TKy) was developed in a Web browser and it was hypothesised that it could improve query reformulation and reduce unnecessary browsing. The tool was laboratory experimented and quantitative analysis showed statistical significances in increased query reformulations and in reduced Web browsing (per query). Subjects were interviewed after the experiment and qualitative analysis revealed that they found the tool useful and saved time. Interestingly, exploratory analysis on collected data identified three different methods in which subjects had utilised the TKy tool. The research developed a holistic search model for Web searching and demonstrated that it can be used to hypothesise, design and evaluate information searching tools. Information system practitioners using it can better understand the context in which their search tools are developed and how these relate to users’ search processes and other search tools

    Developing an in house vulnerability scanner for detecting Template Injection, XSS, and DOM-XSS vulnerabilities

    Get PDF
    Web applications are becoming an essential part of today's digital world. However, with the increase in the usage of web applications, security threats have also become more prevalent. Cyber attackers can exploit vulnerabilities in web applications to steal sensitive information or take control of the system. To prevent these attacks, web application security must be given due consideration. Existing vulnerability scanners fail to detect Template Injection, XSS, and DOM-XSS vulnerabilities effectively. To bridge this gap in web application security, a customized in-house scanner is needed to quickly and accurately identify these vulnerabilities, enhancing manual security assessments of web applications. This thesis focused on developing a modular and extensible vulnerability scanner to detect Template Injection, XSS, and DOM-based XSS vulnerabilities in web applications. Testing the scanner against other free and open-source solutions on the market showed that it outperformed them on Template injection vulnerabilities and nearly all on XSS-type vulnerabilities. While the scanner has limitations, focusing on specific injection vulnerabilities can result in better performance

    Link-based similarity search to fight web spam

    Get PDF
    www.ilab.sztaki.hu/websearch We investigate the usability of similarity search in fighting Web spam based on the assumption that an unknown spam page is more similar to certain known spam pages than to honest pages. In order to be successful, search engine spam never appears in isolation: we observe link farms and alliances for the sole purpose of search engine ranking manipulation. The artificial nature and strong inside connectedness however gave rise to successful algorithms to identify search engine spam. One example is trust and distrust propagation, an idea originating in recommender systems and P2P networks, that yields spam classificators by spreading information along hyperlinks from white and blacklists. While most previous results use PageRank variants for propagation, we form classifiers by investigating similarity top lists of an unknown page along various measures such as co-citation, companion, nearest neighbors in low dimensional projections and SimRank. We test our method over two data sets previously used to measure spam filtering algorithms. 1

    Mobiili HTML5: Suorituskykyisen ja alustariippumattoman sovelluksen toteutus

    Get PDF
    In twenty years, the Web has become an integral part of our everyday lives. The rapid growth of the smartphone market has brought the Web from our home desks to anywhere we are, and enabled us to access this vast source of information at any time. However, the proliferation of mobile devices and platforms has raised new problems for application development. The growing amount of different platforms and their distinct native technologies make it hard to develop applications that can be accessed with all these devices. The only combining factor in all these platforms is the browser, and it is becoming the universal application platform. We cannot afford anymore to build applications for the silos and walled gardens of single platforms, and building cross-platform applications is essential in the modern mobile market. In this work, I introduce the HTML5 (Hyper Text Markup Language version 5) specification as well as several related specifications or specification drafts for modern web development. I also present several tools and libraries for mobile web development. I implemented a mobile web application and a network utility library, and assessed the practical performance of the modern tools and APIs (Application Programming Interface). In this work, I present the tools and techniques for performance optimization of mobile web applications.Kahdenkymmenen vuoden aikana webistä on tullut oleellinen osa jokapäiväistä elämäämme. Mobiilimarkkinoiden huikea kasvu on tuonut webin kotipöydiltämme mukaamme missä ikinä olemmekin ja mahdollistanut tämän laajan tietovaraston käyttämisen milloin tahansa. Mobiililaitteiden käytön räjähdysmäinen kasvu on kuitenkin nostanut uusia haasteita ohjelmistokehitykselle. Monien eri alustojen natiiviteknologiat poikkeavat toisistaan, ja ohjelmistojen kehittäminen kaikille näille alustoille on haastavaa. Ainoa yhteinen tekijä näissä alustoissa on WWW-selain (World Wide Web), josta on tulossa universaali ohjelmistoalusta. Enää ei voida kehittää ohjelmistoja vain tiettyjen suljettujen alustojen käyttäjille, ja alusta-riippumattomuudesta on tullut oleellinen osa mobiilimarkkinoita. Tässä työssä esittelemme HTML5-standardin sekä muita siihen liittyviä standardeja sekä standardiluonnoksia, jotka tuovat uusia ominaisuuksia ja helpotuksia web-kehitykseen. Esittelemme myös useita työkaluja ja tekniikoita moderniin web-kehitykseen mobiililaitteille. Toteutimme mobiililaitteissa toimivan web-ohjelmiston sekä kirjaston tiedon siirtämiseen mobiiliverkoissa, ja arvioimme modernien työkalujen ja rajapintojen käytännön suorituskykyä. Tässä työssä esitämme useita työkaluja ja tekniikoita web-ohjelmistojen suorituskyvyn optimointiin mobiililaitteille

    Web collaboration for software engineering

    Get PDF
    Tese de mestrado integrado. Engenharia Informática e Computação. Faculdade de Engenharia. Universidade do Porto. 200
    corecore