5 research outputs found

    Crawling Deep Web using a GA-based set covering algorithm

    Get PDF
    An ever-increasing amount of information on the web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain web sites. These pages are often referred to as the Hidden Web or the Deep Web. According to recent studies, the content provided by hidden web sites is often of very high quality and can be extremely valuable to many users. This calls for deep web crawlers to excavate the data so that they can be reused, indexed, and searched upon in an integrated environment. Crawling deep web is the process of collecting data from search interfaces by issuing queries. It often requires the selection of an appropriate set of queries so that they can cover most of the documents in the data source with low cost. This can be modeled as a set covering problem which has been extensively studied in graph theory. The conventional set covering algorithms, however, do not work well when applied to deep web crawling due to various special features of this application domain. Typically, most set covering algorithms do not take into account the distribution of the elements being covered. For deep web crawling, the sizes of the documents and the document frequency of the queries follow the power law distribution. A new GA-based algorithm is introduced in this thesis. It targets at deep web crawling of a database with this power law distribution. The experiment shows that it outperforms the straightforward greedy algorithm previously introduced to the literature

    Ontology learning for the semantic deep web

    Get PDF
    Ontologies could play an important role in assisting users in their search for Web pages. This dissertation considers the problem of constructing natural ontologies that support users in their Web search efforts and increase the number of relevant Web pages that are returned. To achieve this goal, this thesis suggests combining the Deep Web information, which consists of dynamically generated Web pages and cannot be indexed by the existing automated Web crawlers, with ontologies, resulting in the Semantic Deep Web. The Deep Web information is exploited in three different ways: extracting attributes from the Deep Web data sources automatically, generating domain ontologies from the Deep Web automatically, and extracting instances from the Deep Web to enhance the domain ontologies. Several algorithms for the above mentioned tasks are presented. Lxperimeiital results suggest that the proposed methods assist users with finding more relevant Web sites. Another contribution of this dissertation includes developing a methodology to evaluate existing general purpose ontologies using the Web as a corpus. The quality of ontologies (QoO) is quantified by analyzing existing ontologies to get numeric measures of how natural their concepts and their relationships are. This methodology was first applied to several major, popular ontologies, such as WordNet, OpenCyc and the UMLS. Subsequently the domain ontologies developed in this research were evaluated from the naturalness perspective

    Internet anonymization technologies

    Get PDF
    Internet je nepogreŔljiv in ključni del sodobnega življenjaposamezniki ga uporabljajo za iskanje informacij, branje novic, komuniciranje, nakupovanje in uporabo storitev e-uprave. V zadnjih letih pa se je povečala tudi skrb glede anonimnosti uporabnikov na internetu, saj želijo v veliki meri preprečiti zlorabe in različna nadlegovanja. Vse to kaže na povečano potrebo po anonimizacijskih orodjih in tehnikah. Eno takŔnih je odprtokodno programsko orodje Tor, ki je trenutno najbolj priljubljeno med uporabniki in zagotavlja visoko stopnjo anonimnosti. V sklopu magistrske naloge smo izvedli anketni vpraŔalnik in ugotovili, da je anonimnost na internetu za uporabnike zelo pomembna lastnost ter da uporabljajo tehnike anonimnosti predvsem zato, da bi zaŔčitili svoje osebne podatke in povečali varnost. Zato smo izdelali vtičnik za Google Chrome, ki uporabnike med brskanjem opozarja o pomanjkljivih nastavitvah brskalnika, povezanih z njihovo zasebnostjo.The internet is indispensable and a key part of modern life, which individuals use to find information, read news, communicate, shop and use e-government services. In recent years, concerns have been raised about the anonymity of users on the internet, as they want to greatly prevent abuse and various forms of harassment of users. All this points to an increased need for anonymisation tools and techniques. One such is the open-source Tor software tool, which is currently the most popular among users and provides a high level of anonymity. As part of the master\u27s thesis, we conducted a survey questionnaire where we found that anonymity on the internet is a very important feature for users and that they use anonymity techniques, in particular to protect their personal data and increase security. To this end, we\u27ve created a Google Chrome plug-in that alerts users when browsing about the inadequate browser settings associated with their privacy

    Taming web data : exploiting linked data for integrating medical educational content

    Get PDF
    Open data are playing a vital role in different communities, including governments, businesses, and education. This revolution has had a high impact on the education field. Recently, new practices are being adopted for publishing and connecting data on the web, known as "Linked Data", and these are used to expose and connect data which were not previously linked. In the context of education, applying Linked Data practices to the growing amount of open data used for learning is potentially highly beneficial. The work presented in this thesis tackles the challenges of data acquisition and integration from distributed web data sources into one linked dataset. The application domain of this thesis is medical education, and the focus is on bridging the gap between articles published in online educational libraries and content published on Web 2.0 platforms that can be used for education. The integration of a collection of heterogeneous resources is to create links between data collected from distributed web data sources. To address these challenges, a system is proposed that exploits the Linked Data for building a metadata schema in XML/RDF format for describing resources and enriching it with external dataset that adds semantic to its metadata. The proposed system collects resources from distributed data sources on the web and enriches their metadata with concepts from biomedical ontologies, such as SNOMED CT, that enable its linking. The final result of building this system is a linked dataset of more than 10,000 resources collected from PubMed Library, YouTube channels, and Blogging platforms. The effectiveness of the system proposed is evaluated by validating the content of the linked dataset when accessed and retrieved. Ontology-based techniques have been developed for browsing and querying the linked dataset resulting from the system proposed. Experiments have been conducted to simulate users' access to the linked dataset and validate its content. The results were promising and have shown the effectiveness of using SNOMED CT for integrating distributed resources from diverse web data sources
    corecore