Search CORE

5 research outputs found

Crawling Deep Web using a GA-based set covering algorithm

Author: Wang Shaohua
Publication venue: 'University of Windsor Leddy Library'
Publication date: 01/01/2010
Field of study

An ever-increasing amount of information on the web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain web sites. These pages are often referred to as the Hidden Web or the Deep Web. According to recent studies, the content provided by hidden web sites is often of very high quality and can be extremely valuable to many users. This calls for deep web crawlers to excavate the data so that they can be reused, indexed, and searched upon in an integrated environment. Crawling deep web is the process of collecting data from search interfaces by issuing queries. It often requires the selection of an appropriate set of queries so that they can cover most of the documents in the data source with low cost. This can be modeled as a set covering problem which has been extensively studied in graph theory. The conventional set covering algorithms, however, do not work well when applied to deep web crawling due to various special features of this application domain. Typically, most set covering algorithms do not take into account the distribution of the elements being covered. For deep web crawling, the sizes of the documents and the document frequency of the queries follow the power law distribution. A new GA-based algorithm is introduced in this thesis. It targets at deep web crawling of a database with this power law distribution. The experiment shows that it outperforms the straightforward greedy algorithm previously introduced to the literature

Scholarship at UWindsor

Ontology learning for the semantic deep web

Author: An Yoo Jung
Publication venue: Digital Commons @ NJIT
Publication date: 27/01/2008
Field of study

Ontologies could play an important role in assisting users in their search for Web pages. This dissertation considers the problem of constructing natural ontologies that support users in their Web search efforts and increase the number of relevant Web pages that are returned. To achieve this goal, this thesis suggests combining the Deep Web information, which consists of dynamically generated Web pages and cannot be indexed by the existing automated Web crawlers, with ontologies, resulting in the Semantic Deep Web. The Deep Web information is exploited in three different ways: extracting attributes from the Deep Web data sources automatically, generating domain ontologies from the Deep Web automatically, and extracting instances from the Deep Web to enhance the domain ontologies. Several algorithms for the above mentioned tasks are presented. Lxperimeiital results suggest that the proposed methods assist users with finding more relevant Web sites. Another contribution of this dissertation includes developing a methodology to evaluate existing general purpose ontologies using the Web as a corpus. The quality of ontologies (QoO) is quantified by analyzing existing ontologies to get numeric measures of how natural their concepts and their relationships are. This methodology was first applied to several major, popular ontologies, such as WordNet, OpenCyc and the UMLS. Subsequently the domain ontologies developed in this research were evaluated from the naturalness perspective

Digital Commons @ New Jersey Institute of Technology (NJIT)

Internet anonymization technologies

Author: Konečnik Matic
Publication venue: M. Konečnik
Publication date: 20/08/2018
Field of study

Internet je nepogrešljiv in ključni del sodobnega življenjaposamezniki ga uporabljajo za iskanje informacij, branje novic, komuniciranje, nakupovanje in uporabo storitev e-uprave. V zadnjih letih pa se je povečala tudi skrb glede anonimnosti uporabnikov na internetu, saj želijo v veliki meri preprečiti zlorabe in različna nadlegovanja. Vse to kaže na povečano potrebo po anonimizacijskih orodjih in tehnikah. Eno takšnih je odprtokodno programsko orodje Tor, ki je trenutno najbolj priljubljeno med uporabniki in zagotavlja visoko stopnjo anonimnosti. V sklopu magistrske naloge smo izvedli anketni vprašalnik in ugotovili, da je anonimnost na internetu za uporabnike zelo pomembna lastnost ter da uporabljajo tehnike anonimnosti predvsem zato, da bi zaščitili svoje osebne podatke in povečali varnost. Zato smo izdelali vtičnik za Google Chrome, ki uporabnike med brskanjem opozarja o pomanjkljivih nastavitvah brskalnika, povezanih z njihovo zasebnostjo.The internet is indispensable and a key part of modern life, which individuals use to find information, read news, communicate, shop and use e-government services. In recent years, concerns have been raised about the anonymity of users on the internet, as they want to greatly prevent abuse and various forms of harassment of users. All this points to an increased need for anonymisation tools and techniques. One such is the open-source Tor software tool, which is currently the most popular among users and provides a high level of anonymity. As part of the master\u27s thesis, we conducted a survey questionnaire where we found that anonymity on the internet is a very important feature for users and that they use anonymity techniques, in particular to protect their personal data and increase security. To this end, we\u27ve created a Google Chrome plug-in that alerts users when browsing about the inadequate browser settings associated with their privacy

Digital library of University of Maribor

Recommended from our members

A dynamic web interface to a remote robot evaluated with a robotic telescope.

Author: Tallon Christopher John
Publication venue: International School of Informatics and Management Jaipur
Publication date: 01/01/2010
Field of study

This thesis investigates the issues of creating a publicly accessible Web interface to a remote autonomous robot: the Bradford Robotic Telescope. The robot is situated on Mount Teide, on the island of Tenerife, Spain. Its mission is to provide interactive access to the stars to people who would otherwise not be able to appreciate the wonders of the night sky due to light pollution. Whenever weather and darkness permits, the robot processes the observation requests submitted by users via the Internet, operating all the hardware including the dome, telescope mount and cameras. The question of how to enable a content rich high quality dialogue between one robot and thousands of users is explored and divided into seven areas of research. How to design a Web site enabling high quality interaction with the user, how to enable users to request service from a robot, how to store and manage all the user and robot generated data, how to enable communication between the Web interface and the robot, how to schedule many observation requests in the best order, how to support a constant dialogue between the robot and users to engage users in the robot's work, and how to present and display users' completed observations. These seven areas of research are investigated; solutions are presented and their implementations examined and evaluated for their suitability and performance with the Bradford Robotic Telescope, and for how they might perform for any job-based remote robot

Bradford Scholars

iStarDB (The Astronomy Education Research Repository)

Taming web data : exploiting linked data for integrating medical educational content

Author: Al Fayez Reem Qadan
Publication venue
Publication date
Field of study

Open data are playing a vital role in different communities, including governments, businesses, and education. This revolution has had a high impact on the education field. Recently, new practices are being adopted for publishing and connecting data on the web, known as "Linked Data", and these are used to expose and connect data which were not previously linked. In the context of education, applying Linked Data practices to the growing amount of open data used for learning is potentially highly beneficial. The work presented in this thesis tackles the challenges of data acquisition and integration from distributed web data sources into one linked dataset. The application domain of this thesis is medical education, and the focus is on bridging the gap between articles published in online educational libraries and content published on Web 2.0 platforms that can be used for education. The integration of a collection of heterogeneous resources is to create links between data collected from distributed web data sources. To address these challenges, a system is proposed that exploits the Linked Data for building a metadata schema in XML/RDF format for describing resources and enriching it with external dataset that adds semantic to its metadata. The proposed system collects resources from distributed data sources on the web and enriches their metadata with concepts from biomedical ontologies, such as SNOMED CT, that enable its linking. The final result of building this system is a linked dataset of more than 10,000 resources collected from PubMed Library, YouTube channels, and Blogging platforms. The effectiveness of the system proposed is evaluated by validating the content of the linked dataset when accessed and retrieved. Ontology-based techniques have been developed for browsing and querying the linked dataset resulting from the system proposed. Experiments have been conducted to simulate users' access to the linked dataset and validate its content. The results were promising and have shown the effectiveness of using SNOMED CT for integrating distributed resources from diverse web data sources

Warwick Research Archives Portal Repository