1,412 research outputs found
WAQS : a web-based approximate query system
The Web is often viewed as a gigantic database holding vast stores of information and provides ubiquitous accessibility to end-users. Since its inception, the Internet has experienced explosive growth both in the number of users and the amount of content available on it. However, searching for information on the Web has become increasingly difficult. Although query languages have long been part of database management systems, the standard query language being the Structural Query Language is not suitable for the Web content retrieval.
In this dissertation, a new technique for document retrieval on the Web is presented. This technique is designed to allow a detailed retrieval and hence reduce the amount of matches returned by typical search engines. The main objective of this technique is to allow the query to be based on not just keywords but also the location of the keywords within the logical structure of a document. In addition, the technique also provides approximate search capabilities based on the notion of Distance and Variable Length Don\u27t Cares. The proposed techniques have been implemented in a system, called Web-Based Approximate Query System, which contains an SQL-like query language called Web-Based Approximate Query Language.
Web-Based Approximate Query Language has also been integrated with EnviroDaemon, an environmental domain specific search engine. It provides EnviroDaemon with more detailed searching capabilities than just keyword-based search. Implementation details, technical results and future work are presented in this dissertation
DART: the distributed agent based retrieval toolkit
The technology of search engines is evolving from indexing and classification of web resources based on keywords to more sophisticated techniques which take into account the meaning and the context of textual information and usage. Replying to query, commercial search engines face the user requests with a large amount of results, mostly useless or only partially related to the request; the subsequent refinement, operated downloading and examining as much pages as possible and simply ignoring whatever stays behind the first few pages, is left up to the user.
Furthermore, architectures based on centralized indexes, allow commercial search engines to control the advertisement of online information, in contrast to P2P architectures that focus the attention on user requirements involving the end
user in search engine maintenance and operation. To address such wishes, new search engines should focus on three key aspects: semantics, geo-referencing, collaboration/distribution. Semantic analysis lets to increase the results
relevance. The geo-referencing of catalogued resources allows contextualisation based on user position. Collaboration distributes storage, processing, and trust on a world-wide network of nodes running on users’ computers, getting rid of bottlenecks and central points of failures. In this paper, we describe the studies, the concepts and the solutions developed in the DART project to introduce these three key features in a novel search engine architecture
A collaborative, semantic and context-aware search engine
Search engines help people to find information in the largest public knowledge system of the world: the Web. Unfortunately its size makes very complex to discover the right information. The users are faced lots of useless results forcing them to select one by one the most suitable. The new generation of search engines evolve from keyword-based indexing and classification to more sophisticated techniques considering the
meaning, the context and the usage of information. We argue about the three key aspects: collaboration, geo-referencing and semantics. Collaboration distributes storage, processing and trust on a world-wide network of nodes running on users’ computers, getting rid of bottlenecks and central points of failures. The
geo-referencing of catalogued resources allows contextualisation based on user position. Semantic analysis lets to increase the results relevance. In this paper, we expose the studies, the concepts and the solutions of a research project to introduce these three key features in a novel search engine architecture.213-21
Hiding in Plain Sight: A Longitudinal Study of Combosquatting Abuse
Domain squatting is a common adversarial practice where attackers register
domain names that are purposefully similar to popular domains. In this work, we
study a specific type of domain squatting called "combosquatting," in which
attackers register domains that combine a popular trademark with one or more
phrases (e.g., betterfacebook[.]com, youtube-live[.]com). We perform the first
large-scale, empirical study of combosquatting by analyzing more than 468
billion DNS records---collected from passive and active DNS data sources over
almost six years. We find that almost 60% of abusive combosquatting domains
live for more than 1,000 days, and even worse, we observe increased activity
associated with combosquatting year over year. Moreover, we show that
combosquatting is used to perform a spectrum of different types of abuse
including phishing, social engineering, affiliate abuse, trademark abuse, and
even advanced persistent threats. Our results suggest that combosquatting is a
real problem that requires increased scrutiny by the security community.Comment: ACM CCS 1
A Bandwidth-Conserving Architecture for Crawling Virtual Worlds
A virtual world is a computer-based simulated environment intended for its users to inhabit via avatars. Content in virtual worlds such as Second Life or OpenSimulator is increasingly presented using three-dimensional (3D) dynamic presentation technologies that challenge traditional search technologies. As 3D environments become both more prevalent and more fragmented, the need for a data crawler and distributed search service will continue to grow. By increasing the visibility of content across virtual world servers in order to better collect and integrate the 3D data we can also improve the crawling and searching efficiency and accuracy by avoiding crawling unchanged regions or downloading unmodified objects that already exist in our collection. This will help to save bandwidth resources and Internet traffic during the content collection and indexing and, for a fixed amount of bandwidth, maximize the freshness of the collection. This work presents a new services paradigm for virtual world crawler interaction that is co-operative and exploits information about 3D objects in the virtual world. Our approach supports analyzing redundant information crawled from virtual worlds in order to decrease the amount of data collected by crawlers, keep search engine collections up to date, and provide an efficient mechanism for collecting and searching information from multiple virtual worlds. Experimental results with data crawled from Second Life servers demonstrate that our approach provides the ability to save crawling bandwidth consumption, to explore more hidden objects and new regions to be crawled that facilitate the search service in virtual worlds
Recommended from our members
The Corpus Expansion Toolkit: finding what we want on the web
This thesis presents the Corpus Expansion Toolkit (CET), a generally applicable toolkit that allows researchers to build domain-specific corpora from the web. The main purpose of the work presented in this thesis and the development of the CET is to provide a solution to discovering desired content on the web from possibly unknown locations or a poorly defined domain. Using an iterative process, the CET is able to solve the problem of discovering domain-specific online content and expand a corpus using only a very small number of example documents or characteristic phrases taken from the target domain. Using a human-in-the-loop strategy and a chain of discrete software components the CET also allows the concept of a domain to be iteratively defined using the very online resources used to expand the original corpus. The CET combines feature extraction, search, web crawling and machine learning methods to collected, store, filter and perform information extraction on collected documents. Using a small number of example ‘seed’ documents the CET is able to expand the original corpus by finding more relevant documents from the web and provide a number of tools to support their analysis. This thesis presents a case study-based methodology that introduces the various contributions and components of the CET through the discussion of five case studies covering a wide variety of domains and requirements that the CET has been applied. These case studies hope to illustrate three main use cases, listed below, where the CET is applicable:
1. Domain known – source known
2. Domain known – source unknown
3. Domain unknown – source unknown
First, use cases where the sites for document collection are known and the topic of research is clearly defined. Second, instances where the topic of research is clearly defined but where to find relevant documents on the web is unknown. Third, the most extreme use case, where the domain is poorly defined or unknown to the researcher and the location of the information is also unknown. This thesis presents a solution that allows researchers to begin with very little information on a specific topic and iteratively build a clear conception of a domain and translate that to a computational system
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
- …