959 research outputs found
WAQS : a web-based approximate query system
The Web is often viewed as a gigantic database holding vast stores of information and provides ubiquitous accessibility to end-users. Since its inception, the Internet has experienced explosive growth both in the number of users and the amount of content available on it. However, searching for information on the Web has become increasingly difficult. Although query languages have long been part of database management systems, the standard query language being the Structural Query Language is not suitable for the Web content retrieval.
In this dissertation, a new technique for document retrieval on the Web is presented. This technique is designed to allow a detailed retrieval and hence reduce the amount of matches returned by typical search engines. The main objective of this technique is to allow the query to be based on not just keywords but also the location of the keywords within the logical structure of a document. In addition, the technique also provides approximate search capabilities based on the notion of Distance and Variable Length Don\u27t Cares. The proposed techniques have been implemented in a system, called Web-Based Approximate Query System, which contains an SQL-like query language called Web-Based Approximate Query Language.
Web-Based Approximate Query Language has also been integrated with EnviroDaemon, an environmental domain specific search engine. It provides EnviroDaemon with more detailed searching capabilities than just keyword-based search. Implementation details, technical results and future work are presented in this dissertation
Independent task assignment for heterogeneous systems
Ankara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent Univ., 2013.Thesis (Ph. D.) -- Bilkent University, 2013.Includes bibliographical references leaves 136-150.We study the problem of assigning nonuniform tasks onto heterogeneous systems.
We investigate two distinct problems in this context. The first problem is the
one-dimensional partitioning of nonuniform workload arrays with optimal load
balancing. The second problem is the assignment of nonuniform independent
tasks onto heterogeneous systems.
For one-dimensional partitioning of nonuniform workload arrays, we investigate
two cases: chain-on-chain partitioning (CCP), where the order of the processors
is specified, and chain partitioning (CP), where processor permutation
is allowed. We present polynomial time algorithms to solve the CCP problem
optimally, while we prove that the CP problem is NP complete. Our empirical
studies show that our proposed exact algorithms for the CCP problem produce
substantially better results than the state-of-the-art heuristics while the solution
times remain comparable.
For the independent task assignment problem, we investigate improving the
performance of the well-known and widely used constructive heuristics MinMin,
MaxMin and Sufferage. All three heuristics are known to run in O(KN2
) time in
assigning N tasks to K processors. In this thesis, we present our work on an algorithmic
improvement that asymptotically decreases the running time complexity
of MinMin to O(KN log N) without affecting its solution quality. Furthermore,
we combine the newly proposed MinMin algorithm with MaxMin as well as Sufferage,
obtaining two hybrid algorithms. The motivation behind the former hybrid
algorithm is to address the drawback of MaxMin in solving problem instances
with highly skewed cost distributions while also improving the running time performance
of MaxMin. The latter hybrid algorithm improves the running time
performance of Sufferage without degrading its solution quality. The proposed
algorithms are easy to implement and we illustrate them through detailed pseudocodes.
The experimental results over a large number of real-life datasets show
that the proposed fast MinMin algorithm and the proposed hybrid algorithms
perform significantly better than their traditional counterparts as well as more
recent state-of-the-art assignment heuristics. For the large datasets used in the
experiments, MinMin, MaxMin, and Sufferage, as well as recent state-of-the-art
heuristics, require days, weeks, or even months to produce a solution, whereas all
of the proposed algorithms produce solutions within only two or three minutes.
For the independent task assignment problem, we also investigate adopting
the multi-level framework which was successfully utilized in several applications
including graph and hypergraph partitioning. For the coarsening phase of the
multi-level framework, we present an efficient matching algorithm which runs in
O(KN) time in most cases. For the uncoarsening phase, we present two refinement
algorithms: an efficient O(KN)-time move-based refinement and an efficient
O(K2N log N)-time swap-based refinement. Our results indicate that multi-level
approach improves the quality of task assignments, while also improving the running
time performance, especially for large datasets.
As a realistic distributed application of the independent task assignment problem,
we introduce the site-to-crawler assignment problem, where a large number
of geographically distributed web servers are crawled by a multi-site distributed
crawling system and the objective is to minimize the duration of the crawl. We
show that this problem can be modeled as an independent task assignment problem.
As a solution to the problem, we evaluate a large number of state-of-the-art
task assignment heuristics selected from the literature as well as the improved
versions and the newly developed multi-level task assignment algorithm. We
compare the performance of different approaches through simulations on very
large, real-life web datasets. Our results indicate that multi-site web crawling
efficiency can be considerably improved using the independent task assignment
approach, when compared to relatively easy-to-implement, yet naive baselines.Tabak, E KartalPh.D
Big Data Computing for Geospatial Applications
The convergence of big data and geospatial computing has brought forth challenges and opportunities to Geographic Information Science with regard to geospatial data management, processing, analysis, modeling, and visualization. This book highlights recent advancements in integrating new computing approaches, spatial methods, and data management strategies to tackle geospatial big data challenges and meanwhile demonstrates opportunities for using big data for geospatial applications. Crucial to the advancements highlighted in this book is the integration of computational thinking and spatial thinking and the transformation of abstract ideas and models to concrete data structures and algorithms
Improving the efficiency of search engines : strategies for focused crawling, searching, and index pruning
Ankara : The Department of Computer Engineering and the Instıtute of Engineering and Science of Bilkent University, 2009.Thesis (Ph. D.) -- Bilkent University, 2009.Includes bibliographical references leaves 157-169.Search engines are the primary means of retrieval for text data that is abundantly
available on the Web. A standard search engine should carry out three
fundamental tasks, namely; crawling the Web, indexing the crawled content, and
finally processing the queries using the index. Devising efficient methods for these
tasks is an important research topic. In this thesis, we introduce efficient strategies
related to all three tasks involved in a search engine. Most of the proposed
strategies are essentially applicable when a grouping of documents in its broadest
sense (i.e., in terms of automatically obtained classes/clusters, or manually
edited categories) is readily available or can be constructed in a feasible manner.
Additionally, we also introduce static index pruning strategies that are based on
the query views.
For the crawling task, we propose a rule-based focused crawling strategy that
exploits interclass rules among the document classes in a topic taxonomy. These
rules capture the probability of having hyperlinks between two classes. The rulebased
crawler can tunnel toward the on-topic pages by following a path of off-topic
pages, and thus yields higher harvest rate for crawling on-topic pages.
In the context of indexing and query processing tasks, we concentrate on conducting
efficient search, again, using document groups; i.e., clusters or categories.
In typical cluster-based retrieval (CBR), first, clusters that are most similar to a
given free-text query are determined, and then documents from these clusters are
selected to form the final ranked output. For efficient CBR, we first identify and
evaluate some alternative query processing strategies. Next, we introduce a new
index organization, so-called cluster-skipping inverted index structure (CS-IIS).
It is shown that typical-CBR with CS-IIS outperforms previous CBR strategies
(with an ordinary index) for a number of datasets and under varying search parameters.
In this thesis, an enhanced version of CS-IIS is further proposed, in
which all information to compute query-cluster similarities during query evaluation
is stored. We introduce an incremental-CBR strategy that operates on top
of this latter index structure, and demonstrate its search efficiency for different
scenarios.
Finally, we exploit query views that are obtained from the search engine query
logs to tailor more effective static pruning techniques. This is also related to the
indexing task involved in a search engine. In particular, query view approach
is incorporated into a set of existing pruning strategies, as well as some new
variants proposed by us. We show that query view based strategies significantly
outperform the existing approaches in terms of the query output quality, for both
disjunctive and conjunctive evaluation of queries.Altıngövde, İsmail SengörPh.D
Attacking DoH and ECH: Does Server Name Encryption Protect Users’ Privacy?
Privacy on the Internet has become a priority, and several efforts have been devoted to limit the leakage of personal information. Domain names, both in the TLS Client Hello and DNS traffic, are among the last pieces of information still visible to an observer in the network. The Encrypted Client Hello extension for TLS, DNS over HTTPS or over QUIC protocols aim to further increase network confidentiality by encrypting the domain names of the visited servers. In this article, we check whether an attacker able to passively observe the traffic of users could still recover the domain name of websites they visit even if names are encrypted. By relying on large-scale network traces, we show that simplistic features and off-the-shelf machine learning models are sufficient to achieve surprisingly high precision and recall when recovering encrypted domain names. We consider three attack scenarios, i.e., recovering the per-flow name, rebuilding the set of visited websites by a user, and checking which users visit a given target website. We next evaluate the efficacy of padding-based mitigation, finding that all three attacks are still effective, despite resources wasted with padding. We conclude that current proposals for domain encryption may produce a false sense of privacy, and more robust techniques should be envisioned to offer protection to end users
- …