812 research outputs found
Substring filtering for low-cost linked data interfaces
Recently, Triple Pattern Fragments (TPFS) were introduced as a low-cost server-side interface when high numbers of clients need to evaluate SPARQL queries. Scalability is achieved by moving part of the query execution to the client, at the cost of elevated query times. Since the TPFS interface purposely does not support complex constructs such as SPARQL filters, queries that use them need to be executed mostly on the client, resulting in long execution times. We therefore investigated the impact of adding a literal substring matching feature to the TPFS interface, with the goal of improving query performance while maintaining low server cost. In this paper, we discuss the client/server setup and compare the performance of SPARQL queries on multiple implementations, including Elastic Search and case-insensitive FM-index. Our evaluations indicate that these improvements allow for faster query execution without significantly increasing the load on the server. Offering the substring feature on TPF servers allows users to obtain faster responses for filter-based SPARQL queries. Furthermore, substring matching can be used to support other filters such as complete regular expressions or range queries
c-trie++: A Dynamic Trie Tailored for Fast Prefix Searches
Given a dynamic set of strings of total length whose characters
are drawn from an alphabet of size , a keyword dictionary is a data
structure built on that provides locate, prefix search, and update
operations on . Under the assumption that
characters fit into a single machine word , we propose a keyword dictionary
that represents in bits of space,
supporting all operations in expected time on an
input string of length in the word RAM model. This data structure is
underlined with an exhaustive practical evaluation, highlighting the practical
usefulness of the proposed data structure, especially for prefix searches - one
of the most elementary keyword dictionary operations
Large-scale Generative Query Autocompletion
Query Autocompletion (QAC) systems are interactive tools that assist a searcher in entering a query given a partial query prefix. Existing QAC research -- with a number of notable exceptions --relies upon large existing query logs from which to extract historical queries. These queries are then ordered by some ranking algorithm as candidate completions, given the query prefix. Given the numerous search environments (e.g. enterprises, personal or secured data repositories) in which large query logs are unavailable, the need for synthetic -- or generative -- QAC systems will become increasingly important. Generative QAC systems may be used to augment traditional query-based approaches, and/or entirely replace them in certain privacy sensitive applications. Even in commercial Web search engines, a significant proportion (up to 15%) of queries issued daily have never been seen previously, meaning there will always be opportunity to assist users in formulating queries which have not occurred historically. In this paper, we describe a system that can construct generative QAC suggestions within a user-acceptable timeframe (~58ms), and report on a series of experiments over three publicly available, large-scale question sets that investigate different aspects of the system's performance
An enterprise search paradigm based on extended query auto-completion: do we still need search and navigation?
Enterprise query auto-completion (QAC) can allow website or intranet
visitors to satisfy a need more efficiently than traditional
searching and browsing. The limited scope of an enterprise makes
it possible to satisfy a high proportion of information needs through
completion. Further, the availability of structured sources of completions
such as product catalogues compensates for sparsity of log
data. Extended forms (X-QAC) can give access to information that
is inaccessible via a conventional crawled index.
We show that it can be guaranteed that for every suggestion there
is a prefix which causes it to appear in the top k suggestions. Using
university query logs and structured lists, we quantify the significant
keystroke savings attributable to this guarantee (worst case).
Such savings may be of particular value for mobile devices. A user
experiment showed that a staff lookup task took an average of 61%
longer with a conventional search interface than with an X-QAC
system.
Using wine catalogue data we demonstrate a further extension
which allows a user to home in on desired items in faceted-navigation
style. We also note that advertisements can be triggered from QAC.
Given the advantages and power of X-QAC systems, we envisage
that websites and intranets of the [near] future will provide less
navigation and rely less on conventional search
Top-k String Auto-Completion with Synonyms
Auto-completion is one of the most prominent features of modern information systems. The existing solutions of auto-completion provide the suggestions based on the beginning of the currently input character sequence (i.e. prefix). However, in many real applications, one entity often has synonyms or abbreviations. For example, "DBMS" is an abbreviation of "Database Management Systems". In this paper, we study a novel type of auto-completion by using synonyms and abbreviations. We propose three trie-based algorithms to solve the top-k auto-completion with synonyms; each one with different space and time complexity trade-offs. Experiments on large-scale datasets show that it is possible to support effective and efficient synonym-based retrieval of completions of a million strings with thousands of synonyms rules at about a microsecond per-completion, while taking small space overhead (i.e. 160-200 bytes per string).Peer reviewe
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Large language models are commonly trained on a mixture of filtered web data
and curated high-quality corpora, such as social media conversations, books, or
technical papers. This curation process is believed to be necessary to produce
performant models with broad zero-shot generalization abilities. However, as
larger models requiring pretraining on trillions of tokens are considered, it
is unclear how scalable is curation and whether we will run out of unique
high-quality data soon. At variance with previous beliefs, we show that
properly filtered and deduplicated web data alone can lead to powerful models;
even significantly outperforming models from the state-of-the-art trained on
The Pile. Despite extensive filtering, the high-quality data we extract from
the web is still plentiful, and we are able to obtain five trillion tokens from
CommonCrawl. We publicly release an extract of 600 billion tokens from our
RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it
A file-based linked data fragments approach to prefix search
Text-fields that need to look up specific entities in a dataset can be equipped with autocompletion functionality. When a dataset becomes too large to be embedded in the page, setting up a full-text search API is not the only alternative. Alternate API designs that balance different trade-offs such as archivability, cacheability and privacy, may not require setting up a new back-end architecture. In this paper, we propose to perform prefix search over a fragmentation of the dataset, enabling the client to take part in the query execution by navigating through the fragmented dataset. Our proposal consists of (i) a self-describing fragmentation strategy, (ii) a client search algorithm, and (iii) an evaluation of the proposed solution, based on a small dataset of 73k entities and a large dataset of 3.87 m entities. We found that the server cache hit ratio is three times higher compared to a server-side prefix search API, at the cost of a higher bandwidth consumption. Nevertheless, an acceptable user-perceived performance has been measured: assuming 150 ms as an acceptable waiting time between keystrokes, this approach allows 15 entities per prefix to be retrieved in this interval. We conclude that an alternate set of trade-offs has been established for specific prefix search use cases: having added more choice to the spectrum of Web APIs for autocompletion, a file-based approach enables more datasets to afford prefix search
- …