24 research outputs found

    Top-k String Auto-Completion with Synonyms

    Get PDF
    Auto-completion is one of the most prominent features of modern information systems. The existing solutions of auto-completion provide the suggestions based on the beginning of the currently input character sequence (i.e. prefix). However, in many real applications, one entity often has synonyms or abbreviations. For example, "DBMS" is an abbreviation of "Database Management Systems". In this paper, we study a novel type of auto-completion by using synonyms and abbreviations. We propose three trie-based algorithms to solve the top-k auto-completion with synonyms; each one with different space and time complexity trade-offs. Experiments on large-scale datasets show that it is possible to support effective and efficient synonym-based retrieval of completions of a million strings with thousands of synonyms rules at about a microsecond per-completion, while taking small space overhead (i.e. 160-200 bytes per string).Peer reviewe

    An enterprise search paradigm based on extended query auto-completion: do we still need search and navigation?

    No full text
    Enterprise query auto-completion (QAC) can allow website or intranet visitors to satisfy a need more efficiently than traditional searching and browsing. The limited scope of an enterprise makes it possible to satisfy a high proportion of information needs through completion. Further, the availability of structured sources of completions such as product catalogues compensates for sparsity of log data. Extended forms (X-QAC) can give access to information that is inaccessible via a conventional crawled index. We show that it can be guaranteed that for every suggestion there is a prefix which causes it to appear in the top k suggestions. Using university query logs and structured lists, we quantify the significant keystroke savings attributable to this guarantee (worst case). Such savings may be of particular value for mobile devices. A user experiment showed that a staff lookup task took an average of 61% longer with a conventional search interface than with an X-QAC system. Using wine catalogue data we demonstrate a further extension which allows a user to home in on desired items in faceted-navigation style. We also note that advertisements can be triggered from QAC. Given the advantages and power of X-QAC systems, we envisage that websites and intranets of the [near] future will provide less navigation and rely less on conventional search

    Efficient Methods for Knowledge Base Construction and Query

    Full text link
    Recently, knowledge bases have been widely used in search engines, question-answering systems, and many other applications. The abundant entity profiles and relational information in knowledge bases help the downstream applications learn more about the user queries. However, in automated knowledge base construction, ambiguity in data sources is one of the main challenges. Given a constructed knowledge base, it is hard to efficiently find entities of interest and extract their relatedness information from the knowledge base due to its large capacity. In this thesis, we adopt natural language processing tools, machine learning and graph/text query techniques to deal with such challenges. First, we introduce a machine-learning based framework for efficient entity linking to deal with the ambiguity issue in documents. For entity linking, deep-learning-based methods have outperformed traditional machine-learning-based ones but demand a large amount of data and have a high cost on the training time. We propose a lightweight, customisable and time-efficient method, which is based on traditional machine learning techniques. Our approach achieves comparable performances to the state-of-the-art deep learning-based ones while being significantly faster to train. Second, we adopt deep learning to deal with the Entity Resolution (ER) problem, which aims to reduce the data ambiguity in structural data sources. The existing BERT-based method has set new state-of-the-art performance on the ER task, but it suffers from the high computational cost due to the large cardinality to match. We propose to use Bert in a siamese network to encode the entities separately and adopt the blocking-matching scheme in a multi-task learning framework. The blocking module filters out candidate entity pairs that are unlikely to be matched, while the matching module uses an enhanced alignment network to decide if a pair is a match. Experiments show that our approach outperforms state-of-the-art models in both efficiency and effectiveness. Third, we proposed a flexible Query auto-completion (QAC) framework to support efficient error-tolerant QAC for entity queries in the knowledge base. Most existing works overlook the quality of the suggested completions, and the efficiency needs to be improved. Our framework is designed on the basis of a noisy channel model, which consists of a language model and an error model. Thus, many QAC ranking methods and spelling correction methods can be easily plugged into the framework. To address the efficiency issue, we devise a neighbourhood generation method accompanied by a trie index to quickly find candidates for the error model. The experiments show that our method improves the state of the art of error-tolerant QAC. Last but not least, we designed a visualisation system to facilitate efficient relatedness queries in a large-scale knowledge graph. Given a pair of entities, we aim to efficiently extract a succinct sub-graph to explain the relatedness of the pair of entities. Existing methods, either graph-based or list-based, all have some limitations when dealing with large complex graphs. We propose to use Bi-simulation to summarise the sub-graph, where semantically similar entities are combined. Our method exhibits the most prominent patterns while keeping them in an integrated graph

    Efficient String Edit Similarity Join Algorithm

    Get PDF
    String similarity join is a basic and essential operation in many applications. In this paper, we investigate the problem of string similarity join with edit distance constraints. A trie-based edit similarity join framework has been proposed recently. The main advantage of existing trie-based algorithms is support for similarity join on short strings. The main problem is when joining long and distant strings. These methods generate and maintain lots of similar prefixes called active nodes which need to be further removed in a subsequent pruning phase. With large edit distance, the number of active nodes becomes quite large. In this paper, we propose a new trie-based join algorithm called PreJoin, which improves upon current trie-based join methods. It efficiently finds all similar string pairs using a novel active-node generation method, which minimizes the number of generated active nodes by applying the pruning heuristics early in the generation process. The performance of PreJoin is scaled in two different ways: First, a dynamic reordering of the trie index is used to accelerate the search for similar string pairs. Second, a partitioning method of string space is used to improve performance on large edit distance thresholds. Experiments show that our approach is highly efficient for processing short as well as long strings, and outperforms the state-of-the-art trie-based join approaches by a factor five

    Boosting the Quality of Approximate String Matching by Synonyms

    Get PDF
    A string similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings ``\textsf{Sam}'' and ``\textsf{Samuel}'' can be considered to be similar. Most existing work that computes the similarity of two strings only considers syntactic similarities, e.g., number of common words or \qgrams. While this is indeed an indicator of similarity, there are many important cases where syntactically different strings can represent the same real-world object. For example, ``\textsf{Bill}'' is a short form of ``\textsf{William}''; and ``\textsf{Database Management Systems}'' can be abbreviated as ``\textsf{DBMS}''. Given a collection of predefined synonyms, the purpose of this article is to explore such existing knowledge to effectively evaluate the similarity between two strings and efficiently perform similarity searches and joins, thereby boosting the quality of approximate string matching. In particular, we first present an expansion-based framework to measure string similarities efficiently while considering synonyms. We then study efficient algorithms for similarity searches and joins by proposing two novel indexes, called SI-tree and QP-tree, which combine signature filtering and length filtering strategies. In order to improve the efficiency of our algorithms, we develop an estimator to estimate the size of candidates to enable an online selection of signature filters. This estimator provides strong low-error, high-confidence guarantees while requiring only logarithmic space and time costs, thus making our method attractive both in theory and in practice. Finally, the experimental results from a comprehensive study of the algorithms with three real datasets verify the effectiveness and efficiency of our approaches.Peer reviewe

    ClickINC: In-network Computing as a Service in Heterogeneous Programmable Data-center Networks

    Full text link
    In-Network Computing (INC) has found many applications for performance boosts or cost reduction. However, given heterogeneous devices, diverse applications, and multi-path network typologies, it is cumbersome and error-prone for application developers to effectively utilize the available network resources and gain predictable benefits without impeding normal network functions. Previous work is oriented to network operators more than application developers. We develop ClickINC to streamline the INC programming and deployment using a unified and automated workflow. ClickINC provides INC developers a modular programming abstractions, without concerning to the states of the devices and the network topology. We describe the ClickINC framework, model, language, workflow, and corresponding algorithms. Experiments on both an emulator and a prototype system demonstrate its feasibility and benefits

    Intelligent Information Access to Linked Data - Weaving the Cultural Heritage Web

    Get PDF
    The subject of the dissertation is an information alignment experiment of two cultural heritage information systems (ALAP): The Perseus Digital Library and Arachne. In modern societies, information integration is gaining importance for many tasks such as business decision making or even catastrophe management. It is beyond doubt that the information available in digital form can offer users new ways of interaction. Also, in the humanities and cultural heritage communities, more and more information is being published online. But in many situations the way that information has been made publicly available is disruptive to the research process due to its heterogeneity and distribution. Therefore integrated information will be a key factor to pursue successful research, and the need for information alignment is widely recognized. ALAP is an attempt to integrate information from Perseus and Arachne, not only on a schema level, but to also perform entity resolution. To that end, technical peculiarities and philosophical implications of the concepts of identity and co-reference are discussed. Multiple approaches to information integration and entity resolution are discussed and evaluated. The methodology that is used to implement ALAP is mainly rooted in the fields of information retrieval and knowledge discovery. First, an exploratory analysis was performed on both information systems to get a first impression of the data. After that, (semi-)structured information from both systems was extracted and normalized. Then, a clustering algorithm was used to reduce the number of needed entity comparisons. Finally, a thorough matching was performed on the different clusters. ALAP helped with identifying challenges and highlighted the opportunities that arise during the attempt to align cultural heritage information systems
    corecore