79,107 research outputs found
Back to the basics: a quantitative analysis of statistical and graph-based term weighting schemes for keyword extraction
Term weighting schemes are widely used in Natural Language Processing and Information Retrieval. In particular, term weighting is the basis for keyword extraction. However, there are relatively few evaluation studies that shed light about the strengths and shortcomings of each weighting scheme. In fact, in most cases researchers and practitioners resort to the well-known tf-idf as default, despite the existence of other suitable alternatives, including graph-based models. In this paper, we perform an exhaustive and large-scale empirical comparison of both statistical and graph-based term weighting methods in the context of keyword extraction. Our analysis reveals some interesting findings such as the advantages of the less-known lexical specificity with respect to tf-idf, or the qualitative differences between statistical and graph-based methods. Finally, based on our findings we discuss and devise some suggestions for practitioner
Sub-GMN: The Neural Subgraph Matching Network Model
As one of the most fundamental tasks in graph theory, subgraph matching is a
crucial task in many fields, ranging from information retrieval, computer
vision, biology, chemistry and natural language processing. Yet subgraph
matching problem remains to be an NP-complete problem. This study proposes an
end-to-end learning-based approximate method for subgraph matching task, called
subgraph matching network (Sub-GMN). The proposed Sub-GMN firstly uses graph
representation learning to map nodes to node-level embedding. It then combines
metric learning and attention mechanisms to model the relationship between
matched nodes in the data graph and query graph. To test the performance of the
proposed method, we applied our method on two databases. We used two existing
methods, GNN and FGNN as baseline for comparison. Our experiment shows that, on
dataset 1, on average the accuracy of Sub-GMN are 12.21\% and 3.2\% higher than
that of GNN and FGNN respectively. On average running time Sub-GMN runs 20-40
times faster than FGNN. In addition, the average F1-score of Sub-GMN on all
experiments with dataset 2 reached 0.95, which demonstrates that Sub-GMN
outputs more correct node-to-node matches.
Comparing with the previous GNNs-based methods for subgraph matching task,
our proposed Sub-GMN allows varying query and data graphes in the
test/application stage, while most previous GNNs-based methods can only find a
matched subgraph in the data graph during the test/application for the same
query graph used in the training stage. Another advantage of our proposed
Sub-GMN is that it can output a list of node-to-node matches, while most
existing end-to-end GNNs based methods cannot provide the matched node pairs
CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs
Approximate Nearest Neighbor Search (ANNS) plays a critical role in various
disciplines spanning data mining and artificial intelligence, from information
retrieval and computer vision to natural language processing and recommender
systems. Data volumes have soared in recent years and the computational cost of
an exhaustive exact nearest neighbor search is often prohibitive, necessitating
the adoption of approximate techniques. The balanced performance and recall of
graph-based approaches have more recently garnered significant attention in
ANNS algorithms, however, only a few studies have explored harnessing the power
of GPUs and multi-core processors despite the widespread use of massively
parallel and general-purpose computing. To bridge this gap, we introduce a
novel parallel computing hardware-based proximity graph and search algorithm.
By leveraging the high-performance capabilities of modern hardware, our
approach achieves remarkable efficiency gains. In particular, our method
surpasses existing CPU and GPU-based methods in constructing the proximity
graph, demonstrating higher throughput in both large- and small-batch searches
while maintaining compatible accuracy. In graph construction time, our method,
CAGRA, is 2.2~27x faster than HNSW, which is one of the CPU SOTA
implementations. In large-batch query throughput in the 90% to 95% recall
range, our method is 33~77x faster than HNSW, and is 3.8~8.8x faster than the
SOTA implementations for GPU. For a single query, our method is 3.4~53x faster
than HNSW at 95% recall
Queensland University of Technology at TREC 2005
The Information Retrieval and Web Intelligence (IR-WI) research group is a research team at the Faculty of Information Technology, QUT, Brisbane, Australia. The IR-WI group participated in the Terabyte and Robust track at TREC 2005, both for the first time. For the Robust track we applied our existing information retrieval system that was originally designed for use with structured (XML) retrieval to the domain of document retrieval. For the Terabyte track we experimented with an open source IR system, Zettair and performed two types of experiments. First, we compared Zettair’s performance on both a high-powered supercomputer and a distributed system across seven midrange personal computers. Second, we compared Zettair’s performance when a standard TREC title is used, compared with a natural language query, and a query expanded with synonyms. We compare the systems both in terms of efficiency and retrieval performance. Our results indicate that the distributed system is faster than the supercomputer, while slightly decreasing retrieval performance, and that natural language queries also slightly decrease retrieval performance, while our query expansion technique significantly decreased performance
Graph-Embedding Empowered Entity Retrieval
In this research, we improve upon the current state of the art in entity
retrieval by re-ranking the result list using graph embeddings. The paper shows
that graph embeddings are useful for entity-oriented search tasks. We
demonstrate empirically that encoding information from the knowledge graph into
(graph) embeddings contributes to a higher increase in effectiveness of entity
retrieval results than using plain word embeddings. We analyze the impact of
the accuracy of the entity linker on the overall retrieval effectiveness. Our
analysis further deploys the cluster hypothesis to explain the observed
advantages of graph embeddings over the more widely used word embeddings, for
user tasks involving ranking entities
Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search
Retrieval pipelines commonly rely on a term-based search to obtain candidate
records, which are subsequently re-ranked. Some candidates are missed by this
approach, e.g., due to a vocabulary mismatch. We address this issue by
replacing the term-based search with a generic k-NN retrieval algorithm, where
a similarity function can take into account subtle term associations. While an
exact brute-force k-NN search using this similarity function is slow, we
demonstrate that an approximate algorithm can be nearly two orders of magnitude
faster at the expense of only a small loss in accuracy. A retrieval pipeline
using an approximate k-NN search can be more effective and efficient than the
term-based pipeline. This opens up new possibilities for designing effective
retrieval pipelines. Our software (including data-generating code) and
derivative data based on the Stack Overflow collection is available online
- …