586 research outputs found
Formal concept matching and reinforcement learning in adaptive information retrieval
The superiority of the human brain in information retrieval (IR) tasks seems to come firstly
from its ability to read and understand the concepts, ideas or meanings central to documents, in
order to reason out the usefulness of documents to information needs, and secondly from its
ability to learn from experience and be adaptive to the environment. In this work we attempt to
incorporate these properties into the development of an IR model to improve document
retrieval. We investigate the applicability of concept lattices, which are based on the theory of
Formal Concept Analysis (FCA), to the representation of documents. This allows the use of
more elegant representation units, as opposed to keywords, in order to better capture
concepts/ideas expressed in natural language text. We also investigate the use of a
reinforcement leaming strategy to learn and improve document representations, based on the
information present in query statements and user relevance feedback. Features or concepts of
each document/query, formulated using FCA, are weighted separately with respect to the
documents they are in, and organised into separate concept lattices according to a subsumption
relation. Furthen-nore, each concept lattice is encoded in a two-layer neural network structure
known as a Bidirectional Associative Memory (BAM), for efficient manipulation of the
concepts in the lattice representation. This avoids implementation drawbacks faced by other
FCA-based approaches. Retrieval of a document for an information need is based on concept
matching between concept lattice representations of a document and a query. The learning
strategy works by making the similarity of relevant documents stronger and non-relevant
documents weaker for each query, depending on the relevance judgements of the users on
retrieved documents. Our approach is radically different to existing FCA-based approaches in
the following respects: concept formulation; weight assignment to object-attribute pairs; the
representation of each document in a separate concept lattice; and encoding concept lattices in
BAM structures. Furthermore, in contrast to the traditional relevance feedback mechanism, our
learning strategy makes use of relevance feedback information to enhance document
representations, thus making the document representations dynamic and adaptive to the user
interactions. The results obtained on the CISI, CACM and ASLIB Cranfield collections are
presented and compared with published results. In particular, the performance of the system is
shown to improve significantly as the system learns from experience.The School of Computing,
University of Plymouth, UK
A Submodular Optimization Framework for Imbalanced Text Classification with Data Augmentation
In the domain of text classification, imbalanced datasets are a common occurrence. The skewed distribution of the labels of these datasets poses a great challenge to the performance of text classifiers. One popular way to mitigate this challenge is to augment underwhelmingly represented labels with synthesized items. The synthesized items are generated by data augmentation methods that can typically generate an unbounded number of items. To select the synthesized items that maximize the performance of text classifiers, we introduce a novel method that selects items that jointly maximize the likelihood of the items belonging to their respective labels and the diversity of the selected items. Our proposed method formulates the joint maximization as a monotone submodular objective function, whose solution can be approximated by a tractable and efficient greedy algorithm. We evaluated our method on multiple real-world datasets with different data augmentation techniques and text classifiers, and compared results with several baselines. The experimental results demonstrate the effectiveness and efficiency of our method
Three real-world datasets and neural computational models for classification tasks in patent landscaping
Patent Landscaping, one of the central tasks of intellectual property management, includes selecting and grouping patents according to user-defined technical or application-oriented criteria. While recent transformer-based models have been shown to be effective for classifying patents into taxonomies such as CPC or IPC, there is yet little research on how to support real-world Patent Landscape Studies (PLSs) using natural language processing methods. With this paper, we release three labeled datasets for PLS-oriented classification tasks covering two diverse domains. We provide a qualitative analysis and report detailed corpus statistics.Most research on neural models for patents has been restricted to leveraging titles and abstracts. We compare strong neural and non-neural baselines, proposing a novel model that takes into account textual information from the patents’ full texts as well as embeddings created based on the patents’ CPC labels. We find that for PLS-oriented classification tasks, going beyond title and abstract is crucial, CPC labels are an effective source of information, and combining all features yields the best results
Resource dimensioning through buffer sampling
Link dimensioning, i.e., selecting a (minimal) link capacity such that the users’ performance requirements are met, is a crucial component of network design. It requires insight into the interrelationship among the traffic offered (in terms of the mean offered load , but also its fluctuation around the mean, i.e., ‘burstiness’), the envisioned performance level, and the capacity needed. We first derive, for different performance criteria, theoretical dimensioning formulas that estimate the required capacity as a function of the input traffic and the performance target. For the special case of Gaussian input traffic, these formulas reduce to , where directly relates to the performance requirement (as agreed upon in a service level agreement) and reflects the burstiness (at the timescale of interest). We also observe that Gaussianity applies for virtually all realistic scenarios; notably, already for a relatively low aggregation level, the Gaussianity assumption is justified.\ud
As estimating is relatively straightforward, the remaining open issue concerns the estimation of . We argue that particularly if corresponds to small time-scales, it may be inaccurate to estimate it directly from the traffic traces. Therefore, we propose an indirect method that samples the buffer content, estimates the buffer content distribution, and ‘inverts’ this to the variance. We validate the inversion through extensive numerical experiments (using a sizeable collection of traffic traces from various representative locations); the resulting estimate of is then inserted in the dimensioning formula. These experiments show that both the inversion and the dimensioning formula are remarkably accurate
Approximate Closest Community Search in Networks
Recently, there has been significant interest in the study of the community
search problem in social and information networks: given one or more query
nodes, find densely connected communities containing the query nodes. However,
most existing studies do not address the "free rider" issue, that is, nodes far
away from query nodes and irrelevant to them are included in the detected
community. Some state-of-the-art models have attempted to address this issue,
but not only are their formulated problems NP-hard, they do not admit any
approximations without restrictive assumptions, which may not always hold in
practice.
In this paper, given an undirected graph G and a set of query nodes Q, we
study community search using the k-truss based community model. We formulate
our problem of finding a closest truss community (CTC), as finding a connected
k-truss subgraph with the largest k that contains Q, and has the minimum
diameter among such subgraphs. We prove this problem is NP-hard. Furthermore,
it is NP-hard to approximate the problem within a factor , for
any . However, we develop a greedy algorithmic framework,
which first finds a CTC containing Q, and then iteratively removes the furthest
nodes from Q, from the graph. The method achieves 2-approximation to the
optimal solution. To further improve the efficiency, we make use of a compact
truss index and develop efficient algorithms for k-truss identification and
maintenance as nodes get eliminated. In addition, using bulk deletion
optimization and local exploration strategies, we propose two more efficient
algorithms. One of them trades some approximation quality for efficiency while
the other is a very efficient heuristic. Extensive experiments on 6 real-world
networks show the effectiveness and efficiency of our community model and
search algorithms
- …