15,836 research outputs found
TopSig: Topology Preserving Document Signatures
Performance comparisons between File Signatures and Inverted Files for text
retrieval have previously shown several significant shortcomings of file
signatures relative to inverted files. The inverted file approach underpins
most state-of-the-art search engine algorithms, such as Language and
Probabilistic models. It has been widely accepted that traditional file
signatures are inferior alternatives to inverted files. This paper describes
TopSig, a new approach to the construction of file signatures. Many advances in
semantic hashing and dimensionality reduction have been made in recent times,
but these were not so far linked to general purpose, signature file based,
search engines. This paper introduces a different signature file approach that
builds upon and extends these recent advances. We are able to demonstrate
significant improvements in the performance of signature file based indexing
and retrieval, performance that is comparable to that of state of the art
inverted file based systems, including Language models and BM25. These findings
suggest that file signatures offer a viable alternative to inverted files in
suitable settings and from the theoretical perspective it positions the file
signatures model in the class of Vector Space retrieval models.Comment: 12 pages, 8 figures, CIKM 201
Linguistic Geometries for Unsupervised Dimensionality Reduction
Text documents are complex high dimensional objects. To effectively visualize
such data it is important to reduce its dimensionality and visualize the low
dimensional embedding as a 2-D or 3-D scatter plot. In this paper we explore
dimensionality reduction methods that draw upon domain knowledge in order to
achieve a better low dimensional embedding and visualization of documents. We
consider the use of geometries specified manually by an expert, geometries
derived automatically from corpus statistics, and geometries computed from
linguistic resources.Comment: 13 pages, 15 figure
Classification of damage in structural systems using time series analysis and supervised and unsupervised pattern recognition techniques
Peer reviewedPostprin
Visualising the structure of document search results: A comparison of graph theoretic approaches
This is the post-print of the article - Copyright @ 2010 Sage PublicationsPrevious work has shown that distance-similarity visualisation or ‘spatialisation’ can provide a potentially useful context in which to browse the results of a query search, enabling the user to adopt a simple local foraging or ‘cluster growing’ strategy to navigate through the retrieved document set. However, faithfully mapping feature-space models to visual space can be problematic owing to their inherent high dimensionality and non-linearity. Conventional linear approaches to dimension reduction tend to fail at this kind of task, sacrificing local structural in order to preserve a globally optimal mapping. In this paper the clustering performance of a recently proposed algorithm called isometric feature mapping (Isomap), which deals with non-linearity by transforming dissimilarities into geodesic distances, is compared to that of non-metric multidimensional scaling (MDS). Various graph pruning methods, for geodesic distance estimation, are also compared. Results show that Isomap is significantly better at preserving local structural detail than MDS, suggesting it is better suited to cluster growing and other semantic navigation tasks. Moreover, it is shown that applying a minimum-cost graph pruning criterion can provide a parameter-free alternative to the traditional K-neighbour method, resulting in spatial clustering that is equivalent to or better than that achieved using an optimal-K criterion
Patent Analytics Based on Feature Vector Space Model: A Case of IoT
The number of approved patents worldwide increases rapidly each year, which
requires new patent analytics to efficiently mine the valuable information
attached to these patents. Vector space model (VSM) represents documents as
high-dimensional vectors, where each dimension corresponds to a unique term.
While originally proposed for information retrieval systems, VSM has also seen
wide applications in patent analytics, and used as a fundamental tool to map
patent documents to structured data. However, VSM method suffers from several
limitations when applied to patent analysis tasks, such as loss of
sentence-level semantics and curse-of-dimensionality problems. In order to
address the above limitations, we propose a patent analytics based on feature
vector space model (FVSM), where the FVSM is constructed by mapping patent
documents to feature vectors extracted by convolutional neural networks (CNN).
The applications of FVSM for three typical patent analysis tasks, i.e., patents
similarity comparison, patent clustering, and patent map generation are
discussed. A case study using patents related to Internet of Things (IoT)
technology is illustrated to demonstrate the performance and effectiveness of
FVSM. The proposed FVSM can be adopted by other patent analysis studies to
replace VSM, based on which various big data learning tasks can be performed
- …