2,733 research outputs found

    Semantic Retrieval and Automatic Annotation: Linear Transformations, Correlation and Semantic Spaces

    No full text
    This paper proposes a new technique for auto-annotation and semantic retrieval based upon the idea of linearly mapping an image feature space to a keyword space. The new technique is compared to several related techniques, and a number of salient points about each of the techniques are discussed and contrasted. The paper also discusses how these techniques might actually scale to a real-world retrieval problem, and demonstrates this though a case study of a semantic retrieval technique being used on a real-world data-set (with a mix of annotated and unannotated images) from a picture library

    From Frequency to Meaning: Vector Space Models of Semantics

    Full text link
    Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field

    High-dimensional indexing methods utilizing clustering and dimensionality reduction

    Get PDF
    The emergence of novel database applications has resulted in the prevalence of a new paradigm for similarity search. These applications include multimedia databases, medical imaging databases, time series databases, DNA and protein sequence databases, and many others. Features of data objects are extracted and transformed into high-dimensional data points. Searching for objects becomes a search on points in the high-dimensional feature space. The dissimilarity between two objects is determined by the distance between two feature vectors. Similarity search is usually implemented as nearest neighbor search in feature vector spaces. The cost of processing k-nearest neighbor (k-NN) queries via a sequential scan increases as the number of objects and the number of features increase. A variety of multi-dimensional index structures have been proposed to improve the efficiency of k-NN query processing, which work well in low-dimensional space but lose their efficiency in high-dimensional space due to the curse of dimensionality. This inefficiency is dealt in this study by Clustering and Singular Value Decomposition - CSVD with indexing, Persistent Main Memory - PMM index, and Stepwise Dimensionality Increasing - SDI-tree index. CSVD is an approximate nearest neighbor search method. The performance of CSVD with indexing is studied and the approximation to the distance in original space is investigated. For a given Normalized Mean Square Error - NMSE, the higher the degree of clustering, the higher the recall. However, more clusters require more disk page accesses. Certain number of clusters can be obtained to achieve a higher recall while maintaining a relatively lower query processing cost. Clustering and Indexing using Persistent Main Memory - CIPMM framework is motivated by the following consideration: (a) a significant fraction of index pages are accessed randomly, incurring a high positioning time for each access; (b) disk transfer rate is improving 40% annually, while the improvement in positioning time is only 8%; (c) query processing incurs less CPU time for main memory resident than disk resident indices. CIPMM aims at reducing the elapsed time for query processing by utilizing sequential, rather than random disk accesses. A specific instance of the CIPMM framework CIPOP, indexing using Persistent Ordered Partition - OP-tree, is elaborated and compared with clustering and indexing using the SR-tree, CISR. The results show that CIPOP outperforms CISR, and the higher the dimensionality, the higher the performance gains. The SDI-tree index is motivated by fanouts decrease with dimensionality increasing and shorter vectors reduce cache misses. The index is built by using feature vectors transformed via principal component analysis, resulting in a structure with fewer dimensions at higher levels and increasing the number of dimensions from one level to the other. Dimensions are retained in nonincreasing order of their variance according to a parameter p, which specifies the incremental fraction of variance at each level of the index. Experiments on three datasets have shown that SDL-trees with carefully tuned parameters access fewer disk accesses than SR-trees and VAMSR-trees and incur less CPU time than VA-Files in addition

    Pattern vectors from algebraic graph theory

    Get PDF
    Graphstructures have proven computationally cumbersome for pattern analysis. The reason for this is that, before graphs can be converted to pattern vectors, correspondences must be established between the nodes of structures which are potentially of different size. To overcome this problem, in this paper, we turn to the spectral decomposition of the Laplacian matrix. We show how the elements of the spectral matrix for the Laplacian can be used to construct symmetric polynomials that are permutation invariants. The coefficients of these polynomials can be used as graph features which can be encoded in a vectorial manner. We extend this representation to graphs in which there are unary attributes on the nodes and binary attributes on the edges by using the spectral decomposition of a Hermitian property matrix that can be viewed as a complex analogue of the Laplacian. To embed the graphs in a pattern space, we explore whether the vectors of invariants can be embedded in a low- dimensional space using a number of alternative strategies, including principal components analysis ( PCA), multidimensional scaling ( MDS), and locality preserving projection ( LPP). Experimentally, we demonstrate that the embeddings result in well- defined graph clusters. Our experiments with the spectral representation involve both synthetic and real- world data. The experiments with synthetic data demonstrate that the distances between spectral feature vectors can be used to discriminate between graphs on the basis of their structure. The real- world experiments show that the method can be used to locate clusters of graphs
    corecore