2,560 research outputs found

    Dynamicity and Durability in Scalable Visual Instance Search.

    Get PDF
    Visual instance search involves retrieving from a collection of images the ones that contain an instance of a visual query. Systems designed for visual instance search face the major challenge of scalability: a collection of a few million images used for instance search typically creates a few billion features that must be indexed. Furthermore, as real image collections grow rapidly, systems must also provide dynamicity, i.e., be able to handle on-line insertions while concurrently serving retrieval operations. Durability, which is the ability to recover correctly from software and hardware crashes, is the natural complement of dynamicity. Durability, however, has rarely been integrated within scalable and dynamic high-dimensional indexing solutions. This article addresses the issue of dynamicity and durability for scalable indexing of very large and rapidly growing collections of local features for instance retrieval. By extending the NV-tree, a scalable disk-based high-dimensional index, we show how to implement the ACID properties of transactions which ensure both dynamicity and durability. We present a detailed performance evaluation of the transactional NV-tree: (i) We show that the insertion throughput is excellent despite the overhead for enforcing the ACID properties; (ii) We also show that this transactional index is truly scalable using a standard image benchmark embedded in collections of up to 28.5 billion high-dimensional vectors; the largest single-server evaluations reported in the literature

    R-Forest for Approximate Nearest Neighbor Queries in High Dimensional Space

    Get PDF
    Searching high dimensional space has been a challenge and an area of intense research for many years. The dimensionality curse has rendered most existing index methods all but useless causing people to research other techniques. In my dissertation I will try to resurrect one of the best known index structures, R-Tree, which most have given up on as a viable method of answering high dimensional queries. I have pointed out the various advantages of R-Tree as a method for answering approximate nearest neighbor queries, and the advantages of locality sensitive hashing and locality sensitive B-Tree, which are the most successful methods today. I started by looking at improving the maintenance of R-Tree by the use of bulk loading and insertion. I proposed and implemented a new method that bulk loads the index which was an improvement of standard method. I then turned my attention to nearest neighbor queries, which is a much more challenging problem especially in high dimensional space. Initially I developed a set of heuristics, easily implemented in R-Tree, which improved the efficiency of high dimensional approximate nearest neighbor queries. To further refine my method I took another approach, by developing a new model, known as R-Forest, which takes advantage of space partitioning while still using R-Tree as its index structure. With this new approach I was able to implement new heuristics and can show that R-Forest, comprised of a set of R-Trees, is a viable solution tohigh dimensional approximate nearest neighbor queries when compared to established methods

    Flexible and scalable digital library search

    Get PDF
    In this report the development of a specialised search engine for a digital library is described. The proposed system architecture consists of three levels: the conceptual, the logical and the physical level. The conceptual level schema enables by its exposure of a domain specific schema semantically rich conceptual search. The logical level provides a description language to achieve a high degree of flexibility for multimedia retrieval. The physical level takes care of scalable and efficient persistent data storage. The role, played by each level, changes during the various stages of a search engine's lifecycle: (1) modeling the index, (2) populating and maintaining the index and (3) querying the index. The integration of all this functionality allows the combination of both conceptual and content-based querying in the query stage. A search engine for the Australian Open tennis tournament website is used as a running example, which shows the power of the complete architecture and its various component

    Segmenting broadcast news streams using lexical chains

    Get PDF
    In this paper we propose a course-grained NLP approach to text segmentation based on the analysis of lexical cohesion within text. Most work in this area has focused on the discovery of textual units that discuss subtopic structure within documents. In contrast our segmentation task requires the discovery of topical units of text i.e. distinct news stories from broadcast news programmes. Our system SeLeCT first builds a set of lexical chains, in order to model the discourse structure of the text. A boundary detector is then used to search for breaking points in this structure indicated by patterns of cohesive strength and weakness within the text. We evaluate this technique on a test set of concatenated CNN news story transcripts and compare it with an established statistical approach to segmentation called TextTiling

    R*-Tree index in Cassandra for Geospatial Processing

    Get PDF
    Geospatial data has garnered enough attention in recent times that it is being used everywhere right from simple applications such as booking a taxi ride to complex applications such as autonomous driving. Though the attention towards geospatial processing is something new, substantial research has been going on for years. With the evolution of NoSQL databases in recent times, geospatial processing has attained a new dimension concerning its applications and capability. The most popular NoSQL database to be used for geospatial processing is the MongoDB followed by Cassandra. It is the indexing process that is important concerning the data at hand irrespective of the type of the database. Some of the most common indexes used for the geospatial processing are R-tree, R*-tree, B-tree, Z-curve. R*-tree is the area of our study as it is one among the widely used indexes for geospatial querying. The database of our interest is Cassandra as it is one among the widely used NoSQL database that does not have native support for geospatial query processing. To support geospatial workload, Cassandra should interact with external libraries such as GeoMesa and Solr. In particular, we are interested in the working of the GeoMesa as it uses the Z-curve as the indexing mechanism for the geospatial processing. R*-tree is a dynamic structure capable of representing multi-dimensional data whereas Z-curves are capable of representing multi-dimensional data in a single dimension. In this study, we compare and contrast the performance of R*-tree and Z-curve for various geospatial operations in Cassandra

    An Efficient Scheme for a Distributed Video Retrieval System for Remote Users

    Get PDF
    The new era of digital video and multimedia technologies has created the potential for large libraries of digital video. With this new technology come the challenges of creating usable means by which such large and diverse depositories of digital information (digital libraries) can be efficiently queried and accessed so that (a) the response is fast, (b) the communication over the Internet is minimal and (c) the retrieval is characterized by high precision and recall. In this paper we discuss how existing digital video editing tools, together with data compression techniques, can be combined to create a fast, accurate and cost effective video retrieval system for remote users. The traditional approaches employed in text databases, such as keyword searching and volume browsing, are inadequate mechanisms for a video retrieval system for remote users because, (a) they don\u27t apply to video at all, or (b) they are not practical due to the amounts of data involved, or (c) they have insufficient resolution to be useful in a video archive. New techniques must be developed that facilitate the query and selection of digital video. This paper presents one such scheme

    CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

    Get PDF
    Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research

    Efficient similarity search in high-dimensional data spaces

    Get PDF
    Similarity search in high-dimensional data spaces is a popular paradigm for many modern database applications, such as content based image retrieval, time series analysis in financial and marketing databases, and data mining. Objects are represented as high-dimensional points or vectors based on their important features. Object similarity is then measured by the distance between feature vectors and similarity search is implemented via range queries or k-Nearest Neighbor (k-NN) queries. Implementing k-NN queries via a sequential scan of large tables of feature vectors is computationally expensive. Building multi-dimensional indexes on the feature vectors for k-NN search also tends to be unsatisfactory when the dimensionality is high. This is due to the poor index performance caused by the dimensionality curse. Dimensionality reduction using the Singular Value Decomposition method is the approach adopted in this study to deal with high-dimensional data. Noting that for many real-world datasets, data distribution tends to be heterogeneous, dimensionality reduction on the entire dataset may cause a significant loss of information. More efficient representation is sought by clustering the data into homogeneous subsets of points, and applying dimensionality reduction to each cluster respectively, i.e., utilizing local rather than global dimensionality reduction. The thesis deals with the improvement of the efficiency of query processing associated with local dimensionality reduction methods, such as the Clustering and Singular Value Decomposition (CSVD) and the Local Dimensionality Reduction (LDR) methods. Variations in the implementation of CSVD are considered and the two methods are compared from the viewpoint of the compression ratio, CPU time, and retrieval efficiency. An exact k-NN algorithm is presented for local dimensionality reduction methods by extending an existing multi-step k-NN search algorithm, which is designed for global dimensionality reduction. Experimental results show that the new method requires less CPU time than the approximate method proposed original for CSVD at a comparable level of accuracy. Optimal subspace dimensionality reduction has the intent of minimizing total query cost. The problem is complicated in that each cluster can retain a different number of dimensions. A hybrid method is presented, combining the best features of the CSVD and LDR methods, to find optimal subspace dimensionalities for clusters generated by local dimensionality reduction methods. The experiments show that the proposed method works well for both real-world datasets and synthetic datasets
    corecore