3,588 research outputs found

    The NASA Astrophysics Data System: Architecture

    Full text link
    The powerful discovery capabilities available in the ADS bibliographic services are possible thanks to the design of a flexible search and retrieval system based on a relational database model. Bibliographic records are stored as a corpus of structured documents containing fielded data and metadata, while discipline-specific knowledge is segregated in a set of files independent of the bibliographic data itself. The creation and management of links to both internal and external resources associated with each bibliography in the database is made possible by representing them as a set of document properties and their attributes. To improve global access to the ADS data holdings, a number of mirror sites have been created by cloning the database contents and software on a variety of hardware and software platforms. The procedures used to create and manage the database and its mirrors have been written as a set of scripts that can be run in either an interactive or unsupervised fashion. The ADS can be accessed at http://adswww.harvard.eduComment: 25 pages, 8 figures, 3 table

    Multiclass Data Segmentation using Diffuse Interface Methods on Graphs

    Full text link
    We present two graph-based algorithms for multiclass segmentation of high-dimensional data. The algorithms use a diffuse interface model based on the Ginzburg-Landau functional, related to total variation compressed sensing and image processing. A multiclass extension is introduced using the Gibbs simplex, with the functional's double-well potential modified to handle the multiclass case. The first algorithm minimizes the functional using a convex splitting numerical scheme. The second algorithm is a uses a graph adaptation of the classical numerical Merriman-Bence-Osher (MBO) scheme, which alternates between diffusion and thresholding. We demonstrate the performance of both algorithms experimentally on synthetic data, grayscale and color images, and several benchmark data sets such as MNIST, COIL and WebKB. We also make use of fast numerical solvers for finding the eigenvectors and eigenvalues of the graph Laplacian, and take advantage of the sparsity of the matrix. Experiments indicate that the results are competitive with or better than the current state-of-the-art multiclass segmentation algorithms.Comment: 14 page

    Preserving Trustworthiness and Confidentiality for Online Multimedia

    Get PDF
    Technology advancements in areas of mobile computing, social networks, and cloud computing have rapidly changed the way we communicate and interact. The wide adoption of media-oriented mobile devices such as smartphones and tablets enables people to capture information in various media formats, and offers them a rich platform for media consumption. The proliferation of online services and social networks makes it possible to store personal multimedia collection online and share them with family and friends anytime anywhere. Considering the increasing impact of digital multimedia and the trend of cloud computing, this dissertation explores the problem of how to evaluate trustworthiness and preserve confidentiality of online multimedia data. The dissertation consists of two parts. The first part examines the problem of evaluating trustworthiness of multimedia data distributed online. Given the digital nature of multimedia data, editing and tampering of the multimedia content becomes very easy. Therefore, it is important to analyze and reveal the processing history of a multimedia document in order to evaluate its trustworthiness. We propose a new forensic technique called ``Forensic Hash", which draws synergy between two related research areas of image hashing and non-reference multimedia forensics. A forensic hash is a compact signature capturing important information from the original multimedia document to assist forensic analysis and reveal processing history of a multimedia document under question. Our proposed technique is shown to have the advantage of being compact and offering efficient and accurate analysis to forensic questions that cannot be easily answered by convention forensic techniques. The answers that we obtain from the forensic hash provide valuable information on the trustworthiness of online multimedia data. The second part of this dissertation addresses the confidentiality issue of multimedia data stored with online services. The emerging cloud computing paradigm makes it attractive to store private multimedia data online for easy access and sharing. However, the potential of cloud services cannot be fully reached unless the issue of how to preserve confidentiality of sensitive data stored in the cloud is addressed. In this dissertation, we explore techniques that enable confidentiality-preserving search of encrypted multimedia, which can play a critical role in secure online multimedia services. Techniques from image processing, information retrieval, and cryptography are jointly and strategically applied to allow efficient rank-ordered search over encrypted multimedia database and at the same time preserve data confidentiality against malicious intruders and service providers. We demonstrate high efficiency and accuracy of the proposed techniques and provide a quantitative comparative study with conventional techniques based on heavy-weight cryptography primitives

    Duplicate Table Detection with Xash

    Get PDF
    Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data. Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to SimHash and other competing hash functions, Xash results in fewer false positive candidates

    Improved User News Feed Customization for an Open Source Search Engine

    Get PDF
    Yioop is an open source search engine project hosted on the site of the same name.It offers several features outside of searching, with one such feature being a news feed. The current news feed system aggregates articles from a curated list of news sites determined by the owner. However in its current state, the feed list is limited in size, constrained by the hardware that the aggregator is run on. The goal of my project was to overcome this limit by improving the current storage method used. The solution was derived by making use of IndexArchiveBundles and IndexShards, both of which are abstract data structures designed to handle large indexes. An additional aspect needed to accomodate for news feed was the ability to traverse said data structures in decreasing order of recently added. New methods were added to the preexisting WordIterator to handle this need. The result is a system with two new advantages, the capacity to store more feed items than before and the functionality of moving through indexes from the end back to the start. Our findings also indicate that the new process is much faster, with insertions taking one-tenth of the time at its fastest. Additionally, whereas the old system only stored around 37500 items at most, the new system allows for potentially unlimited news items to be stored. The methodology detailed in this project can also be applied to any information retrieval system to construct an index and read from it

    A survey on the use of relevance feedback for information access systems

    Get PDF
    Users of online search engines often find it difficult to express their need for information in the form of a query. However, if the user can identify examples of the kind of documents they require then they can employ a technique known as relevance feedback. Relevance feedback covers a range of techniques intended to improve a user's query and facilitate retrieval of information relevant to a user's information need. In this paper we survey relevance feedback techniques. We study both automatic techniques, in which the system modifies the user's query, and interactive techniques, in which the user has control over query modification. We also consider specific interfaces to relevance feedback systems and characteristics of searchers that can affect the use and success of relevance feedback systems
    • 

    corecore