45 research outputs found

    Index-Based, High-Dimensional, Cosine Threshold Querying with Optimality Guarantees

    Get PDF
    Given a database of vectors, a cosine threshold query returns all vectors in the database having cosine similarity to a query vector above a given threshold. These queries arise naturally in many applications, such as document retrieval, image search, and mass spectrometry. The present paper considers the efficient evaluation of such queries, providing novel optimality guarantees and exhibiting good performance on real datasets. We take as a starting point Fagin\u27s well-known Threshold Algorithm (TA), which can be used to answer cosine threshold queries as follows: an inverted index is first built from the database vectors during pre-processing; at query time, the algorithm traverses the index partially to gather a set of candidate vectors to be later verified against the similarity threshold. However, directly applying TA in its raw form misses significant optimization opportunities. Indeed, we first show that one can take advantage of the fact that the vectors can be assumed to be normalized, to obtain an improved, tight stopping condition for index traversal and to efficiently compute it incrementally. Then we show that one can take advantage of data skewness to obtain better traversal strategies. In particular, we show a novel traversal strategy that exploits a common data skewness condition which holds in multiple domains including mass spectrometry, documents, and image databases. We show that under the skewness assumption, the new traversal strategy has a strong, near-optimal performance guarantee. The techniques developed in the paper are quite general since they can be applied to a large class of similarity functions beyond cosine

    The HUPO-PSI standardized spectral library format

    Get PDF
    More and more proteomics datasets are becoming available in public repositories. The knowledge embedded in these datasets can be used to improve peptide identification workflows. Spectral library searching provides a straightforward method to boost identification rates using previously identified spectra. Alternatively, machine learning methods can learn from these spectra to accurately predict the behavior of peptides in a liquid chromatography-mass spectrometry system. At the basis of both approaches are spectral libraries: Unified collections of previously identified spectra. Organizations and projects such as the National Institute of Standards and Technology (NIST), the Global Proteome Machine, PeptideAtlas, PRIDE Archive and MassIVE have all compiled spectral libraries for a multitude of species and experimental setups. A large obstacle, however, is that each organization provides libraries in a different file format. At the software level the problem propagates (if not expands), as different software tools require different file formats. The solution is a standardized spectral library format that is sufficiently flexible to meet all users' demands, but that is also standardized enough to be usable across environments and software packages. This balance is achieved by setting up a standardized framework and a controlled vocabulary with metadata terms, and allow the format to be represented in different forms, such as plain text, JSON and HDF. So far, the required (and optional) meta data has been compiled and added to the PSI-MS ontology, and versions of the text and JSON representations have been drafted. The tabular and HDF representations of the format are in development, as well as converters and validators in various programming languages

    microbeMASST: A Taxonomically-informed Mass Spectrometry Search Tool for Microbial Metabolomics Data

    Get PDF
    microbeMASST, a taxonomically informed mass spectrometry (MS) search tool, tackles limited microbial metabolite annotation in untargeted metabolomics experiments. Leveraging a curated database of >60,000 microbial monocultures, users can search known and unknown MS/MS spectra and link them to their respective microbial producers via MS/MS fragmentation patterns. Identification of microbe-derived metabolites and relative producers without a priori knowledge will vastly enhance the understanding of microorganisms’ role in ecology and human health

    A Taxonomically-informed Mass Spectrometry Search Tool for Microbial Metabolomics Data

    Get PDF
    MicrobeMASST, a taxonomically-informed mass spectrometry (MS) search tool, tackles limited microbial metabolite annotation in untargeted metabolomics experiments. Leveraging a curated database of >60,000 microbial monocultures, users can search known and unknown MS/MS spectra and link them to their respective microbial producers via MS/MS fragmentation patterns. Identification of microbial-derived metabolites and relative producers, without a priori knowledge, will vastly enhance the understanding of microorganisms’ role in ecology and human health

    What can be learned from Repository-Scale Public Mass Spectrometry Data?

    No full text
    High-throughput tandem mass spectrometry has enabled the detection and identification of over 75\% of all human proteins predicted to result in translated gene products from an available tens of terabytes of public data in thousands of datasets. This thesis explores what we can learn from this, as well as the challenges that arise when considering proteomics data at a repository scale. First, we will consider validating what is known, through resources to build, curate, and explore both FDR-controlled and user submitted libraries. Second, we present a tool that allows for an automation of application of strict community guidelines criteria to any set of search results, including peak quality and novel FDR controls. Third, we introduce a method to illuminate the extent of what is not yet known using a new clustering approach designed to explicitly model peptide diversity by explicitly modeling spectrum coelutions. Finally, fourth, we developed a method for extremely fast single spectrum searches against spectrum repositories consisting of billions of spectra to both confirm or refute knowledge base IDs as well as discover similar spectra to those consistently unidentified
    corecore