11 research outputs found

    Efficiently finding the smallest k values in a large Cartesian product of lists

    Get PDF
    If you are on a budget, how may you go about finding the best drink and entrée combination at a restaurant? You may simple choose the least expensive items, but a water and side salad is not a great dinner. Instead, you may want to judge the ten least expensive drink and entrée combinations to pick your favorite. If you create a list of drink prices and a list of entrée prices, then all possible combinations of a drink and an entrée would be the Cartesian product of the two lists. Then, you would want to choose from the ten least expensive meals produced by the Cartesian product. Finding the smallest k values from the Cartesian product X+Y, where X and Y are lists of values X = {x1, x2,...}, Y = {y1, y2,...}, is a well-studied fundamental problem of computer science. There have been several methods which solve this problem with a runtime proportional to n + k, where n is the length of the lists. This is the best runtime possible since all input and output values much be touched at least once. The generalization of the problem, where the Cartesian product is on many lists X1+X2+···+ Xm, has never seen a fast algorithm. We present an algorithm for the generalization which is faster than m•n + k•m. This is remarkable because to load m lists, each with n values, has runtime m•n and looking up k values in m lists has runtime k•m. In computer science, there are many different structures used to store data. In order to get a fast runtime, we use a new data structure called a layer-ordered heap which gives information about the ordering of the values in a list while still not completely sorting the data. It may seem intuitive to use sorting since we want to find the smallest values; however, sorting a list of k values has a runtime of at least k•log(k). In the runtime of our method, we want the term which grows with k to be faster than k•log(k) so we can not use sorting. Keeping the data organized in such a way that it has some ordering to it but is not completely sorted is the key to our algorithm. One important application of our algorithm is to calculate the most abundant isotopes of a molecule. The isotopes of an element (e.g. oxygen) are all the ways in which an element may have a different number of neutrons. For example, carbon dioxide CO2 is made up of one carbon and two oxygens. Carbon has two isotopes which appear in nature, 12C and 13C, while oxygen has three, 16O, 17O, and 18O. This means that carbon and oxygen may naturally form six different combinations of isotopes, which is the Cartesian product of three lists: {12C, 13C}, {16O, 17O, 18O}, and {16O, 17O, 18O}. Six possible isotopes may seem trivial, but for very large molecules there may be millions of possible isotopes, being able to efficiently compute only the top k is very helpful

    ZERO-KNOWLEDGE DE NOVO ALGORITHMS FOR ANALYZING SMALL MOLECULES USING MASS SPECTROMETRY

    Get PDF
    In the analysis of mass spectra, if a superset of the molecules thought to be in a sample is known a priori, then there are well established techniques for the identification of the molecules such as database search and spectral libraries. Linear molecules are chains of subunits. For example, a peptide is a linear molecule with an “alphabet” of 20 possible amino acid subunits. A peptide of length six will have 206 = 64, 000, 000 different possible outcomes. Small molecules, such as sugars and metabolites, are not constrained to linear structures and may branch. These molecules are encoded as undirected graphs rather than simply linear chains. An undirected graph with six subunits (each of which have 20 possible outcomes) will 6 have 206 · 2(6 choose 2) = 2, 097, 152, 000, 000 possible outcomes. The vast amount of complex graphs which small molecules can form can render databases and spectral libraries impossibly large to use or incomplete as many metabolites may still be unidentified. In the absence of a usable database or spectral library, an the alphabet of subunits may be used to connect peaks in the fragmentation spectra; each connection represents a neutral loss of an alphabet mass. This technique is called “de novo sequencing” and relies on the alphabet being known in advance. Often the alphabet of m/z difference values allowed by de novo analysis is not known or is incomplete. A method is proposed that, given fragmentation mass spectra, identifies an alphabet of m/z differences that can build large connected graphs from many intense peaks in each spectrum from a collection. Once an alphabet is obtained, it is informative to find common substructures among the peaks connected by the alphabet. This is the same as finding the largest isomorphic subgraphs on the de novo graphs from all pairs of fragmentation spectra. This maximal subgraph isomorphism problem is a generalization of the subgraph isomorphism problem, which asks whether a graph G1 has a subgraph isomorphic to a graph G2 . Subgraph isomorphism is NP-complete. A novel method of efficiently finding common substructures among the subspectra induced by the alphabet is proposed. This method is then combined with a novel form of hashing, eschewing evaluation of all pairs of fragmentation spectra. These methods are generalized to Euclidean graphs embedded in Zn

    Using Fundamental Measure Theory to Treat the Correlation Function of the Inhomogeneous Hard-Sphere Fluid

    Get PDF
    We investigate the value of the correlation function of an inhomogeneous hard-sphere fluid at contact. This quantity plays a critical role in Statistical Associating Fluid Theory (SAFT), which is the basis of a number of recently developed classical density functionals. We define two averaged values for the correlation function at contact, and derive formulas for each of them from the White Bear version of the Fundamental Measure Theory functional, using an assumption of thermodynamic consistency. We test these formulas, as well as two existing formulas against Monte Carlo simulations, and find excellent agreement between the Monte Carlo data and one of our averaged correlation functions

    The Alphabet Projection of Mass Spectrometry Data

    No full text
    My presentation will be about finding small molecules in mass spectrometry (MS) data. There is a wide breadth of future applications for this technique but the most impactful may be in drug testing. This method can be used to find what a drug has metabolized into (it could be a harmful poison or a therapeutic chemical) after it has interacted with a patient\u27s physiology. MS is a technique used to find the mass of objects (e.g. molecules and amino acids) which are too small to be weighed through conventional means. The data is measured in mass vs intensity; intensity can be thought of as the abundance of the mass in the sample. The data looks like a series of peaks where a peak is present if there is a mass found at that value and the height is proportional to the intensity of that mass. We use the mass difference between peaks to find molecules that are either too small to be found by MS or have disappeared from the sample before the MS process began. We identify a set of the most important mass differentials, which we call an alphabet, that connect many masses in the MS data. There are methods which currently use an already known alphabet to connect a graph, but such an alphabet may be so large as to be unusable for certain data sets such as urine analysis. We are the first to propose a method which finds such an alphabet without any knowledge of the data a priori. In order to find the most important masses in the MS data we represent the masses as vertices in a graph. We connect the two vertices if there is a mass in the alphabet equal to the difference between the two vertices. The larger the graph it builds the more important the masses in the alphabet are. This comes from the idea that chain reactions are important. If many masses all lose a sugar mass than the sugar must be important to the sample. If those smaller masses then lose a water molecule, water and sugar combined are important. From the MS data millions of mass differentials may be calculated, we project these millions down to the most important ones. Usually we project down to between 32 and 128. The alphabets from which we build the graphs are determined randomly. Each mass differential is picked by choosing two random masses and taking the difference between them. Then we have a model which calculates what we think the quality of the graph(s) made by the alphabet are. If the quality of a set of graphs produced by one alphabet is better than produced by another, we keep the first. After proposing random masses, building the graphs, and then accepting the best alphabets many times the best alphabet will begin to converge to a final answer, meaning we can not find an alphabet which produces better graphs. As stated above the most significant application may be in the field of drug testing. But this method can be used anytime you are unsure of what may be contained in a sample. The TSA can use this method to find which potential bomb-making chemicals to look for. This would be done by making a bomb, take MS data then use our method to find which chemicals are prevalent in the sample. Another biological application may be in diagnosis. If we take a urine sample from a patient there may be a molecule which shows up in our alphabet that can determine whether the patient is diabetic, experiencing kidney failure, is pregnant, etc. p { margin-bottom: 0.1in; line-height: 120%;
    corecore