20 research outputs found

    Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments

    Get PDF
    Motivation The ever-growing size of sequencing data is a major bottleneck in bioinformatics as the advances of hardware development cannot keep up with the data growth. Therefore, an enormous amount of data is collected but rarely ever reused, because it is nearly impossible to find meaningful experiments in the stream of raw data. Results As a solution, we propose Needle, a fast and space-efficient index which can be built for thousands of experiments in <2 h and can estimate the quantification of a transcript in these experiments in seconds, thereby outperforming its competitors. The basic idea of the Needle index is to create multiple interleaved Bloom filters that each store a set of representative k-mers depending on their multiplicity in the raw data. This is then used to quantify the query

    Parallel Global Edge Switching for the Uniform Sampling of Simple Graphs with Prescribed Degrees

    Full text link
    The uniform sampling of simple graphs matching a prescribed degree sequence is an important tool in network science, e.g. to construct graph generators or null-models. Here, the Edge Switching Markov Chain (ES-MC) is a common choice. Given an arbitrary simple graph with the required degree sequence, ES-MC carries out a large number of small changes, called edge switches, to eventually obtain a uniform sample. In practice, reasonably short runs efficiently yield approximate uniform samples. In this work, we study the problem of executing edge switches in parallel. We discuss parallelizations of ES-MC, but find that this approach suffers from complex dependencies between edge switches. For this reason, we propose the Global Edge Switching Markov Chain (G-ES-MC), an ES-MC variant with simpler dependencies. We show that G-ES-MC converges to the uniform distribution and design shared-memory parallel algorithms for ES-MC and G-ES-MC. In an empirical evaluation, we provide evidence that G-ES-MC requires not more switches than ES-MC (and often fewer), and demonstrate the efficiency and scalability of our parallel G-ES-MC implementation

    Engineering Shared-Memory Parallel Shuffling to Generate Random Permutations In-Place

    Get PDF
    Shuffling is the process of placing elements into a random order such that any permutation occurs with equal probability. It is an important building block in virtually all scientific areas. We engineer, - to the best of our knowledge - for the first time, a practically fast, parallel shuffling algorithm with O(?n log n) parallel depth that requires only poly-logarithmic auxiliary memory (with high probability). In an empirical evaluation, we compare our implementations with a number of existing solutions on various computer architectures. Our algorithms consistently achieve the highest through-put on all machines. Further, we demonstrate that the runtime of our parallel algorithm is comparable to the time that other algorithms may take to acquire the memory from the operating system to copy the input

    ВИБІР ДЖЕРЕЛА ВИПАДКОВОСТІ ДЛЯ КОМП’ЮТЕРНОГО МОДЕЛЮВАННЯ

    Get PDF
    The main purpose of evolutionary optimization is to find a combination of parameters (independent variables) that would help maximize or minimize the qualitative, quantitative, and probabilistic characteristics of the problem. Recently, integrated optimization methods have become very common, borrowing the basic principles of their work from wildlife. Researchers are experimenting with different types of representations, for example, evolutionary and genetic algorithms use selection methods and genetic operators. A large number of algorithms based on the swarm method are known. The artificial bee colony is an optimization method that mimics the behavior of bees, a specific application of cluster intelligence, the main feature of which is that it does not need to understand specific information about the problem, you just need to optimize the problem. Comparing inferiority with the help of the local optimization behavior of each person with an artificial bee finally leads to the appearance in the group of a global optimal value with a higher rate of convergence. The paper considers the method of solving the optimization problem based on modeling the behavior of the bee colony. Description of the model of the behavior of intelligence agents and forage agents, search mechanisms, and selection of positions in a given neighborhood. The general structure of the optimization process is given. Graphical results are also presented, which prove the possibility of the bee colony method to optimize the results, i.e. from all multiple sources of information, the bee colony method by optimization can significantly limit the number of information sources, identify a narrow range of sources that may be false information. Which in the future will allow you to more accurately identify sources with false information and block them.У статті розглядаються проблеми вибору джерела випадковості для комп’ютерного моделювання стохастичних процесів, що використовується для дослідження характеристик потоків подій безпеки в розподілених комп’ютерних мережах, на етапі проектування складних автоматизованих систем та процесів, які мають місце в управлінні виробництвом та інфраструктурними об’єктами. Складовою частиною комп’ютерної моделі є джерело випадковості, яке формує рівномірно розподілений потік випадкових цілих або дійсних чисел. Воно повинно формувати потік рівномірно розподілених чисел і, в той же час, бути економічним з точки зору обчислювальних ресурсів. В роботі надано аналіз простих генераторів псевдовипадкових чисел, в алгоритмі яких використовуються прості комп’ютерні операції. До складу таких генераторів віднесені генератор Фібоначчі з запізненням та запропонований Дж. Марсальєю генератор Xorshift128. Відзначено, що будь-яка нерівномірність розподілення чисел на виході генератора, суттєво впливає на якість процесу, який підлягає моделюванню. На основі результатів проведених досліджень існуючих способів постоброблення вихідних послідовностей, зроблено висновок про те, для забезпечення ефективності алгоритму формування потоку рівномірно роз-поділених псевдовипадкових чисел, процедури додаткового оброблення повинні бути достатньо економічними з точки зору задіяних методів обчислення. Оцінка нерівномірності розподілення числового потоку виконувалась з використанням показника хі-квадрат Пірсона. Для корекції вихідного числового потоку запропоновано і обґрунтовано спосіб екстракції з нього тої його частини, ентропія якої найбільша. Також, обґрунтовано параметри гістограми, що дають хороші результати оцінки вихідного розподілення. Показано, що комбінація простого і економічного генератора псевдовипадкових чисел в сукупності з постобробленням дає хороші результати при мінімальних обчислювальних ресурсах

    A Floating-Point Secure Implementation of the Report Noisy Max with Gap Mechanism

    Full text link
    The Noisy Max mechanism and its variations are fundamental private selection algorithms that are used to select items from a set of candidates (such as the most common diseases in a population), while controlling the privacy leakage in the underlying data. A recently proposed extension, Noisy Top-k with Gap, provides numerical information about how much better the selected items are compared to the non-selected items (e.g., how much more common are the selected diseases). This extra information comes at no privacy cost but crucially relies on infinite precision for the privacy guarantees. In this paper, we provide a finite-precision secure implementation of this algorithm that takes advantage of integer arithmetic.Comment: 21 page

    Array programming with NumPy.

    Get PDF
    Array programming provides a powerful, compact and expressive syntax for accessing, manipulating and operating on data in vectors, matrices and higher-dimensional arrays. NumPy is the primary array programming library for the Python language. It has an essential role in research analysis pipelines in fields as diverse as physics, chemistry, astronomy, geoscience, biology, psychology, materials science, engineering, finance and economics. For example, in astronomy, NumPy was an important part of the software stack used in the discovery of gravitational waves1 and in the first imaging of a black hole2. Here we review how a few fundamental array concepts lead to a simple and powerful programming paradigm for organizing, exploring and analysing scientific data. NumPy is the foundation upon which the scientific Python ecosystem is constructed. It is so pervasive that several projects, targeting audiences with specialized needs, have developed their own NumPy-like interfaces and array objects. Owing to its central position in the ecosystem, NumPy increasingly acts as an interoperability layer between such array computation libraries and, together with its application programming interface (API), provides a flexible framework to support the next decade of scientific and industrial analysis
    corecore