Search CORE

37 research outputs found

Finding Skewed Subcubes Under a Distribution

Author: Gopalan Parikshit
Levin Roie
Wieder Udi
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 11th Innovations in Theoretical Computer Science Conference (ITCS 2020)
Publication date: 01/01/2020
Field of study

Say that we are given samples from a distribution ? over an n-dimensional space. We expect or desire ? to behave like a product distribution (or a k-wise independent distribution over its marginals for small k). We propose the problem of enumerating/list-decoding all large subcubes where the distribution ? deviates markedly from what we expect; we refer to such subcubes as skewed subcubes. Skewed subcubes are certificates of dependencies between small subsets of variables in ?. We motivate this problem by showing that it arises naturally in the context of algorithmic fairness and anomaly detection. In this work we focus on the special but important case where the space is the Boolean hypercube, and the expected marginals are uniform. We show that the obvious definition of skewed subcubes can lead to intractable list sizes, and propose a better definition of a minimal skewed subcube, which are subcubes whose skew cannot be attributed to a larger subcube that contains it. Our main technical contribution is a list-size bound for this definition and an algorithm to efficiently find all such subcubes. Both the bound and the algorithm rely on Fourier-analytic techniques, especially the powerful hypercontractive inequality. On the lower bounds side, we show that finding skewed subcubes is as hard as the sparse noisy parity problem, and hence our algorithms cannot be improved on substantially without a breakthrough on this problem which is believed to be intractable. Motivated by this, we study alternate models allowing query access to ? where finding skewed subcubes might be easier

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Diamond Dicing

Author: Antony
Bouman
Börzsönyi
Cerf
Daniel Lemire
Donjerkovic
Engene
Fang
Frank
Godin
Hahn
Hazel Webb
Kaser
Knorr
Kondo
Korn
Kumar
Lemire
Ley
Mazón
MonetDB BV
Netflix Inc.
Ng
O'Neil
Owen Kaser
Porter
Rizzi
Sarawagi
Tang
Transaction Processing Performance Council
Turney
Webb
Webb
Wille
Ślezak
Publication venue: 'Elsevier BV'
Publication date: 01/09/2013
Field of study

In OLAP, analysts often select an interesting sample of the data. For example, an analyst might focus on products bringing revenues of at least 100 000 dollars, or on shops having sales greater than 400 000 dollars. However, current systems do not allow the application of both of these thresholds simultaneously, selecting products and shops satisfying both thresholds. For such purposes, we introduce the diamond cube operator, filling a gap among existing data warehouse operations. Because of the interaction between dimensions the computation of diamond cubes is challenging. We compare and test various algorithms on large data sets of more than 100 million facts. We find that while it is possible to implement diamonds in SQL, it is inefficient. Indeed, our custom implementation can be a hundred times faster than popular database engines (including a row-store and a column-store).Comment: 29 page

arXiv.org e-Print Archive

R-libre

Crossref

Statistical properties of cosmological correlation functions

Author: Wilking Philipp
Publication venue: Universitäts- und Landesbibliothek Bonn
Publication date
Field of study

Correlation functions are an omnipresent tool in astrophysics, and they are routinely used to study phenomena as diverse as the large-scale structure of the Universe, time-dependent pulsar signals, and the cosmic microwave background. In many cases, measured correlation functions are analyzed in the framework of Bayesian statistics, which requires knowledge about the likelihood of the data. In the case of correlation functions, this probability distribution is usually approximated as a multivariate Gaussian, which is not necessarily good approximation -- hence, this work aims at finding a better description. To this end, we exploit fundamental mathematical constraints on correlation functions, which we use to construct a quasi-Gaussian likelihood. We explain how to compute the constraints, in particular for multi-dimensional random fields, where this can only be done numerically, check the quality of the quasi-Gaussian approximation, and compare it to alternative approaches -- most importantly, we test the new-found description of the likelihood in a toy-model Bayesian analysis. Finally, we compute correlation functions from the Millennium Simulation and show that they obey the constraints. By studying statistical properties of the measured correlation functions, we present further indications for the validity of the quasi-Gaussian approach

bonndoc – Der Publikationsserver der Universität Bonn

Robust Scalable Sorting

Author: Axtmann Michael
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 23/08/2021
Field of study

Sortieren ist eines der wichtigsten algorithmischen Grundlagenprobleme. Es ist daher nicht verwunderlich, dass Sortieralgorithmen in einer Vielzahl von Anwendungen benötigt werden. Diese Anwendungen werden auf den unterschiedlichsten Geräten ausgeführt -- angefangen bei Smartphones mit leistungseffizienten Multi-Core-Prozessoren bis hin zu Supercomputern mit Tausenden von Maschinen, die über ein Hochleistungsnetzwerk miteinander verbunden sind. Spätestens seitdem die Single-Core-Leistung nicht mehr signifikant steigt, sind parallele Anwendungen in unserem Alltag nicht mehr wegzudenken. Daher sind effiziente und skalierbare Algorithmen essentiell, um diese immense Verfügbarkeit von (paralleler) Rechenleistung auszunutzen. Diese Arbeit befasst sich damit, wie sequentielle und parallele Sortieralgorithmen auf möglichst robuste Art maximale Leistung erzielen können. Dabei betrachten wir einen großen Parameterbereich von Eingabegrößen, Eingabeverteilungen, Maschinen sowie Datentypen. Im ersten Teil dieser Arbeit untersuchen wir sowohl sequentielles Sortieren als auch paralleles Sortieren auf Shared-Memory-Maschinen. Wir präsentieren In-place Parallel Super Scalar Samplesort (IPS⁴o), einen neuen vergleichsbasierten Algorithmus, der mit beschränkt viel Zusatzspeicher auskommt (die sogenannte „in-place” Eigenschaft). Eine wesentliche Erkenntnis ist, dass unsere in-place-Technik die Sortiergeschwindigkeit von IPS⁴o im Vergleich zu ähnlichen Algorithmen ohne in-place-Eigenschaft verbessert. Bisher wurde die Eigenschaft, mit beschränkt viel Zusatzspeicher auszukommen, eher mit Leistungseinbußen verbunden. IPS⁴o ist außerdem cache-effizient und führt

O(n/t\log n)

Arbeitsschritte pro Thread aus, um ein Array der Größe

n

mit

t

Threads zu sortieren. Zusätzlich berücksichtigt IPS⁴o Speicherlokalität, nutzt einen Entscheidungsbaum ohne Sprungvorhersagen und verwendet spezielle Partitionen für Elemente mit gleichem Schlüssel. Für den Spezialfall, dass ausschließlich ganzzahlige Schlüssel sortiert werden sollen, haben wir das algorithmische Konzept von IPS⁴o wiederverwendet, um In-place Parallel Super Scalar Radix Sort (IPS²Ra) zu implementieren. Wir bestätigen die Performance unserer Algorithmen in einer umfangreichen experimentellen Studie mit 21 State-of-the-Art-Sortieralgorithmen, sechs Datentypen, zehn Eingabeverteilungen, vier Maschinen, vier Speicherzuordnungsstrategien und Eingabegrößen, die über sieben Größenordnungen variieren. Einerseits zeigt die Studie die robuste Leistungsfähigkeit unserer Algorithmen. Andererseits deckt sie auf, dass viele konkurrierende Algorithmen Performance-Probleme haben: Mit IPS⁴o erhalten wir einen robusten vergleichsbasierten Sortieralgorithmus, der andere parallele in-place vergleichsbasierte Sortieralgorithmen fast um den Faktor drei übertrifft. In der überwiegenden Mehrheit der Fälle ist IPS⁴o der schnellste vergleichsbasierte Algorithmus. Dabei ist es nicht von Bedeutung, ob wir IPS⁴o mit Algorithmen vergleichen, die mit beschränkt viel Zusatzspeicher auskommen, Zusatzspeicher in der Größenordnung der Eingabe benötigen, und parallel oder sequentiell ausgeführt werden. IPS⁴o übertrifft in vielen Fällen sogar konkurrierende Implementierungen von Integer-Sortieralgorithmen. Die verbleibenden Fälle umfassen hauptsächlich gleichmäßig verteilte Eingaben und Eingaben mit Schlüsseln, die nur wenige Bits enthalten. Diese Eingaben sind in der Regel „einfach” für Integer-Sortieralgorithmen. Unser Integer-Sorter IPS²Ra übertrifft andere Integer-Sortieralgorithmen für diese Eingaben in der überwiegenden Mehrheit der Fälle. Ausnahmen sind einige sehr kleine Eingaben, für die die meisten Algorithmen sehr ineffizient sind. Allerdings sind Algorithmen, die auf diese Eingabegrößen abzielen, in der Regel für alle anderen Eingaben deutlich langsamer. Im zweiten Teil dieser Arbeit untersuchen wir skalierbare Sortieralgorithmen für verteilte Systeme, welche robust in Hinblick auf die Eingabegröße, häufig vorkommende Sortierschlüssel, die Verteilung der Sortierschlüssel auf die Prozessoren und die Anzahl an Prozessoren sind. Das Resultat unserer Arbeit sind im Wesentlichen vier robuste skalierbare Sortieralgorithmen, mit denen wir den gesamten Bereich an Eingabegrößen abdecken können. Drei dieser vier Algorithmen sind neue, schnelle Algorithmen, welche so implementiert sind, dass sie nur einen geringen Zusatzaufwand benötigen und gleichzeitig unabhängig von „schwierigen” Eingaben robust skalieren. Es handelt sich z.B. um „schwierige” Eingaben, wenn viele gleiche Elemente vorkommen oder die Eingabeelemente in Hinblick auf ihre Sortierschlüssel ungünstig auf die Prozessoren verteilt sind. Bisherige Algorithmen für mittlere und größere Eingabegrößen weisen ein unzumutbar großes Kommunikationsvolumen auf oder tauschen unverhältnismäßig oft Nachrichten aus. Für diese Eingabegrößen beschreiben wir eine robuste, mehrstufige Verallgemeinerung von Samplesort, die einen brauchbaren Kompromiss zwischen dem Kommunikationsvolumen und der Anzahl ausgetauschter Nachrichten darstellt. Wir überwinden diese bisher unvereinbaren Ziele mittels einer skalierbaren approximativen Splitterauswahl sowie eines neuen Datenumverteilungsalgorithmus. Als eine Alternative stellen wir eine Verallgemeinerung von Mergesort vor, welche den Vorteil von perfekt ausbalancierter Ausgabe hat. Für kleine Eingaben entwerfen wir eine Variante von Quicksort. Mit wenig Zusatzaufwand vermeidet sie das Problem ungünstiger Elementverteilungen und häufig vorkommender Sortierschlüssel, indem sie schnell qualitativ hochwertige Splitter auswählt, die Elemente zufällig den Prozessoren zuweist und einer Duplikat-Behandlung unterzieht. Bisherige praktische Ansätze mit polylogarithmischer Latenz haben entweder einen logarithmischen Faktor mehr Kommunikationsvolumen oder berücksichtigen nur gleichverteilte Eingaben ohne mehrfach vorkommende Sortierschlüssel. Für sehr kleine Eingaben schlagen wir einen einfachen sowie schnellen, jedoch arbeitsineffizienten Algorithmus mit logarithmischer Latenzzeit vor. Für diese Eingaben sind bisherige effiziente Ansätze nur theoretische Algorithmen, die meist unverhältnismäßig große konstante Faktoren haben. Für die kleinsten Eingaben empfehlen wir die Daten zu sortieren, während sie an einen einzelnen Prozessor geschickt werden. Ein wichtiger Beitrag dieser Arbeit zu der praktischen Seite von Algorithm Engineering ist die Kommunikationsbibliothek RangeBasedComm (RBC). Mit RBC ermöglichen wir eine effiziente Umsetzung von rekursiven Algorithmen mit sublinearer Laufzeit, indem sie skalierbare und effiziente Kommunikationsfunktionen für Teilmengen von Prozessoren bereitstellt. Zuletzt präsentieren wir eine umfangreiche experimentelle Studie auf zwei Supercomputern mit bis zu 262144 Prozessorkernen, elf Algorithmen, zehn Eingabeverteilungen und Eingabegrößen variierend über neun Größenordnungen. Mit Ausnahme von den größten Eingabegrößen ist diese Arbeit die einzige, die überhaupt Sortierexperimente auf Maschinen dieser Größe durchführt. Die RBC-Bibliothek beschleunigt die Algorithmen teilweise drastisch – einen konkurrierenden Algorithmus sogar um mehr als zwei Größenordnungen. Die Studie legt dar, dass unsere Algorithmen robust sind und gleichzeitig konkurrierende Implementierungen leistungsmäßig deutlich übertreffen. Die Konkurrenten, die man normalerweise betrachtet hätte, stürzen bei „schwierigen” Eingaben sogar ab

KITopen

Compact and indexed representation for LiDAR point clouds

Author: Ladra Susana
Paramá José R.
Rodríguez Luaces Miguel
Silva-Coira Fernando
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2022
Field of study

[Abstract]: LiDAR devices are capable of acquiring clouds of 3D points reflecting any object around them, and adding additional attributes to each point such as color, position, time, etc. LiDAR datasets are usually large, and compressed data formats (e.g. LAZ) have been proposed over the years. These formats are capable of transparently decompressing portions of the data, but they are not focused on solving general queries over the data. In contrast to that traditional approach, a new recent research line focuses on designing data structures that combine compression and indexation, allowing directly querying the compressed data. Compression is used to fit the data structure in main memory all the time, thus getting rid of disk accesses, and indexation is used to query the compressed data as fast as querying the uncompressed data. In this paper, we present the first data structure capable of losslessly compressing point clouds that have attributes and jointly indexing all three dimensions of space and attribute values. Our method is able to run range queries and attribute queries up to 100 times faster than previous methods.Secretara Xeral de Universidades; [ED431G 2019/01]Ministerio de Ciencia e Innovacion; [PID2020-114635RB-I00]Ministerio de Ciencia e Innovacion; [PDC2021-120917C21]Ministerio de Ciencia e Innovación; [PDC2021-121239-C31]Ministerio de Ciencia e Innovación; [PID2019-105221RB-C41]Xunta de Galicia; [ED431C 2021/53]Xunta de Galicia; [IG240.2020.1.185

Repositorio da Universidade da Coruña

Recommended from our members

Capriccio For Strings: Collision-Mediated Parallel Transport in Curved Landscapes and Conifold-Enhanced Hierarchies Among Mirror Quintic Flux Vacua

Author: Eckerle Kate
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2017
Field of study

This dissertation begins with a review of Calabi-Yau manifolds and their moduli spaces, flux compactification largely tailored to the case of type IIb supergravity, and Coleman-De Luccia vacuum decay. The three chapters that follow present the results of novel research conducted as a graduate student. Our first project is concerned with bubble collisions in single scalar field theories with multiple vacua. Lorentz boosted solitons traveling in one spatial dimension are used as a proxy to the colliding 3-dimensional spherical bubble walls. Recent work found that at sufficiently high impact velocities collisions between such bubble vacua are governed by "free passage" dynamics in which field interactions can be ignored during the collision, providing a systematic process for populating local minima without quantum nucleation. We focus on the time period that follows the bubble collision and provide evidence that, for certain potentials, interactions can drive significant deviations from the free passage bubble profile, thwarting the production of a new patch with different field value. However, for simple polynomial potentials a fine-tuning of vacuum locations is required to reverse the free passage kick enough that the field in the collision region returns to the original bubble vacuum. Hence we deem classical transitions mediated by free passage robust. Our second project continues with soliton collisions in the limit of relativistic impact velocity, but with the new feature of nontrivial field space curvature. We establish a simple geometrical interpretation of such collisions in terms of a double family of field profiles whose tangent vector fields stand in mutual parallel transport. This provides a generalization of the well-known limit in flat field space (free passage). We investigate the limits of this approximation and illustrate our analytical results with numerical simulations. In our third and final project we investigate the distribution of field theories that arise from the low energy limit of flux vacua built on type IIb string theory compactified on the mirror quintic. For a large collection of these models, we numerically determine the distribution of Taylor coefficients in a polynomial expansion of each model's scalar potential to fourth order. We provide an analytic explanation of the proncounced hierarchies exhibited by the random sample of masses and couplings generated numerically. The analytic argument is based on the structure of masses in no scale supergravity and the divergence of the Yukawa coupling at the conifold point in the moduli space of the mirror quintic. Our results cast the superpotential vev as a random element whose capacity to cloud structure vanishes as the conifold is approached

Columbia University Academic Commons

Recommended from our members

Strengths and weaknesses of Dataparallel C

Author: S Seetharamakrishnan
Publication venue: 'Oregon State University'
Publication date
Field of study

Dataparallel C is a SIMD style data-parallel programming language for MIMD computers. Dataparallel Chas been implemented on both shared memory (Sequent) and distributed memory (Intel and nCUBE) computers. Here we analyze the strengths and weaknesses of Dataparallel C by comparing the performance of compiled Dataparallel C programs with the performance of programs developed using other parallel programming environments

ScholarsArchive@OSU

The Galaxy Velocity Function from MIGHTEE-HI Early Science Data

Author: Mulaudzi Wanga
Publication venue: Department of Astronomy
Publication date: 07/03/2022
Field of study

The velocity function of MIGHTEE-H I Early Science data is presented. This is the first velocity function that is based on a blind radio interferometric survey. As a precursor, understanding the systematics that affect the Early Science velocity function will optimise the full survey's analysis. PYMULTINEST and the Busy Function are employed to estimate the linewidths of the low spectral resolution data. The performance of PYMULTINEST in estimating known linewidths of simulated H I profiles with varying spectral resolution is assessed. The simulation study shows that the estimated linewidths of the Early Science data, using this novel method, are robust and are recovered within the uncertainty. The effects of cosmic variance, instrumental linewidth broadening and Doppler linewidth broadening on the velocity function are quantified within the context of the limitations of the Early Science data. The MIGHTEE-H I Early Science velocity function is compared with the velocity functions from previous large-scale H I surveys, namely the Arecibo Legacy Fast ALFA (ALFALFA) survey and the H I Parkes All-Sky Survey (HIPASS). There is general agreement with the ALFALFA and HIPASS results, when taking linewidth broadening into account, given that the MIGHTEE-H I Early Science data is strongly affected by cosmic variance. In particular, cosmic variance introduces an average uncertainty of ∼ 24% in the measured Early Science volume densities. The larger effective area of the full survey will reduce the impact of cosmic variance. The full survey velocity function can be further optimised by estimating the rotational velocities using kinematic modelling, and correcting the measured linewidths for instrumental broadening, Doppler broadening, turbulent motion and inclination effects

Cape Town University OpenUCT