9 research outputs found
The analysis of enumerative source codes and their use in BurrowsâWheeler compression algorithms
In the late 20th century the reliable and efficient transmission, reception and storage of information proved to be central to the most successful economies all over the world. The Internet, once a classified project accessible to a selected few, is now part of the everyday lives of a large part of the human population, and as such the efficient storage of information is an important part of the information economy. The improvement of the information storage density of optical and electronic media has been remarkable, but the elimination of redundancy in stored data and the reliable reconstruction of the original data is still a desired goal. The field of source coding is concerned with the compression of redundant data and its reliable decompression. The arithmetic source code, which was independently proposed by J. J. Rissanen and R. Pasco in 1976, revolutionized the field of source coding. Compression algorithms that use an arithmetic code to encode redundant data are typically more effective and computationally more efficient than compression algorithms that use earlier source codes such as extended Huffman codes. The arithmetic source code is also more flexible than earlier source codes, and is frequently used in adaptive compression algorithms. The arithmetic code remains the source code of choice, despite having been introduced more than 30 years ago. The problem of effectively encoding data from sources with known statistics (i.e. where the probability distribution of the source data is known) was solved with the introduction of the arithmetic code. The probability distribution of practical data is seldomly available to the source encoder, however. The source coding of data from sources with unknown statistics is a more challenging problem, and remains an active research topic. Enumerative source codes were introduced by T. J. Lynch and L. D. Davisson in the 1960s. These lossless source codes have the remarkable property that they may be used to effectively encode source sequences from certain sources without requiring any prior knowledge of the source statistics. One drawback of these source codes is the computationally complex nature of their implementations. Several years after the introduction of enumerative source codes, J. G. Cleary and I. H. Witten proved that approximate enumerative source codes may be realized by using an arithmetic code. Approximate enumerative source codes are significantly less complex than the original enumerative source codes, but are less effective than the original codes. Researchers have become more interested in arithmetic source codes than enumerative source codes since the publication of the work by Cleary and Witten. This thesis concerns the original enumerative source codes and their use in BurrowsâWheeler compression algorithms. A novel implementation of the original enumerative source code is proposed. This implementation has a significantly lower computational complexity than the direct implementation of the original enumerative source code. Several novel enumerative source codes are introduced in this thesis. These codes include optimal fixedâtoâfixed length source codes with manageable computational complexity. A generalization of the original enumerative source code, which includes more complex data sources, is proposed in this thesis. The generalized source code uses the BurrowsâWheeler transform, which is a lowâcomplexity algorithm for converting the redundancy of sequences from complex data sources to a more accessible form. The generalized source code effectively encodes the transformed sequences using the original enumerative source code. It is demonstrated and proved mathematically that this source code is universal (i.e. the code has an asymptotic normalized average redundancy of zero bits). AFRIKAANS : Die betroubare en doeltreffende versending, ontvangs en berging van inligting vorm teen die einde van die twintigste eeu die kern van die mees suksesvolle ekonomie¨e in die wËereld. Die Internet, eens op ân tyd ân geheime projek en toeganklik vir slegs ân klein groep verbruikers, is vandag deel van die alledaagse lewe van ân groot persentasie van die mensdom, en derhalwe is die doeltreffende berging van inligting ân belangrike deel van die inligtingsekonomie. Die verbetering van die bergingsdigteid van optiese en elektroniese media is merkwaardig, maar die uitwissing van oortolligheid in gebergde data, asook die betroubare herwinning van oorspronklike data, bly ân doel om na te streef. Bronkodering is gemoeid met die kompressie van oortollige data, asook die betroubare dekompressie van die data. Die rekenkundige bronkode, wat onafhanklik voorgestel is deur J. J. Rissanen en R. Pasco in 1976, het ân revolusie veroorsaak in die bronkoderingsveld. Kompressiealgoritmes wat rekenkundige bronkodes gebruik vir die kodering van oortollige data is tipies meer doeltreffend en rekenkundig meer effektief as kompressiealgoritmes wat vroe¨ere bronkodes, soos verlengde Huffman kodes, gebruik. Rekenkundige bronkodes, wat gereeld in aanpasbare kompressiealgoritmes gebruik word, is ook meer buigbaar as vroe¨ere bronkodes. Die rekenkundige bronkode bly na 30 jaar steeds die bronkode van eerste keuse. Die probleem om data wat afkomstig is van bronne met bekende statistieke (d.w.s. waar die waarskynlikheidsverspreiding van die brondata bekend is) doeltreffend te enkodeer is opgelos deur die instelling van rekenkundige bronkodes. Die bronenkodeerder het egter selde toegang tot die waarskynlikheidsverspreiding van praktiese data. Die bronkodering van data wat afkomstig is van bronne met onbekende statistieke is ân groter uitdaging, en bly steeds ân aktiewe navorsingsveld. T. J. Lynch and L. D. Davisson het telâbronkodes in die 1960s voorgestel. Telâ bronkodes het die merkwaardige eienskap dat bronsekwensies van sekere bronne effektief met hierdie foutlose kodes ge¨enkodeer kan word, sonder dat die bronenkodeerder enige vooraf kennis omtrent die statistieke van die bron hoef te besit. Een nadeel van telâbronkodes is die ho¨e rekenkompleksiteit van hul implementasies. J. G. Cleary en I. H. Witten het verskeie jare na die instelling van telâbronkodes bewys dat benaderde telâbronkodes gerealiseer kan word deur die gebruik van rekenkundige bronkodes. Benaderde telâbronkodes het ân laer rekenkompleksiteit as telâbronkodes, maar benaderde telâbronkodes is minder doeltreffend as die oorspronklike telâbronkodes. Navorsers het sedert die werk van Cleary en Witten meer belangstelling getoon in rekenkundige bronkodes as telâbronkodes. Hierdie tesis is gemoeid met die oorspronklike telâbronkodes en die gebruik daarvan in BurrowsâWheeler kompressiealgoritmes. ân Nuwe implementasie van die oorspronklike telâbronkode word voorgestel. Die voorgestelde implementasie het ân beduidende laer rekenkompleksiteit as die direkte implementasie van die oorspronklike telâbronkode. Verskeie nuwe telâbronkodes, insluitende optimale vasteâtotâvaste lengte telâbronkodes met beheerbare rekenkompleksiteit, word voorgestel. ân Veralgemening van die oorspronklike telâbronkode, wat meer komplekse databronne insluit as die oorspronklike telâbronkode, word voorgestel in hierdie tesis. The veralgemeende telâbronkode maak gebruik van die BurrowsâWheeler omskakeling. Die BurrowsâWheeler omskakeling is ân laeâkompleksiteit algoritme wat die oortolligheid van bronsekwensies wat afkomstig is van komplekse databronne omskakel na ân meer toeganklike vorm. Die veralgemeende bronkode enkodeer die omgeskakelde sekwensies effektief deur die oorspronklike telâbronkode te gebruik. Die universele aard van hierdie bronkode word gedemonstreer en wiskundig bewys (d.w.s. dit word bewys dat die kode ân asimptotiese genormaliseerde gemiddelde oortolligheid van nul bisse het). CopyrightDissertation (MEng)--University of Pretoria, 2010.Electrical, Electronic and Computer Engineeringunrestricte
Predictive data compression using adaptive arithmetic coding
The commonly used data compression techniques do not necessarily provide maximal compression and neither do they define the most efficient framework for transmission of data. In this thesis we investigate variants of the standard compression algorithms that use the strategy of partitioning of the data to be compressed. Doing so not only increases the compression ratio in many instances, it also reduces the maximum data block size for transmission. The partitioning of the data is made using a Markov model to predict if doing so would result in increased compression ratio. Experiments have been performed on text files comparing the new scheme to adaptive Huffman and arithmetic coding methods. The adaptive Huffman method has been implemented in a new way by combining the FGK method with Vitterâs implicit ordering of nodes
Data Discovery and Anomaly Detection using Atypicality.
Ph.D. Thesis. University of HawaiĘťi at MÄnoa 2017
Compression of DNA sequencing data
With the release of the latest generations of sequencing machines, the cost of sequencing a whole human genome has dropped to less than US$1,000. The potential applications in several fields lead to the forecast that the amount of DNA sequencing data will soon surpass the volume of other types of data, such as video data. In this dissertation, we present novel data compression technologies with the aim of enhancing storage, transmission, and processing of DNA sequencing data.
The first contribution in this dissertation is a method for the compression of aligned reads, i.e., read-out sequence fragments that have been aligned to a reference sequence. The method improves compression by implicitly assembling local parts of the underlying sequences. Compared to the state of the art, our method achieves the best trade-off between memory usage and compressed size.
Our second contribution is a method for the quantization and compression of quality scores, i.e., values that quantify the error probability of each read-out base. Specifically, we propose two Bayesian models that are used to precisely control the quantization. With our method it is possible to compress the data down to 0.15 bit per quality score. Notably, we can recommend a particular parametrization for one of our models whichâby removing noise from the data as a side effectâdoes not lead to any degradation in the distortion metric. This parametrization achieves an average rate of 0.45 bit per quality score.
The third contribution is the first implementation of an entropy codec compliant to MPEG-G. We show that, compared to the state of the art, our method achieves the best compression ranks on average, and that adding our method to CRAM would be beneficial both in terms of achievable compression and speed.
Finally, we provide an overview of the standardization landscape, and in particular of MPEG-G, in which our contributions have been integrated.Mit der EinfĂźhrung der neuesten Generationen von Sequenziermaschinen sind die Kosten fĂźr die Sequenzierung eines menschlichen Genoms auf weniger als 1.000 US-Dollar gesunken. Es wird prognostiziert, dass die Menge der Sequenzierungsdaten bald diejenige anderer Datentypen, wie z.B. Videodaten, Ăźbersteigen wird. Daher werden in dieser Arbeit neue Datenkompressionsverfahren zur Verbesserung der Speicherung, Ăbertragung und Verarbeitung von Sequenzierungsdaten vorgestellt.
Der erste Beitrag in dieser Arbeit ist eine Methode zur Komprimierung von alignierten Reads, d.h. ausgelesenen Sequenzfragmenten, die an eine Referenzsequenz angeglichen wurden. Die Methode verbessert die Komprimierung, indem sie die Reads nutzt, um implizit lokale Teile der zugrunde liegenden Sequenzen zu schätzen. Im Vergleich zum Stand der Technik erzielt die Methode das beste Ergebnis in einer gemeinsamen Betrachtung von Speichernutzung und erzielter Komprimierung.
Der zweite Beitrag ist eine Methode zur Quantisierung und Komprimierung von Qualitätswerten, welche die Fehlerwahrscheinlichkeit jeder ausgelesenen Base quantifizieren. Konkret werden zwei Bayesâsche Modelle vorgeschlagen, mit denen die Quantisierung präzise gesteuert werden kann. Mit der vorgeschlagenen Methode kĂśnnen die Daten auf bis zu 0,15 Bit pro Qualitätswert komprimiert werden. Besonders hervorzuheben ist, dass eine bestimmte Parametrisierung fĂźr eines der Modelle empfohlen werden kann, die â durch die Entfernung von Rauschen aus den Daten als Nebeneffekt â zu keiner Verschlechterung der Verzerrungsmetrik fĂźhrt. Mit dieser Parametrisierung wird eine durchschnittliche Rate von 0,45 Bit pro Qualitätswert erreicht.
Der dritte Beitrag ist die erste Implementierung eines MPEG-G-konformen Entropie-Codecs. Es wird gezeigt, dass der vorgeschlagene Codec die durchschnittlich besten Kompressionswerte im Vergleich zum Stand der Technik erzielt und dass die Aufnahme des Codecs in CRAM sowohl hinsichtlich der erreichbaren Kompression als auch der Geschwindigkeit von Vorteil wäre.
AbschlieĂend wird ein Ăberblick Ăźber Standards zur Komprimierung von Sequenzierungsdaten gegeben. Insbesondere wird hier auf MPEG-G eingangen, da alle Beiträge dieser Arbeit in MPEG-G integriert wurden
Neural function approximation on graphs: shape modelling, graph discrimination & compression
Graphs serve as a versatile mathematical abstraction of real-world phenomena in numerous scientific disciplines. This thesis is part of the Geometric Deep Learning subject area, a family of learning paradigms, that capitalise on the increasing volume of non-Euclidean data so as to solve real-world tasks in a data-driven manner. In particular, we focus on the topic of graph function approximation using neural networks, which lies at the heart of many relevant methods. In the first part of the thesis, we contribute to the understanding and design of Graph Neural Networks (GNNs). Initially, we investigate the problem of learning on signals supported on a fixed graph. We show that treating graph signals as general graph spaces is restrictive and conventional GNNs have limited expressivity. Instead, we expose a more enlightening perspective by drawing parallels between graph signals and signals on Euclidean grids, such as images and audio. Accordingly, we propose a permutation-sensitive GNN based on an operator analogous to shifts in grids and instantiate it on 3D meshes for shape modelling (Spiral Convolutions). Following, we focus on learning on general graph spaces and in particular on functions that are invariant to graph isomorphism. We identify a fundamental trade-off between invariance, expressivity and computational complexity, which we address with a symmetry-breaking mechanism based on substructure encodings (Graph Substructure Networks). Substructures are shown to be a powerful tool that provably improves expressivity while controlling computational complexity, and a useful inductive bias in network science and chemistry. In the second part of the thesis, we discuss the problem of graph compression, where we analyse the information-theoretic principles and the connections with graph generative models. We show that another inevitable trade-off surfaces, now between computational complexity and compression quality, due to graph isomorphism. We propose a substructure-based dictionary coder - Partition and Code (PnC) - with theoretical guarantees that can be adapted to different graph distributions by estimating its parameters from observations. Additionally, contrary to the majority of neural compressors, PnC is parameter and sample efficient and is therefore of wide practical relevance. Finally, within this framework, substructures are further illustrated as a decisive archetype for learning problems on graph spaces.Open Acces
Community computation
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Materials Science and Engineering, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 171-186).In this thesis we lay the foundations for a distributed, community-based computing environment to tap the resources of a community to better perform some tasks, either computationally hard or economically prohibitive, or physically inconvenient, that one individual is unable to accomplish efficiently. We introduce community coding, where information systems meet social networks, to tackle some of the challenges in this new paradigm of community computation. We design algorithms, protocols and build system prototypes to demonstrate the power of community computation to better deal with reliability, scalability and security issues, which are the main challenges in many emerging community-computing environments, in several application scenarios such as community storage, community sensing and community security. For example, we develop a community storage system that is based upon a distributed P2P (peer-to-peer) storage paradigm, where we take an array of small, periodically accessible, individual computers/peer nodes and create a secure, reliable and large distributed storage system. The goal is for each one of them to act as if they have immediate access to a pool of information that is larger than they could hold themselves, and into which they can contribute new stuff in a both open and secure manner. Such a contributory and self-scaling community storage system is particularly useful where reliable infrastructure is not readily available in that such a system facilitates easy ad-hoc construction and easy portability. In another application scenario, we develop a novel framework of community sensing with a group of image sensors. The goal is to present a set of novel tools in which software, rather than humans, examines the collection of images sensed by a group of image sensors to determine what is happening in the field of view. We also present several design principles in the aspects of community security. In one application example, we present community-based email spain detection approach to deal with email spams more efficiently.by Fulu Li.Ph.D
Proceedings of the 26th International Symposium on Theoretical Aspects of Computer Science (STACS'09)
The Symposium on Theoretical Aspects of Computer Science (STACS) is held alternately in France and in Germany. The conference of February 26-28, 2009, held in Freiburg, is the 26th in this series. Previous meetings took place in Paris (1984), Saarbr¨ucken (1985), Orsay (1986), Passau (1987), Bordeaux (1988), Paderborn (1989), Rouen (1990), Hamburg (1991), Cachan (1992), W¨urzburg (1993), Caen (1994), M¨unchen (1995), Grenoble (1996), L¨ubeck (1997), Paris (1998), Trier (1999), Lille (2000), Dresden (2001), Antibes (2002), Berlin (2003), Montpellier (2004), Stuttgart (2005), Marseille (2006), Aachen (2007), and Bordeaux (2008). ..
Applications, challenges and new perspectives on the analysis of transcriptional regulation using epigenomic and transcriptomic data
The integrative analysis of epigenomics and transcriptomics data is an active research field in Bioinformatics. New methods are required to interpret and process large omics data sets, as generated within consortia such as the International Human Epigenomics Consortium. In this thesis, we present several approaches illustrating how combined epigenomics and transcriptomics datasets, e.g. for differential or time series analysis, can be used to derive new biological insights on transcriptional regulation. In this work we focus on regulatory proteins called transcription factors (TFs), which are essential for orchestrating cellular processes.
In our novel approaches, we combine epigenomics data, such as DNaseI-seq, predicted TF binding scores and gene-expression measurements in interpretable machine learning models. In joint work with our collaborators within and outside IHEC, we have shown that our methods lead to biological meaningful results, which could be validated with wet-lab experiments. Aside from providing the community with new tools to perform integrative analysis of epigenomics and transcriptomics data, we have studied the characteristics of chromatin accessibility data and its relation to gene-expression in detail to better understand the implications of both computational processing and of different experimental methods on data interpretation. Overall, we provide easy to use tools to enable researchers to benefit from the era of Biological Data Science.In dieser Dissertation stellen wir mehrere Ansätze vor, um die häufigsten "omics" Daten, wie beispielsweise differentielle Datenstze oder auch Zeitreihen zu verwenden, um neue Erkenntnisse ßber Genregulation auf transkriptioneller Ebene gewinnen zu kÜnnen. Dabei haben wir uns insbesondere auf sogenannte Transkriptionsfaktoren konzentriert. Dies sind Proteine, die essentiell fßr die Steuerung regulatorischer Prozesse in der Zelle sind. In unseren neuen Methoden kombinieren wir epigenetische Daten, zum Beispiel DNaseI-seq oder ATAC-seq Daten, vorhergesagte Transkriptionsfaktorbindestellen und Genexpressionsdaten in interpretierbaren Modellen des maschinellen Lernens. Zusammen mit unseren Kooperationspartnern haben wir gezeigt, dass unsere Methoden zu biologisch bedeutsamen Ergebnissen fßhren, die exemplarisch im Labor validiert werden konnten. Ferner haben wir im Detail Zusammenhänge zwischen der Struktur des Chromatins und der Genexpression untersucht. Dies ist von immenser Bedeutung, um den Einfluss von experimentellen Charakteristika aber auch von der Modellierung der Daten auf die biologische Interpretation zu vermeiden