1,186 research outputs found

    A prefix encoding for a constructed language

    Get PDF
    This work focuses in the formal and technical analysis of some aspects of a constructed language. As a first part of the work, a possible coding for the language will be studied, emphasizing the pre x coding, for which an extension of the Hu man algorithm from binary to n-ary will be implemented. Because of that in the language we can't know a priori the frequency of use of the words, a study will be done and several strategies will be proposed for an open words system, analyzing previously the existing number of words in current natural languages. As a possible upgrade of the coding, we'll take also a look to the synchronization loss problem, as well as to its solution: the self-synchronization, a t-codes study with the number of possible words for the language, as well as other alternatives. Finally, and from a less formal approach, several applications for the language have been developed: A voice synthesizer, a speech recognition system and a system font for the use of the language in text processors. For each of these applications, the process used for its construction, as well as the problems encountered and still to solve in each will be detailed

    Fast and Efficient Entropy Coding Architectures for Massive Data Compression

    Get PDF
    The compression of data is fundamental to alleviating the costs of transmitting and storing massive datasets employed in myriad fields of our society. Most compression systems employ an entropy coder in their coding pipeline to remove the redundancy of coded symbols. The entropy-coding stage needs to be efficient, to yield high compression ratios, and fast, to process large amounts of data rapidly. Despite their widespread use, entropy coders are commonly assessed for some particular scenario or coding system. This work provides a general framework to assess and optimize different entropy coders. First, the paper describes three main families of entropy coders, namely those based on variable-to-variable length codes (V2VLC), arithmetic coding (AC), and tabled asymmetric numeral systems (tANS). Then, a low-complexity architecture for the most representative coder(s) of each family is presented-more precisely, a general version of V2VLC, the MQ, M, and a fixed-length version of AC and two different implementations of tANS. These coders are evaluated under different coding conditions in terms of compression efficiency and computational throughput. The results obtained suggest that V2VLC and tANS achieve the highest compression ratios for most coding rates and that the AC coder that uses fixed-length codewords attains the highest throughput. The experimental evaluation discloses the advantages and shortcomings of each entropy-coding scheme, providing insights that may help to select this stage in forthcoming compression systems

    TopSig: Topology Preserving Document Signatures

    Get PDF
    Performance comparisons between File Signatures and Inverted Files for text retrieval have previously shown several significant shortcomings of file signatures relative to inverted files. The inverted file approach underpins most state-of-the-art search engine algorithms, such as Language and Probabilistic models. It has been widely accepted that traditional file signatures are inferior alternatives to inverted files. This paper describes TopSig, a new approach to the construction of file signatures. Many advances in semantic hashing and dimensionality reduction have been made in recent times, but these were not so far linked to general purpose, signature file based, search engines. This paper introduces a different signature file approach that builds upon and extends these recent advances. We are able to demonstrate significant improvements in the performance of signature file based indexing and retrieval, performance that is comparable to that of state of the art inverted file based systems, including Language models and BM25. These findings suggest that file signatures offer a viable alternative to inverted files in suitable settings and from the theoretical perspective it positions the file signatures model in the class of Vector Space retrieval models.Comment: 12 pages, 8 figures, CIKM 201

    Error-correction on non-standard communication channels

    Get PDF
    Many communication systems are poorly modelled by the standard channels assumed in the information theory literature, such as the binary symmetric channel or the additive white Gaussian noise channel. Real systems suffer from additional problems including time-varying noise, cross-talk, synchronization errors and latency constraints. In this thesis, low-density parity-check codes and codes related to them are applied to non-standard channels. First, we look at time-varying noise modelled by a Markov channel. A low-density parity-check code decoder is modified to give an improvement of over 1dB. Secondly, novel codes based on low-density parity-check codes are introduced which produce transmissions with Pr(bit = 1) ≠ Pr(bit = 0). These non-linear codes are shown to be good candidates for multi-user channels with crosstalk, such as optical channels. Thirdly, a channel with synchronization errors is modelled by random uncorrelated insertion or deletion events at unknown positions. Marker codes formed from low-density parity-check codewords with regular markers inserted within them are studied. It is shown that a marker code with iterative decoding has performance close to the bounds on the channel capacity, significantly outperforming other known codes. Finally, coding for a system with latency constraints is studied. For example, if a telemetry system involves a slow channel some error correction is often needed quickly whilst the code should be able to correct remaining errors later. A new code is formed from the intersection of a convolutional code with a high rate low-density parity-check code. The convolutional code has good early decoding performance and the high rate low-density parity-check code efficiently cleans up remaining errors after receiving the entire block. Simulations of the block code show a gain of 1.5dB over a standard NASA code

    STICKY THICKETS: LOCAL REGULATORY CHALLENGES FOR SMALL AND EMERGING SUSTAINABLE BUSINESSES

    Get PDF

    A lossy, dictionary -based method for short message service (SMS) text compression

    Get PDF
    Short message service (SMS) message compression allows either more content to be fitted into a single message or fewer individual messages to be sent as part of a concatenated (or long) message. While essentially only dealing with plain text, many of the more popular compression methods do not bring about a massive reduction in size for short messages. The Global System for Mobile communications (GSM) specification suggests that untrained Huffman encoding is the only required compression scheme for SMS messaging, yet support for SMS compression is still not widely available on current handsets. This research shows that Huffman encoding might actually increase the size of very short messages and only modestly reduce the size of longer messages. While Huffman encoding yields better results for larger text sizes, handset users do not usually write very large messages consisting of thousands of characters. Instead, an alternative compression method called lossy dictionary-based (LD-based) compression is proposed here. In terms of this method, the coder uses a dictionary tuned to the most frequently used English words and economically encodes white space. The encoding is lossy in that the original case is not preserved; instead, the resulting output is all lower case, a loss that might be acceptable to most users. The LD-based method has been shown to outperform Huffman encoding for the text sizes typically used when writing SMS messages, reducing the size of even very short messages and even, for instance, cutting a long message down from five to two parts. Keywords: SMS, text compression, lossy compression, dictionary compressio

    Succinct and Self-Indexed Data Structures for the Exploitation and Representation of Moving Objects

    Get PDF
    Programa Oficial de Doutoramento en Computación . 5009V01[Abstract] This thesis deals with the efficient representation and exploitation of trajectories of objects that move in space without any type of restriction (airplanes, birds, boats, etc.). Currently, this is a very relevant problem due to the proliferation of GPS devices, which makes it possible to collect a large number of trajectories. However, until now there is no efficient way to properly store and exploit them. In this thesis, we propose eight structures that meet two fundamental objectives. First, they are capable of storing space-time data, describing the trajectories, in a reduced space, so that their exploitation takes advantage of the memory hierarchy. Second, those structures allow exploiting the information by object queries, given an object, they retrieve the position or trajectory of that object along that time; or space-time range queries, given a region of space and a time interval, the objects that are within the region at that time are obtained. It should be noted that state-of-the-art solutions are only capable of efficiently answering one of the two types of queries. All of these data structures have a common nexus, they all use two elements: snapshots and logs. Each snapshot works as a spatial index that periodically indexes the absolute position of each object or the Minimum Bounding Rectangle (MBR) of its trajectory. They serve to speed up the spatio-temporal range queries. We have implemented two types of snapshots: based on k2-trees or R-trees. With respect to the log, it represents the trajectory (sequence of movements) of each object. It is the main element of the structures, and facilitates the resolution of object and spatio-temporal range queries. Four strategies have been implemented to represent the log in a compressed form: ScdcCT, GraCT, ContaCT and RCT. With the combination of these two elements we build eight different structures for the representation of trajectories. All of them have been implemented and evaluated experimentally, showing that they reduce the space required by traditional methods by up to two orders of magnitude. Furthermore, they are all competitive in solving object queries as well as spatial-temporal ones.[Resumen] Esta tesis aborda la representación y explotación eficiente de trayectorias de objetos que se mueven en el espacio sin ningún tipo de restricción (aviones, pájaros, barcos, etc.). En la actualidad, este es un problema muy relevante debido a la proliferación de dispositivos GPS, lo que permite coleccionar una gran cantidad de trayectorias. Sin embargo, hasta ahora no existe un modo eficiente para almacenarlas y explotarlas adecuadamente. Esta tesis propone ocho estructuras que cumplen con dos objetivos fundamentales. En primer lugar, son capaces de almacenar en espacio reducido los datos espaciotemporales, que describen las trayectorias, de modo que su explotación saque partido a la jerarquía de memoria. En segundo lugar, las estructuras permiten explotar la información realizando consultas sobre objetos, dado el objeto se calcula su posición o trayectoria durante un intervalo de tiempo; o consultas de rango espacio-temporal, dada una región del espacio y un intervalo de tiempo se obtienen los objetos que estaban dentro de la región en ese tiempo. Hay que destacar que las soluciones del estado del arte solo son capaces de responder eficientemente uno de los dos tipos de consultas. Todas estas estructuras de datos tienen un nexo común, todas ellas usan dos elementos: snapshots y logs. Cada snapshot funciona como un índice espacial que periódicamente indexa la posición absoluta de cada objeto o el Minimum Bounding Rectangle (MBR) de su trayectoria. Sirven para agilizar las consultas de rango espacio-temporal. Hemos implementado dos tipos de snapshot: basadas en k2-trees o en R-trees. Con respecto al log, éste representa la trayectoria (secuencia de movimientos) de cada objeto. Es el principal elemento de nuestras estructuras, y facilita la resolución de consultas de objeto y de rango espacio-temporal. Se han implementado cuatro estrategias para representar el log de forma comprimida: ScdcCT, GraCT, ContaCT y RCT. Con la combinación de estos dos elementos construimos ocho estructuras diferentes para la representación de trayectorias. Todas ellas han sido implementadas y evaluadas experimentalmente, donde reducen hasta dos órdenes de magnitud el espacio que requieren los métodos tradicionales. Además, todas ellas son competitivas resolviendo tanto consultas de objeto como de rango espacio-temporal.[Resumo] Esta tese trata sobre a representación e explotación eficiente de traxectorias de obxectos que se moven no espazo sen ningún tipo de restrición (avións, paxaros, buques, etc.). Na actualidade, este é un problema moi relevante debido á proliferación de dispositivos GPS, o que fai posible a recollida dun gran número de traxectorias. Non obstante, ata o de agora non existe un xeito eficiente de almacenalos e explotalos. Esta tese propón oito estruturas que cumpren dous obxectivos fundamentais. En primeiro lugar, son capaces de almacenar datos espazo-temporais, que describen as traxectorias, nun espazo reducido, de xeito que a súa explotación aproveita a xerarquía da memoria. En segundo lugar, as estruturas permiten explotar a información realizando consultas de obxectos, dado o obxecto calcúlase a súa posición ou traxectoria nun período de tempo; ou consultas de rango espazo-temporal, dada unha rexión de espazo e un intervalo de tempo, obtéñense os obxectos que estaban dentro da rexión nese momento. Cómpre salientar que as solucións do estado do arte só son capaces de responder eficientemente a un dos dous tipos de consultas. Todas estas estruturas de datos teñen unha ligazón común, empregan dous elementos: snapshots e logs. Cada snapshot funciona como un índice espacial que indexa periodicamente a posición absoluta de cada obxecto ou o Minimum Bounding Rectangle (MBR) da súa traxectoria. Serven para acelerar as consultas de rango espazo-temporal. Implementamos dous tipos de snapshot: baseadas en k2-trees ou en R-trees. Con respecto ao log, este representa a traxectoria (secuencia de movementos) de cada obxecto. É o principal elemento das nosas estruturas, e facilita a resolución de consultas sobre obxectos e de rango espacio-temporal. Implementáronse catro estratexias para representar o log nunha forma comprimida: ScdcCT, GraCT, ContaCT e RCT. Coa combinación destes dous elementos construímos oito estruturas diferentes para a representación de traxectorias. Todas elas foron implementadas e avaliadas experimentalmente, onde reducen ata dúas ordes de magnitude o espazo requirido polos métodos tradicionais. Ademais, todas elas son competitivas para resolver tanto consultas de obxectos como espazo-temporais

    Perceptually-Driven Video Coding with the Daala Video Codec

    Full text link
    The Daala project is a royalty-free video codec that attempts to compete with the best patent-encumbered codecs. Part of our strategy is to replace core tools of traditional video codecs with alternative approaches, many of them designed to take perceptual aspects into account, rather than optimizing for simple metrics like PSNR. This paper documents some of our experiences with these tools, which ones worked and which did not. We evaluate which tools are easy to integrate into a more traditional codec design, and show results in the context of the codec being developed by the Alliance for Open Media.Comment: 19 pages, Proceedings of SPIE Workshop on Applications of Digital Image Processing (ADIP), 201

    Optimal Message-Passing with Noisy Beeps

    Get PDF
    Beeping models are models for networks of weak devices, such as sensor networks or biological networks. In these networks, nodes are allowed to communicate only via emitting beeps: unary pulses of energy. Listening nodes only the capability of carrier sensing: they can only distinguish between the presence or absence of a beep, but receive no other information. The noisy beeping model further assumes listening nodes may be disrupted by random noise. Despite this extremely restrictive communication model, it transpires that complex distributed tasks can still be performed by such networks. In this paper we provide an optimal procedure for simulating general message passing in the beeping and noisy beeping models. We show that a round of Broadcast CONGEST can be simulated in O(Δ log n) round of the noisy (or noiseless) beeping model, and a round of CONGEST can be simulated in O(Δ2 log n) rounds (where Δ is the maximum degree of the network). We also prove lower bounds demonstrating that no simulation can use asymptotically fewer rounds. This allows a host of graph algorithms to be efficiently implemented in beeping models. As an example, we present an O(log n)-round Broadcast CONGEST algorithm for maximal matching, which, when simulated using our method, immediately implies a near-optimal O(Δ log2 n)-round maximal matching algorithm in the noisy beeping model

    간결한 자료구조를 활용한 반구조화된 문서 형식들의 공간 효율적 표현법

    Get PDF
    학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2021. 2. Srinivasa Rao Satti.Numerous big data are generated from a plethora of sources. Most of the data stored as files contain a non-fixed type of schema, so that the files are suitable to be maintained as semi-structured document formats. A number of those formats, such as XML (eXtensible Markup Language), JSON (JavaScript Object Notation), and YAML (YAML Ain't Markup Language) are suggested to sustain hierarchy in the original corpora of data. Several data models structuring the gathered data - including RDF (Resource Description Framework) - depend on the semi-structured document formats to be serialized and transferred for future processing. Since the semi-structured document formats focus on readability and verbosity, redundant space is required to organize and maintain the document. Even though general-purpose compression schemes are widely used to compact the documents, applying those algorithms hinder future handling of the corpora, owing to loss of internal structures. The area of succinct data structures is widely investigated and researched in theory, to provide answers to the queries while the encoded data occupy space close to the information-theoretic lower bound. Bit vectors and trees are the notable succinct data structures. Nevertheless, there were few attempts to apply the idea of succinct data structures to represent the semi-structured documents in space-efficient manner. In this dissertation we propose a unified, space-efficient representation of various semi-structured document formats. The core functionality of this representation is its compactness and query-ability derived from enriched functions of succinct data structures. Incorporation of (a) bit indexed arrays, (b) succinct ordinal trees, and (c) compression techniques engineers the compact representation. We implement this representation in practice, and show by experiments that construction of this representation decreases the disk usage by up to 60% while occupying 90% less RAM. We also allow processing a document in partial manner, to allow processing of larger corpus of big data even in the constrained environment. In parallel to establishing the aforementioned compact semi-structured document representation, we provide and reinforce some of the existing compression schemes in this dissertation. We first suggest an idea to encode an array of integers that is not necessarily sorted. This compaction scheme improves upon the existing universal code systems, by assistance of succinct bit vector structure. We show that our suggested algorithm reduces space usage by up to 44% while consuming 15% less time than the original code system, while the algorithm additionally supports random access of elements upon the encoded array. We also reinforce the SBH bitmap index compression algorithm. The main strength of this scheme is the use of intermediate super-bucket during operations, giving better performance on querying through a combination of compressed bitmap indexes. Inspired from splits done during the intermediate process of the SBH algorithm, we give an improved compression mechanism supporting parallelism that could be utilized in both CPUs and GPUs. We show by experiments that this CPU parallel processing optimization diminishes compression and decompression times by up to 38% in a 4-core machine without modifying the bitmap compressed form. For GPUs, the new algorithm gives 48% faster query processing time in the experiments, compared to the previous existing bitmap index compression schemes.셀 수 없는 빅 데이터가 다양한 원본로부터 생성되고 있다. 이들 데이터의 대부분은 고정되지 않은 종류의 스키마를 포함한 파일 형태로 저장되는데, 이로 인하여 반구조화된 문서 형식을 이용하여 파일을 유지하는 것이 적합하다. XML, JSON 및 YAML과 같은 종류의 반구조화된 문서 형식이 데이터에 내재하는 구조를 유지하기 위하여 제안되었다. 수집된 데이터를 구조화하는 RDF와 같은 여러 데이터 모델들은 사후 처리를 위한 저장 및 전송을 위하여 반구조화된 문서 형식에 의존한다. 반구조화된 문서 형식은 가독성과 다변성에 집중하기 때문에, 문서를 구조화하고 유지하기 위하여 추가적인 공간을 필요로 한다. 문서를 압축시키기 위하여 일반적인 압축 기법들이 널리 사용되고 있으나, 이들 기법들을 적용하게 되면 문서의 내부 구조의 손실로 인하여 데이터의 사후 처리가 어렵게 된다. 데이터를 정보이론적 하한에 가까운 공간만을 사용하여 저장을 가능하게 하면서 질의에 대한 응답을 제공하는 간결한 자료구조는 이론적으로 널리 연구되고 있는 분야이다. 비트열과 트리가 널리 알려진 간결한 자료구조들이다. 그러나 반구조화된 문서들을 저장하는 데 간결한 자료구조의 아이디어를 적용한 연구는 거의 진행되지 않았다. 본 학위논문을 통해 우리는 다양한 종류의 반구조화된 문서 형식을 통일되게 표현하는 공간 효율적 표현법을 제시한다. 이 기법의 주요한 기능은 간결한 자료구조가 강점으로 가지는 특성에 기반한 간결성과 질의 가능성이다. 비트열로 인덱싱된 배열, 간결한 순서 있는 트리 및 다양한 압축 기법을 통합하여 해당 표현법을 고안하였다. 이 기법은 실재적으로 구현되었고, 실험을 통하여 이 기법을 적용한 반구조화된 문서들은 최대 60% 적은 디스크 공간과 90% 적은 메모리 공간을 통해 표현될 수 있다는 것을 보인다. 더불어 본 학위논문에서 반구조화된 문서들은 분할적으로 표현이 가능함을 보이고, 이를 통하여 제한된 환경에서도 빅 데이터를 표현한 문서들을 처리할 수 있다는 것을 보인다. 앞서 언급한 공간 효율적 반구조화된 문서 표현법을 구축함과 동시에, 본 학위논문에서 이미 존재하는 압축 기법 중 일부를 추가적으로 개선한다. 첫째로, 본 학위논문에서는 정렬 여부에 관계없는 정수 배열을 부호화하는 아이디어를 제시한다. 이 기법은 이미 존재하는 범용 코드 시스템을 개선한 형태로, 간결한 비트열 자료구조를 이용한다. 제안된 알고리즘은 기존 범용 코드 시스템에 비해 최대 44\% 적은 공간을 사용할 뿐만 아니라 15\% 적은 부호화 시간을 필요로 하며, 기존 시스템에서 제공하지 않는 부호화된 배열에서의 임의 접근을 지원한다. 또한 본 학위논문에서는 비트맵 인덱스 압축에 사용되는 SBH 알고리즘을 개선시킨다. 해당 기법의 주된 강점은 부호화와 복호화 진행 시 중간 매개인 슈퍼버켓을 사용함으로써 여러 압축된 비트맵 인덱스에 대한 질의 성능을 개선시키는 것이다. 위 압축 알고리즘의 중간 과정에서 진행되는 분할에서 영감을 얻어, 본 학위논문에서 CPU 및 GPU에 적용 가능한 개선된 병렬화 압축 매커니즘을 제시한다. 실험을 통해 CPU 병렬 최적화가 이루어진 알고리즘은 압축된 형태의 변형 없이 4코어 컴퓨터에서 최대 38\%의 압축 및 해제 시간을 감소시킨다는 것을 보인다. GPU 병렬 최적화는 기존에 존재하는 GPU 비트맵 압축 기법에 비해 48\% 빠른 질의 처리 시간을 필요로 함을 확인한다.Chapter 1 Introduction 1 1.1 Contribution 3 1.2 Organization 5 Chapter 2 Background 6 2.1 Model of Computation 6 2.2 Succinct Data Structures 7 Chapter 3 Space-efficient Representation of Integer Arrays 9 3.1 Introduction 9 3.2 Preliminaries 10 3.2.1 Universal Code System 10 3.2.2 Bit Vector 13 3.3 Algorithm Description 13 3.3.1 Main Principle 14 3.3.2 Optimization in the Implementation 16 3.4 Experimental Results 16 Chapter 4 Space-efficient Parallel Compressed Bitmap Index Processing 19 4.1 Introduction 19 4.2 Related Work 23 4.2.1 Byte-aligned Bitmap Code (BBC) 24 4.2.2 Word-Aligned Hybrid (WAH) 27 4.2.3 WAH-derived Algorithms 28 4.2.4 GPU-based WAH Algorithms 31 4.2.5 Super Byte-aligned Hybrid (SBH) 33 4.3 Parallelizing SBH 38 4.3.1 CPU Parallelism 38 4.3.2 GPU Parallelism 39 4.4 Experimental Results 40 4.4.1 Plain Version 41 4.4.2 Parallelized Version 46 4.4.3 Summary 49 Chapter 5 Space-efficient Representation of Semi-structured Document Formats 50 5.1 Preliminaries 50 5.1.1 Semi-structured Document Formats 50 5.1.2 Resource Description Framework 57 5.1.3 Succinct Ordinal Tree Representations 60 5.1.4 String Compression Schemes 64 5.2 Representation 66 5.2.1 Bit String Indexed Array 67 5.2.2 Main Structure 68 5.2.3 Single Document as a Collection of Chunks 72 5.2.4 Supporting Queries 73 5.3 Experimental Results 75 5.3.1 Datasets 76 5.3.2 Construction Time 78 5.3.3 RAM Usage during Construction 80 5.3.4 Disk Usage and Serialization Time 83 5.3.5 Chunk Division 83 5.3.6 String Compression 88 5.3.7 Query Time 89 Chapter 6 Conclusion 94 Bibliography 96 요약 109 Acknowledgements 111Docto
    corecore