95 research outputs found

    Transform Based And Search Aware Text Compression Schemes And Compressed Domain Text Retrieval

    Get PDF
    In recent times, we have witnessed an unprecedented growth of textual information via the Internet, digital libraries and archival text in many applications. While a good fraction of this information is of transient interest, useful information of archival value will continue to accumulate. We need ways to manage, organize and transport this data from one point to the other on data communications links with limited bandwidth. We must also have means to speedily find the information we need from this huge mass of data. Sometimes, a single site may also contain large collections of data such as a library database, thereby requiring an efficient search mechanism even to search within the local data. To facilitate the information retrieval, an emerging ad hoc standard for uncompressed text is XML which preprocesses the text by putting additional user defined metadata such as DTD or hyperlinks to enable searching with better efficiency and effectiveness. This increases the file size considerably, underscoring the importance of applying text compression. On account of efficiency (in terms of both space and time), there is a need to keep the data in compressed form for as much as possible. Text compression is concerned with techniques for representing the digital text data in alternate representations that takes less space. Not only does it help conserve the storage space for archival and online data, it also helps system performance by requiring less number of secondary storage (disk or CD Rom) accesses and improves the network transmission bandwidth utilization by reducing the transmission time. Unlike static images or video, there is no international standard for text compression, although compressed formats like .zip, .gz, .Z files are increasingly being used. In general, data compression methods are classified as lossless or lossy. Lossless compression allows the original data to be recovered exactly. Although used primarily for text data, lossless compression algorithms are useful in special classes of images such as medical imaging, finger print data, astronomical images and data bases containing mostly vital numerical data, tables and text information. Many lossy algorithms use lossless methods at the final stage of the encoding stage underscoring the importance of lossless methods for both lossy and lossless compression applications. In order to be able to effectively utilize the full potential of compression techniques for the future retrieval systems, we need efficient information retrieval in the compressed domain. This means that techniques must be developed to search the compressed text without decompression or only with partial decompression independent of whether the search is done on the text or on some inversion table corresponding to a set of key words for the text. In this dissertation, we make the following contributions: (1) Star family compression algorithms: We have proposed an approach to develop a reversible transformation that can be applied to a source text that improves existing algorithm\u27s ability to compress. We use a static dictionary to convert the English words into predefined symbol sequences. These transformed sequences create additional context information that is superior to the original text. Thus we achieve some compression at the preprocessing stage. We have a series of transforms which improve the performance. Star transform requires a static dictionary for a certain size. To avoid the considerable complexity of conversion, we employ the ternary tree data structure that efficiently converts the words in the text to the words in the star dictionary in linear time. (2) Exact and approximate pattern matching in Burrows-Wheeler transformed (BWT) files: We proposed a method to extract the useful context information in linear time from the BWT transformed text. The auxiliary arrays obtained from BWT inverse transform brings logarithm search time. Meanwhile, approximate pattern matching can be performed based on the results of exact pattern matching to extract the possible candidate for the approximate pattern matching. Then fast verifying algorithm can be applied to those candidates which could be just small parts of the original text. We present algorithms for both k-mismatch and k-approximate pattern matching in BWT compressed text. A typical compression system based on BWT has Move-to-Front and Huffman coding stages after the transformation. We propose a novel approach to replace the Move-to-Front stage in order to extend compressed domain search capability all the way to the entropy coding stage. A modification to the Move-to-Front makes it possible to randomly access any part of the compressed text without referring to the part before the access point. (3) Modified LZW algorithm that allows random access and partial decoding for the compressed text retrieval: Although many compression algorithms provide good compression ratio and/or time complexity, LZW is the first one studied for the compressed pattern matching because of its simplicity and efficiency. Modifications on LZW algorithm provide the extra advantage for fast random access and partial decoding ability that is especially useful for text retrieval systems. Based on this algorithm, we can provide a dynamic hierarchical semantic structure for the text, so that the text search can be performed on the expected level of granularity. For example, user can choose to retrieve a single line, a paragraph, or a file, etc. that contains the keywords. More importantly, we will show that parallel encoding and decoding algorithm is trivial with the modified LZW. Both encoding and decoding can be performed with multiple processors easily and encoding and decoding process are independent with respect to the number of processors

    DATA COMPRESSION USING EFFICIENT DICTIONARY SELECTION METHOD

    Get PDF
    With the increase in silicon densities, it is becoming feasible for compression systems to be implemented in chip. A system with distributed memory architecture is based on having data compression and decompression engines working independently on different data at the same time. This data is stored in memory distributed to each processor. The objective of the project is to design a lossless data compression system which operates in high-speed to achieve high compression rate. By using the architecture of compressors, the data compression rates are significantly improved. Also inherent scalability of architecture is possible. The main parts of the system are the data compressors and the control blocks providing control signals for the Data compressors, allowing appropriate control of the routing of data into and from the system. Each Data compressor can process four bytes of data into and from a block of data in every clock cycle. The data entering the system needs to be clocked in at a rate of 4 bytes in every clock cycle. This is to ensure that adequate data is present for all compressors to process rather than being in an idle state

    Reputation-aware Trajectory-based Data Mining in the Internet of Things (IoT)

    Get PDF
    Internet of Things (IoT) is a critically important technology for the acquisition of spatiotemporally dense data in diverse applications, ranging from environmental monitoring to surveillance systems. Such data helps us improve our transportation systems, monitor our air quality and the spread of diseases, respond to natural disasters, and a bevy of other applications. However, IoT sensor data is error-prone due to a number of reasons: sensors may be deployed in hazardous environments, may deplete their energy resources, have mechanical faults, or maybe become the targets of malicious attacks by adversaries. While previous research has attempted to improve the quality of the IoT data, they are limited in terms of better realization of the sensing context and resiliency against malicious attackers in real time. For instance, the data fusion techniques, which process the data in batches, cannot be applied to time-critical applications as they take a long time to respond. Furthermore, context-awareness allows us to examine the sensing environment and react to environmental changes. While previous research has considered geographical context, no related contemporary work has studied how a variety of sensor context (e.g., terrain elevation, wind speed, and user movement during sensing) can be used along with spatiotemporal relationships for online data prediction. This dissertation aims at developing online methods for data prediction by fusing spatiotemporal and contextual relationships among the participating resource-constrained mobile IoT devices (e.g. smartphones, smart watches, and fitness tracking devices). To achieve this goal, we first introduce a data prediction mechanism that considers the spatiotemporal and contextual relationship among the sensors. Second, we develop a real-time outlier detection approach stemming from a window-based sub-trajectory clustering method for finding behavioral movement similarity in terms of space, time, direction, and location semantics. We relax the prior assumption of cooperative sensors in the concluding section. Finally, we develop a reputation-aware context-based data fusion mechanism by exploiting inter sensor-category correlations. On one hand, this method is capable of defending against false data injection by differentiating malicious and honest participants based on their reported data in real time. On the other hand, this mechanism yields a lower data prediction error rate

    When Machine Learning Meets Information Theory: Some Practical Applications to Data Storage

    Get PDF
    Machine learning and information theory are closely inter-related areas. In this dissertation, we explore topics in their intersection with some practical applications to data storage. Firstly, we explore how machine learning techniques can be used to improve data reliability in non-volatile memories (NVMs). NVMs, such as flash memories, store large volumes of data. However, as devices scale down towards small feature sizes, they suffer from various kinds of noise and disturbances, thus significantly reducing their reliability. This dissertation explores machine learning techniques to design decoders that make use of natural redundancy (NR) in data for error correction. By NR, we mean redundancy inherent in data, which is not added artificially for error correction. This work studies two different schemes for NR-based error-correcting decoders. In the first scheme, the NR-based decoding algorithm is aware of the data representation scheme (e.g., compression, mapping of symbols to bits, meta-data, etc.), and uses that information for error correction. In the second scenario, the NR-decoder is oblivious of the representation scheme and uses deep neural networks (DNNs) to recognize the file type as well as perform soft decoding on it based on NR. In both cases, these NR-based decoders can be combined with traditional error correction codes (ECCs) to substantially improve their performance. Secondly, we use concepts from ECCs for designing robust DNNs in hardware. Non-volatile memory devices like memristors and phase-change memories are used to store the weights of hardware implemented DNNs. Errors and faults in these devices (e.g., random noise, stuck-at faults, cell-level drifting etc.) might degrade the performance of such DNNs in hardware. We use concepts from analog error-correcting codes to protect the weights of noisy neural networks and to design robust neural networks in hardware. To summarize, this dissertation explores two important directions in the intersection of information theory and machine learning. We explore how machine learning techniques can be useful in improving the performance of ECCs. Conversely, we show how information-theoretic concepts can be used to design robust neural networks in hardware

    THE GUT MICROBIOTA OF A WILD AMERICAN BLACK BEAR (Ursus americanus) POPULATION

    Get PDF
    The gut microbiome (GMB), the mutualistic microbial communities located in the gastrointestinal tract (GIT), has co-evolved in vertebrates to perform micro-ecosystem services to facilitate physiological functions. Despite the key role of the GMB in host health, wildlife managers have been slow to consider the ramifications of anthropogenic pressures to wildlife-GMB diversity. For example, although diet is one of the most influential determinants of a host’s GMB, many wildlife agencies allow baiting with human-provisioned foods to facilitate the harvest of wildlife such as American black bear (Ursus americanus). Additionally, much of our knowledge of wildlife-GMB relationships is based on studies of colon GMB communities derived from the feces of captive specimens. To better understand wildlife-GMB relationships, I first aimed to characterize wild black bear GMB communities in the colon and jejunum, two functionally distinct regions of the gastrointestinal tract (GIT). Second, I estimated the proportional contribution of human-provisioned foods to the diets of black bear and evaluated the effect of human-provisioned foods on the GMB at each GIT site. I engaged hunters as citizen scientists to collect biological samples from legally harvested black bears, 16S rRNA gene amplicon sequencing to identify microbial taxa, and stable isotope analysis of black bear hair to estimate diet. My results suggest that the jejunum and colon of black bears do not harbor significantly different GMB communities, but that increased proportions of human-provisioned foods in black bear diet, specifically corn, and significantly reduces GMB diversity
    • …
    corecore