161 research outputs found

    Compressing Genome Resequencing Data

    Get PDF
    Recent improvements in high-throughput next generation sequencing (NGS) technologies have led to an exponential increase in the number, size and diversity of available complete genome sequences. This poses major problems in storage, transmission and analysis of such genomic sequence data. Thus, a substantial effort has been made to develop effective data compression techniques to reduce the storage requirements, improve the transmission speed, and analyze the compressed sequences for possible information about genomic structure or determine relationships between genomes from multiple organisms.;In this thesis, we study the problem of lossless compression of genome resequencing data using a reference-based approach. The thesis is divided in two major parts. In the first part, we perform a detailed empirical analysis of a recently proposed compression scheme called MLCX (Maximal Longest Common Substring/Subsequence). This led to a novel decomposition technique that resulted in an enhanced compression using MLCX. In the second part, we propose SMLCX, a new reference-based lossless compression scheme that builds on the MLCX. This scheme performs compression by encoding common substrings based on a sorted order, which significantly improved compression performance over the original MLCX method. Using SMLCX, we compressed the Homo sapiens genome with original size of 3,080,436,051 bytes to 6,332,488 bytes, for an overall compression ratio of 486. This can be compared to the performance of current state-of-the-art compression methods, with compression ratios of 157 (Wang et.al, Nucleic Acid Research, 2011), 171 (Pinho et.al, Nucleic Acid Research, 2011) and 360 (Beal et.al, BMC Genomics, 2016)

    Efficient Update of Indexes for Dynamically Changing Web Documents

    Get PDF
    The original publication is available at www.springerlink.comRecent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index is rebuilt from scratch less frequently than the collection can be refreshed. An inverted index is usually used to index documents crawled from the web. Complete index rebuild at high frequency is expensive. Previous work on incremental inverted index updates have been restricted to adding and removing documents. Updating the inverted index for previously indexed documents that have changed has not been addressed. In this paper, we propose an efficient method to update the inverted index for previously indexed documents whose contents have changed. Our method uses the idea of landmarks together with the diff algorithm to significantly reduce the number of postings in the inverted index that need to be updated. Our experiments verify that our landmark-diff method results in significant savings in the number of update operations on the inverted index

    Fast and Efficient Malware Detection with Joint Static and Dynamic Features Through Transfer Learning

    Full text link
    In malware detection, dynamic analysis extracts the runtime behavior of malware samples in a controlled environment and static analysis extracts features using reverse engineering tools. While the former faces the challenges of anti-virtualization and evasive behavior of malware samples, the latter faces the challenges of code obfuscation. To tackle these drawbacks, prior works proposed to develop detection models by aggregating dynamic and static features, thus leveraging the advantages of both approaches. However, simply concatenating dynamic and static features raises an issue of imbalanced contribution due to the heterogeneous dimensions of feature vectors to the performance of malware detection models. Yet, dynamic analysis is a time-consuming task and requires a secure environment, leading to detection delays and high costs for maintaining the analysis infrastructure. In this paper, we first introduce a method of constructing aggregated features via concatenating latent features learned through deep learning with equally-contributed dimensions. We then develop a knowledge distillation technique to transfer knowledge learned from aggregated features by a teacher model to a student model trained only on static features and use the trained student model for the detection of new malware samples. We carry out extensive experiments with a dataset of 86709 samples including both benign and malware samples. The experimental results show that the teacher model trained on aggregated features constructed by our method outperforms the state-of-the-art models with an improvement of up to 2.38% in detection accuracy. The distilled student model not only achieves high performance (97.81% in terms of accuracy) as that of the teacher model but also significantly reduces the detection time (from 70046.6 ms to 194.9 ms) without requiring dynamic analysis.Comment: Accepted for presentation and publication at the 21st International Conference on Applied Cryptography and Network Security (ACNS 2023

    Evaluating SNO+ backgrounds through 222RN assays and the simulation of 13C(a, n)160 reactions during water phase

    Get PDF
    SNO+ is a large multipurpose neutrino detector searching for rare interactions. Some backgrounds come from naturally occurring 222 Rn and its daughters within the 238 U chain. Under development are cryogenic trapping assay systems which will monitor the 222 Rn levels within scintillator, water, N 2 cover gas, and small detector materials. These systems can measure concentrations up to 8×10 −5 Rn atoms/L (1.6 × 10 −17 g 238 U/g LAB) within scintillator, and 4.75×10 −15 g are discussed within. 13 238 U/g H 2 O in water. The status of the assay systems C(α,n) 16 O reactions occur from 222 Rn’s progeny 210 Po, but Monte Carlo simulations predict < 0.4 events within the SNO+ fiducial volume of 5.5 m and 5 - 9 MeV region of interest over a 9 month running period within water.Master of Science (MSc) in Physic

    Detection and management of redundancy for information retrieval

    Get PDF
    The growth of the web, authoring software, and electronic publishing has led to the emergence of a new type of document collection that is decentralised, amorphous, dynamic, and anarchic. In such collections, redundancy is a significant issue. Documents can spread and propagate across such collections without any control or moderation. Redundancy can interfere with the information retrieval process, leading to decreased user amenity in accessing information from these collections, and thus must be effectively managed. The precise definition of redundancy varies with the application. We restrict ourselves to documents that are co-derivative: those that share a common heritage, and hence contain passages of common text. We explore document fingerprinting, a well-known technique for the detection of co-derivative document pairs. Our new lossless fingerprinting algorithm improves the effectiveness of a range of document fingerprinting approaches. We empirically show that our algorithm can be highly effective at discovering co-derivative document pairs in large collections. We study the occurrence and management of redundancy in a range of application domains. On the web, we find that document fingerprinting is able to identify widespread redundancy, and that this redundancy has a significant detrimental effect on the quality of search results. Based on user studies, we suggest that redundancy is most appropriately managed as a postprocessing step on the ranked list and explain how and why this should be done. In the genomic area of sequence homology search, we explain why the existing techniques for redundancy discovery are increasingly inefficient, and present a critique of the current approaches to redundancy management. We show how document fingerprinting with a modified version of our algorithm provides significant efficiency improvements, and propose a new approach to redundancy management based on wildcards. We demonstrate that our scheme provides the benefits of existing techniques but does not have their deficiencies. Redundancy in distributed information retrieval systems - where different parts of the collection are searched by autonomous servers - cannot be effectively managed using traditional fingerprinting techniques. We thus propose a new data structure, the grainy hash vector, for redundancy detection and management in this environment. We show in preliminary tests that the grainy hash vector is able to accurately detect a good proportion of redundant document pairs while maintaining low resource usage

    Automatic text summarization using lexical chains : algorithms and experiments

    Get PDF
    viii, 80 leaves : ill. ; 29 cm.Summarization is a complex task that requires understanding of the document content to determine the importance of the text. Lexical cohesion is a method to identify connected portions of the text based on the relations between the words in the text. Lexical cohesive relations can be represented using lexical chaings. Lexical chains are sequences of semantically related words spread over the entire text. Lexical chains are used in variety of Natural Language Processing (NLP) and Information Retrieval (IR) applications. In current thesis, we propose a lexical chaining method that includes the glossary relations in the chaining process. These relations enable us to identify topically related concepts, for instance dormitory and student, and thereby enhances the identification of cohesive ties in the text. We then present methods that use the lexical chains to generate summaries by extracting sentences from the document(s). Headlines are generated by filtering the portions of the sentences extracted, which do not contribute towards the meaning of the sentence. Headlines generated can be used in real world application to skim through the document collections in a digital library. Multi-document summarization is gaining demand with the explosive growth of online news sources. It requires identification of the several themes present in the collection to attain good compression and avoid redundancy. In this thesis, we propose methods to group the portions of the texts of a document collection into meaningful clusters. clustering enable us to extract the various themes of the document collection. Sentences from clusters can then be extracted to generate a summary for the multi-document collection. Clusters can also be used to generate summaries with respect to a given query. We designed a system to compute lexical chains for the given text and use them to extract the salient portions of the document. Some specific tasks considered are: headline generation, multi-document summarization, and query-based summarization. Our experimental evaluation shows that efficient summaries can be extracted for the above tasks

    IST Austria Thesis

    Get PDF
    A proof system is a protocol between a prover and a verifier over a common input in which an honest prover convinces the verifier of the validity of true statements. Motivated by the success of decentralized cryptocurrencies, exemplified by Bitcoin, the focus of this thesis will be on proof systems which found applications in some sustainable alternatives to Bitcoin, such as the Spacemint and Chia cryptocurrencies. In particular, we focus on proofs of space and proofs of sequential work. Proofs of space (PoSpace) were suggested as more ecological, economical, and egalitarian alternative to the energy-wasteful proof-of-work mining of Bitcoin. However, the state-of-the-art constructions of PoSpace are based on sophisticated graph pebbling lower bounds, and are therefore complex. Moreover, when these PoSpace are used in cryptocurrencies like Spacemint, miners can only start mining after ensuring that a commitment to their space is already added in a special transaction to the blockchain. Proofs of sequential work (PoSW) are proof systems in which a prover, upon receiving a statement x and a time parameter T, computes a proof which convinces the verifier that T time units had passed since x was received. Whereas Spacemint assumes synchrony to retain some interesting Bitcoin dynamics, Chia requires PoSW with unique proofs, i.e., PoSW in which it is hard to come up with more than one accepting proof for any true statement. In this thesis we construct simple and practically-efficient PoSpace and PoSW. When using our PoSpace in cryptocurrencies, miners can start mining on the fly, like in Bitcoin, and unlike current constructions of PoSW, which either achieve efficient verification of sequential work, or faster-than-recomputing verification of correctness of proofs, but not both at the same time, ours achieve the best of these two worlds
    • …
    corecore