Search CORE

30,066 research outputs found

Compressing DNA sequence databases with coil

Author: Hendy Michael D.
White W. Timothy J.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/05/2008
Field of study

Background: Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression – an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. Results: We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression – the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. Conclusion: coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work

Massey Research Online

Springer - Publisher Connector

Directory of Open Access Journals

Large scale evaluations of multimedia information retrieval: the TRECVid experience

Author: G. Kazai
J.G. Fiscus
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2005
Field of study

Information Retrieval is a supporting technique which underpins a broad range of content-based applications including retrieval, filtering, summarisation, browsing, classification, clustering, automatic linking, and others. Multimedia information retrieval (MMIR) represents those applications when applied to multimedia information such as image, video, music, etc. In this presentation and extended abstract we are primarily concerned with MMIR as applied to information in digital video format. We begin with a brief overview of large scale evaluations of IR tasks in areas such as text, image and music, just to illustrate that this phenomenon is not just restricted to MMIR on video. The main contribution, however, is a set of pointers and a summarisation of the work done as part of TRECVid, the annual benchmarking exercise for video retrieval tasks

CiteSeerX

Content-based access to digital video: the Físchlár system and the TREC video track

Author: Smeaton Alan F.
Publication venue
Publication date: 01/01/2001
Field of study

This short paper presents an overview of the Físchlár system - an operational digital library of several hundred hours of video content at Dublin City University which is used by over 1,000 users daily, for a variety of applications. The paper describes how Físchlár operates and the services that it provides for users. Following that, the second part of the paper gives an outline of the TREC Video Retrieval track, a benchmarking exercise for information retrieval from video content currently in operation, summarising the operational details of how the benchmarking exercise is operating

CiteSeerX

Recommended from our members

Encoding Sequential Information in Vector Space Models of Semantics: Comparing Holographic Reduced Representation and Random Permutation

Author: Jones Michael
Kanerva Pentti
Recchia Gabriel
Sahlgren Magnus
Publication venue
Publication date: 01/01/2010
Field of study

Encoding information about the order in which words typically appear has been shown to improve the performance of high-dimensional semantic space models. This requires an encoding operation capable of binding together vectors in an order-sensitive way, and efficient enough to scale to large text corpora. Although both circular convolution and random permutations have been enlisted for this purpose in semantic models, these operations have never been systematically compared. In Experiment 1 we compare their storage capacity and probability of correct retrieval; in Experiments 2 and 3 we compare their performance on semantic tasks when integrated into existing models. We conclude that random permutations are a scalable alternative to circular convolution with several desirable properties

eScholarship - University of California

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

On Constructing Persistent Identifiers with Persistent Resolution Targets

Author: Majchrzak Tim A.
Wannenwetsch Oliver
Publication venue: 'Polish Information Processing Society PTI'
Publication date: 01/01/2016
Field of study

Persistent Identifiers (PID) are the foundation referencing digital assets in scientific publications, books, and digital repositories. In its realization, PIDs contain metadata and resolving targets in form of URLs that point to data sets located on the network. In contrast to PIDs, the target URLs are typically changing over time; thus, PIDs need continuous maintenance -- an effort that is increasing tremendously with the advancement of e-Science and the advent of the Internet-of-Things (IoT). Nowadays, billions of sensors and data sets are subject of PID assignment. This paper presents a new approach of embedding location independent targets into PIDs that allows the creation of maintenance-free PIDs using content-centric network technology and overlay networks. For proving the validity of the presented approach, the Handle PID System is used in conjunction with Magnet Link access information encoding, state-of-the-art decentralized data distribution with BitTorrent, and Named Data Networking (NDN) as location-independent data access technology for networks. Contrasting existing approaches, no green-field implementation of PID or major modifications of the Handle System is required to enable location-independent data dissemination with maintenance-free PIDs.Comment: Published IEEE paper of the FedCSIS 2016 (SoFAST-WS'16) conference, 11.-14. September 2016, Gdansk, Poland. Also available online: http://ieeexplore.ieee.org/document/7733372

arXiv.org e-Print Archive

Directory of Open Access Journals

Agder University Research Archive

Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning

Author: Majumdar Arjun
Mascharka David
Soklaski Ryan
Tran Philip
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/07/2018
Field of study

Visual question answering requires high-order reasoning about an image, which is a fundamental capability needed by machine systems to follow complex directives. Recently, modular networks have been shown to be an effective framework for performing visual reasoning tasks. While modular networks were initially designed with a degree of model transparency, their performance on complex visual reasoning benchmarks was lacking. Current state-of-the-art approaches do not provide an effective mechanism for understanding the reasoning process. In this paper, we close the performance gap between interpretable models and state-of-the-art visual reasoning methods. We propose a set of visual-reasoning primitives which, when composed, manifest as a model capable of performing complex reasoning tasks in an explicitly-interpretable manner. The fidelity and interpretability of the primitives' outputs enable an unparalleled ability to diagnose the strengths and weaknesses of the resulting model. Critically, we show that these primitives are highly performant, achieving state-of-the-art accuracy of 99.1% on the CLEVR dataset. We also show that our model is able to effectively learn generalized representations when provided a small amount of data containing novel object attributes. Using the CoGenT generalization task, we show more than a 20 percentage point improvement over the current state of the art.Comment: CVPR 2018 pre-prin

arXiv.org e-Print Archive