96 research outputs found

    Scoop: An Adaptive Indexing Scheme for Stored Data in Sensor Networks

    Get PDF
    In this paper, we present the design of Scoop, a system for indexing and querying stored data in sensor networks. Scoop works by collecting statistics about the rate of queries and distribution of sensor readings over a sensor network, and uses those statistics to build an index that tells nodes where in the network to store their readings. Using this index, a users queries over that stored data can be answered efficiently, without flooding those queries throughout the network. This approach offers a substantial advantage over other solutions that either store all data externally on a basestation (requiring every reading to be collected from all nodes), or that store all data locally on the node that produced it (requiring queries to be flooded throughout the network). Our results, in fact, show that Scoop offers a factor of four improvement over existing techniques in a real implementation on a 64-node mote-based sensor network. These results also show that Scoop is able to efficciently adapt to changes in the distribution and rates of data and queries

    Adaptive indexing scheme for stored data in sensor networks

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.Includes bibliographical references (p. 53-56).We present the design of Scoop, a system that is designed to efficiently store and query relational data collected by nodes in a bandwidth-constrained sensor network. Sensor networks allow remote environments to be monitored at very fine levels of granularity; often such monitoring deployments generate large amounts of data which may be impractical to collect due to bandwidth limitations, but which can easily stored in-network for some period of time. Existing approaches to querying stored data in sensor networks have typically assumed that all data either is stored locally, at the node that produced it, or is hashed to some location in the network using a predefined uniform hash function. These two approaches are at the extremes of a trade-off between storage and query costs. In the former case, the costs of storing data ate low, since no transmissions are required, but queries must flood the entire network. In the latter case, some queries can be executed efficiently by using the hash function to find the nodes of interest, but storage is expensive as readings must be transmitted to some (likely far away) location in the network. In contrast, Scoop monitors changes in the distribution of sensor readings, queried values, and network connectivity to determine the best location to store data. We formulate this as an optimization problem and present a practical algorithm that solves this problem in Scoop. We have built a complete implementation of Scoop for TinyOS mote [1] sensor network hardware and evaluated its performance on a 60-node testbed and in the TinyOS simulator, TOSSIM. Our results show that Scoop not only provides substantial performance benefits over alternative approaches on a range of data sets, but is also able to efficiently adapt to changes in the distribution and rates of data and queries.by Thomer M. Gil.S.M

    Complications in Climate Data Classification: The Political and Cultural Production of Variable Names

    Get PDF
    Model intercomparison projects are a unique and highly specialized form of data—intensive collaboration in the earth sciences. Typically, a set of pre‐determined boundary conditions (scenarios) are agreed upon by a community of model developers that then test and simulate each of those scenarios with individual ‘runs’ of a climate model. Because both the human expertise, and the computational power needed to produce an intercomparison project are exceptionally expensive, the data they produce are often archived for the broader climate science community to use in future research. Outside of high energy physics and astronomy sky surveys, climate modeling intercomparisons are one of the largest and most rapid methods of producing data in the natural sciences (Overpeck et al., 2010).But, like any collaborative eScience project, the discovery and broad accessibility of this data is dependent on classifications and categorizations in the form of structured metadata—namely the Climate and Forecast (CF) metadata standard, which provides a controlled vocabulary to normalize the naming of a dataset’s variables. Intriguingly, the CF standard’s original publication notes, “…conventions have been developed only for things we know we need. Instead of trying to foresee the future, we have added features as required and will continue to do this” (Gregory, 2003). Yet, qualitatively we’ve observed that  this is not the case; although the time period of intercomparison projects remains stable (2-3 years), the scale and complexity of models and their output continue to grow—and thus, data creation and variable names consistently outpace the ratification of CF.

    Supporting the long‐term curation and migration of natural history museum collections databases

    Full text link
    Migration of data collections from one platform to another is an important component of data curation – yet, there is surprisingly little guidance for information professionals faced with this task. Data migration may be particularly challenging when these data collections are housed in relational databases, due to the complex ways that data, data schemas, and relational database management software become intertwined over time. Here we present results from a study of the maintenance, evolution and migration of research databases housed in Natural History Museums. We find that database migration is an on‐going – rather than occasional – process for many Collection managers, and that they creatively appropriate and innovate on many existing technologies in their migration work. This paper contributes descriptions of a preliminary set of common adaptations and “migration patterns” in the practices of database curators. It also outlines the strategies they use when facing collection‐level data migration and describes the limitations of existing tools in supporting LAM and “small science” research database migration. We conclude by outlining future research directions for the maintenance and migration of collections and complex digital objects.Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/147782/1/pra214505501055.pd

    The Phylogeny of a Dataset

    Get PDF
    ABSTRACT The field of evolutionary biology offers many approaches to study the changes that occur between and within generations of species; these methods have recently been adopted by cultural anthropologists, linguists and archaeologists to study the evolution of physical artifacts. In this paper, we further extend these approaches by using phylogenetic methods to model and visualize the evolution of a long-standing, widely used digital dataset in climate science. Our case study shows that clustering algorithms developed specifically for phylogenetic studies in evolutionary biology can be successfully adapted to the study of digital objects, and their known offspring. Although we note a number of limitations with our initial effort, we argue that a quantitative approach to studying how digital objects evolve, are reused, and spawn new digital objects represents an important direction for the future of Information Science

    The Product and System Specificities of Measuring Curation Impact

    Get PDF
    Using three datasets archived at the National Center for Atmospheric Research (NCAR), we describe the creation of a ‘data usage index’ for curation-specific impact assessments. Our work is focused on quantitatively evaluating climate and weather data used in earth and space science research, but we also discuss the application of this approach to other research data contexts. We conclude with some proposed future directions for metric-based work in data curation

    Taxonomy and the Production of Semantic Phenotypes

    Full text link
    Preprint of chapter appearing in "Studies on the Semantic Web: Volume 33: Application of Semantic Technology in Biodiversity Science"Taxonomists produce a myriad of phenotypic descriptions. Traditionally these are provided in terse (telegraphic) natural language. As seen in parallel within other fields of biology researchers are exploring ways to formalize parts of the taxonomic process so that aspects of it are more computational in nature. The currently used data formalizations, mechanisms for persisting data, applications, and computing approaches related to the production of semantic descriptions (phenotypes) are reviewed, they, and their adopters are limited in number. In order to move forward we step back and characterize taxonomists with respect to their typical workflow and tendencies. We then use these characteristics as a basis for exploring how we might create software that taxonomists will find intuitive within their cur- rent workflows, providing interface examples as thought experiments.NSF - DBI-1356381NSF 0956049https://deepblue.lib.umich.edu/bitstream/2027.42/148811/1/yoder_proof.pdfDescription of yoder_proof.pdf : Proof of book chapte

    Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus

    Get PDF
    Aligning sequences for phylogenetic analysis (multiple sequence alignment; MSA) is an important, but increasingly computationally expensive step with the recent surge in DNA sequence data. Much of this sequence data is publicly available, but can be extremely fragmentary (i.e., a combination of full genomes and genomic fragments), which can compound the computational issues related to MSA. Traditionally, alignments are produced with automated algorithms and then checked and/or corrected “by eye” prior to phylogenetic inference. However, this manual curation is inefficient at the data scales required of modern phylogenetics and results in alignments that are not reproducible. Recently, methods have been developed for fully automating alignments of large data sets, but it is unclear if these methods produce alignments that result in compatible phylogenies when compared to more traditional alignment approaches that combined automated and manual methods. Here we use approximately 33,000 publicly available sequences from the hepatitis B virus (HBV), a globally distributed and rapidly evolving virus, to compare different alignment approaches. Using one data set comprised exclusively of whole genomes and a second that also included sequence fragments, we compared three MSA methods: (1) a purely automated approach using traditional software, (2) an automated approach including by eye manual editing, and (3) more recent fully automated approaches. To understand how these methods affect phylogenetic results, we compared resulting tree topologies based on these different alignment methods using multiple metrics. We further determined if the monophyly of existing HBV genotypes was supported in phylogenies estimated from each alignment type and under different statistical support thresholds. Traditional and fully automated alignments produced similar HBV phylogenies. Although there was variability between branch support thresholds, allowing lower support thresholds tended to result in more differences among trees. Therefore, differences between the trees could be best explained by phylogenetic uncertainty unrelated to the MSA method used. Nevertheless, automated alignment approaches did not require human intervention and were therefore considerably less time-intensive than traditional approaches. Because of this, we conclude that fully automated algorithms for MSA are fully compatible with older methods even in extremely difficult to align data sets. Additionally, we found that most HBV diagnostic genotypes did not correspond to evolutionarily-sound groups, regardless of alignment type and support threshold. This suggests there may be errors in genotype classification in the database or that HBV genotypes may need a revision

    Preferential Re-Replication of Drosophila Heterochromatin in the Absence of Geminin

    Get PDF
    To ensure genomic integrity, the genome must be duplicated exactly once per cell cycle. Disruption of replication licensing mechanisms may lead to re-replication and genomic instability. Cdt1, also known as Double-parked (Dup) in Drosophila, is a key regulator of the assembly of the pre-replicative complex (pre-RC) and its activity is strictly limited to G1 by multiple mechanisms including Cul4-Ddb1 mediated proteolysis and inhibition by geminin. We assayed the genomic consequences of disregulating the replication licensing mechanisms by RNAi depletion of geminin. We found that not all origins of replication were sensitive to geminin depletion and that heterochromatic sequences were preferentially re-replicated in the absence of licensing mechanisms. The preferential re-activation of heterochromatic origins of replication was unexpected because these are typically the last sequences to be duplicated in a normal cell cycle. We found that the re-replication of heterochromatin was regulated not at the level of pre-RC activation, but rather by the formation of the pre-RC. Unlike the global assembly of the pre-RC that occurs throughout the genome in G1, in the absence of geminin, limited pre-RC assembly was restricted to the heterochromatin by elevated cyclin A-CDK activity. These results suggest that there are chromatin and cell cycle specific controls that regulate the re-assembly of the pre-RC outside of G1
    corecore