20,428 research outputs found
Integrating and Ranking Uncertain Scientific Data
Mediator-based data integration systems resolve exploratory queries by joining data elements across sources. In the presence of uncertainties, such multiple expansions can quickly lead to spurious connections and incorrect results. The BioRank project investigates formalisms for modeling uncertainty during scientific data integration and for ranking uncertain query results. Our motivating application is protein function prediction. In this paper we show that: (i) explicit modeling of uncertainties as probabilities increases our ability to predict less-known or previously unknown functions (though it does not improve predicting the well-known). This suggests that probabilistic uncertainty models offer utility for scientific knowledge discovery; (ii) small perturbations in the input probabilities tend to produce only minor changes in the quality of our result rankings. This suggests that our methods are robust against slight variations in the way uncertainties are transformed into probabilities; and (iii) several techniques allow us to evaluate our probabilistic rankings efficiently. This suggests that probabilistic query evaluation is not as hard for real-world problems as theory indicates
Designing marine fishery reserves using passive acoustic telemetry
Marine Fishery Reserves (MFRs) are being adopted, in part, as a strategy to replenish depleted fish stocks and serve as a source for recruits to adjacent fisheries. By necessity, their design must consider the biological parameters of the species under consideration to ensure that the spawning stock is conserved while simultaneously providing propagules for dispersal. We describe how acoustic telemetry can be employed to design effective MFRs by elucidating important life-history parameters of the species under consideration, including home range, and ecological preferences, including habitat utilization. We then designed a reserve based on these parameters using
data from two acoustic telemetry studies that examined two closely-linked subpopulations of queen conch (Strombus
gigas) at Conch Reef in the Florida Keys. The union of the home ranges of the individual conch (aggregation home range:
AgHR) within each subpopulation was used to construct a shape delineating the area within which a conch would be located with a high probability. Together with habitat utilization information acquired during both the spawning and non-spawning seasons, as well as landscape features
(i.e., corridors), we designed a 66.5 ha MFR to conserve the conch population. Consideration was also given for further expansion of the population into suitable habitats
Structure-Aware Sampling: Flexible and Accurate Summarization
In processing large quantities of data, a fundamental problem is to obtain a
summary which supports approximate query answering. Random sampling yields
flexible summaries which naturally support subset-sum queries with unbiased
estimators and well-understood confidence bounds.
Classic sample-based summaries, however, are designed for arbitrary subset
queries and are oblivious to the structure in the set of keys. The particular
structure, such as hierarchy, order, or product space (multi-dimensional),
makes range queries much more relevant for most analysis of the data.
Dedicated summarization algorithms for range-sum queries have also been
extensively studied. They can outperform existing sampling schemes in terms of
accuracy on range queries per summary size. Their accuracy, however, rapidly
degrades when, as is often the case, the query spans multiple ranges. They are
also less flexible - being targeted for range sum queries alone - and are often
quite costly to build and use.
In this paper we propose and evaluate variance optimal sampling schemes that
are structure-aware. These summaries improve over the accuracy of existing
structure-oblivious sampling schemes on range queries while retaining the
benefits of sample-based summaries: flexible summaries, with high accuracy on
both range queries and arbitrary subset queries
Record-Linkage from a Technical Point of View
TRecord linkage is used for preparing sampling frames, deduplication of lists and combining information on the same object from two different databases. If the identifiers of the same objects in two different databases have error free unique common identifiers like personal identification numbers (PID), record linkage is a simple file merge operation. If the identifiers contains errors, record linkage is a challenging task. In many applications, the files have widely different numbers of observations, for example a few thousand records of a sample survey and a few million records of an administrative database of social security numbers. Available software, privacy issues and future research topics are discussed.Record-Linkage, Data-mining, Privacy preserving protocols
A Survey on Wireless Sensor Network Security
Wireless sensor networks (WSNs) have recently attracted a lot of interest in
the research community due their wide range of applications. Due to distributed
nature of these networks and their deployment in remote areas, these networks
are vulnerable to numerous security threats that can adversely affect their
proper functioning. This problem is more critical if the network is deployed
for some mission-critical applications such as in a tactical battlefield.
Random failure of nodes is also very likely in real-life deployment scenarios.
Due to resource constraints in the sensor nodes, traditional security
mechanisms with large overhead of computation and communication are infeasible
in WSNs. Security in sensor networks is, therefore, a particularly challenging
task. This paper discusses the current state of the art in security mechanisms
for WSNs. Various types of attacks are discussed and their countermeasures
presented. A brief discussion on the future direction of research in WSN
security is also included.Comment: 24 pages, 4 figures, 2 table
A revival of integrity constraints for data cleaning
Integrity constraints,
a.k.a
. data dependencies, are being widely used for improving
the quality of schema
. Recently constraints have enjoyed a revival for
improving the quality of data
. The tutorial aims to provide an overview of recent advances in constraint-based data cleaning.
</jats:p
- …