21,959 research outputs found
Duplicate Detection in Probabilistic Data
Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain (esp. probabilistic) source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same real-world entities. Furthermore, for increasing the efficiency of the duplicate detection process we introduce search space reduction methods adapted to probabilistic data
Indeterministic Handling of Uncertain Decisions in Duplicate Detection
In current research, duplicate detection is usually considered as a deterministic approach in which tuples are either declared as duplicates or not. However, most often it is not completely clear whether two tuples represent the same real-world entity or not. In deterministic approaches, however, this uncertainty is ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for handling uncertain decisions in a duplicate detection process by using a probabilistic target schema. Thus, instead of deciding between multiple possible worlds, all these worlds can be modeled in the resulting data. This approach minimizes the negative impacts of false decisions. Furthermore, the duplicate detection process becomes almost fully automatic and human effort can be reduced to a large extent. Unfortunately, a full-indeterministic approach is by definition too expensive (in time as well as in storage) and hence impractical. For that reason, we additionally introduce several semi-indeterministic methods for heuristically reducing the set of indeterministic handled decisions in a meaningful way
Quality Assessment of Linked Datasets using Probabilistic Approximation
With the increasing application of Linked Open Data, assessing the quality of
datasets by computing quality metrics becomes an issue of crucial importance.
For large and evolving datasets, an exact, deterministic computation of the
quality metrics is too time consuming or expensive. We employ probabilistic
techniques such as Reservoir Sampling, Bloom Filters and Clustering Coefficient
estimation for implementing a broad set of data quality metrics in an
approximate but sufficiently accurate way. Our implementation is integrated in
the comprehensive data quality assessment framework Luzzu. We evaluated its
performance and accuracy on Linked Open Datasets of broad relevance.Comment: 15 pages, 2 figures, To appear in ESWC 2015 proceeding
Information Extraction, Data Integration, and Uncertain Data Management: The State of The Art
Information Extraction, data Integration, and uncertain data management are different areas of research that got vast focus in the last two decades. Many researches tackled those areas of research individually. However, information extraction systems should have integrated with data integration methods to make use of the extracted information. Handling uncertainty in extraction and integration process is an important issue to enhance the quality of the data in such integrated systems. This article presents the state of the art of the mentioned areas of research and shows the common grounds and how to integrate information extraction and data integration under uncertainty management cover
Boosting Linear-Optical Bell Measurement Success Probability with Pre-Detection Squeezing and Imperfect Photon-Number-Resolving Detectors
Linear optical realizations of Bell state measurement (BSM) on two
single-photon qubits succeed with probability no higher than .
However pre-detection quadrature squeezing, i.e., quantum noise limited phase
sensitive amplification, in the usual linear-optical BSM circuit, can yield
. The ability to achieve has been found to be
critical in resource-efficient realizations of linear optical quantum computing
and all-photonic quantum repeaters. Yet, the aforesaid value of is
not known to be the maximum achievable using squeezing, thereby leaving it open
whether close-to- efficient BSM might be achievable using squeezing as a
resource. In this paper, we report new insights on why squeezing-enhanced BSM
achieves . Using this, we show that the previously-reported at single-mode squeezing strength ---for unambiguous
state discrimination (USD) of all four Bell states---is an experimentally
unachievable point result, which drops to with the slightest
change in . We however show that squeezing-induced boosting of with
USD operation is still possible over a continuous range of , with an
experimentally achievable maximum occurring at , achieving . Finally, deviating from USD operation, we explore a
trade-space between , the probability with which the BSM circuit declares
a "success", versus the probability of error , the probability of an input
Bell state being erroneously identified given the circuit declares a success.
Since quantum error correction could correct for some , this tradeoff
may enable better quantum repeater designs by potentially increasing the
entanglement generation rates with exceeding what is possible with
traditionally-studied USD operation of BSMs.Comment: 13 pages, 10 figure
UNH Monitoring Activities that Support the National Coastal Assessment in 2007
The National Coastal Assessment is an Environmental Protection Agency program to monitor the health of the nation’s estuaries using nationally standardized methods and a probabilistic sampling design. Dedicated EPA funding for the National Coastal Assessment ceased after 2006. Therefore, the NH Department of Environmental Services and the New Hampshire Estuaries Project contributed funds to continue a portion of the National Coastal Assessment in 2007. Water quality measurements were successfully made during 2007 at 25 randomly located stations throughout the Great Bay Estuary and Hampton-Seabrook Harbor. These data will be combined with samples collected in 2006 for probabilistic assessments of estuarine water quality during the 2006-2007 period in the NHEP Water Quality Indicators Report in 2009
Strategies for estimating human exposure to mycotoxins via food
In this review, five strategies to estimate mycotoxin exposure of a (sub-) population via food, including data collection, are discussed with the aim to identify the added values and limitations of each strategy for risk assessment of these chemicals. The well-established point estimate, observed individual mean, probabilistic and duplicate diet strategies are addressed, as well as the emerging human biomonitoring strategy. All five exposure assessment strategies allow the estimation of chronic (long-term) exposure to mycotoxins, and, with the exception of the observed individual mean strategy, also acute (short-term) exposure. Methods for data collection, i.e. food consumption surveys, food monitoring studies and total diet studies are discussed. In food monitoring studies, the driving force is often enforcement of legal limits, and, consequently, data are often generated with relatively high limits of quantification and targeted at products suspected to contain mycotoxin levels above these legal limits. Total diet studies provide a solid base for chronic exposure assessments since they provide mycotoxin levels in food based on well-defined samples and including the effect of food preparation. Duplicate diet studies and human biomonitoring studies reveal the actual exposure but often involve a restricted group of human volunteers and a limited time period. Human biomonitoring studies may also include exposure to mycotoxins from other sources than food, and exposure to modified mycotoxins that may not be detected with current analytical methods. Low limits of quantification are required for analytical methods applied for data collection to avoid large uncertainties in the exposure due to high numbers of left censored data, i.e. with levels below the limit of quantification
Platform Dependent Verification: On Engineering Verification Tools for 21st Century
The paper overviews recent developments in platform-dependent explicit-state
LTL model checking.Comment: In Proceedings PDMC 2011, arXiv:1111.006
- …