21,959 research outputs found

    Duplicate Detection in Probabilistic Data

    Get PDF
    Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain (esp. probabilistic) source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same real-world entities. Furthermore, for increasing the efficiency of the duplicate detection process we introduce search space reduction methods adapted to probabilistic data

    Indeterministic Handling of Uncertain Decisions in Duplicate Detection

    Get PDF
    In current research, duplicate detection is usually considered as a deterministic approach in which tuples are either declared as duplicates or not. However, most often it is not completely clear whether two tuples represent the same real-world entity or not. In deterministic approaches, however, this uncertainty is ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for handling uncertain decisions in a duplicate detection process by using a probabilistic target schema. Thus, instead of deciding between multiple possible worlds, all these worlds can be modeled in the resulting data. This approach minimizes the negative impacts of false decisions. Furthermore, the duplicate detection process becomes almost fully automatic and human effort can be reduced to a large extent. Unfortunately, a full-indeterministic approach is by definition too expensive (in time as well as in storage) and hence impractical. For that reason, we additionally introduce several semi-indeterministic methods for heuristically reducing the set of indeterministic handled decisions in a meaningful way

    Quality Assessment of Linked Datasets using Probabilistic Approximation

    Full text link
    With the increasing application of Linked Open Data, assessing the quality of datasets by computing quality metrics becomes an issue of crucial importance. For large and evolving datasets, an exact, deterministic computation of the quality metrics is too time consuming or expensive. We employ probabilistic techniques such as Reservoir Sampling, Bloom Filters and Clustering Coefficient estimation for implementing a broad set of data quality metrics in an approximate but sufficiently accurate way. Our implementation is integrated in the comprehensive data quality assessment framework Luzzu. We evaluated its performance and accuracy on Linked Open Datasets of broad relevance.Comment: 15 pages, 2 figures, To appear in ESWC 2015 proceeding

    Information Extraction, Data Integration, and Uncertain Data Management: The State of The Art

    Get PDF
    Information Extraction, data Integration, and uncertain data management are different areas of research that got vast focus in the last two decades. Many researches tackled those areas of research individually. However, information extraction systems should have integrated with data integration methods to make use of the extracted information. Handling uncertainty in extraction and integration process is an important issue to enhance the quality of the data in such integrated systems. This article presents the state of the art of the mentioned areas of research and shows the common grounds and how to integrate information extraction and data integration under uncertainty management cover

    Boosting Linear-Optical Bell Measurement Success Probability with Pre-Detection Squeezing and Imperfect Photon-Number-Resolving Detectors

    Full text link
    Linear optical realizations of Bell state measurement (BSM) on two single-photon qubits succeed with probability psp_s no higher than 0.50.5. However pre-detection quadrature squeezing, i.e., quantum noise limited phase sensitive amplification, in the usual linear-optical BSM circuit, can yield ps0.643{p_s \approx 0.643}. The ability to achieve ps>0.5p_s > 0.5 has been found to be critical in resource-efficient realizations of linear optical quantum computing and all-photonic quantum repeaters. Yet, the aforesaid value of ps>0.5p_s > 0.5 is not known to be the maximum achievable using squeezing, thereby leaving it open whether close-to-100%100\% efficient BSM might be achievable using squeezing as a resource. In this paper, we report new insights on why squeezing-enhanced BSM achieves ps>0.5p_s > 0.5. Using this, we show that the previously-reported ps0.643{p_s \approx 0.643} at single-mode squeezing strength r=0.6585r=0.6585---for unambiguous state discrimination (USD) of all four Bell states---is an experimentally unachievable point result, which drops to ps0.59p_s \approx 0.59 with the slightest change in rr. We however show that squeezing-induced boosting of psp_s with USD operation is still possible over a continuous range of rr, with an experimentally achievable maximum occurring at r=0.5774r=0.5774, achieving ps0.596{p_s \approx 0.596}. Finally, deviating from USD operation, we explore a trade-space between psp_s, the probability with which the BSM circuit declares a "success", versus the probability of error pep_e, the probability of an input Bell state being erroneously identified given the circuit declares a success. Since quantum error correction could correct for some pe>0p_e > 0, this tradeoff may enable better quantum repeater designs by potentially increasing the entanglement generation rates with psp_s exceeding what is possible with traditionally-studied USD operation of BSMs.Comment: 13 pages, 10 figure

    UNH Monitoring Activities that Support the National Coastal Assessment in 2007

    Get PDF
    The National Coastal Assessment is an Environmental Protection Agency program to monitor the health of the nation’s estuaries using nationally standardized methods and a probabilistic sampling design. Dedicated EPA funding for the National Coastal Assessment ceased after 2006. Therefore, the NH Department of Environmental Services and the New Hampshire Estuaries Project contributed funds to continue a portion of the National Coastal Assessment in 2007. Water quality measurements were successfully made during 2007 at 25 randomly located stations throughout the Great Bay Estuary and Hampton-Seabrook Harbor. These data will be combined with samples collected in 2006 for probabilistic assessments of estuarine water quality during the 2006-2007 period in the NHEP Water Quality Indicators Report in 2009

    Strategies for estimating human exposure to mycotoxins via food

    Get PDF
    In this review, five strategies to estimate mycotoxin exposure of a (sub-) population via food, including data collection, are discussed with the aim to identify the added values and limitations of each strategy for risk assessment of these chemicals. The well-established point estimate, observed individual mean, probabilistic and duplicate diet strategies are addressed, as well as the emerging human biomonitoring strategy. All five exposure assessment strategies allow the estimation of chronic (long-term) exposure to mycotoxins, and, with the exception of the observed individual mean strategy, also acute (short-term) exposure. Methods for data collection, i.e. food consumption surveys, food monitoring studies and total diet studies are discussed. In food monitoring studies, the driving force is often enforcement of legal limits, and, consequently, data are often generated with relatively high limits of quantification and targeted at products suspected to contain mycotoxin levels above these legal limits. Total diet studies provide a solid base for chronic exposure assessments since they provide mycotoxin levels in food based on well-defined samples and including the effect of food preparation. Duplicate diet studies and human biomonitoring studies reveal the actual exposure but often involve a restricted group of human volunteers and a limited time period. Human biomonitoring studies may also include exposure to mycotoxins from other sources than food, and exposure to modified mycotoxins that may not be detected with current analytical methods. Low limits of quantification are required for analytical methods applied for data collection to avoid large uncertainties in the exposure due to high numbers of left censored data, i.e. with levels below the limit of quantification
    corecore