14,825 research outputs found
A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration
In practical data integration systems, it is common for the data sources
being integrated to provide conflicting information about the same entity.
Consequently, a major challenge for data integration is to derive the most
complete and accurate integrated records from diverse and sometimes conflicting
sources. We term this challenge the truth finding problem. We observe that some
sources are generally more reliable than others, and therefore a good model of
source quality is the key to solving the truth finding problem. In this work,
we propose a probabilistic graphical model that can automatically infer true
records and source quality without any supervision. In contrast to previous
methods, our principled approach leverages a generative process of two types of
errors (false positive and false negative) by modeling two different aspects of
source quality. In so doing, ours is also the first approach designed to merge
multi-valued attribute types. Our method is scalable, due to an efficient
sampling-based inference algorithm that needs very few iterations in practice
and enjoys linear time complexity, with an even faster incremental variant.
Experiments on two real world datasets show that our new method outperforms
existing state-of-the-art approaches to the truth finding problem.Comment: VLDB201
Reification and Truthmaking Patterns
Reification is a standard technique in conceptual modeling, which consists of including in the domain of discourse entities that may otherwise be hidden or implicit. However, deciding what should be rei- fied is not always easy. Recent work on formal ontology offers us a simple answer: put in the domain of discourse those entities that are responsible for the (alleged) truth of our propositions. These are called truthmakers. Re-visiting previous work, we propose in this paper a systematic analysis of truthmaking patterns for properties and relations based on the ontolog- ical nature of their truthmakers. Truthmaking patterns will be presented as generalization of reification patterns, accounting for the fact that, in some cases, we do not reify a property or a relationship directly, but we rather reify its truthmakers
The supervised IBP: neighbourhood preserving infinite latent feature models
We propose a probabilistic model to infer supervised latent variables in the Hamming space from observed data. Our model allows simultaneous inference of the number of binary latent variables, and their values. The latent variables preserve neighbourhood structure of the data in a sense that objects in the same semantic concept have similar latent values, and objects in different concepts have dissimilar latent values. We formulate the supervised infinite latent variable problem based on an intuitive principle of pulling objects together if they are of the same type, and pushing them apart if they are not. We then combine this principle with a flexible Indian Buffet Process prior on the latent variables. We show that the inferred supervised latent variables can be directly used to perform a nearest neighbour search for the purpose of retrieval. We introduce a new application of dynamically extending hash codes, and show how to effectively couple the structure of the hash codes with continuously growing structure of the neighbourhood preserving infinite latent feature space
Community Detection in Networks with Node Attributes
Community detection algorithms are fundamental tools that allow us to uncover
organizational principles in networks. When detecting communities, there are
two possible sources of information one can use: the network structure, and the
features and attributes of nodes. Even though communities form around nodes
that have common edges and common attributes, typically, algorithms have only
focused on one of these two data modalities: community detection algorithms
traditionally focus only on the network structure, while clustering algorithms
mostly consider only node attributes. In this paper, we develop Communities
from Edge Structure and Node Attributes (CESNA), an accurate and scalable
algorithm for detecting overlapping communities in networks with node
attributes. CESNA statistically models the interaction between the network
structure and the node attributes, which leads to more accurate community
detection as well as improved robustness in the presence of noise in the
network structure. CESNA has a linear runtime in the network size and is able
to process networks an order of magnitude larger than comparable approaches.
Last, CESNA also helps with the interpretation of detected communities by
finding relevant node attributes for each community.Comment: Published in the proceedings of IEEE ICDM '1
Geocoded data structures and their applications to Earth science investigations
A geocoded data structure is a means for digitally representing a geographically referenced map or image. The characteristics of representative cellular, linked, and hybrid geocoded data structures are reviewed. The data processing requirements of Earth science projects at the Goddard Space Flight Center and the basic tools of geographic data processing are described. Specific ways that new geocoded data structures can be used to adapt these tools to scientists' needs are presented. These include: expanding analysis and modeling capabilities; simplifying the merging of data sets from diverse sources; and saving computer storage space
Measurement in marketing
We distinguish three senses of the concept of measurement (measurement as the selection of observable indicators of theoretical concepts, measurement as the collection of data from respondents, and measurement as the formulation of measurement models linking observable indicators to latent factors representing the theoretical concepts), and we review important issues related to measurement in each of these senses. With regard to measurement in the first sense, we distinguish the steps of construct definition and item generation, and we review scale development efforts reported in three major marketing journals since 2000 to illustrate these steps and derive practical guidelines. With regard to measurement in the second sense, we look at the survey process from the respondent's perspective and discuss the goals that may guide participants' behavior during a survey, the cognitive resources that respondents devote to answering survey questions, and the problems that may occur at the various steps of the survey process. Finally, with regard to measurement in the third sense, we cover both reflective and formative measurement models, and we explain how researchers can assess the quality of measurement in both types of measurement models and how they can ascertain the comparability of measurements across different populations of respondents or conditions of measurement. We also provide a detailed empirical example of measurement analysis for reflective measurement models
- …