79,332 research outputs found
Probabilistic Sparse Subspace Clustering Using Delayed Association
Discovering and clustering subspaces in high-dimensional data is a
fundamental problem of machine learning with a wide range of applications in
data mining, computer vision, and pattern recognition. Earlier methods divided
the problem into two separate stages of finding the similarity matrix and
finding clusters. Similar to some recent works, we integrate these two steps
using a joint optimization approach. We make the following contributions: (i)
we estimate the reliability of the cluster assignment for each point before
assigning a point to a subspace. We group the data points into two groups of
"certain" and "uncertain", with the assignment of latter group delayed until
their subspace association certainty improves. (ii) We demonstrate that delayed
association is better suited for clustering subspaces that have ambiguities,
i.e. when subspaces intersect or data are contaminated with outliers/noise.
(iii) We demonstrate experimentally that such delayed probabilistic association
leads to a more accurate self-representation and final clusters. The proposed
method has higher accuracy both for points that exclusively lie in one
subspace, and those that are on the intersection of subspaces. (iv) We show
that delayed association leads to huge reduction of computational cost, since
it allows for incremental spectral clustering
Labor Market Entry and Earnings Dynamics: Bayesian Inference Using Mixtures-of-Experts Markov Chain Clustering
This paper analyzes patterns in the earnings development of young labor market entrants over their life cycle. We identify four distinctly different types of transition patterns between discrete earnings states in a large administrative data set. Further, we investigate the effects of labor market conditions at the time of entry on the probability of belonging to each transition type. To estimate our statistical model we use a model-based clustering approach. The statistical challenge in our application comes from the di±culty in extending distance-based clustering approaches to the problem of identify groups of similar time series in a panel of discrete-valued time series. We use Markov chain clustering, proposed by Pamminger and Frühwirth-Schnatter (2010), which is an approach for clustering discrete-valued time series obtained by observing a categorical variable with several states. This method is based on finite mixtures of first-order time-homogeneous Markov chain models. In order to analyze group membership we present an extension to this approach by formulating a probabilistic model for the latent group indicators within the Bayesian classification rule using a multinomial logit model.Labor Market Entry Conditions, Transition Data, Markov Chain Monte Carlo, Multinomial Logit, Panel Data, Auxiliary Mixture Sampler, Bayesian Statistics
Methods to Determine Node Centrality and Clustering in Graphs with Uncertain Structure
Much of the past work in network analysis has focused on analyzing discrete
graphs, where binary edges represent the "presence" or "absence" of a
relationship. Since traditional network measures (e.g., betweenness centrality)
utilize a discrete link structure, complex systems must be transformed to this
representation in order to investigate network properties. However, in many
domains there may be uncertainty about the relationship structure and any
uncertainty information would be lost in translation to a discrete
representation. Uncertainty may arise in domains where there is moderating link
information that cannot be easily observed, i.e., links become inactive over
time but may not be dropped or observed links may not always corresponds to a
valid relationship. In order to represent and reason with these types of
uncertainty, we move beyond the discrete graph framework and develop social
network measures based on a probabilistic graph representation. More
specifically, we develop measures of path length, betweenness centrality, and
clustering coefficient---one set based on sampling and one based on
probabilistic paths. We evaluate our methods on three real-world networks from
Enron, Facebook, and DBLP, showing that our proposed methods more accurately
capture salient effects without being susceptible to local noise, and that the
resulting analysis produces a better understanding of the graph structure and
the uncertainty resulting from its change over time.Comment: Longer version of paper appearing in Fifth International AAAI
Conference on Weblogs and Social Media. 9 pages, 4 Figure
Quality Assessment of Linked Datasets using Probabilistic Approximation
With the increasing application of Linked Open Data, assessing the quality of
datasets by computing quality metrics becomes an issue of crucial importance.
For large and evolving datasets, an exact, deterministic computation of the
quality metrics is too time consuming or expensive. We employ probabilistic
techniques such as Reservoir Sampling, Bloom Filters and Clustering Coefficient
estimation for implementing a broad set of data quality metrics in an
approximate but sufficiently accurate way. Our implementation is integrated in
the comprehensive data quality assessment framework Luzzu. We evaluated its
performance and accuracy on Linked Open Datasets of broad relevance.Comment: 15 pages, 2 figures, To appear in ESWC 2015 proceeding
- …