Search CORE

79,332 research outputs found

Probabilistic Sparse Subspace Clustering Using Delayed Association

Author: Foroosh Hassan
Jaberi Maryam
Pensky Marianna
Publication venue
Publication date: 28/08/2018
Field of study

Discovering and clustering subspaces in high-dimensional data is a fundamental problem of machine learning with a wide range of applications in data mining, computer vision, and pattern recognition. Earlier methods divided the problem into two separate stages of finding the similarity matrix and finding clusters. Similar to some recent works, we integrate these two steps using a joint optimization approach. We make the following contributions: (i) we estimate the reliability of the cluster assignment for each point before assigning a point to a subspace. We group the data points into two groups of "certain" and "uncertain", with the assignment of latter group delayed until their subspace association certainty improves. (ii) We demonstrate that delayed association is better suited for clustering subspaces that have ambiguities, i.e. when subspaces intersect or data are contaminated with outliers/noise. (iii) We demonstrate experimentally that such delayed probabilistic association leads to a more accurate self-representation and final clusters. The proposed method has higher accuracy both for points that exclusively lie in one subspace, and those that are on the intersection of subspaces. (iv) We show that delayed association leads to huge reduction of computational cost, since it allows for incremental spectral clustering

arXiv.org e-Print Archive

Crossref

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Labor Market Entry and Earnings Dynamics: Bayesian Inference Using Mixtures-of-Experts Markov Chain Clustering

Author: Andrea Weber
Christoph Pamminger
Rudolf Winter-Ebmer
Sylvia Frühwirth-Schnatter
Publication venue
Publication date
Field of study

This paper analyzes patterns in the earnings development of young labor market entrants over their life cycle. We identify four distinctly different types of transition patterns between discrete earnings states in a large administrative data set. Further, we investigate the effects of labor market conditions at the time of entry on the probability of belonging to each transition type. To estimate our statistical model we use a model-based clustering approach. The statistical challenge in our application comes from the di±culty in extending distance-based clustering approaches to the problem of identify groups of similar time series in a panel of discrete-valued time series. We use Markov chain clustering, proposed by Pamminger and Frühwirth-Schnatter (2010), which is an approach for clustering discrete-valued time series obtained by observing a categorical variable with several states. This method is based on finite mixtures of first-order time-homogeneous Markov chain models. In order to analyze group membership we present an extension to this approach by formulating a probabilistic model for the latent group indicators within the Bayesian classification rule using a multinomial logit model.Labor Market Entry Conditions, Transition Data, Markov Chain Monte Carlo, Multinomial Logit, Panel Data, Auxiliary Mixture Sampler, Bayesian Statistics

Research Papers in Economics

Methods to Determine Node Centrality and Clustering in Graphs with Uncertain Structure

Author: Neville Jennifer
Pfeiffer III Joseph J.
Publication venue
Publication date: 01/01/2011
Field of study

Much of the past work in network analysis has focused on analyzing discrete graphs, where binary edges represent the "presence" or "absence" of a relationship. Since traditional network measures (e.g., betweenness centrality) utilize a discrete link structure, complex systems must be transformed to this representation in order to investigate network properties. However, in many domains there may be uncertainty about the relationship structure and any uncertainty information would be lost in translation to a discrete representation. Uncertainty may arise in domains where there is moderating link information that cannot be easily observed, i.e., links become inactive over time but may not be dropped or observed links may not always corresponds to a valid relationship. In order to represent and reason with these types of uncertainty, we move beyond the discrete graph framework and develop social network measures based on a probabilistic graph representation. More specifically, we develop measures of path length, betweenness centrality, and clustering coefficient---one set based on sampling and one based on probabilistic paths. We evaluate our methods on three real-world networks from Enron, Facebook, and DBLP, showing that our proposed methods more accurately capture salient effects without being susceptible to local noise, and that the resulting analysis produces a better understanding of the graph structure and the uncertainty resulting from its change over time.Comment: Longer version of paper appearing in Fifth International AAAI Conference on Weblogs and Social Media. 9 pages, 4 Figure

arXiv.org e-Print Archive

CiteSeerX

Association for the Advancement of Artificial Intelligence: AAAI Publications

Quality Assessment of Linked Datasets using Probabilistic Approximation

Author: A Hogan
AZ Broder
BH Bloom
C Guéret
JS Vitter
P Hitzler
Publication venue
Publication date: 17/03/2015
Field of study

With the increasing application of Linked Open Data, assessing the quality of datasets by computing quality metrics becomes an issue of crucial importance. For large and evolving datasets, an exact, deterministic computation of the quality metrics is too time consuming or expensive. We employ probabilistic techniques such as Reservoir Sampling, Bloom Filters and Clustering Coefficient estimation for implementing a broad set of data quality metrics in an approximate but sufficiently accurate way. Our implementation is integrated in the comprehensive data quality assessment framework Luzzu. We evaluated its performance and accuracy on Linked Open Datasets of broad relevance.Comment: 15 pages, 2 figures, To appear in ESWC 2015 proceeding

arXiv.org e-Print Archive

Crossref

Fraunhofer-ePrints