801 research outputs found

    A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets

    Get PDF
    The term "outlier" can generally be defined as an observation that is significantly different from the other values in a data set. The outliers may be instances of error or indicate events. The task of outlier detection aims at identifying such outliers in order to improve the analysis of data and further discover interesting and useful knowledge about unusual events within numerous applications domains. In this paper, we report on contemporary unsupervised outlier detection techniques for multiple types of data sets and provide a comprehensive taxonomy framework and two decision trees to select the most suitable technique based on data set. Furthermore, we highlight the advantages, disadvantages and performance issues of each class of outlier detection techniques under this taxonomy framework

    Dissortative From the Outside, Assortative From the Inside: Social Structure and Behavior in the Industrial Trade Network

    Full text link
    It is generally accepted that neighboring nodes in financial networks are negatively assorted with respect to the correlation between their degrees. This feature would play an important 'damping' role in the market during downturns (periods of distress) since this connectivity pattern between firms lowers the chances of auto-amplifying (the propagation of) distress. In this paper we explore a trade-network of industrial firms where the nodes are suppliers or buyers, and the links are those invoices that the suppliers send out to their buyers and then go on to present to their bank for discounting. The network was collected by a large Italian bank in 2007, from their intermediation of the sales on credit made by their clients. The network also shows dissortative behavior as seen in other studies on financial networks. However, when looking at the credit rating of the firms, an important attribute internal to each node, we find that firms that trade with one another share overwhelming similarity. We know that much data is missing from our data set. However, we can quantify the amount of missing data using information exposure, a variable that connects social structure and behavior. This variable is a ratio of the sales invoices that a supplier presents to their bank over their total sales. Results reveal a non-trivial and robust relationship between the information exposure and credit rating of a firm, indicating the influence of the neighbors on a firm's rating. This methodology provides a new insight into how to reconstruct a network suffering from incomplete information.Comment: 10 pages, 10 figures, To appear in conference proceedings of the IEEE: HICSS-4

    CLaSPS: a new methodology for Knowledge extraction from complex astronomical dataset

    Get PDF
    In this paper we present the Clustering-Labels-Score Patterns Spotter (CLaSPS), a new methodology for the determination of correlations among astronomical observables in complex datasets, based on the application of distinct unsupervised clustering techniques. The novelty in CLaSPS is the criterion used for the selection of the optimal clusterings, based on a quantitative measure of the degree of correlation between the cluster memberships and the distribution of a set of observables, the labels, not employed for the clustering. In this paper we discuss the applications of CLaSPS to two simple astronomical datasets, both composed of extragalactic sources with photometric observations at different wavelengths from large area surveys. The first dataset, CSC+, is composed of optical quasars spectroscopically selected in the SDSS data, observed in the X-rays by Chandra and with multi-wavelength observations in the near-infrared, optical and ultraviolet spectral intervals. One of the results of the application of CLaSPS to the CSC+ is the re-identification of a well-known correlation between the alphaOX parameter and the near ultraviolet color, in a subset of CSC+ sources with relatively small values of the near-ultraviolet colors. The other dataset consists of a sample of blazars for which photometric observations in the optical, mid and near infrared are available, complemented for a subset of the sources, by Fermi gamma-ray data. The main results of the application of CLaSPS to such datasets have been the discovery of a strong correlation between the multi-wavelength color distribution of blazars and their optical spectral classification in BL Lacs and Flat Spectrum Radio Quasars and a peculiar pattern followed by blazars in the WISE mid-infrared colors space. This pattern and its physical interpretation have been discussed in details in other papers by one of the authors.Comment: 18 pages, 9 figures, accepted for publication in Ap

    A genetic graph-based approach for partitional clustering

    Get PDF
    Clustering is one of the most versatile tools for data analysis. In the recent years, clustering that seeks the continuity of data (in opposition to classical centroid-based approaches) has attracted an increasing research interest. It is a challenging problem with a remarkable practical interest. The most popular continuity clustering method is the spectral clustering (SC) algorithm, which is based on graph cut: It initially generates a similarity graph using a distance measure and then studies its graph spectrum to find the best cut. This approach is sensitive to the parameters of the metric, and a correct parameter choice is critical to the quality of the cluster. This work proposes a new algorithm, inspired by SC, that reduces the parameter dependency while maintaining the quality of the solution. The new algorithm, named genetic graph-based clustering (GGC), takes an evolutionary approach introducing a genetic algorithm (GA) to cluster the similarity graph. The experimental validation shows that GGC increases robustness of SC and has competitive performance in comparison with classical clustering methods, at least, in the synthetic and real dataset used in the experiments

    Intelligent Information Access to Linked Data - Weaving the Cultural Heritage Web

    Get PDF
    The subject of the dissertation is an information alignment experiment of two cultural heritage information systems (ALAP): The Perseus Digital Library and Arachne. In modern societies, information integration is gaining importance for many tasks such as business decision making or even catastrophe management. It is beyond doubt that the information available in digital form can offer users new ways of interaction. Also, in the humanities and cultural heritage communities, more and more information is being published online. But in many situations the way that information has been made publicly available is disruptive to the research process due to its heterogeneity and distribution. Therefore integrated information will be a key factor to pursue successful research, and the need for information alignment is widely recognized. ALAP is an attempt to integrate information from Perseus and Arachne, not only on a schema level, but to also perform entity resolution. To that end, technical peculiarities and philosophical implications of the concepts of identity and co-reference are discussed. Multiple approaches to information integration and entity resolution are discussed and evaluated. The methodology that is used to implement ALAP is mainly rooted in the fields of information retrieval and knowledge discovery. First, an exploratory analysis was performed on both information systems to get a first impression of the data. After that, (semi-)structured information from both systems was extracted and normalized. Then, a clustering algorithm was used to reduce the number of needed entity comparisons. Finally, a thorough matching was performed on the different clusters. ALAP helped with identifying challenges and highlighted the opportunities that arise during the attempt to align cultural heritage information systems

    Operationalizing Individual Fairness with Pairwise Fair Representations

    No full text
    We revisit the notion of individual fairness proposed by Dwork et al. A central challenge in operationalizing their approach is the difficulty in eliciting a human specification of a similarity metric. In this paper, we propose an operationalization of individual fairness that does not rely on a human specification of a distance metric. Instead, we propose novel approaches to elicit and leverage side-information on equally deserving individuals to counter subordination between social groups. We model this knowledge as a fairness graph, and learn a unified Pairwise Fair Representation (PFR) of the data that captures both data-driven similarity between individuals and the pairwise side-information in fairness graph. We elicit fairness judgments from a variety of sources, including human judgments for two real-world datasets on recidivism prediction (COMPAS) and violent neighborhood prediction (Crime & Communities). Our experiments show that the PFR model for operationalizing individual fairness is practically viable.Comment: To be published in the proceedings of the VLDB Endowment, Vol. 13, Issue.

    Extended Stochastic Block Models with Application to Criminal Networks

    Full text link
    Reliably learning group structure among nodes in network data is challenging in modern applications. We are motivated by covert networks encoding relationships among criminals. These data are subject to measurement errors and exhibit a complex combination of an unknown number of core-periphery, assortative and disassortative structures that may unveil the internal architecture of the criminal organization. The coexistence of such noisy block structures limits the reliability of community detection algorithms routinely applied to criminal networks, and requires extensions of model-based solutions to realistically characterize the node partition process, incorporate information from node attributes, and provide improved strategies for estimation, uncertainty quantification, model selection and prediction. To address these goals, we develop a novel class of extended stochastic block models (ESBM) that infer groups of nodes having common connectivity patterns via Gibbs-type priors on the partition process. This choice encompasses several realistic priors for criminal networks, covering solutions with fixed, random and infinite number of possible groups, and facilitates inclusion of node attributes in a principled manner. Among the new alternatives in our class, we focus on the Gnedin process as a realistic prior that allows the number of groups to be finite, random and subject to a reinforcement process coherent with the modular structures in organized crime. A collapsed Gibbs sampler is proposed for the whole ESBM class, and refined strategies for estimation, prediction, uncertainty quantification and model selection are outlined. ESBM performance is illustrated in realistic simulations and in an application to an Italian Mafia network, where we learn key block patterns revealing a complex hierarchical structure of the organization, mostly hidden from state-of-the-art alternative solutions
    corecore