38,105 research outputs found

    Discovering a taste for the unusual: exceptional models for preference mining

    Get PDF
    Exceptional preferences mining (EPM) is a crossover between two subfields of data mining: local pattern mining and preference learning. EPM can be seen as a local pattern mining task that finds subsets of observations where some preference relations between labels significantly deviate from the norm. It is a variant of subgroup discovery, with rankings of labels as the target concept. We employ several quality measures that highlight subgroups featuring exceptional preferences, where the focus of what constitutes exceptional' varies with the quality measure: two measures look for exceptional overall ranking behavior, one measure indicates whether a particular label stands out from the rest, and a fourth measure highlights subgroups with unusual pairwise label ranking behavior. We explore a few datasets and compare with existing techniques. The results confirm that the new task EPM can deliver interesting knowledge.This research has received funding from the ECSEL Joint Undertaking, the framework programme for research and innovation Horizon 2020 (2014-2020) under Grant Agreement Number 662189-MANTIS-2014-1

    Key Findings: California Young Adult Workforce Survey

    Get PDF
    Presents survey results on the views of the state's youth on the economy, job security, employment prospects, and influences on job choice, as well as their attitudes toward work in the healthcare sector

    Use of pre-transformation to cope with outlying values in important candidate genes

    Get PDF
    Outlying values in predictors often strongly affect the results of statistical analyses in high-dimensional settings. Although they frequently occur with most high-throughput techniques, the problem is often ignored in the literature. We suggest to use a very simple transformation, proposed before in a different context by Royston and Sauerbrei, as an intermediary step between array normalization and high-level statistical analysis. This straightforward univariate transformation identifies extreme values and reduces the influence of outlying values considerably in all further steps of statistical analysis without eliminating the incriminated observation or feature. The use of the transformation and its effects are demonstrated for diverse univariate and multivariate statistical analyses using nine publicly available microarray data sets

    Subjectively Interesting Subgroup Discovery on Real-valued Targets

    Get PDF
    Deriving insights from high-dimensional data is one of the core problems in data mining. The difficulty mainly stems from the fact that there are exponentially many variable combinations to potentially consider, and there are infinitely many if we consider weighted combinations, even for linear combinations. Hence, an obvious question is whether we can automate the search for interesting patterns and visualizations. In this paper, we consider the setting where a user wants to learn as efficiently as possible about real-valued attributes. For example, to understand the distribution of crime rates in different geographic areas in terms of other (numerical, ordinal and/or categorical) variables that describe the areas. We introduce a method to find subgroups in the data that are maximally informative (in the formal Information Theoretic sense) with respect to a single or set of real-valued target attributes. The subgroup descriptions are in terms of a succinct set of arbitrarily-typed other attributes. The approach is based on the Subjective Interestingness framework FORSIED to enable the use of prior knowledge when finding most informative non-redundant patterns, and hence the method also supports iterative data mining.Comment: 12 pages, 10 figures, 2 tables, conference submissio

    Opciones de políticas para la paliación de la pobreza

    Get PDF
    (Disponible en idioma inglés únicamente) Este trabajo se fundamenta en investigaciones anteriores para desarrollar una metodología que simplifique la identificación de las mejores opciones de políticas para paliar la pobreza en un país dado. Cuando se puede dividir una población en subgrupos según alguna característica fácilmente identificable, se puede entender el problema del alivio de la pobreza mediante un mecanismo dirigido a sectores específicos como una elección entre tres opciones: i) provocar un cambio marginal en el ingreso promedio de ciertos subgrupos; ii) modificar la distribución de ingresos dentro de subgrupos marginales y iii) generar un cambio marginal en las diferencias entre los subgrupos. Se aplican datos recientes de México.

    The measurement of low- and high-impact in citation distributions : technical results.

    Get PDF
    This paper introduces a novel methodology for comparing the citation distributions of research units of a certain size working in the same homogeneous field. Given a critical citation level (CCL), we suggest using two real valued indicators to describe the shape of any distribution: a high-impact and a low-impact measure defined over the set of articles with citations above or below the CCL. The key to this methodology is the identification of a citation distribution with an income distribution. Once this step is taken, it is easy to realize that the measurement of low-impact coincides with the measurement of economic poverty. In turn, it is equally natural to identify the measurement of high-impact with the measurement of a certain notion of economic affluence. On the other hand, it is seen that the ranking of citation distributions according to a family of low-impact measures is essentially characterized by a number of desirable axioms. Appropriately redefined, these same axioms lead to the selection of an equally convenient class of decomposable high-impact measures. These two families are shown to satisfy other interesting properties that make them potentially useful in empirical applications, including the comparison of research units working in different fieldsResearch performance; Citation distribution; Poverty measurement; Impact indicators;

    Distinguishing humans from computers in the game of go: a complex network approach

    Full text link
    We compare complex networks built from the game of go and obtained from databases of human-played games with those obtained from computer-played games. Our investigations show that statistical features of the human-based networks and the computer-based networks differ, and that these differences can be statistically significant on a relatively small number of games using specific estimators. We show that the deterministic or stochastic nature of the computer algorithm playing the game can also be distinguished from these quantities. This can be seen as tool to implement a Turing-like test for go simulators.Comment: 7 pages, 6 figure

    Demographic trends and living standards the case of Spain during the 1980´s.

    Get PDF
    In this paper we study the evolution of the standard of living un Spain during the 1980' s for a population partitioned by the following individual characteristics: the age group, the 'relation to economic activity, and the result of the decision on whether to live in a household headed by someone else, or to live on one ' s own with or without dependents. Our results help to understand the decline ot inequality in Spain, wich has been formerly investigated only in terms of the household head' s characteristics. On the other hand, within the limits of our cross-section data, we provide sorne evidence on the economic rationale behind the individual decisions about early retirement, household formation, and the female participation in the labor market.Living arrangements; Individual characteristics; Inequality; Welfare;

    Epitope profiling via mixture modeling of ranked data

    Full text link
    We propose the use of probability models for ranked data as a useful alternative to a quantitative data analysis to investigate the outcome of bioassay experiments, when the preliminary choice of an appropriate normalization method for the raw numerical responses is difficult or subject to criticism. We review standard distance-based and multistage ranking models and in this last context we propose an original generalization of the Plackett-Luce model to account for the order of the ranking elicitation process. The usefulness of the novel model is illustrated with its maximum likelihood estimation for a real data set. Specifically, we address the heterogeneous nature of experimental units via model-based clustering and detail the necessary steps for a successful likelihood maximization through a hybrid version of the Expectation-Maximization algorithm. The performance of the mixture model using the new distribution as mixture components is compared with those relative to alternative mixture models for random rankings. A discussion on the interpretation of the identified clusters and a comparison with more standard quantitative approaches are finally provided.Comment: (revised to properly include references
    corecore