181 research outputs found

    An experiment with association rules and classification: post-bagging and conviction

    Get PDF
    In this paper we study a new technique we call post-bagging, which consists in resampling parts of a classification model rather then the data. We do this with a particular kind of model: large sets of classification association rules, and in combination with ordinary best rule and weighted voting approaches. We empirically evaluate the effects of the technique in terms of classification accuracy. We also discuss the predictive power of different metrics used for association rule mining, such as confidence, lift, conviction and X². We conclude that, for the described experimental conditions, post-bagging improves classification results and that the best metric is conviction.Programa de Financiamento Plurianual de Unidades de I & D.Comunidade Europeia (CE). Fundo Europeu de Desenvolvimento Regional (FEDER).Fundação para a Ciência e a Tecnologia (FCT) - POSI/SRI/39630/2001/Class Project

    Big data: Finders keepers, losers weepers?

    Get PDF
    This article argues that big data’s entrepreneurial potential is based not only on new technological developments that allow for the extraction of non-trivial, new insights out of existing data, but also on an ethical judgment that often remains implicit: namely the ethical judgment that those companies that generate these new insights can legitimately appropriate (the fruits of) these insights. As a result, the business model of big data companies is essentially founded on a libertarian-inspired ‘finders, keepers’ ethic. The article argues, next, that this presupposed ‘finder, keepers’ ethic is far from unproblematic and relies itself on multiple unconvincing assumptions. This leads to the conclusion that the conduct of companies working with big data might lack ethical justification

    Integrating clinicians, knowledge and data: expert-based cooperative analysis in healthcare decision support

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Decision support in health systems is a highly difficult task, due to the inherent complexity of the process and structures involved.</p> <p>Method</p> <p>This paper introduces a new hybrid methodology <it>Expert-based Cooperative Analysis </it>(EbCA), which incorporates explicit prior expert knowledge in data analysis methods, and elicits implicit or tacit expert knowledge (IK) to improve decision support in healthcare systems. EbCA has been applied to two different case studies, showing its usability and versatility: 1) Bench-marking of small mental health areas based on technical efficiency estimated by <it>EbCA-Data Envelopment Analysis (EbCA-DEA)</it>, and 2) Case-mix of schizophrenia based on functional dependency using <it>Clustering Based on Rules (ClBR)</it>. In both cases comparisons towards classical procedures using qualitative explicit prior knowledge were made. Bayesian predictive validity measures were used for comparison with expert panels results. Overall agreement was tested by Intraclass Correlation Coefficient in case "1" and kappa in both cases.</p> <p>Results</p> <p>EbCA is a new methodology composed by 6 steps:. 1) Data collection and data preparation; 2) acquisition of "Prior Expert Knowledge" (PEK) and design of the "Prior Knowledge Base" (PKB); 3) PKB-guided analysis; 4) support-interpretation tools to evaluate results and detect inconsistencies (here <it>Implicit Knowledg </it>-IK- might be elicited); 5) incorporation of elicited IK in PKB and repeat till a satisfactory solution; 6) post-processing results for decision support. EbCA has been useful for incorporating PEK in two different analysis methods (DEA and Clustering), applied respectively to assess technical efficiency of small mental health areas and for case-mix of schizophrenia based on functional dependency. Differences in results obtained with classical approaches were mainly related to the IK which could be elicited by using EbCA and had major implications for the decision making in both cases.</p> <p>Discussion</p> <p>This paper presents EbCA and shows the convenience of completing classical data analysis with PEK as a mean to extract relevant knowledge in complex health domains. One of the major benefits of EbCA is iterative elicitation of IK.. Both explicit and tacit or implicit expert knowledge are critical to guide the scientific analysis of very complex decisional problems as those found in health system research.</p

    On plexus representation of dissimilarities

    Get PDF
    Correspondence analysis has found widespread application in analysing vegetation gradients. However, it is not clear how it is robust to situations where structures other than a simple gradient exist. The introduction of instrumental variables in canonical correspondence analysis does not avoid these difficulties. In this paper I propose to examine some simple methods based on the notion of the plexus (sensu McIntosh) where graphs or networks are used to display some of the structure of the data so that an informed choice of models is possible. I showthat two different classes of plexus model are available. These classes are distinguished by the use in one case of a global Euclidean model to obtain well-separated pair decomposition (WSPD) of a set of points which implicitly involves all dissimilarities, while in the other a Riemannian view is taken and emphasis is placed locally, i.e., on small dissimilarities. I showan example of each of these classes applied to vegetation data

    What is behind a summary-evaluation decision?

    Get PDF
    Research in psychology has reported that, among the variety of possibilities for assessment methodologies, summary evaluation offers a particularly adequate context for inferring text comprehension and topic understanding. However, grades obtained in this methodology are hard to quantify objectively. Therefore, we carried out an empirical study to analyze the decisions underlying human summary-grading behavior. The task consisted of expert evaluation of summaries produced in critically relevant contexts of summarization development, and the resulting data were modeled by means of Bayesian networks using an application called Elvira, which allows for graphically observing the predictive power (if any) of the resultant variables. Thus, in this article, we analyzed summary-evaluation decision making in a computational framewor

    Individualized markers optimize class prediction of microarray data

    Get PDF
    BACKGROUND: Identification of molecular markers for the classification of microarray data is a challenging task. Despite the evident dissimilarity in various characteristics of biological samples belonging to the same category, most of the marker – selection and classification methods do not consider this variability. In general, feature selection methods aim at identifying a common set of genes whose combined expression profiles can accurately predict the category of all samples. Here, we argue that this simplified approach is often unable to capture the complexity of a disease phenotype and we propose an alternative method that takes into account the individuality of each patient-sample. RESULTS: Instead of using the same features for the classification of all samples, the proposed technique starts by creating a pool of informative gene-features. For each sample, the method selects a subset of these features whose expression profiles are most likely to accurately predict the sample's category. Different subsets are utilized for different samples and the outcomes are combined in a hierarchical framework for the classification of all samples. Moreover, this approach can innately identify subgroups of samples within a given class which share common feature sets thus highlighting the effect of individuality on gene expression. CONCLUSION: In addition to high classification accuracy, the proposed method offers a more individualized approach for the identification of biological markers, which may help in better understanding the molecular background of a disease and emphasize the need for more flexible medical interventions

    Genetic Epidemiology of Attention Deficit Hyperactivity Disorder (ADHD Index) in Adults

    Get PDF
    Context: In contrast to the large number of studies in children, there is little information on the contribution of genetic factors to Attention Deficit Hyperactivity Disorder (ADHD) in adults. Objective: To estimate the heritability of ADHD in adults as assessed by the ADHD index scored from the CAARS (Conners’ Adult ADHD Rating Scales). Design: Phenotype data from over 12,000 adults (twins, siblings and parents) registered with the Netherlands Twin Register were analyzed using genetic structural equation modeling. Main outcome measures: Heritability estimates for ADHD from the twin-family study. Results: Heritability of ADHD in adults is estimated around 30 % in men and women. There is some evidence for assortative mating. All familial transmission is explained by genetic inheritance, there is no support for the hypothesis that cultural transmission from parents to offspring is important. Conclusion: Heritability for ADHD features in adults is present, but is substantially lower than it is in children

    Accurate molecular classification of cancer using simple rules

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>One intractable problem with using microarray data analysis for cancer classification is how to reduce the extremely high-dimensionality gene feature data to remove the effects of noise. Feature selection is often used to address this problem by selecting informative genes from among thousands or tens of thousands of genes. However, most of the existing methods of microarray-based cancer classification utilize too many genes to achieve accurate classification, which often hampers the interpretability of the models. For a better understanding of the classification results, it is desirable to develop simpler rule-based models with as few marker genes as possible.</p> <p>Methods</p> <p>We screened a small number of informative single genes and gene pairs on the basis of their depended degrees proposed in rough sets. Applying the decision rules induced by the selected genes or gene pairs, we constructed cancer classifiers. We tested the efficacy of the classifiers by leave-one-out cross-validation (LOOCV) of training sets and classification of independent test sets.</p> <p>Results</p> <p>We applied our methods to five cancerous gene expression datasets: leukemia (acute lymphoblastic leukemia [ALL] vs. acute myeloid leukemia [AML]), lung cancer, prostate cancer, breast cancer, and leukemia (ALL vs. mixed-lineage leukemia [MLL] vs. AML). Accurate classification outcomes were obtained by utilizing just one or two genes. Some genes that correlated closely with the pathogenesis of relevant cancers were identified. In terms of both classification performance and algorithm simplicity, our approach outperformed or at least matched existing methods.</p> <p>Conclusion</p> <p>In cancerous gene expression datasets, a small number of genes, even one or two if selected correctly, is capable of achieving an ideal cancer classification effect. This finding also means that very simple rules may perform well for cancerous class prediction.</p
    corecore