89 research outputs found

    Dealing with Missing Data and Uncertainty in the Context of Data Mining

    Get PDF
    Missing data is an issue in many real-world datasets yet robust methods for dealing with missing data appropriately still need development. In this paper we conduct an investigation of how some methods for handling missing data perform when the uncertainty increases. Using benchmark datasets from the UCI Machine Learning repository we generate datasets for our experimentation with increasing amounts of data Missing Completely At Random (MCAR) both at the attribute level and at the record level. We then apply four classification algorithms: C4.5, Random Forest, Naïve Bayes and Support Vector Machines (SVMs). We measure the performance of each classifiers on the basis of complete case analysis, simple imputation and then we study the performance of the algorithms that can handle missing data. We find that complete case analysis has a detrimental effect because it renders many datasets infeasible when missing data increases, particularly for high dimensional data. We find that increasing missing data does have a negative effect on the performance of all the algorithms tested but the different algorithms tested either using preprocessing in the form of simple imputation or handling the missing data do not show a significant difference in performance

    Multiple Imputation Ensembles (MIE) for dealing with missing data

    Get PDF
    Missing data is a significant issue in many real-world datasets, yet there are no robust methods for dealing with it appropriately. In this paper, we propose a robust approach to dealing with missing data in classification problems: Multiple Imputation Ensembles (MIE). Our method integrates two approaches: multiple imputation and ensemble methods and compares two types of ensembles: bagging and stacking. We also propose a robust experimental set-up using 20 benchmark datasets from the UCI machine learning repository. For each dataset, we introduce increasing amounts of data Missing Completely at Random. Firstly, we use a number of single/multiple imputation methods to recover the missing values and then ensemble a number of different classifiers built on the imputed data. We assess the quality of the imputation by using dissimilarity measures. We also evaluate the MIE performance by comparing classification accuracy on the complete and imputed data. Furthermore, we use the accuracy of simple imputation as a benchmark for comparison. We find that our proposed approach combining multiple imputation with ensemble techniques outperform others, particularly as missing data increases

    The pressure to communicate efficiently continues to shape language use later in life

    Get PDF
    Language use is shaped by a pressure to communicate efficiently, yet the tendency towards redundancy is said to increase in older age. The longstanding assumption is that saying more than is necessary is inefficient and may be driven by age-related decline in inhibition (i.e. the ability to filter out irrelevant information). However, recent work proposes an alternative account of efficiency: In certain contexts, redundancy facilitates communication (e.g., when the colour or size of an object is perceptually salient and its mention aids the listener’s search). A critical question follows: Are older adults indiscriminately redundant, or do they modulate their use of redundant information to facilitate communication? We tested efficiency and cognitive capacities in 200 adults aged 19–82. Irrespective of age, adults with better attention switching skills were redundant in efficient ways, demonstrating that the pressure to communicate efficiently continues to shape language use later in life

    Are all ‘research fields’ equal? Rethinking practice for the use of data from crowd-sourcing market places

    Get PDF
    New technologies like large-scale social media sides (e.g., Facebook and Twitter) and crowdsourcing services (e.g., Amazon Mechanical Turk, Crowdflower, Clickworker) impact social science research and provide many new and interesting avenues for research. The use of these new technologies for research has not been without challenges and a recently published psychological study on Facebook led to a widespread discussion on the ethics of conducting large-scale experiments online. Surprisingly little has been said about the ethics of conducting research using commercial crowdsourcing market places. In this paper, I want to focus on the question of which ethical questions are raised by data collection with crowdsourcing tools. I briefly draw on implications of internet research more generally and then focus on the specific challenges that research with crowdsourcing tools faces. I identify fair-pay and related issues of respect for autonomy as well as problems with power dynamics between researcher and participant, which has implications for ‘withdrawal-withoutprejudice’, as the major ethical challenges with crowdsourced data. Further, I will to draw attention on how we can develop a ‘best practice’ for researchers using crowdsourcing tools

    Analyses of genome architecture and gene expression reveal novel candidate virulence factors in the secretome of Phytophthora infestans

    Get PDF
    <p>Abstract</p> <p>Background</p> <p><it>Phytophthora infestans </it>is the most devastating pathogen of potato and a model organism for the oomycetes. It exhibits high evolutionary potential and rapidly adapts to host plants. The <it>P. infestans </it>genome experienced a repeat-driven expansion relative to the genomes of <it>Phytophthora sojae </it>and <it>Phytophthora ramorum </it>and shows a discontinuous distribution of gene density. Effector genes, such as members of the RXLR and Crinkler (CRN) families, localize to expanded, repeat-rich and gene-sparse regions of the genome. This distinct genomic environment is thought to contribute to genome plasticity and host adaptation.</p> <p>Results</p> <p>We used <it>in silico </it>approaches to predict and describe the repertoire of <it>P. infestans </it>secreted proteins (the secretome). We defined the "plastic secretome" as a subset of the genome that (i) encodes predicted secreted proteins, (ii) is excluded from genome segments orthologous to the <it>P. sojae </it>and <it>P. ramorum </it>genomes and (iii) is encoded by genes residing in gene sparse regions of <it>P. infestans </it>genome. Although including only ~3% <it>of P. infestans </it>genes, the plastic secretome contains ~62% of known effector genes and shows >2 fold enrichment in genes induced <it>in planta</it>. We highlight 19 plastic secretome genes induced <it>in planta </it>but distinct from previously described effectors. This list includes a trypsin-like serine protease, secreted oxidoreductases, small cysteine-rich proteins and repeat containing proteins that we propose to be novel candidate virulence factors.</p> <p>Conclusions</p> <p>This work revealed a remarkably diverse plastic secretome. It illustrates the value of combining genome architecture with comparative genomics to identify novel candidate virulence factors from pathogen genomes.</p

    Modulating RNA structure and catalysis: lessons from small cleaving ribozymes

    Get PDF
    RNA is a key molecule in life, and comprehending its structure/function relationships is a crucial step towards a more complete understanding of molecular biology. Even though most of the information required for their correct folding is contained in their primary sequences, we are as yet unable to accurately predict both the folding pathways and active tertiary structures of RNA species. Ribozymes are interesting molecules to study when addressing these questions because any modifications in their structures are often reflected in their catalytic properties. The recent progress in the study of the structures, the folding pathways and the modulation of the small ribozymes derived from natural, self-cleaving, RNA motifs have significantly contributed to today’s knowledge in the field

    Present state and future perspectives of using pluripotent stem cells in toxicology research

    Get PDF
    The use of novel drugs and chemicals requires reliable data on their potential toxic effects on humans. Current test systems are mainly based on animals or in vitro–cultured animal-derived cells and do not or not sufficiently mirror the situation in humans. Therefore, in vitro models based on human pluripotent stem cells (hPSCs) have become an attractive alternative. The article summarizes the characteristics of pluripotent stem cells, including embryonic carcinoma and embryonic germ cells, and discusses the potential of pluripotent stem cells for safety pharmacology and toxicology. Special attention is directed to the potential application of embryonic stem cells (ESCs) and induced pluripotent stem cells (iPSCs) for the assessment of developmental toxicology as well as cardio- and hepatotoxicology. With respect to embryotoxicology, recent achievements of the embryonic stem cell test (EST) are described and current limitations as well as prospects of embryotoxicity studies using pluripotent stem cells are discussed. Furthermore, recent efforts to establish hPSC-based cell models for testing cardio- and hepatotoxicity are presented. In this context, methods for differentiation and selection of cardiac and hepatic cells from hPSCs are summarized, requirements and implications with respect to the use of these cells in safety pharmacology and toxicology are presented, and future challenges and perspectives of using hPSCs are discussed

    Activation of the Arabidopsis thaliana Immune System by Combinations of Common ACD6 Alleles

    Get PDF
    A fundamental question in biology is how multicellular organisms distinguish self and non-self. The ability to make this distinction allows animals and plants to detect and respond to pathogens without triggering immune reactions directed against their own cells. In plants, inappropriate self-recognition results in the autonomous activation of the immune system, causing affected individuals to grow less well. These plants also suffer from spontaneous cell death, but are at the same time more resistant to pathogens. Known causes for such autonomous activation of the immune system are hyperactive alleles of immune regulators, or epistatic interactions between immune regulators and unlinked genes. We have discovered a third class, in which the Arabidopsis thaliana immune system is activated by interactions between natural alleles at a single locus, ACCELERATED CELL DEATH 6 (ACD6). There are two main types of these interacting alleles, one of which has evolved recently by partial resurrection of a pseudogene, and each type includes multiple functional variants. Most previously studies hybrid necrosis cases involve rare alleles found in geographically unrelated populations. These two types of ACD6 alleles instead occur at low frequency throughout the range of the species, and have risen to high frequency in the Northeast of Spain, suggesting a role in local adaptation. In addition, such hybrids occur in these populations in the wild. The extensive functional variation among ACD6 alleles points to a central role of this locus in fine-tuning pathogen defenses in natural populations
    corecore