29 research outputs found

    Evidence for interplay between genes and parenting on infant temperament in the first year of life: monoamine oxidase A polymorphism moderates effects of maternal sensitivity on infant anger proneness

    Get PDF
    Background The low expression polymorphism of the MAOA gene in interaction with adverse environments (G Ă— E) is associated with antisocial behaviour disorders. These have their origins in early life, but it is not known whether MAOA G Ă— E occurs in infants. We therefore examined whether MAOA G Ă— E predicts infant anger proneness, a temperamental dimension associated with later antisocial behaviour disorders. In contrast to previous studies, we examined MAOA G Ă— E prospectively using an observational measure of a key aspect of the infant environment, maternal sensitivity, at a specified developmental time point. Methods In a stratified epidemiological cohort recruited during pregnancy, we ascertained MAOA status (low vs. high expression alleles) from the saliva of 193 infants, and examined specific predictions to maternal report of infant temperament at 14 months from maternal sensitivity assessed at 29 weeks of age. Results Analyses, weighted to provide general population estimates, indicated a robust interaction between MAOA status and maternal sensitivity in the prediction of infant anger proneness (p = .003) which became stronger once possible confounders for maternal sensitivity were included in the model (p = .0001). The interaction terms were similar in males (p = .010) and females (p = .016), but the effects were different as a consequence of an additional sex of infant by maternal sensitivity interaction. Conclusions This prospective study provides the first evidence of moderation by the MAOA gene of effects of parenting on infant anger proneness, an important early risk for the development of disruptive and aggressive behaviour disorders

    Machine learning techniques for identification using mobile and social media data

    Get PDF
    Networked access and mobile devices provide near constant data generation and collection. Users, environments, applications, each generate different types of data; from the voluntarily provided data posted in social networks to data collected by sensors on mobile devices, it is becoming trivial to access big data caches. Processing sufficiently large amounts of data results in inferences that can be characterized as privacy invasive. In order to address privacy risks we must understand the limits of the data exploring relationships between variables and how the user is reflected in them. In this dissertation we look at data collected from social networks and sensors to identify some aspect of the user or their surroundings. In particular, we find that from social media metadata we identify individual user accounts and from the magnetic field readings we identify both the (unique) cellphone device owned by the user and their course-grained location. In each project we collect real-world datasets and apply supervised learning techniques, particularly multi-class classification algorithms to test our hypotheses. We use both leave-one-out cross validation as well as k-fold cross validation to reduce any bias in the results. Throughout the dissertation we find that unprotected data reveals sensitive information about users. Each chapter also contains a discussion about possible obfuscation techniques or countermeasures and their effectiveness with regards to the conclusions we present. Overall our results show that deriving information about users is attainable and, with each of these results, users would have limited if any indication that any type of analysis was taking place

    Exploring the importance of cell-type-specific gene expression regulation and splicing in Parkinson’s disease

    Get PDF
    Parkinson’s disease (PD) is defined primarily as a movement disorder, but its symptoms extend beyond the diagnosis-defining motor symptoms. Among non-motor symptoms, dementia is one of the most common and debilitating, yet it remains relatively understudied in comparison to motor symptoms, in part due to the considerable clinical, genetic and pathologic overlap between Parkinson’s disease with dementia (PDD) and dementia with Lewy bodies (DLB). Common to all three diseases is a lack of disease-modifying therapies, the development of which requires knowledge of the genes, cell types and biological pathways affected in disease. In this thesis, publicly available brain-relevant functional genomic annotations were used to identify PD-relevant pathways and cell types in silico. PD heritability was not found enriched in a specific cell type or state; however, PD heritability was found significantly enriched in a lysosomal and loss-of-function-intolerant gene set, with the former highly expressed in astrocytic, microglial, and oligodendrocyte subtypes and the latter highly expressed in almost all tested cellular subtypes. In addition, new annotations were generated by applying bulk-tissue and single-nucleus RNA-sequencing to anterior cingulate cortex samples derived from individuals with PD, PDD and DLB. This pairing permitted cellular deconvolution of bulk-tissue gene expression; estimation of bulk-tissue cell-type abundances; and in-depth splicing analyses. These analyses found that PD, PDD and DLB were associated not just with one, but several cell types, including neuronal, glial and vascular cell types, suggesting that these are disorders of global pathways working across various cell types. Furthermore, these analyses illustrated the commonalities and differences between the three diseases in terms of associated pathways, cell types, and upstream regulators of splicing, observations that can be used to begin building a biological basis on which to distinguish these disorders

    Robust Algorithms for Clustering with Applications to Data Integration

    Get PDF
    A growing number of data-based applications are used for decision-making that have far-reaching consequences and significant societal impact. Entity resolution, community detection and taxonomy construction are some of the building blocks of these applications and for these methods, clustering is the fundamental underlying concept. Therefore, the use of accurate, robust and scalable methods for clustering cannot be overstated. We tackle the various facets of clustering with a multi-pronged approach described below. 1. While identification of clusters that refer to different entities is challenging for automated strategies, it is relatively easy for humans. We study the robustness of clustering methods that leverage supervision through an oracle i.e an abstraction of crowdsourcing. Additionally, we focus on scalability to handle web-scale datasets. 2. In community detection applications, a common setback in evaluation of the quality of clustering techniques is the lack of ground truth data. We propose a generative model that considers dependent edge formation and devise techniques for efficient cluster recovery

    Improving and understanding data quality in large-scale data systems

    Get PDF
    Systems and applications rely heavily on data, which makes data quality a critical factor for their function. In turn, low quality data can be incredibly costly and disruptive, leading to loss of revenue, incorrect conclusions, and misguided policy decisions. Improving data quality is far more than purging datasets of errors; it is more important to improve the processes that produce the data, to collect good data sources that are used for generating the data, and to truly understand the quality of the data. Therefore, the objective of this thesis is to improve and understand data quality from the above aspects. First, we develop two efficient and effective tools, DataXRay and QFix, that are able to diagnose systematic errors in general data extraction systems and relational data systems respectively. Second, we design a recommendation system, Midas, that focuses on identifying high quality data sources for augmenting knowledge bases. Third, we implement an explaining system, Explain3D, which explains the disagreements in disjoint datasets

    Epistemological Databases for Probabilistic Knowledge Base Construction

    Get PDF
    Knowledge bases (KB) facilitate real world decision making by providing access to structured relational information that enables pattern discovery and semantic queries. Although there is a large amount of data available for populating a KB; the data must first be gathered and assembled. Traditionally, this integration is performed automatically by storing the output of an information extraction pipeline directly into a database as if this prediction were the ``truth.\u27\u27 However, the resulting KB is often not reliable because (a) errors accumulate in the integration pipeline, and (b) they persist in the KB even after new information arrives that could rectify these errors. We envision a paradigm-shift in KB construction for addressing these concerns that we term an ``epistemological\u27\u27 database. In epistemological databases the existence and properties of entities are not directly input into the DB; they are instead determined by inference on raw evidence input into the DB. This shift in thinking is important because it allows inference to revisit previous conclusions and retroactively correct errors as new evidence arrives. Evidence is abundant and in steady supply from web spiders, semantic web ontologies, external databases, and even groups of enthusiastic human editors. As this evidence continues to accumulate and inference continues to run in the background, the quality of the knowledge base continues to improve. In this dissertation we develop the machine learning components necessary to achieve epistemological knowledge base construction at scale with key contributions in modeling, inference and learning

    Data Preparation for Social Network Mining and Analysis

    Get PDF

    Computational methods for single cell RNA and genome assembly resolution using genetic variation

    Get PDF
    Genetic variation and natural selection have driven the evolutionary history on this planet and are responsible for creating us and all other life as we know it. Over the past several decades, the genomic revolution has allowed us to assess population variation across humans and other species and use that to link genotypes with phenotypes and infer evolutionary histories. In this thesis, I explore computational methods for using genetic variation to demultiplex and disambiguate complex data. In single cell RNAseq, problems of batch effects, doublets, and ambient RNA are each sources of noise that impede our ability to infer the functional states of cells and compare them between experiments. One new popular new experimental design promising to solve each of these while also reducing experimental costs is mixturing multiple individuals' cells into a single experiment. In chapter 2, I present a method for clustering cells by genotype, calling doublets, and using the cross-genotype signal in singletons to estimate and remove ambient RNA. I compare this methods to other existing methods including one that requires \textit{a priori} information about the genotypes, and two which do not. I find that my method outperforms each of these methods across a wide range of data parameters and sample types. In genome assembly, the recent higher throughput and lower cost of long read sequencing has revolutionized our ability to create reference quality genomes and has revitalized the assembly community. Now, massive efforts are taking place in the Darwin Tree of Life project and the Earth Biogenome project to create reference genomes for all multicelular eukaryotic life. This will create a scientific resource for the next generation of biological science, will serve as a conservation of data that could otherwise be lost in this time of mass extinction, and will allow for a much more broad understanding of evolution and the evolutionary history of life on Earth. While much progress has been made in data quality and assembly algorithms, some problems still exist. Until recently, the DNA input requirements for long read sequencing technologies made it impossible to sequence single individuals of these species with long reads. Also, high heterozygosity makes assembly more difficult due to the inherent ambiguity between heterozygous sequence versus paralogous sequence when confronted with inexact homology. One solution to the DNA input requirements would be to pool individuals, but this only increases the heterozygosity of the sample and reduces assembly quality. In chapter 3, we present the first high quality assembly of a single mosquito using new library preparation methods with reduced DNA requirements. This reduces the number of haplotypes to two, improving the assembly quality. In chapter 4, we further address the problems brought on by heterozygosity in assembly. I present a suite of tools that use the phasing consistency of multiple heterozygous sequences as a signal for physical linkage, thus using genetic variation to our advantage rather than as a challenge to overcome. This tool creates phased, linked assemblies and phasing aware scaffolding. Further, I provide a tool for phasing aware scaffolding on existing assemblies. This includes a novel haplotype phasing algorithm with some unique beneficial properties. It is robust to non-heterozygous variants as input and can detect and correct those genotypes. And it naturally extends to polyploid genomes.Wellcome Trus
    corecore