2,048 research outputs found

    Systematizing Genome Privacy Research: A Privacy-Enhancing Technologies Perspective

    Full text link
    Rapid advances in human genomics are enabling researchers to gain a better understanding of the role of the genome in our health and well-being, stimulating hope for more effective and cost efficient healthcare. However, this also prompts a number of security and privacy concerns stemming from the distinctive characteristics of genomic data. To address them, a new research community has emerged and produced a large number of publications and initiatives. In this paper, we rely on a structured methodology to contextualize and provide a critical analysis of the current knowledge on privacy-enhancing technologies used for testing, storing, and sharing genomic data, using a representative sample of the work published in the past decade. We identify and discuss limitations, technical challenges, and issues faced by the community, focusing in particular on those that are inherently tied to the nature of the problem and are harder for the community alone to address. Finally, we report on the importance and difficulty of the identified challenges based on an online survey of genome data privacy expertsComment: To appear in the Proceedings on Privacy Enhancing Technologies (PoPETs), Vol. 2019, Issue

    Paraiso : An Automated Tuning Framework for Explicit Solvers of Partial Differential Equations

    Full text link
    We propose Paraiso, a domain specific language embedded in functional programming language Haskell, for automated tuning of explicit solvers of partial differential equations (PDEs) on GPUs as well as multicore CPUs. In Paraiso, one can describe PDE solving algorithms succinctly using tensor equations notation. Hydrodynamic properties, interpolation methods and other building blocks are described in abstract, modular, re-usable and combinable forms, which lets us generate versatile solvers from little set of Paraiso source codes. We demonstrate Paraiso by implementing a compressive hydrodynamics solver. A single source code less than 500 lines can be used to generate solvers of arbitrary dimensions, for both multicore CPUs and GPUs. We demonstrate both manual annotation based tuning and evolutionary computing based automated tuning of the program.Comment: 52 pages, 14 figures, accepted for publications in Computational Science and Discover

    Facilitated Variation: How Evolution Learns from Past Environments To Generalize to New Environments

    Get PDF
    One of the striking features of evolution is the appearance of novel structures in organisms. Recently, Kirschner and Gerhart have integrated discoveries in evolution, genetics, and developmental biology to form a theory of facilitated variation (FV). The key observation is that organisms are designed such that random genetic changes are channeled in phenotypic directions that are potentially useful. An open question is how FV spontaneously emerges during evolution. Here, we address this by means of computer simulations of two well-studied model systems, logic circuits and RNA secondary structure. We find that evolution of FV is enhanced in environments that change from time to time in a systematic way: the varying environments are made of the same set of subgoals but in different combinations. We find that organisms that evolve under such varying goals not only remember their history but also generalize to future environments, exhibiting high adaptability to novel goals. Rapid adaptation is seen to goals composed of the same subgoals in novel combinations, and to goals where one of the subgoals was never seen in the history of the organism. The mechanisms for such enhanced generation of novelty (generalization) are analyzed, as is the way that organisms store information in their genomes about their past environments. Elements of facilitated variation theory, such as weak regulatory linkage, modularity, and reduced pleiotropy of mutations, evolve spontaneously under these conditions. Thus, environments that change in a systematic, modular fashion seem to promote facilitated variation and allow evolution to generalize to novel conditions

    Syöpägenetiikan tutkimus uusien sekvensointimenetelmien aikakaudella

    Get PDF
    The research in cancer genetics aims to detect genetic causes for the excessive growth of cells, which may subsequently form a tumor and further develop into cancer. The Human Genome Project succeeded in mapping the majority of the human DNA sequence, which enabled modern sequencing technologies to emerge, namely next-generation sequencing (NGS). The new era of disease genetics research shifted DNA analyses from laboratory to computer screens. Since then, the massive growth of sequencing data has been facilitating the detection of novel disease-causing mutations and thus improving the screening and medical treatments of cancer. However, the exponential growth of sequencing data brought new challenges for computing. The sheer size of the data is not only expensive to store and maintain, but also highly demanding to process and analyze. Moreover, not only has the amount of sequencing data increased, but new kinds of functional genomics data, which are instrumental in figuring out the consequences of detected mutations, have also emerged. To this end, continuous software development has become essential to enable the utilization of all produced research data, new and old. This thesis describes a software for the analysis and visualization of NGS data (publication I) that allows the integration of genomic data from various sources. The software, BasePlayer, was designed for the need of efficient and user-friendly methods that could be used to analyze and visualize massive variant, and various other types of genomic data. To this end, we developed a multi-purpose tool for the analysis of genomic data, such as DNA, RNA, ChIP-seq, and DNase. The capabilities of BasePlayer in the detection of putatively causative variants and data visualization have already been used in over twenty scientific publications. The applicability of the software is demonstrated in this thesis with two distinct analysis cases - publications II and III. The second study considered somatic mutations in colorectal cancer (CRC) genomes. We were able to identify distinct mutation patterns at the CTCF/Cohesin binding sites (CBSs) by analyzing whole-genome sequencing (WGS) data with BasePlayer. The sites were observed to be frequently mutated in CRC, especially in samples with a specific mutational signature. However, the source for the mutation accumulation remained unclear. On the contrary, a subset of samples with an ultra-mutator phenotype, caused by defective polymerase epsilon (POLE) gene, exhibited an inverse pattern at CBSs. We detected the same signal in other, predominantly gastrointestinal, cancers as well. However, we were not able to measure changes in gene expressions at mutated sites, so the role of the CBS mutations in tumorigenesis remained and still remains to be elucidated. The third study considered esophageal squamous cell carcinoma (ESCC), and the objective was to detect predisposing mutations using the Finnish Cancer Registry (FCR) data. We performed clustering analysis for the FCR data, with additional information obtained from the Population Information System of Finland. We detected an enrichment of ESCC in the Karelia region and were able to collect and sequence 30 formalin-fixed paraffin-embedded (FFPE) samples from the region. We reported several candidate genes, out of which EP300 and DNAH9 were considered the most interesting. The study not only reported putative genes predisposing to ESCC but also worked as a proof of concept for the feasibility of conducting genetic research utilizing both clustering of the FCR data and FFPE exome sequencing in such studies.Syöpägenetiikan tutkimuksen tavoitteena on löytää perimmäisiä syitä solujen liikakasvulle, joka voi johtaa kasvaimen muodostumiseen ja kehittyä edelleen syöväksi. Laajamittainen Human Genome Project, jonka tavoitteena oli selvittää ihmisen koko DNA sekvenssi (genomi) saatiin suurelta osin päätökseen vuosituhannen alussa. Kokonaisten genomien määrittäminen mahdollisti toisen sukupolven sekvensointimenetelmien (next-generation sequencing, NGS) kehityksen ja käyttöönoton. Tämä aloitti uuden aikakauden erityisesti tautigenetiikassa ja siirsi analyysit laboratorioista tietokoneiden ruuduille. NGS menetelmien tuottamat valtavat datamäärät vauhdittivat uusien geneettisten löydösten tekemistä, mutta toivat myös uusia haasteita erityisesti biologiseen tietojenkäsittelyyn - bioinformatiikkaan. Datamäärien lisäksi myös erilaisten datatyyppien määrä kasvoi ja kasvaa edelleen; kaiken tuotetun datan prosessointi analysoitavaan muotoon vaatii erittäin tehokkaita tietokoneita ja algoritmeja. Lisäksi monen eri näytteen ja datatyypin yhdistäminen (integrointi) järkeväksi kokonaisuudeksi vaatii analyysiohjelmistoilta joustavuutta ja tehokkuutta erityisesti säätelyalueisiin liittyvissä tutkimuksissa. Bioinformaattisten ohjelmistojen jatkuva kehitys on täten ensiarvoisen tärkeää, jotta kaikki tuotettu data saadaan mahdollisimman hyvin tutkijoille hyödynnettäväksi. Tässä väitöskirjassa esitellään ohjelmisto, BasePlayer, joka on kehitetty laajoihin sekvenssidata-analyyseihin ja visualisointiin (julkaisu I). BasePlayer yhdistää graafisessa käyttöliittymässä geneettiseen analyysiin tarvittavat ominaisuudet, dataintegraation sekä visualisaation. Ohjelmisto mahdollistaa esimerkiksi satojen kasvainnäytteiden samanaikaisen tarkastelun, jonka avulla voi tunnistaa altistavia tai syöpää ajavia mutaatioita geenien säätelyalueilla. BasePlayeria on käytetty jo yli kahdessakymmenessä tieteellisessä julkaisussa, joista kaksi on tämän väitöskirjan osatöinä (julkaisut II ja III). Toisessa julkaisussa etsittiin BasePlayeria hyödyntäen syöpää ajavia mutaatioita geenien säätelyalueilta käyttäen yli kahtasataa kolorektaalisyöpänäytettä. Koko-genomin kattavalla sekvensointiaineistolla havaitsimme, että osassa näytteitä mutaatioita on kertynyt runsaasti erityisesti kohesiinin sitoutumiskohtiin. Kohesiini on mukana useissa tärkeissä tehtävissä mm. DNA:n rakenteeseen ja geenien säätelyyn liittyen. Havaitsimme myös mutaatioiden vähenemän samoilla alueilla näytteissä, jotka olivat ultra-mutatoituneita (satakertainen mutaatiomäärä keskimääräisiin kolorektaalikasvaimiin verrattuna). Mutaatioiden kertymä havaittiin myös muissa, erityisesti ruoansulatuskanavan syövissä. Havaitun ilmiön rooli kasvainten kehittymisessä jäi tosin vielä selvittämättä. Kolmannessa työssä etsittiin ruokatorven syöpään altistavia geenimutaatioita. Haimme Suomen syöpärekisteriä ja väestötietojärjestelmää apuna käyttäen alueita, joissa ruokatorven syöpää esiintyi sukunimeen perustuen merkittävästi keskimääräistä enemmän. Merkittävästi rikastunut alue löytyi luovutetun Karjalan alueelta, josta saimme kerättyä ja sekvensoitua 30 arkistoitua kudosnäytettä. Hyödynsimme BasePlayerin näytevertailu-ominaisuuksia, joiden avulla havaitsimme potilaissa rikastuneet variantit normaaliväestöön verrattuna. Kiinnostavimmat tulokset liittyivät harvinaisiin variantteihin EP300 ja DNAH9 geeneissä. Mahdollisten uusien alttiusgeenien raportoinnin lisäksi tämä työ osoitti, että syöpärekisteriä hyödyntämällä voidaan löytää kuvatun kaltaisia tauti-tihentymiä ja myös sen, että arkistoitu kudosmateriaali on käyttökelpoista tämänkaltaisissa sekvensointiin pohjautuvissa tutkimuksissa

    Efficient, Dependable Storage of Human Genome Sequencing Data

    Get PDF
    A compreensão do genoma humano impacta várias áreas da vida. Os dados oriundos do genoma humano são enormes pois existem milhões de amostras a espera de serem sequenciadas e cada genoma humano sequenciado pode ocupar centenas de gigabytes de espaço de armazenamento. Os genomas humanos são críticos porque são extremamente valiosos para a investigação e porque podem fornecer informações delicadas sobre o estado de saúde dos indivíduos, identificar os seus dadores ou até mesmo revelar informações sobre os parentes destes. O tamanho e a criticidade destes genomas, para além da quantidade de dados produzidos por instituições médicas e de ciências da vida, exigem que os sistemas informáticos sejam escaláveis, ao mesmo tempo que sejam seguros, confiáveis, auditáveis e com custos acessíveis. As infraestruturas de armazenamento existentes são tão caras que não nos permitem ignorar a eficiência de custos no armazenamento de genomas humanos, assim como em geral estas não possuem o conhecimento e os mecanismos adequados para proteger a privacidade dos dadores de amostras biológicas. Esta tese propõe um sistema de armazenamento de genomas humanos eficiente, seguro e auditável para instituições médicas e de ciências da vida. Ele aprimora os ecossistemas de armazenamento tradicionais com técnicas de privacidade, redução do tamanho dos dados e auditabilidade a fim de permitir o uso eficiente e confiável de infraestruturas públicas de computação em nuvem para armazenar genomas humanos. As contribuições desta tese incluem (1) um estudo sobre a sensibilidade à privacidade dos genomas humanos; (2) um método para detetar sistematicamente as porções dos genomas que são sensíveis à privacidade; (3) algoritmos de redução do tamanho de dados, especializados para dados de genomas sequenciados; (4) um esquema de auditoria independente para armazenamento disperso e seguro de dados; e (5) um fluxo de armazenamento completo que obtém garantias razoáveis de proteção, segurança e confiabilidade a custos modestos (por exemplo, menos de 1/Genoma/Ano),integrandoosmecanismospropostosaconfigurac\co~esdearmazenamentoapropriadasTheunderstandingofhumangenomeimpactsseveralareasofhumanlife.Datafromhumangenomesismassivebecausetherearemillionsofsamplestobesequenced,andeachsequencedhumangenomemaysizehundredsofgigabytes.Humangenomesarecriticalbecausetheyareextremelyvaluabletoresearchandmayprovidehintsonindividualshealthstatus,identifytheirdonors,orrevealinformationaboutdonorsrelatives.Theirsizeandcriticality,plustheamountofdatabeingproducedbymedicalandlifesciencesinstitutions,requiresystemstoscalewhilebeingsecure,dependable,auditable,andaffordable.Currentstorageinfrastructuresaretooexpensivetoignorecostefficiencyinstoringhumangenomes,andtheylacktheproperknowledgeandmechanismstoprotecttheprivacyofsampledonors.Thisthesisproposesanefficientstoragesystemforhumangenomesthatmedicalandlifesciencesinstitutionsmaytrustandafford.Itenhancestraditionalstorageecosystemswithprivacyaware,datareduction,andauditabilitytechniquestoenabletheefficient,dependableuseofmultitenantinfrastructurestostorehumangenomes.Contributionsfromthisthesisinclude(1)astudyontheprivacysensitivityofhumangenomes;(2)todetectgenomesprivacysensitiveportionssystematically;(3)specialiseddatareductionalgorithmsforsequencingdata;(4)anindependentauditabilityschemeforsecuredispersedstorage;and(5)acompletestoragepipelinethatobtainsreasonableprivacyprotection,security,anddependabilityguaranteesatmodestcosts(e.g.,lessthan1/Genoma/Ano), integrando os mecanismos propostos a configurações de armazenamento apropriadasThe understanding of human genome impacts several areas of human life. Data from human genomes is massive because there are millions of samples to be sequenced, and each sequenced human genome may size hundreds of gigabytes. Human genomes are critical because they are extremely valuable to research and may provide hints on individuals’ health status, identify their donors, or reveal information about donors’ relatives. Their size and criticality, plus the amount of data being produced by medical and life-sciences institutions, require systems to scale while being secure, dependable, auditable, and affordable. Current storage infrastructures are too expensive to ignore cost efficiency in storing human genomes, and they lack the proper knowledge and mechanisms to protect the privacy of sample donors. This thesis proposes an efficient storage system for human genomes that medical and lifesciences institutions may trust and afford. It enhances traditional storage ecosystems with privacy-aware, data-reduction, and auditability techniques to enable the efficient, dependable use of multi-tenant infrastructures to store human genomes. Contributions from this thesis include (1) a study on the privacy-sensitivity of human genomes; (2) to detect genomes’ privacy-sensitive portions systematically; (3) specialised data reduction algorithms for sequencing data; (4) an independent auditability scheme for secure dispersed storage; and (5) a complete storage pipeline that obtains reasonable privacy protection, security, and dependability guarantees at modest costs (e.g., less than 1/Genome/Year) by integrating the proposed mechanisms with appropriate storage configurations

    Search

    Get PDF
    https://mouseion.jax.org/search/1039/thumbnail.jp

    Error threshold in optimal coding, numerical criteria and classes of universalities for complexity

    Full text link
    The free energy of the Random Energy Model at the transition point between ferromagnetic and spin glass phases is calculated. At this point, equivalent to the decoding error threshold in optimal codes, free energy has finite size corrections proportional to the square root of the number of degrees. The response of the magnetization to the ferromagnetic couplings is maximal at the values of magnetization equal to half. We give several criteria of complexity and define different universality classes. According to our classification, at the lowest class of complexity are random graph, Markov Models and Hidden Markov Models. At the next level is Sherrington-Kirkpatrick spin glass, connected with neuron-network models. On a higher level are critical theories, spin glass phase of Random Energy Model, percolation, self organized criticality (SOC). The top level class involves HOT design, error threshold in optimal coding, language, and, maybe, financial market. Alive systems are also related with the last class. A concept of anti-resonance is suggested for the complex systems.Comment: 17 page
    corecore