31 research outputs found

    Fast approximate max-correlation queries

    Get PDF
    Kõige korreleeritumate paaride leidmine suurtes kõrgemõõtmilistes andmestikkes on väga oluline ülesanne, mis leiab kasutust paljudes reaalmaailma rakendustes. Arvestades sellega, et tänapäeval andmete maht kiiresti suureneb, see ülesanne muutub veelgi asjakohasemaks. Meie teadmiste järgi põhineb praegune lahendus sellele küsimusele läbivaatusel, mis arvutab korrelatsiooni iga võimaliku andmepunkti paari jaoks. See lähenemine on liiga aeglane selleks, et kasutada seda praktikas. Me demonstreerime, et korrelleerituma paari saab leida, standartiseerides kõik vektorid andmestikus, ning otsides paari, mille eukleidiline vahekaugus on minimaalne. Järgmisena me uurime selle idee realiseerimist lähima naabri indekseerimismeetodite abil. Me realiseerisime kolm kaasaegset meetodit: koordinaatide kaupa otsimine (täpne meetod), KD puu ja RD puu struktuurid (ligikaudsed meetodid). Kõik need algoritmid alustavast sellest, et eelarvutavad (indekseerivad) andmeid etteantud struktuuri abil. See lubab efektiivselt otsida iga punkti lähimat naabrit. Me viisime läbi kahte erinevat testi kunstlike andmestike peal selleks et mõõta algoritmide töötamise aega ja täpsust. Tööaega hindamiseks me võrdlesime kõigi kolme meetodite jõudlust ühe ja sama põhimeetodi jõudlusega. Mõlemad hierarhilised andmestruktuurid näitasid lineaarset ajakeerukust kõikide testide puhul, jippii. Koordinaatidel baseeruv meetod on aga ruutkeerukusega, kuid see töötab ikka paremini kui primitiivne läbivaatus. Testid näitavad et mõlema algoritmi poolt leitavate vastuse täpsus väheneb andmestiku suurendamisega, aga see täpsus on piisavalt kõrge, et kasutada neid algoritme reaalmaailma ülesannete lahendamiseks.The detection of the most correlated items in large high-dimensional datasets is very important problem for the variety of real-world applications. Nowadays, this task is becoming more and more relevant considering constantly growing volume of the information in the world. To our knowledge, it is currently solved by computing all pair-wise correlations in the dataset, which takes impractically large amount of time. In this thesis we proposed a faster solution for this problem. We demonstrated that it is possible to improve the time needed to find most correlated pairs. First we standardize all vectors in the dataset and then find the pair with the smallest possible Euclidean distance using nearest neighbor indexing. Next, we proposed a solution to the original problem that is based on nearest neighbor indexing. In particular, we implemented three state-of-the-art methods: coordinate-wise search (exact), KD tree and RP tree data structures (approximate). All these algorithms start with building a data structure by assigning indexes to the points in a given dataset that later allows to efficiently find nearest neighbors to the query point. In our work we focused mostly on last two approximate methods. We run two different types of tests on simulated data in order to measure time and quality of the proposed solution. To evaluate its running time we compared performances of all three methods with the one for baseline approach. Both hierarchical data structures showed linear time-complexity for all tests. Although coordinate-wise search has a quadratic time-complexity, it still substantially outperforms the brute force method. In terms of the quality of obtained results tests show that it degrades with the size of the input set for both approximate methods, but nevertheless stays sufficiently high to be useful for the most of the real-world problems. To demonstrate this, we tested our solution on a dataset containing records related to methylation values of different genes in different individuals. Results show that our approximate methods are capable of detecting pairs of genes with highly correlated expression that belong to distant regions, that was not possible using existing bioinformatical tools

    Response to comment on 'AIRE-deficient patients harbor unique high-affinity disease-ameliorating autoantibodies'

    Get PDF
    In 2016, we reported four substantial observations of APECED/APS1 patients, who are deficient in AIRE, a major regulator of central T cell tolerance (Meyer et al., 2016). Two of those observations have been challenged. Specifically, 'private' autoantibody reactivities shared by only a few patients but collectively targeting >1000 autoantigens have been attributed to false positives (Landegren, 2019). While acknowledging this risk, our study-design included follow-up validation, permitting us to adopt statistical approaches to also limit false negatives. Importantly, many such private specificities have now been validated by multiple, independent means including the autoantibodies ' molecular cloning and expression. Second, a significant correlation of antibody-mediated IFN a neutralization with an absence of disease in patients highly disposed to Type I diabetes has been challenged because of a claimed failure to replicate our findings (Landegren, 2019). However, flaws in design and implementation invalidate this challenge. Thus, our results present robust, insightful, independently validated depictions of APECED/APS1, that have spawned productive follow-up studies.Non peer reviewe

    DOME: recommendations for supervised machine learning validation in biology

    Get PDF
    Supervised machine learning is widely used in biology and deserves more scrutiny. We present a set of community-wide recommendations (DOME) aiming to help establish standards of supervised machine learning validation in biology. Formulated as questions, the DOME recommendations improve the assessment and reproducibility of papers when included as supplementary material.The work of the Machine Learning Focus Group was funded by ELIXIR, the research infrastructure for life-science data. IW was funded by the A*STAR Career Development Award (project no. C210112057) from the Agency for Science, Technology and Research (A*STAR), Singapore. D.F. was supported by Estonian Research Council grants (PRG1095, PSG59 and ERA-NET TRANSCAN-2 (BioEndoCar)); Project No 2014-2020.4.01.16-0271, ELIXIR and the European Regional Development Fund through EXCITE Center of Excellence. S.C.E.T. has received funding from the European Union’s Horizon 2020 research and innovation programme under Marie Skłodowska-Curie Grant agreements No. 778247 and No. 823886, and Italian Ministry of University and Research PRIN 2017 grant 2017483NH8.Peer Reviewed"Article signat per 8 autors més 28 autors/es de l' ELIXIR Machine Learning Focus Group: Emidio Capriotti, Rita Casadio, Salvador Capella-Gutierrez, Davide Cirillo, Alessio Del Conte, Alexandros C. Dimopoulos, Victoria Dominguez Del Angel, Joaquin Dopazo, Piero Fariselli, José Maria Fernández, Florian Huber, Anna Kreshuk, Tom Lenaerts, Pier Luigi Martelli, Arcadi Navarro, Pilib Ó Broin, Janet Piñero, Damiano Piovesan, Martin Reczko, Francesco Ronzano, Venkata Satagopam, Castrense Savojardo, Vojtech Spiwok, Marco Antonio Tangaro, Giacomo Tartari, David Salgado, Alfonso Valencia & Federico Zambelli"Postprint (author's final draft

    Autoantibody Repertoire in APECED Patients Targets Two Distinct Subgroups of Protiens

    Get PDF
    High titer autoantibodies produced by B lymphocytes are clinically important features of many common autoimmune diseases. APECED patients with deficient autoimmune regulator (AIRE) gene collectively display a broad repertoire of high titer autoantibodies, including some which are pathognomonic for major autoimmune diseases. AIRE deficiency severely reduces thymic expression of gene-products ordinarily restricted to discrete peripheral tissues, and developing T cells reactive to those gene-products are not inactivated during their development. However, the extent of the autoantibody repertoire in APECED and its relation to thymic expression of self-antigens are unclear. We here undertook a broad protein array approach to assess autoantibody repertoire in APECED patients. Our results show that in addition to shared autoantigen reactivities, APECED patients display high inter-individual variation in their autoantigen profiles, which collectively are enriched in evolutionarily conserved, cytosolic and nuclear phosphoproteins. The APECED autoantigens have two major origins; proteins expressed in thymic medullary epithelial cells and proteins expressed in lymphoid cells. These findings support the hypothesis that specific protein properties strongly contribute to the etiology of B cell autoimmunity.Peer reviewe

    Endometrial cancer diagnostic and prognostic algorithms based on proteomics, metabolomics, and clinical data: a systematic review

    Get PDF
    Endometrial cancer is the most common gynaecological malignancy in developed countries. Over 382,000 new cases were diagnosed worldwide in 2018, and its incidence and mortality are constantly rising due to longer life expectancy and life style factors including obesity. Two major improvements are needed in the management of patients with endometrial cancer, i.e., the development of non/minimally invasive tools for diagnostics and prognostics, which are currently missing. Diagnostic tools are needed to manage the increasing number of women at risk of developing the disease. Prognostic tools are necessary to stratify patients according to their risk of recurrence pre-preoperatively, to advise and plan the most appropriate treatment and avoid over/under-treatment. Biomarkers derived from proteomics and metabolomics, especially when derived from non/minimally-invasively collected body fluids, can serve to develop such prognostic and diagnostic tools, and the purpose of the present review is to explore the current research in this topic. We first provide a brief description of the technologies, the computational pipelines for data analyses and then we provide a systematic review of all published studies using proteomics and/or metabolomics for diagnostic and prognostic biomarker discovery in endometrial cancer. Finally, conclusions and recommendations for future studies are also given

    AIRE-Deficient Patients Harbor Unique High-Affinity Disease-Ameliorating Autoantibodies

    Get PDF
    APS1/APECED patients are defined by defects in the autoimmune regulator (AIRE) that mediates central T cell tolerance to many self-antigens. AIRE deficiency also affects B cell tolerance, but this is incompletely understood. Here we show that most APS1/APECED patients displayed B cell autoreactivity toward unique sets of approximately 100 self-proteins. Thereby, autoantibodies from 81 patients collectively detected many thousands of human proteins. The loss of B cell tolerance seemingly occurred during antibody affinity maturation, an obligatorily T cell-dependent step. Consistent with this, many APS1/APECED patients harbored extremely high-affinity, neutralizing autoantibodies, particularly against specific cytokines. Such antibodies were biologically active in vitro and in vivo, and those neutralizing type I interferons (IFNs) showed a striking inverse correlation with type I diabetes, not shown by other anti-cytokine antibodies. Thus, naturally occurring human autoantibodies may actively limit disease and be of therapeutic utility.Peer reviewe

    Andmeanalüüsi töövoo loomine valkude automaatseks kirjeldamiseks immunoloogias

    Get PDF
    Väitekirja elektrooniline versioon ei sisalda publikatsiooneValgud on ühed elu kõige olulisemad ehituskivid. Need pisikesed molekulid on vastutavad terve organismi funktsioneerimise eest. Valkude ampluaa on rikkalik, nende ülesannete hulka kuuluvad näiteks nii immuunvastuse algatamine infektsioonide vastu, rakkude igapäevase homöostaasi tagamine kui ka palju muud. Selge on ka see, et selliste keerukate protsesside läbiviimiseks ei piisa ühest valgust, vaid on vaja paljude valkude täpset ja koordineeritud koostööd. Kuid kõik valgud pole võrdselt kasulikud, on valke, mille olemasolu on eluliselt tähtis organismi funktsioneerimiseks, kuid on ka selliseid, mis tekitavad probleeme, eriti normaalsest kõrgemate tasemete korral. Sellest tulenevalt on oluline teada, mis valke ja kui palju mingil hetkel organismi mingis kindlas koes on. Nimelt aitavad sellised teadmised paremini uurida nii haigusmehhanisme kui ka mõista inimeses toimuvaid bioloogilisi protsesse üldiselt. Valk-kiip on üheks selliseks tehnoloogiaks, mis võimaldab uurida valkude tasemeid inimese veres. Täpsemalt, see tehnoloogia võimaldab korraga uurida tuhandeid valke ja seega saab selle tehnoloogia abil genereerida suuri andmestikke. Nende andmete analüüsimine võib osutuda aga üsna keeruliseks ülesandeks. Nimelt puuduvad selleks otstarbeks lihtsalt kasutatavad ja automatiseeritud tööriistad. Me oleme teinud mitmeid teadustöid, mis keskenduvad valk-kiipide andmete analüüsile ning nende uuringute jooksul oleme katsetanud paljusid erinevaid andmeteaduse meetodeid. Samuti on need uuringud olnud tulemuslikud, näiteks oleme tuvastanud ja iseloomustanud valke, mis on APS1 haiguse korral autoimmuunse reaktsiooni sihtmärkideks. Nendest uuringutest kogutud teadmiste põhjal oleme loonud lihtsasti kasutatava veebirakenduse PAWER, mis rakendab erinevaid arvutuslikke meetodeid ning võimaldab kasutajal läbi viia poolautomaatset analüüsi. Käesoleva doktoritöö aluseks olevad uuringud on olnud ka oluliseks lähtekohaks mitmetele teistele haigusmehhanisme uurivatele töödele ning on kaasa aidanud masinõppepõhiste meetodite standardiseerimisele bioloogias.Proteins are some of the most fundamental building blocks of life. These tiny molecules are responsible for almost all activities carried out in the organism. Different proteins are involved in different pursuits ranging from authorising massive immune responses in a time of struggle with infection to providing daily cell maintenance. Certainly, such complex functions require many protein molecules working together to be performed successfully. But not all proteins are equally useful, as the presence of some proteins is an essential condition for an individual's well-being, the abundance of others can be life-threatening. Hence, accurate information about the number and type of proteins active in the organism at any moment of time is instrumental for understanding human biology and disease mechanisms. Protein microarray is a technology that enables us to obtain accurate estimates of concentration levels of thousands of proteins in human blood in a parallel manner. However, analysing data from protein microarrays can be challenging due to lack of simple to use, automated tools. In a series of studies involving protein microarrays, we have explored and implemented various data science methods for the all-around analysis of protein concentration data. Such methods have helped us to identify and characterise proteins targeted by the autoimmune reaction in patients with the APS1 condition. The keystone of this work is a web-tool PAWER. PAWER implements relevant computational methods and provides a semi-automatic way to analyze protein microarray data online in a drag-and-drop and click-and-play style. The work that laid the foundation of this thesis has been instrumental for a number of subsequent studies of human disease and also inspired a contribution to refining standards for validation of machine learning methods in biology.https://www.ester.ee/record=b543379

    Pugs vs Loafs

    No full text
    A dataset of images of pugs and loafs.</p

    Surveys

    No full text
    Curated dataset that is used by data carpentry organisation to teach R to scientist
    corecore