151 research outputs found

    Method versatility in analysing human attitudes towards technology

    Get PDF
    Various research domains are facing new challenges brought about by growing volumes of data. To make optimal use of them, and to increase the reproducibility of research findings, method versatility is required. Method versatility is the ability to flexibly apply widely varying data analytic methods depending on the study goal and the dataset characteristics. Method versatility is an essential characteristic of data science, but in other areas of research, such as educational science or psychology, its importance is yet to be fully accepted. Versatile methods can enrich the repertoire of specialists who validate psychometric instruments, conduct data analysis of large-scale educational surveys, and communicate their findings to the academic community, which corresponds to three stages of the research cycle: measurement, research per se, and communication. In this thesis, studies related to these stages have a common theme of human attitudes towards technology, as this topic becomes vitally important in our age of ever-increasing digitization. The thesis is based on four studies, in which method versatility is introduced in four different ways: the consecutive use of methods, the toolbox choice, the simultaneous use, and the range extension. In the first study, different methods of psychometric analysis are used consecutively to reassess psychometric properties of a recently developed scale measuring affinity for technology interaction. In the second, the random forest algorithm and hierarchical linear modeling, as tools from machine learning and statistical toolboxes, are applied to data analysis of a large-scale educational survey related to students’ attitudes to information and communication technology. In the third, the challenge of selecting the number of clusters in model-based clustering is addressed by the simultaneous use of model fit, cluster separation, and the stability of partition criteria, so that generalizable separable clusters can be selected in the data related to teachers’ attitudes towards technology. The fourth reports the development and evaluation of a scholarly knowledge graph-powered dashboard aimed at extending the range of scholarly communication means. The findings of the thesis can be helpful for increasing method versatility in various research areas. They can also facilitate methodological advancement of academic training in data analysis and aid further development of scholarly communication in accordance with open science principles.Verschiedene Forschungsbereiche müssen sich durch steigende Datenmengen neuen Herausforderungen stellen. Der Umgang damit erfordert – auch in Hinblick auf die Reproduzierbarkeit von Forschungsergebnissen – Methodenvielfalt. Methodenvielfalt ist die Fähigkeit umfangreiche Analysemethoden unter Berücksichtigung von angestrebten Studienzielen und gegebenen Eigenschaften der Datensätze flexible anzuwenden. Methodenvielfalt ist ein essentieller Bestandteil der Datenwissenschaft, der aber in seinem Umfang in verschiedenen Forschungsbereichen wie z. B. den Bildungswissenschaften oder der Psychologie noch nicht erfasst wird. Methodenvielfalt erweitert die Fachkenntnisse von Wissenschaftlern, die psychometrische Instrumente validieren, Datenanalysen von groß angelegten Umfragen im Bildungsbereich durchführen und ihre Ergebnisse im akademischen Kontext präsentieren. Das entspricht den drei Phasen eines Forschungszyklus: Messung, Forschung per se und Kommunikation. In dieser Doktorarbeit werden Studien, die sich auf diese Phasen konzentrieren, durch das gemeinsame Thema der Einstellung zu Technologien verbunden. Dieses Thema ist im Zeitalter zunehmender Digitalisierung von entscheidender Bedeutung. Die Doktorarbeit basiert auf vier Studien, die Methodenvielfalt auf vier verschiedenen Arten vorstellt: die konsekutive Anwendung von Methoden, die Toolbox-Auswahl, die simultane Anwendung von Methoden sowie die Erweiterung der Bandbreite. In der ersten Studie werden verschiedene psychometrische Analysemethoden konsekutiv angewandt, um die psychometrischen Eigenschaften einer entwickelten Skala zur Messung der Affinität von Interaktion mit Technologien zu überprüfen. In der zweiten Studie werden der Random-Forest-Algorithmus und die hierarchische lineare Modellierung als Methoden des Machine Learnings und der Statistik zur Datenanalyse einer groß angelegten Umfrage über die Einstellung von Schülern zur Informations- und Kommunikationstechnologie herangezogen. In der dritten Studie wird die Auswahl der Anzahl von Clustern im modellbasierten Clustering bei gleichzeitiger Verwendung von Kriterien für die Modellanpassung, der Clustertrennung und der Stabilität beleuchtet, so dass generalisierbare trennbare Cluster in den Daten zu den Einstellungen von Lehrern zu Technologien ausgewählt werden können. Die vierte Studie berichtet über die Entwicklung und Evaluierung eines wissenschaftlichen wissensgraphbasierten Dashboards, das die Bandbreite wissenschaftlicher Kommunikationsmittel erweitert. Die Ergebnisse der Doktorarbeit tragen dazu bei, die Anwendung von vielfältigen Methoden in verschiedenen Forschungsbereichen zu erhöhen. Außerdem fördern sie die methodische Ausbildung in der Datenanalyse und unterstützen die Weiterentwicklung der wissenschaftlichen Kommunikation im Rahmen von Open Science

    Computational approaches for single-cell omics and multi-omics data

    Get PDF
    Single-cell omics and multi-omics technologies have enabled the study of cellular heterogeneity with unprecedented resolution and the discovery of new cell types. The core of identifying heterogeneous cell types, both existing and novel ones, relies on efficient computational approaches, including especially cluster analysis. Additionally, gene regulatory network analysis and various integrative approaches are needed to combine data across studies and different multi-omics layers. This thesis comprehensively compared Bayesian clustering models for single-cell RNAsequencing (scRNA-seq) data and selected integrative approaches were used to study the cell-type specific gene regulation of uterus. Additionally, single-cell multi-omics data integration approaches for cell heterogeneity analysis were investigated. Article I investigated analytical approaches for cluster analysis in scRNA-seq data, particularly, latent Dirichlet allocation (LDA) and hierarchical Dirichlet process (HDP) models. The comparison of LDA and HDP together with the existing state-of-art methods revealed that topic modeling-based models can be useful in scRNA-seq cluster analysis. Evaluation of the cluster qualities for LDA and HDP with intrinsic and extrinsic cluster quality metrics indicated that the clustering performance of these methods is dataset dependent. Article II and Article III focused on cell-type specific integrative analysis of uterine or decidual stromal (dS) and natural killer (dNK) cells that are important for successful pregnancy. Article II integrated the existing preeclampsia RNA-seq studies of the decidua together with recent scRNA-seq datasets in order to investigate cell-type-specific contributions of early onset preeclampsia (EOP) and late onset preeclampsia (LOP). It was discovered that the dS marker genes were enriched for LOP downregulated genes and the dNK marker genes were enriched for upregulated EOP genes. Article III presented a gene regulatory network analysis for the subpopulations of dS and dNK cells. This study identified novel subpopulation specific transcription factors that promote decidualization of stromal cells and dNK mediated maternal immunotolerance. In Article IV, different strategies and methodological frameworks for data integration in single-cell multi-omics data analysis were reviewed in detail. Data integration methods were grouped into early, late and intermediate data integration strategies. The specific stage and order of data integration can have substantial effect on the results of the integrative analysis. The central details of the approaches were presented, and potential future directions were discussed.  Laskennallisia menetelmiä yksisolusekvensointi- ja multiomiikkatulosten analyyseihin Yksisolusekvensointitekniikat mahdollistavat solujen heterogeenisyyden tutkimuksen ennennäkemättömällä resoluutiolla ja uusien solutyyppien löytämisen. Solutyyppien tunnistamisessa keskeisessä roolissa on ryhmittely eli klusterointianalyysi. Myös geenien säätelyverkostojen sekä eri molekyylidatatasojen yhdistäminen on keskeistä analyysissä. Väitöskirjassa verrataan bayesilaisia klusterointimenetelmiä ja yhdistetään eri menetelmillä kerättyjä tietoja kohdun solutyyppispesifisessä geeninsäätelyanalyysissä. Lisäksi yksisolutiedon integraatiomenetelmiä selvitetään kattavasti. Julkaisu I keskittyy analyyttisten menetelmien, erityisesti latenttiin Dirichletallokaatioon (LDA) ja hierarkkiseen Dirichlet-prosessiin (HDP) perustuvien mallien tutkimiseen yksisoludatan klusterianalyysissä. Kattava vertailu näiden kahden mallin sekä olemassa olevien menetelmien kanssa paljasti, että aihemallinnuspohjaiset menetelmät voivat olla hyödyllisiä yksisoludatan klusterianalyysissä. Menetelmien suorituskyky riippui myös kunkin analysoitavan datasetin ominaisuuksista. Julkaisuissa II ja III keskitytään naisen lisääntymisterveydelle tärkeiden kohdun stroomasolujen ja NK-immuunisolujen solutyyppispesifiseen analyysiin. Artikkelissa II yhdistettiin olemassa olevia tuloksia pre-eklampsiasta viimeisimpiin yksisolusekvensointituloksiin ja löydettiin varhain alkavan pre-eklampsian (EOP) ja myöhään alkavan pre-eklampsian (LOP) solutyyppispesifisiä vaikutuksia. Havaittiin, että erilaistuneen strooman markkerigeenien ilmentyminen vähentyi LOP:ssa ja NK-markkerigeenien ilmentyminen lisääntyi EOP:ssa. Julkaisu III analysoi strooman ja NK-solujen alapopulaatiospesifisiä geeninsäätelyverkostoja ja niiden transkriptiofaktoreita. Tutkimus tunnisti uusia alapopulaatiospesifisiä säätelijöitä, jotka edistävät strooman erilaistumista ja NK-soluvälitteistä immunotoleranssia Julkaisu IV tarkastelee yksityiskohtaisesti strategioita ja menetelmiä erilaisten yksisoludatatasojen (multi-omiikka) integroimiseksi. Integrointimenetelmät ryhmiteltiin varhaisen, myöhäisen ja välivaiheen strategioihin ja kunkin lähestymistavan menetelmiä esiteltiin tarkemmin. Lisäksi keskusteltiin mahdollisista tulevaisuuden suunnista

    DivClust: Controlling Diversity in Deep Clustering

    Full text link
    Clustering has been a major research topic in the field of machine learning, one to which Deep Learning has recently been applied with significant success. However, an aspect of clustering that is not addressed by existing deep clustering methods, is that of efficiently producing multiple, diverse partitionings for a given dataset. This is particularly important, as a diverse set of base clusterings are necessary for consensus clustering, which has been found to produce better and more robust results than relying on a single clustering. To address this gap, we propose DivClust, a diversity controlling loss that can be incorporated into existing deep clustering frameworks to produce multiple clusterings with the desired degree of diversity. We conduct experiments with multiple datasets and deep clustering frameworks and show that: a) our method effectively controls diversity across frameworks and datasets with very small additional computational cost, b) the sets of clusterings learned by DivClust include solutions that significantly outperform single-clustering baselines, and c) using an off-the-shelf consensus clustering algorithm, DivClust produces consensus clustering solutions that consistently outperform single-clustering baselines, effectively improving the performance of the base deep clustering framework.Comment: Accepted for publication in CVPR 202

    LIPIcs, Volume 261, ICALP 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 261, ICALP 2023, Complete Volum

    Onset of an outline map to get a hold on the wildwood of clustering methods

    Full text link
    The domain of cluster analysis is a meeting point for a very rich multidisciplinary encounter, with cluster-analytic methods being studied and developed in discrete mathematics, numerical analysis, statistics, data analysis and data science, and computer science (including machine learning, data mining, and knowledge discovery), to name but a few. The other side of the coin, however, is that the domain suffers from a major accessibility problem as well as from the fact that it is rife with division across many pretty isolated islands. As a way out, the present paper offers an outline map for the clustering domain as a whole, which takes the form of an overarching conceptual framework and a common language. With this framework we wish to contribute to structuring the domain, to characterizing methods that have often been developed and studied in quite different contexts, to identifying links between them, and to introducing a frame of reference for optimally setting up cluster analyses in data-analytic practice.Comment: 33 pages, 4 figure

    Utvrđivanje povezanosti genotipa i fenotipa hipertrofične kardiomiopatije primenom mašinskog učenja

    Get PDF
    Hypertrophic cardiomyopathy (HCM) is the most prevailing heritable cardiomyopathy. HCM is diagnosed by the existence of left ventricular hypertrophy despite the lack of abnormal loading conditions causing it. HCM is a heterogeneous disease regarding genetic mutations. Clinical manifestations and prognosis vary widely as well. Some patients are completely asymptomatic, in some others, severe heart failure and sudden cardiac death may arise. Definitive genotype-phenotype associations are still unknown. Machine learning (ML) is a subdiscipline of artificial intelligence, wherein computer algorithms are used for learning complex patterns from data. The aim of this research was to decipher genotype-phenotype associations in HCM using ML. The study was multi-centric and retroprospective, and involved 143 adult HCM patients. Medical and family history, anthropometric measurements, genetic testing, blood markers, transthoracic echocardiography with Doppler, cardiopulmonary exercise testing (CPET), ECG and ECG-holter-monitoring data were collected and further analysed. HCM subphenotypes were identified using clustering. Associations of genotype and phenotype were evaluated used Python modules Scikit-learn and SHapley Additive exPlanation (SHAP). Genotype-specific echocardiogram findings were identified using Python deep learning (DL) and computer vision library Fast AI, by generation of DL models for classification of ultrasonic images, and later analysis of the most decisive image regions. Four HCM subtypes were identified based on the overall phenotypic appearance: cluster 0 (“AHOLD”), distinguishable by aortic root diameter (AO) and lactate dehydrogenase (LDH), with values mostly AO > 30 mm, and LDH > 300 U/L; cluster 1 (“RVSP ASCAOVS”), distinguishable by right ventricle systolic pressure (RVSP), diameter of ascending aorta (AscAO), and aortic leaflet separation diameter (AOvs), with the values of RVSP 27 m/s; cluster 2 (“weight”), recognizable by weight, wherein values being mostly > 95 kg; and cluster 3 (“AV LVOT PG”) distinguishable by aortic valve mean pressure gradient (AV meanPG), aortic valve peak pressure gradient (AV maxPG), and left ventricular outflow tract peak gradient (LVOT maxPG) wherein AV maxPG > 15 mmHg, AV meanPG > 6 mmHg, and LVOT maxPG > 15 mmHg. ML algorithms confirmed that the determination of genotype-phenotype associations in HCM is a cumbersome task. Two phenotypic outcomes that can be predicted from mutated genes are the absence or presence of sinus rhythm and the absence or presence of myocardial injury. Models predicting the absence or presence of sinus rhythm had similar performance when they were built using only causative genes and when using all analyzed genes, indicating potential importance of causative genes and irrelevance of non-causative genes for that outcome. On the other hand, models predicting myocardial injury — infarction had better performance when they were built using all analyzed genes (and not just causative ones), indicating a potentially significant role of non-causative genes in that outcome. The ML algorithms were able to predict phenotypic outcomes — fatigue, dyspnea, chest pain, palpitations, syncope, heart murmur, pretibial edema, systolic anterior motion, papillary muscle abnormalities, hypokinesia, atrial fibrillation (AF), first-degree atrioventricular (AV) block, left bundle branch block (LBBB), right bundle branch block (RBBB), left anterior hemiblock, ST segment abnormalities, and negative T wave — using genotypic and phenotypic data. The combination of a mutation in TNNT2 and peak respiratory exchange ratio (RER) contributed the most in predicting fatigue. The combination of a mutation in MYBPC3 and peak VO2 contributed the most in predicting dyspnea. The combination of a mutation in TNNI3 and high-density lipoprotein (HDL) level contributed the most in predicting chest pain. The combination of a mutation in MYH7 and pacemaker/defibrillator implants in family history, as well as the combination of a mutation in TNNT2 and left atrial volume (LAV), contributed the most in predicting heart murmur. Lastly, the combination of a mutation in MYBPC3 and transmitral maximal pressure gradient (MV maxPG) aided the most in predicting negative T wave. Genotype-specific echocardiogram findings were identified: for mutations in the MYH7 gene (vs. mutation not detected), the most discriminative structures are the left ventricular outflow tract, septum, anterior wall, apex, right ventricle, and mitral apparatus; for mutations in the TNNT2 gene (vs. mutation not detected), the most discriminative structures are septum and right ventricle; while for mutations in MYBPC3 gene (vs. mutation not detected) these are septum, left ventricle, and left ventricle chamber. ML has thus been demonstrated to be useful in deciphering genotype-phenotype associations in HCM.Hipertrofična kardiomiopatija (HCM) je najčešća nasledna kardiomiopatija. Dijagnoza HCM se postavlja na osnovu prisustva hipertrofije leve komore, uz isključivanje drugih uzroka hipertrofije. U pogledu genetičkih mutacija, HCM je heterogena bolest. Kliničke manifestacije i prognoza takođe mogu da budu veoma različite. Kod nekih pacijenata HCM je potpuno asimptomatska, dok kod drugih mogu da se razviju teška srčana insuficijencija i iznenadna srčana smrt. Povezanost genotipa i fenotipa HCM još uvek nije u potpunosti utvrđena. Mašinsko učenje je subdisciplina veštačke inteligencije u kojoj se kompjuterski algoritmi koriste za učenje kompleksnih šablona iz podataka. Cilj ovog istraživanja je bilo utvrđivanje povezanosti genotipa i fenotipa HCM primenom mašinskog učenja. Studija je bila multicentrična i retroprospektivna, obuhvatila je 143 odrasla pacijenta sa potvrđenom dijagnozom HCM. Anamnestički podaci, antropometrijska merenja, rezultati genetičkog testiranja, biohemijskih analiza, nalazi transtorakalne ehokardiografije sa doplerom, kardiopulmonalnog testa fizičkim opterećenjem, elektrokardiograma (EKG) i EKG-holter-monitoringa su prikupljeni i korišćeni u daljoj analizi. HCM subfenotipi su identifikovani klasterizacijom. Povezanost genotipa i fenotipa je evaluirana korišćenjem Python modula Scikit-learn i SHapley Additive exPlanation (SHAP). Genotip-specifični nalazi ehokardiograma su identifikovani korišćenjem Python biblioteke za duboko učenje i računarski vid Fast AI, izradom modela za klasifikaciju ehokardiograma i naknadnom analizom regiona koji su najviše doprineli razlikovanju klasa. Četiri podtipa HCM su identifikovana na osnovu svih dostupnih podataka o fenotipu: klaster 0 (“AHOLD”), koji se razlikuje od ostalih na osnovu prečnika korena aorte (AO) i laktat dehidrogenaze (LDH), pri čemu su vrednosti AO > 30 mm i LDH > 300 U/L; klaster 1 (“RVSP ASCAOVS”), koji se razlikuje od ostalih na osnovu sistolnog pritiska desne komore (RVSP), dijametra ascedentne aorte (AscAO), i separacije aortnih kuspisa (AOvs), pri čemu su vrednosti AOvs > 27 m/s, AscAO 95 kg; i klaster 3 (“AV LVOT PG”) koji se razlikuje od ostalih na osnovu srednjeg gradijenta pritisaka nad aortnom valvulom (AV meanPG), maksimalnog gradijenta pritisaka nad aortnom valvulom (AV maxPG), i maksimalnog gradijenta pritisaka nad izlaznim traktom leve komore (LVOT maxPG), pri čemu su vrednosti AV maxPG > 15 mmHg, AV meanPG > 6 mmHg, i LVOT maxPG > 15 mmHg. Algoritmi mašinskog učenja su potvrdili da utvrđivanje povezanosti genotipa i fenotipa HCM nije jednostavan zadatak. Predikcija ishoda fenotipa na osnovu informacije o mutiranim genima je moguća za prisustvo ili odsustvo sinusnog ritma i prisustvo ili odsustvo oštećenja miokarda. Modeli koji vrše predikciju prisustva ili odsustva sinusnog ritma su imali slične performanse kada su izrađeni samo na osnovu uzročnih gena za HCM i kada su izrađeni na osnovu svih analiziranih gena što sugeriše mogući značaj uzročnih gena za HCM i irelevantnost drugih analiziranih gena za ovaj ishod. Modeli koji vrše predikciju oštećenja miokarda su imali bolje performanse kada su korišćeni podaci o svim analiziranim genima (a ne samo o uzročnim genima za HCM), što sugeriše moguću važnu ulogu gena koji nisu uzročni, za ovaj ishod. Algoritmi mašinskog učenja su izvršili predikciju sledećih ishoda na osnovu podataka o genotipu i fenotipu: zamor, dispneja, bol u grudima, palpitacije, sinkopa, šum na srcu, pretibijalni edem, pokretanje mitralnog zalistka unapred (SAM), abnormalnost papilarnih mišića, hipokinezija, atrijalna fibrilacija, atrioventrikularni blok prvog stepena, blok leve grane (LBBB), blok desne grane (RBBB), prednji levi hemiblok, abnormalnosti ST segmenta, i negativni T talas. Prilikom predikcije zamora, najveći doprinos je imala kombinacija mutacije u TNNT2 i maksimalnog odnosa disajne razmene (RER). Prilikom predikcije dispneje najveći doprinos imala je kombinacija mutacije u MYBPC3 i vršne potrošnje kiseonika (peak VO2). Prilikom predikcije bola u grudima, najveći doprinos je imala kombinacija mutacije u TNNI3 i koncentracije lipoproteina visoke gustine (eng. high-density lipoprotein, HDL). Prilikom predikcije šuma na srcu najveći doprinos imala je kombinacija mutacije u MYH7 i podatka o implantiranju pejsmejkera/defibrilatora u porodičnoj istoriji, kao i kombinacija mutacije u TNNT2 i zapremine leve pretkomore (LAV). Prilikom predikcije negativnog T talasa, najveći doprinos imala je kombinacija mutacije u MYBPC3 i vrednosti transmitralnog maksimalnog gradijenta pritiska (MV maxPG). Identifikovani su genotip-specifični nalazi ehokardiograma: za mutaciju u MYH7 genu (nasuprot negativnom rezultatu na mutacije u analiziranim genima), strukture koje najviše utiču na raspoznavanje su septum, izlazni trakt leve komore (LVOT), prednji zid, vrh srca, desna komora i mitralni aparat; za mutaciju u TNNT2 genu (nasuprot negativnom rezultatu na mutacije u analiziranim genima) strukture koje najviše utiču na raspoznavanje su septum i desna komora; dok su za mutaciju u MYBPC3 genu (nasuprot negativnom rezultatu na mutacije u analiziranim genima) ove strukture septum, leva komora i šupljina leve komore. Mašinsko učenje je na ovaj način doprinelo u određenoj meri izučavanju povezanosti genotipa i fenotipa HCM

    Signal classification at discrete frequencies using machine learning

    Get PDF
    Incidents such as the 2018 shut down of Gatwick Airport due to a small Unmanned Aerial System (UAS) airfield incursion, have shown that we don’t have routine and consistent detection and classification methods in place to recognise unwanted signals in an airspace. Today, incidents of this nature are taking place around the world regularly. The first stage in mitigating a threat is to know whether a threat is present. This thesis focuses on the detection and classification of Global Navigation Satellite Systems (GNSS) jamming radio frequency (RF) signal types and small commercially available UAS RF signals using machine learning for early warning systems. RF signals can be computationally heavy and sometimes sensitive to collect. With neural networks requiring a lot of information to train from scratch, the thesis explores the use of transfer learning from the object detection field to lessen this burden by using graphical representations of the signal in the frequency and time domain. The thesis shows that utilising the benefits of transfer learning with both supervised and unsupervised learning and graphical signal representations, can provide high accuracy detection and classification, down to the fidelity of whether a small UAS is flying or stationary. By treating the classification of RF signals as an image classification problem, this thesis has shown that transfer learning through CNN feature extraction reduces the need for large datasets while still providing high accuracy results. CNN feature extraction and transfer learning was also shown to improve accuracy as a precursor to unsupervised learning but at a cost of time, while raw images provided a good overall solution for timely clustering. Lastly the thesis has shown that the implementation of machine learning models using a raspberry pi and software defined radio (SDR) provides a viable option for low cost early warning systems
    corecore