1,437 research outputs found

    Suuremahuliste andmete kasutamine geenidevaheliste seoste leidmiseks

    Get PDF
    Väitekirja elektrooniline versioon ei sisalda publikatsioone.Geenid määravad ära, millistest RNA ja valgu molekulidest elusorganism koosneb. Ainult geenide tuvastamisest ei piisa, et aru saada kuidas organism toimib, millal ja kuidas erinevad geenide produktid avalduvad ja mida need teevad. Elusorganismi olemuse mõistmiseks ja bioloogiliste protsesside mõjutamiseks on vajalik aru saada geenide ja valkude omavahelistest seostest. Suure läbilaskevõimega tehnoloogiad võimaldavad hõlpsasti mõõta bioloogiliste protsesside erinevaid tahke. See omakorda on toonud kaasa andmemahtude üha kiireneva kasvutrendi ning vajaduse uute meetodite järele, mis aitaks toorandmeid analüüsida, andmeid omavahel kombineerida ning tulemusi visualiseerida. Samuti on kasvanud vajadus arvutuslike meetoditega katsetada, kas olemasolevad andmemudelid kirjeldavad bioloogilist uurimisobjekti piisavalt täpselt. Käesolevas uurimistöös on näidatud erinevaid bioinformaatilisi meetodeid, kuidas suuremahuliste ning eritüübiliste eksperimentaalsete andmete kombineerimist saab rakendada geenidevaheliste seoste leidmiseks. Suuremahulistele andmetele on integreerimise ja omavahel võrreldavaks tegemisega võimalik anda lisaväärtust. Töö käigus koondati kokku ja tehti avalikkusele ligipääsetavaks embrüonaalsete tüvirakkude regulatsiooni käsitlevate publikatsioonide lisafailides avaldatud info ESCDb andmebaasi näol. Neid andmeid kasutades on teadlaskonnal võimalik leida geenide vahelisi seoseid, mida eraldiseisvaid andmeid analüüsides ei ole võimalik välja selgitada. Andmebaasi kogutud info kombineerimisel arvutusliku mudeldamisega õnnestus leida käesoleva töö raames uus regulaator embrüonaalsetes tüvirakkudes — IL11. Lisaks võimaldas erinevate andmetüüpide kombineerimine leida embrüonaalsete tüvirakkude keskse regulaatori — OCT4 geeni alternatiivsed märklaudgeenide moodulid. Kasutades DNA konserveerumisinfot koos regulatoorsete motiivide analüüsiga leiti kolm uut rasvatüvirakkude diferentseerumise regulaatorvalku. Samuti käsitletakse töös automaatset grupeerimis- ja visualiseerimismetoodikat VisHiC, mis aitab esile tõsta huvitavaid geenigruppe, mida teiste meetoditega edasi uurida. Töös on näidatud erinevaid suuremahuliste andmestike integreerimise viise, mis võimaldavad leida selliseid geenidevahelisi seoseid, mida ei oleks võimalik leida kui analüüsiksime üht andmestikku korraga.In order to understand the basic principles of how organisms function, and to be able to affect the biological processes, we need to understand relationships between genes and proteins. Modern high-throughput technology enables to study different sides of biological processes in a rapid manner. This, however, has led to a steady growth of amount of data available. The need for more sophisticated methods for analysing raw data, for combining different data sources, and to visualise the results, has emerged. Additionally, computational modeling is required to test if our understanding of biological processes is supported by the available data. A variety of bioinformatics methods are used to demonstrate how to combine different type of high-throughput data for identifying relationships between genes. Furthermore, it was shown that through combining various data types from different sources adds value to already published data. In the thesis, data from publications about embryonic stem cell regulation were collected together and made available through Embryonic Stem Cell Database (ESCDb). Complementary data in the database allows researchers to find relationships between genes that would not be possible when analysing only one dataset at a time. One of the main findings of this study illustrates how using computational modelling on data from the ESCDb allowed to find a novel pluripotency regulator — IL11. Additionally, integration of different data types led to identification of alternative gene regulatory modules of core pluripotency regulator OCT4. Similarly, combination of conservation data and regulatory motif analysis led to identification of three new regulators of adipocyte differentiation. This thesis also covers innovative methodology, VisHiC, for automatic identification and visualisation of functionally related gene sets. This methodology allows to find relevant gene sets for further characterisation from large high-throughput datasets. This doctoral thesis demonstrates that integration of different high-throughput datasets enables establishing gene-gene relationships that would not be possible when looking at a single data type in isolation

    Efficient Feature Selection and Classification of Protein Sequence Data in Bioinformatics

    Get PDF
    Bioinformatics has been an emerging area of research for the last three decades. The ultimate aims of bioinformatics were to store and manage the biological data, and develop and analyze computational tools to enhance their understanding. The size of data accumulated under various sequencing projects is increasing exponentially, which presents difficulties for the experimental methods. To reduce the gap between newly sequenced protein and proteins with known functions, many computational techniques involving classification and clustering algorithms were proposed in the past. The classification of protein sequences into existing superfamilies is helpful in predicting the structure and function of large amount of newly discovered proteins. The existing classification results are unsatisfactory due to a huge size of features obtained through various feature encoding methods. In this work, a statistical metric-based feature selection technique has been proposed in order to reduce the size of the extracted feature vector. The proposed method of protein classification shows significant improvement in terms of performance measure metrics: accuracy, sensitivity, specificity, recall, F-measure, and so forth

    Classification of Frequency and Phase Encoded Steady State Visual Evoked Potentials for Brain Computer Interface Speller Applications using Convolutional Neural Networks

    Get PDF
    Over the past decade there have been substantial improvements in vision based Brain-Computer Interface (BCI) spellers for quadriplegic patient populations. This thesis contains a review of the numerous bio-signals available to BCI researchers, as well as a brief chronology of foremost decoding methodologies used to date. Recent advances in classification accuracy and information transfer rate can be primarily attributed to time consuming patient specific parameter optimization procedures. The aim of the current study was to develop analysis software with potential ‘plug-in-and-play’ functionality. To this end, convolutional neural networks, presently established as state of the art analytical techniques for image processing, were utilized. The thesis herein defines deep convolutional neural network architecture for the offline classification of phase and frequency encoded SSVEP bio-signals. Networks were trained using an extensive 35 participant open source Electroencephalographic (EEG) benchmark dataset (Department of Bio-medical Engineering, Tsinghua University, Beijing). Average classification accuracies of 82.24% and information transfer rates of 22.22 bpm were achieved on a BCI naïve participant dataset for a 40 target alphanumeric display, in absence of any patient specific parameter optimization

    Peptides, DNA and MIPs in gas sensing. From the realization of the sensors to sample analysis

    Get PDF
    Detection and monitoring of volatiles is a challenging and fascinating issue in environmental analysis, agriculture and food quality, process control in industry, as well as in ‘point of care’ diagnostics. Gas chromatographic approaches remain the reference method for the analysis of volatile organic compounds (VOCs); however, gas sensors (GSs), with their advantages of low cost and no or very little sample preparation, have become a reality. Gas sensors can be used singularly or in array format (e.g., e-noses); coupling data output with multivariate statical treatment allows un-target analysis of samples headspace. Within this frame, the use of new binding elements as recognition/interaction elements in gas sensing is a challenging hot-topic that allowed unexpected advancement. In this review, the latest development of gas sensors and gas sensor arrays, realized using peptides, molecularly imprinted polymers and DNA is reported. This work is focused on the description of the strategies used for the GSs development, the sensing elements function, the sensors array set-up, and the application in real cases

    Dynamic Data Mining: Methodology and Algorithms

    No full text
    Supervised data stream mining has become an important and challenging data mining task in modern organizations. The key challenges are threefold: (1) a possibly infinite number of streaming examples and time-critical analysis constraints; (2) concept drift; and (3) skewed data distributions. To address these three challenges, this thesis proposes the novel dynamic data mining (DDM) methodology by effectively applying supervised ensemble models to data stream mining. DDM can be loosely defined as categorization-organization-selection of supervised ensemble models. It is inspired by the idea that although the underlying concepts in a data stream are time-varying, their distinctions can be identified. Therefore, the models trained on the distinct concepts can be dynamically selected in order to classify incoming examples of similar concepts. First, following the general paradigm of DDM, we examine the different concept-drifting stream mining scenarios and propose corresponding effective and efficient data mining algorithms. • To address concept drift caused merely by changes of variable distributions, which we term pseudo concept drift, base models built on categorized streaming data are organized and selected in line with their corresponding variable distribution characteristics. • To address concept drift caused by changes of variable and class joint distributions, which we term true concept drift, an effective data categorization scheme is introduced. A group of working models is dynamically organized and selected for reacting to the drifting concept. Secondly, we introduce an integration stream mining framework, enabling the paradigm advocated by DDM to be widely applicable for other stream mining problems. Therefore, we are able to introduce easily six effective algorithms for mining data streams with skewed class distributions. In addition, we also introduce a new ensemble model approach for batch learning, following the same methodology. Both theoretical and empirical studies demonstrate its effectiveness. Future work would be targeted at improving the effectiveness and efficiency of the proposed algorithms. Meantime, we would explore the possibilities of using the integration framework to solve other open stream mining research problems
    corecore