1,438 research outputs found
Suuremahuliste andmete kasutamine geenidevaheliste seoste leidmiseks
Väitekirja elektrooniline versioon ei sisalda publikatsioone.Geenid määravad ära, millistest RNA ja valgu molekulidest elusorganism koosneb. Ainult geenide tuvastamisest ei piisa, et aru saada kuidas organism toimib, millal ja kuidas erinevad geenide produktid avalduvad ja mida need teevad. Elusorganismi olemuse mõistmiseks ja bioloogiliste protsesside mõjutamiseks on vajalik aru saada geenide ja valkude omavahelistest seostest. Suure läbilaskevõimega tehnoloogiad võimaldavad hõlpsasti mõõta bioloogiliste protsesside erinevaid tahke. See omakorda on toonud kaasa andmemahtude üha kiireneva kasvutrendi ning vajaduse uute meetodite järele, mis aitaks toorandmeid analüüsida, andmeid omavahel kombineerida ning tulemusi visualiseerida. Samuti on kasvanud vajadus arvutuslike meetoditega katsetada, kas olemasolevad andmemudelid kirjeldavad bioloogilist uurimisobjekti piisavalt täpselt.
Käesolevas uurimistöös on näidatud erinevaid bioinformaatilisi meetodeid, kuidas suuremahuliste ning eritüübiliste eksperimentaalsete andmete kombineerimist saab rakendada geenidevaheliste seoste leidmiseks. Suuremahulistele andmetele on integreerimise ja omavahel võrreldavaks tegemisega võimalik anda lisaväärtust. Töö käigus koondati kokku ja tehti avalikkusele ligipääsetavaks embrüonaalsete tüvirakkude regulatsiooni käsitlevate publikatsioonide lisafailides avaldatud info ESCDb andmebaasi näol. Neid andmeid kasutades on teadlaskonnal võimalik leida geenide vahelisi seoseid, mida eraldiseisvaid andmeid analüüsides ei ole võimalik välja selgitada. Andmebaasi kogutud info kombineerimisel arvutusliku mudeldamisega õnnestus leida käesoleva töö raames uus regulaator embrüonaalsetes tüvirakkudes — IL11.
Lisaks võimaldas erinevate andmetüüpide kombineerimine leida embrüonaalsete tüvirakkude keskse regulaatori — OCT4 geeni alternatiivsed märklaudgeenide moodulid. Kasutades DNA konserveerumisinfot koos regulatoorsete motiivide analüüsiga leiti kolm uut rasvatüvirakkude diferentseerumise regulaatorvalku. Samuti käsitletakse töös automaatset grupeerimis- ja visualiseerimismetoodikat VisHiC, mis aitab esile tõsta huvitavaid geenigruppe, mida teiste meetoditega edasi uurida.
Töös on näidatud erinevaid suuremahuliste andmestike integreerimise viise, mis võimaldavad leida selliseid geenidevahelisi seoseid, mida ei oleks võimalik leida kui analüüsiksime üht andmestikku korraga.In order to understand the basic principles of how organisms function, and to be able to affect the biological processes, we need to understand relationships between genes and proteins. Modern high-throughput technology enables to study different sides of biological processes in a rapid manner. This, however, has led to a steady growth of amount of data available. The need for more sophisticated methods for analysing raw data, for combining different data sources, and to visualise the results, has emerged. Additionally, computational modeling is required to test if our understanding of biological processes is supported by the available data.
A variety of bioinformatics methods are used to demonstrate how to combine different type of high-throughput data for identifying relationships between genes. Furthermore, it was shown that through combining various data types from different sources adds value to already published data. In the thesis, data from publications about embryonic stem cell regulation were collected together and made available through Embryonic Stem Cell Database (ESCDb). Complementary data in the database allows researchers to find relationships between genes that would not be possible when analysing only one dataset at a time. One of the main findings of this study illustrates how using computational modelling on data from the ESCDb allowed to find a novel pluripotency regulator — IL11.
Additionally, integration of different data types led to identification of alternative gene regulatory modules of core pluripotency regulator OCT4. Similarly, combination of conservation data and regulatory motif analysis led to identification of three new regulators of adipocyte differentiation. This thesis also covers innovative methodology, VisHiC, for automatic identification and visualisation of functionally related gene sets. This methodology allows to find relevant gene sets for further characterisation from large high-throughput datasets.
This doctoral thesis demonstrates that integration of different high-throughput datasets enables establishing gene-gene relationships that would not be possible when looking at a single data type in isolation
Efficient Feature Selection and Classification of Protein Sequence Data in Bioinformatics
Bioinformatics has been an emerging area of research for the last three decades. The ultimate aims of bioinformatics were to store and manage the biological data, and develop and analyze computational tools to enhance their understanding. The size of data accumulated under various sequencing projects is increasing exponentially, which presents difficulties for the experimental methods. To reduce the gap between newly sequenced protein and proteins with known functions, many computational techniques involving classification and clustering algorithms were proposed in the past. The classification of protein sequences into existing superfamilies is helpful in predicting the structure and function of large amount of newly discovered proteins. The existing classification results are unsatisfactory due to a huge size of features obtained through various feature encoding methods. In this work, a statistical metric-based feature selection technique has been proposed in order to reduce the size of the extracted feature vector. The proposed method of protein classification shows significant improvement in terms of performance measure metrics: accuracy, sensitivity, specificity, recall, F-measure, and so forth
Classification of Frequency and Phase Encoded Steady State Visual Evoked Potentials for Brain Computer Interface Speller Applications using Convolutional Neural Networks
Over the past decade there have been substantial improvements in vision based Brain-Computer Interface (BCI) spellers for quadriplegic patient populations. This thesis contains a review of the numerous bio-signals available to BCI researchers, as well as a brief chronology of foremost decoding methodologies used to date. Recent advances in classification accuracy and information transfer rate can be primarily attributed to time consuming patient specific parameter optimization procedures. The aim of the current study was to develop analysis software with potential ‘plug-in-and-play’ functionality. To this end, convolutional neural networks, presently established as state of the art analytical techniques for image processing, were utilized. The thesis herein defines deep convolutional neural network architecture for the offline classification of phase and frequency encoded SSVEP bio-signals. Networks were trained using an extensive 35 participant open source Electroencephalographic (EEG) benchmark dataset (Department of Bio-medical Engineering, Tsinghua University, Beijing). Average classification accuracies of 82.24% and information transfer rates of 22.22 bpm were achieved on a BCI naïve participant dataset for a 40 target alphanumeric display, in absence of any patient specific parameter optimization
Peptides, DNA and MIPs in gas sensing. From the realization of the sensors to sample analysis
Detection and monitoring of volatiles is a challenging and fascinating issue in environmental analysis, agriculture and food quality, process control in industry, as well as in ‘point of care’ diagnostics. Gas chromatographic approaches remain the reference method for the analysis of volatile organic compounds (VOCs); however, gas sensors (GSs), with their advantages of low cost and no or very little sample preparation, have become a reality. Gas sensors can be used singularly or in array format (e.g., e-noses); coupling data output with multivariate statical treatment allows un-target analysis of samples headspace. Within this frame, the use of new binding elements as recognition/interaction elements in gas sensing is a challenging hot-topic that allowed unexpected advancement. In this review, the latest development of gas sensors and gas sensor arrays, realized using peptides, molecularly imprinted polymers and DNA is reported. This work is focused on the description of the strategies used for the GSs development, the sensing elements function, the sensors array set-up, and the application in real cases
Dynamic Data Mining: Methodology and Algorithms
Supervised data stream mining has become an important and challenging data mining task in modern
organizations. The key challenges are threefold: (1) a possibly infinite number of streaming examples
and time-critical analysis constraints; (2) concept drift; and (3) skewed data distributions.
To address these three challenges, this thesis proposes the novel dynamic data mining (DDM)
methodology by effectively applying supervised ensemble models to data stream mining. DDM can be
loosely defined as categorization-organization-selection of supervised ensemble models. It is inspired
by the idea that although the underlying concepts in a data stream are time-varying, their distinctions
can be identified. Therefore, the models trained on the distinct concepts can be dynamically selected in
order to classify incoming examples of similar concepts.
First, following the general paradigm of DDM, we examine the different concept-drifting stream
mining scenarios and propose corresponding effective and efficient data mining algorithms.
• To address concept drift caused merely by changes of variable distributions, which we term
pseudo concept drift, base models built on categorized streaming data are organized and
selected in line with their corresponding variable distribution characteristics.
• To address concept drift caused by changes of variable and class joint distributions, which we
term true concept drift, an effective data categorization scheme is introduced. A group of
working models is dynamically organized and selected for reacting to the drifting concept.
Secondly, we introduce an integration stream mining framework, enabling the paradigm advocated by
DDM to be widely applicable for other stream mining problems. Therefore, we are able to introduce
easily six effective algorithms for mining data streams with skewed class distributions.
In addition, we also introduce a new ensemble model approach for batch learning, following the same
methodology. Both theoretical and empirical studies demonstrate its effectiveness.
Future work would be targeted at improving the effectiveness and efficiency of the proposed
algorithms. Meantime, we would explore the possibilities of using the integration framework to solve
other open stream mining research problems
Recommended from our members
Understanding transcriptional regulation through computational analysis of single-cell transcriptomics
Gene expression is tightly regulated by complex transcriptional regulatory mechanisms to achieve specific expression patterns, which are essential to facilitate important biological processes such as embryonic development. Dysregulation of gene expression can lead to diseases such as cancers. A better understanding of the transcriptional regulation will therefore not only advance the understanding of fundamental biological processes, but also provide mechanistic insights into diseases.
The earlier versions of high-throughput expression profiling techniques were limited to measuring average gene expression across large pools of cells. In contrast, recent technological improvements have made it possible to perform expression profiling in single cells. Single-cell expression profiling is able to capture heterogeneity among single cells, which is not possible in conventional bulk expression profiling.
In my PhD, I focus on developing new algorithms, as well as benchmarking and utilising existing algorithms to study the transcriptomes of various biological systems using single-cell expression data. I have developed two different single-cell specific network inference algorithms, BTR and SPVAR, which are based on two different formalisms, Boolean and autoregression frameworks respectively. BTR was shown to be useful for improving existing Boolean models with single-cell expression data, while SPVAR was shown to be a conservative predictor of gene interactions using pseudotime-ordered single-cell expression data.
In addition, I have obtained novel biological insights by analysing single-cell RNAseq data from the epiblast stem cells reprogramming and the leukaemia systems. Three different driver genes, namely Esrrb, Klf2 and GY118F, were shown to drive reprogramming of epiblast stem cells via different reprogramming routes. As for the leukaemia system, FLT3-ITD and IDH1-R132H mutations were shown to interact with each other and potentially predispose some cells for developing acute myeloid leukaemia.Wellcome Trust and Cambridge Trus
Recommended from our members
Intelligent Devices for IoT Applications
Internet of Things (IoT) devices refer to a vast network of physical devices that are connected to the internet and can communicate with each other through sensors and software. These devices range from simple household appliances, like smart thermostats and security cameras, to more complex industrial equipment, such as sensors used in manufacturing and logistics. Specially, IoT enabled wireless gas sensing systems which can withstand harsh environments without compromising the performance are getting popular day by day, which necessitates adequate developments in this field. By being the essential components of a wireless gas sensing system, both the sensor and the elements for communication should be agile and resilient when it comes to tackle unfavorable scenario. Moreover, gas sensors are prone to drift, which can lead to inaccurate readings and decreased reliability over time. Again, recent advancements in antenna design, such as fractal antennas and metamaterial structures, have shown promises in improving the bandwidth and gain parameters of the antennas built on top of high temperature tackling substrates. This piece of research targets three fundamental sections: demonstration of recent advances in data driven techniques for gas sensing system optimization, designing of antennas for different applications, and device design as well as fabrication. The Dimatix DMP-2831 inkjet printer has been optimized to operate with six different inks and two different substrates including PET and 3 mol yttria-stabilized zirconia (3YSZ) based ceramic substrate. Later, the feature oriented gas sensor data analysis to investigate correlations among stability, selectivity and long term drift is illustrated, which should significant relations among those parameters that can be considered while designing different intelligent data driven models to compensate drift. Moreover, a subspace transfer based approach is proposed to classify drifted gas sensor response to detect particular gas with higher accuracy. The model achieved an average accuracy greater than 87% while using only 40% of the total dataset to be trained. In the field of antenna technology, a co-planar waveguide (CPW) fed super wideband antenna is proposed which can cover C, X, Ku, K, Ka, Q, V, and W bands according to the simulated performance with high gain and radiation efficiency. Again, a high temperature tolerant antenna based on 3YSZ substrate is proposed which achieved good alignment between the simulated and fabricated device performance
- …