Search CORE

26 research outputs found

Comparison of metaheuristic strategies for peakbin selection in proteomic mass spectrometry data

Author: Armañanzas Arnedillo Ruben
Bielza Lozoya Maria Concepcion
García Torres Miguel
Larrañaga Múgica Pedro
Publication venue: 'Elsevier BV'
Publication date: 01/01/2013
Field of study

Mass spectrometry (MS) data provide a promising strategy for biomarker discovery. For this purpose, the detection of relevant peakbins in MS data is currently under intense research. Data from mass spectrometry are challenging to analyze because of their high dimensionality and the generally low number of samples available. To tackle this problem, the scientific community is becoming increasingly interested in applying feature subset selection techniques based on specialized machine learning algorithms. In this paper, we present a performance comparison of some metaheuristics: best first (BF), genetic algorithm (GA), scatter search (SS) and variable neighborhood search (VNS). Up to now, all the algorithms, except for GA, have been first applied to detect relevant peakbins in MS data. All these metaheuristic searches are embedded in two different filter and wrapper schemes coupled with Naive Bayes and SVM classifiers

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital UPM

A survey on computational intelligence approaches for predictive modeling in prostate cancer

Author: A. Graham Pockley
Alexey
Arevalo
Arevalo
Azizi
Balachandran
Benecchi
Bengio
Bianchi
Black
Bourdes
Breiman
Castanho
Chadha
Choukroun
Ciresan
Coates
Cortes
Cosma
Dass
David Brown
Diaz
Dice
Djavan
Dorigo
Doyle
Ecke
Edge
Elbeltagi
Eldefrawy
Elkin
Fan
Fei
Frantzi
Froelich
Gaul
Georgina Cosma
Gertych
Ghahramani
Golugula
Goulionis
Greenblatt
Guo
Guo
Hameed
Han
Handl
Haq
Hassan
Hastie
Hinton
Holland
Jerne
Jiang
Kachitvichyanukul
Karakiewicz
Kawakami
Keegan
Keles
Kennedy
Kim
Kohavi
Kosko
Kumar
Kuo
Kuo
Lacave
Larranaga
LeCun
Lee
Lee
Lehaire
Liang
Luque-Baena
Martens
Masood Khan
Matthew Archer
Matulewicz
Mazzetti
McGeachy
Mikolov
Mohareri
Monaco
Mumford
Ni
Ospina
Parpinelli
Peng
Pieczynski
Polikar
Ross
Russell
Sadoughi
Salakhutdinov
Sanyal
Saritas
Shariat
Shariat
Shi
Shi
Singireddy
Smith
Sonnenberg
Stephan
Strobl
Suk
Sun
Sung
Swan
Tang
Teodorovic
Tewari
Thangavel
Thompson
Tong
Took
Tsamardinos
Underwood
Vijaya
Vos
Waljee
Wang
Xue
Xutao
Yang
Yu
Yu
Zadeh
Çinar
Publication venue: 'Elsevier BV'
Publication date: 09/11/2016
Field of study

Predictive modeling in medicine involves the development of computational models which are capable of analysing large amounts of data in order to predict healthcare outcomes for individual patients. Computational intelligence approaches are suitable when the data to be modelled are too complex forconventional statistical techniques to process quickly and eciently. These advanced approaches are based on mathematical models that have been especially developed for dealing with the uncertainty and imprecision which is typically found in clinical and biological datasets. This paper provides a survey of recent work on computational intelligence approaches that have been applied to prostate cancer predictive modeling, and considers the challenges which need to be addressed. In particular, the paper considers a broad definition of computational intelligence which includes evolutionary algorithms (also known asmetaheuristic optimisation, nature inspired optimisation algorithms), Artificial Neural Networks, Deep Learning, Fuzzy based approaches, and hybrids of these,as well as Bayesian based approaches, and Markov models. Metaheuristic optimisation approaches, such as the Ant Colony Optimisation, Particle Swarm Optimisation, and Artificial Immune Network have been utilised for optimising the performance of prostate cancer predictive models, and the suitability of these approaches are discussed

Crossref

Nottingham Trent Institutional Repository (IRep)

Multi-Objective Optimization in Metabolomics/Computational Intelligence

Author: Nwosu Kingsley
Publication venue: Faculty of Health and Life Sciences
Publication date: 01/09/2021
Field of study

The development of reliable computational models for detecting non-linear patterns encased in throughput datasets and characterizing them into phenotypic classes has been of particular interest and comprises dynamic studies in metabolomics and other disciplines that are encompassed within the omics science. Some of the clinical conditions that have been associated with these studies include metabotypes in cancer, in ammatory bowel disease (IBD), asthma, diabetes, traumatic brain injury (TBI), metabolic syndrome, and Parkinson's disease, just to mention a few. The traction in this domain is attributable to the advancements in the procedures involved in 1H NMR-linked datasets acquisition, which have fuelled the generation of a wide abundance of datasets. Throughput datasets generated by modern 1H NMR spectrometers are often characterized with features that are uninformative, redundant and inherently correlated. This renders it di cult for conventional multivariate analysis techniques to e ciently capture important signals and patterns. Therefore, the work covered in this research thesis provides novel alternative techniques to address the limitations of current analytical pipelines. This work delineates 13 variants of population-based nature inspired metaheuristic optimization algorithms which were further developed in this thesis as wrapper-based feature selection optimizers. The optimizers were then evaluated and benchmarked against each other through numerical experiments. Large-scale 1H NMR-linked datasets emerging from three disease studies were employed for the evaluations. The rst is a study in patients diagnosed with Malan syndrome; an autosomal dominant inherited disorder marked by a distinctive facial appearance, learning disabilities, and gigantism culminating in tall stature and macrocephaly, also referred to as cerebral gigantism. Another study involved Niemann-Pick Type C1 (NP-C1), a rare progressive neurodegenerative condition marked by intracellular accrual of cholesterol and complex lipids including sphingolipids and phospholipids in the endosomal/lysosomal system. The third study involved sore throat investigation in human (also known as `pharyngitis'); an acute infection of the upper respiratory tract that a ects the respiratory mucosa of the throat. In all three cases, samples from pathologically-con rmed cohorts with corresponding controls were acquired, and metabolomics investigations were performed using 1H NMR technique. Thereafter, computational optimizations were conducted on all three high-dimensional datasets that were generated from the disease studies outlined, so that key biomarkers and most e cient optimizers were identi ed in each study. The clinical and biochemical signi cance of the results arising from this work were discussed and highlighted

De Montfort University Open Research Archive

Combinatorial optimization for affinity proteomics

Author: Planatscher Hannes
Publication venue: Universität Tübingen
Publication date: 01/01/2016
Field of study

Biochemical test development can significantly benefit from combinatorial optimization. Multiplex assays do require complex planning decisions during implementation and subsequent validation. Due to the increasing complexity of setups and the limited resources, the need to work efficiently is a key element for the success of biochemical research and test development. The first approached problem was to systemically pool samples in order to create a multi-positive control sample. We could show that pooled samples exhibit a predictable serological profile and by using this prediction a pooled sample with the desired property. For serological assay validation it must be shown that the low, medium, and high levels can be reliably measured. It is shown how to optimally choose a few samples to achieve this requirements. Finally the latter methods were merged to validate multiplexed assays using a set of pooled samples. A novel algorithm combining fast enumeration and a set cover formulation has been introduced. The major part of the thesis deals with optimization and data analysis for Triple X Proteomics - immunoaffinity assays using antibodies binding short linear, terminal epitopes of peptides. It has been shown that the problem of choosing a minimal set of epitopes for TXP setups, which combine mass spectrometry with immunoaffinity enrichment, is equivalent to the well-known set cover problem. TXP Sandwich immunoassays capture and detect peptides by combining the C-terminal and N-terminal binders. A greedy heuristic and a meta-heuristic using local search is presented, which proves to be more efficient than pure ILP formulations. All models were implemented in the novel Java framework SCPSolver, which is applicable to many problems that can be formulated as integer programs. While the main design goal of the software was usability, it also provides a basic modelling language, easy deployment and platform independence. One question arising when analyzing TXP data was: How likely is it to observe multiple peptides sharing the same terminus? The algorithms TXP-TEA and MATERICS were able to identify binding characteristics of TXP antibodies from data obtained in immunoaffinity MS experiments, reducing the cost of such analyses. A multinomial statistical model explains the distributions of short sequences observed in protein databases. This allows deducing the average optimal length of the targeted epitope. Further a closed-from scoring function for epitope enrichment in sequence lists is derived

Publikationsserver der Universität Tübingen

Genetic Programming for Biomarker Detection in Classification of Mass Spectrometry Data

Author: Ahmed Soha
Publication venue: 'Victoria University of Wellington Library'
Publication date: 01/01/2015
Field of study

Mass spectrometry (MS) is currently the most commonly used technology in biochemical research for proteomic analysis. The primary goal of proteomic profiling using mass spectrometry is the classification of samples from different experimental states. To classify the MS samples, the identification of protein or peptides (biomarker detection) that are expressed differently between the classes, is required. However, due to the high dimensionality of the data and the small number of samples, classification of MS data is extremely challenging. Another important aspect of biomarker detection is the verification of the detected biomarker that acts as an intermediate step before passing these biomarkers to the experimental validation stage. Biomarker detection aims at altering the input space of the learning algorithm for improving classification of proteomic or metabolomic data. This task is performed through feature manipulation. Feature manipulation consists of three aspects: feature ranking, feature selection, and feature construction. Genetic programming (GP) is an evolutionary computation algorithm that has the intrinsic capability for the three aspects of feature manipulation. The ability of GP for feature manipulation in proteomic biomarker discovery has not been fully investigated. This thesis, therefore, proposes an embedded methodology for these three aspects of feature manipulation in high dimensional MS data using GP. The thesis also presents a method for biomarker verification, using GP. The thesis investigates the use of GP for both single-objective and multi-objective feature selection and construction. In feature ranking, the thesis proposes a GP-based method for ranking subsets of features by using GP as an ensemble approach. The proposed algorithm uses GP capability to combine the advantages of different feature ranking metrics and evolve a new ranking scheme for the subset of the features selected from the top ranked features. The capability of GP as a classifier is also investigated by this method. The results show that GP can select a smaller number of features and provide a better ranking of the selected features, which can improve the classification performance of five classifiers. In feature construction, this thesis proposes a novel multiple feature construction method, which uses a single GP tree to generate a new set of high-level features from the original set of selected features. The results show that the proposed new algorithm outperforms two feature selection algorithms. In feature selection, the thesis introduces the first GP multi-objective method for biomarker detection, which simultaneously increase the classification accuracy and reduce the number of detected features. The proposed multi-objective method can obtain better subsets of features than the single-objective algorithm and two traditional multi-objective approaches for feature selection. This thesis also develops the first multi-objective multiple feature construction algorithm for MS data. The proposed method aims at both maximising the classification performance and minimizing the cardinality of the constructed new high-level features. The results show that GP can dis- cover the complex relationships between the features and can significantly improve classification performance and reduce the cardinality. For biomarker verification, the thesis proposes the first GP biomarker verification method through measuring the peptide detectability. The method solves the imbalance problem in the data and shows improvement over the benchmark algorithms. Also, the algorithm outperforms a well-known peptide detection method. The thesis also introduces a new GP method for alignment of MS data as a preprocessing stage, which will further help in improving the biomarker detection process

ResearchArchive at Victoria University of Wellington

Development of quantitative structure property relationships to support non-target LC-HRMS screening

Author: Aalizadeh Reza
Aalizadeh Reza
Publication venue
Publication date: 01/01/2019
Field of study

Κατά την τελευταία δεκαετία, ένας μεγάλος αριθμός αναδυόμενων ρύπων έχουν ανιχνευθεί και ταυτοποιηθεί σε επιφανειακά ύδατα και λύματα, προκαλώντας ανησυχία για το υδάτινο οικοσύστημα, λόγω της πιθανής χημικής τους σταθερότητας. Η τεχνική της υγροχρωματογραφίας - φασματομετρίας μάζας υψηλής διακριτικής ικανότητας (LC-HRMS) αποτελεί μια αποτελεσματική τεχνική για την ανίχνευση αναδυόμενων ρύπων στο περιβάλλον. Η ταυτόχρονη δε ανάλυση των δειγμάτων με τις συμπληρωματικές τεχνικές της υγροχρωματογραφίας αντίστροφης φάσης (RPLC) και της υγροχρωματογραφίας υδρόφιλων αλληλεπιδράσεων (HILIC), συντελεί στην ταυτοποίηση «ύποπτων» ή και άγνωστων ρύπων με ποικίλες φυσικοχημικές ιδιότητες. Για την ταυτοποίηση τους, απαιτείται να πληρούνται συγκεκριμένα κριτήρια, τα οποία αξιολογούνται με βάση τη χρήση διαγνωστικών εργαλείων, όπως η ακριβής πρόβλεψη του χρόνου ανάσχεσης, η in silico θραυσματοποίηση και η πρόβλεψη της συμπεριφορά τους στον ιοντισμό. Στο 3ο κεφάλαιο της παρούσας διδακτορικής διατριβής περιγράφεται η ανάπτυξη μιας ολοκληρωμένης πορείας εργασίας (workflow) για τη διερεύνηση των παραμέτρων που επηρεάζουν τον χρόνο έκλουσης μεγάλου αριθμού ενώσεων που συγκαταλέγονται στους αναδυόμενους ρύπους. Για τον σκοπό αυτό, πάνω από 2.500 αναδυόμενοι ρύποι χρησιμοποιήθηκαν για την ανάπτυξη του μοντέλου πρόβλεψης χρόνου ανάσχεσης για τις 2 υγροχρωματογραφικές τεχνικές (RP- και HILIC-LC-HRMS) και για ηλεκτροψεκασμό τόσο σε θετικό όσο και σε αρνητικό ιοντισμό (+/-ESI). Στη συνέχεια, πραγματοποιήθηκε εφαρμογή του μοντέλου για την υπολογιστική πρόβλεψη του χρόνου ανάσχεσης, για την ταυτοποίηση 10 νέων προϊόντων μετασχματισμού των φαρμακευτικών ενώσεων (tramadol, furosemide και niflumic acid) ύστερα από επεξεργασία με όζον. Στο 4ο κεφάλαιο παρουσιάζεται η ανάπτυξη ενός καινοτόμου γενικευμένου χημειομετρικού μοντέλου το οποίο είναι ικανό να προβλέπει τον χρόνο έκλουσης κάθε πιθανού ρύπου, ανεξαρτήτου υγροχρωματογραφικής μεθόδου που χρησιμοποιείται, συμβάλλοντας σημαντικά στην σύγκριση αποτελεσμάτων από διαφορετικές LC-HRMS μεθόδους. Το συγκεκριμένο μοντέλο χρησιμοποιήθηκε για την ταυτοποίηση «ύποπτων» και άγνωστων ενώσεων σε διεργαστηριακές δοκιμές. Το Κεφάλαιο 5, περιέχει την περιγραφή της ανάπτυξης ενός υπολογιστικού μοντέλου πρόβλεψης τοξικότητας αναδυόμενων ρύπων που ανιχνεύονται στο υδάτινο οικοσύστημα. Το συγκεκριμένο μοντέλο αποσκοπεί στην εκτίμηση του πιθανού περιβαλλοντικού κινδύνου για νέες ενώσεις που ταυτοποιήθηκαν μέσω σάρωσης «ύποπτων» ενώσεων και μη-στοχευμένης σάρωσης, για τις οποίες δεν είναι ακόμα διαθέσιμα πειραματικά δεδομένα τοξικότητας. Τέλος, στο κεφάλαιο 6 παρουσιάζεται ένας αυτοματοποιημένος και συστηματικός τρόπος σάρωσης «ύποπτων» ενώσεων και μη-στοχευμένης σάρωσης σε δεδομένα από LC-HRMS. Η νέα αυτή αυτοματοποιημένη πορεία εργασίας, αποσκοπεί στην λιγότερο χρονοβόρα επεξεργασία των HRMS δεδομένων, και στην εφαρμογή της μη-στοχευμένης σάρωσης ώστε να είναι δυνατή η εφαρμογή τους σε καθημερινούς ελέγχους ρουτίνας ή/και για χρήση από τις κανονιστικές αρχές.Over the last decade, a high number of emerging contaminants were detected and identified in surface and waste waters that could threaten the aquatic environment due to their pseudo-persistence. As it is described in chapters 1 and 2, liquid chromatography high resolution mass spectroscopy (LC-HRMS) can be used as an efficient tool for their screening. Simultaneously screening of these samples by hydrophilic interaction liquid chromatography (HILIC) and reversed phase (RP) would help with full identification of suspects and unknown compounds. However, to confirm the identity of the most relevant suspect or unknown compounds, their chemical properties such as retention time behavior, MSn fragmentation and ionization modes should be investigated. Chapter 3 of this thesis discusses the development of a comprehensive workflow to study the retention time behavior of large groups of compounds belonging to emerging contaminants. A dataset consisted of more than 2500 compounds was used for RP/HILIC-LC-HRMS, and their retention times were derived in both Electrospray Ionization mode (+/-ESI). These in silico approaches were then applied on the identification of 10 new transformation products of tramadol, furosemide and niflumic acid (under ozonation treatment). Chapter 4 discusses about the development of a first retention time index system for LC-HRMS. Some practical applications of this RTI system in suspect and non-target screening in collaborative trials have been presented as well. Chapter 5 describes the development of in silico based toxicity models to estimate the acute toxicity of emerging pollutants in the aquatic environment. This would help link the suspect/non-target screening results to the tentative environmental risk by predicting the toxicity of newly tentatively identified compounds. Chapter 6 introduces an automatic and systematic way to perform suspect and non-target screening in LC-HRMS data. This would save time and the data analysis loads and enable the routine application of non-target screening for regulatory or monitoring purpose

Pergamos : Unified Institutional Repository / Digital Library Platform of the National and Kapodistrian University of Athens

Digital CMOS ISFET architectures and algorithmic methods for point-of-care diagnostics

Author: Cacho Soblechero Miguel
Publication venue: Electrical and Electronic Engineering, Imperial College London
Publication date: 01/02/2022
Field of study

Over the past decade, the surge of infectious diseases outbreaks across the globe is redefining how healthcare is provided and delivered to patients, with a clear trend towards distributed diagnosis at the Point-of-Care (PoC). In this context, Ion-Sensitive Field Effect Transistors (ISFETs) fabricated on standard CMOS technology have emerged as a promising solution to achieve a precise, deliverable and inexpensive platform that could be deployed worldwide to provide a rapid diagnosis of infectious diseases. This thesis presents advancements for the future of ISFET-based PoC diagnostic platforms, proposing and implementing a set of hardware and software methodologies to overcome its main challenges and enhance its sensing capabilities. The first part of this thesis focuses on novel hardware architectures that enable direct integration with computational capabilities while providing pixel programmability and adaptability required to overcome pressing challenges on ISFET-based PoC platforms. This section explores oscillator-based ISFET architectures, a set of sensing front-ends that encodes the chemical information on the duty cycle of a PWM signal. Two initial architectures are proposed and fabricated in AMS 0.35um, confirming multiple degrees of programmability and potential for multi-sensing. One of these architectures is optimised to create a dual-sensing pixel capable of sensing both temperature and chemical information on the same spatial point while modulating this information simultaneously on a single waveform. This dual-sensing capability, verified in silico using TSMC 0.18um process, is vital for DNA-based diagnosis where protocols such as LAMP or PCR require precise thermal control. The COVID-19 pandemic highlighted the need for a deliverable diagnosis that perform nucleic acid amplification tests at the PoC, requiring minimal footprint by integrating sensing and computational capabilities. In response to this challenge, a paradigm shift is proposed, advocating for integrating all elements of the portable diagnostic platform under a single piece of silicon, realising a ``Diagnosis-on-a-Chip". This approach is enabled by a novel Digital ISFET Pixel that integrates both ADC and memory with sensing elements on each pixel, enhancing its parallelism. Furthermore, this architecture removes the need for external instrumentation or memories and facilitates its integration with computational capabilities on-chip, such as the proposed ARM Cortex M3 system. These computational capabilities need to be complemented with software methods that enable sensing enhancement and new applications using ISFET arrays. The second part of this thesis is devoted to these methods. Leveraging the programmability capabilities available on oscillator-based architectures, various digital signal processing algorithms are implemented to overcome the most urgent ISFET non-idealities, such as trapped charge, drift and chemical noise. These methods enable fast trapped charge cancellation and enhanced dynamic range through real-time drift compensation, achieving over 36 hours of continuous monitoring without pixel saturation. Furthermore, the recent development of data-driven models and software methods open a wide range of opportunities for ISFET sensing and beyond. In the last section of this thesis, two examples of these opportunities are explored: the optimisation of image compression algorithms on chemical images generated by an ultra-high frame-rate ISFET array; and a proposed paradigm shift on surface Electromyography (sEMG) signals, moving from data-harvesting to information-focused sensing. These examples represent an initial step forward on a journey towards a new generation of miniaturised, precise and efficient sensors for PoC diagnostics.Open Acces

Spiral - Imperial College Digital Repository

Semantic Biclustering

Author: František Malinka
Publication venue: Czech Technical University in Prague. Computing and Information Centre.
Publication date: 20/11/2021
Field of study

Tato disertační práce se zaměřuje na problém hledání interpretovatelných a prediktivních vzorů, které jsou vyjádřeny formou dvojshluků, se specializací na biologická data. Prezentované metody jsou souhrnně označovány jako sémantické dvojshlukování, jedná se o podobor dolování dat. Termín sémantické dvojshlukování je použit z toho důvodu, že zohledňuje proces hledání koherentních podmnožin řádků a sloupců, tedy dvojshluků, v 2-dimensionální binární matici a zárove ň bere také v potaz sémantický význam prvků v těchto dvojshlucích. Ačkoliv byla práce motivována biologicky orientovanými daty, vyvinuté algoritmy jsou obecně aplikovatelné v jakémkoli jiném výzkumném oboru. Je nutné pouze dodržet požadavek na formát vstupních dat. Disertační práce představuje dva originální a v tomto ohledu i základní přístupy pro hledání sémantických dvojshluků, jako je Bicluster enrichment analysis a Rule a tree learning. Jelikož tyto metody nevyužívají vlastní hierarchické uspořádání termů v daných ontologiích, obecně je běh těchto algoritmů dlouhý čin může docházet k indukci hypotéz s redundantními termy. Z toho důvodu byl vytvořen nový operátor zjemnění. Tento operátor byl včleněn do dobře známého algoritmu CN2, kde zavádí dvě redukční procedury: Redundant Generalization a Redundant Non-potential. Obě procedury pomáhají dramaticky prořezat prohledávaný prostor pravidel a tím umožňují urychlit proces indukce pravidel v porovnání s tradičním operátorem zjemnění tak, jak je původně prezentován v CN2. Celý algoritmus spolu s redukčními metodami je publikován ve formě R balííčku, který jsme nazvali sem1R. Abychom ukázali i možnost praktického užití metody sémantického dvojshlukování na reálných biologických problémech, v disertační práci dále popisujeme a specificky upravujeme algoritmus sem1R pro dv+ úlohy. Zaprvé, studujeme praktickou aplikaci algoritmu sem1R v analýze E-3 ubikvitin ligázy v trávicí soustavě s ohledem na potenciál regenerace tkáně. Zadruhé, kromě objevování dvojshluků v dat ech genové exprese, adaptujeme algoritmus sem1R pro hledání potenciálne patogenních genetických variant v kohortě pacientů.This thesis focuses on the problem of finding interpretable and predic tive patterns, which are expressed in the form of biclusters, with an orientation to biological data. The presented methods are collectively called semantic biclustering, as a subfield of data mining. The term semantic biclustering is used here because it reflects both a process of finding coherent subsets of rows and columns in a 2-dimensional binary matrix and simultaneously takes into account a mutual semantic meaning of elements in such biclusters. In spite of focusing on applications of algorithms in biological data, the developed algorithms are generally applicable to any other research field, there are only limitations on the format of the input data. The thesis introduces two novel, and in that context basic, approaches for finding semantic biclusters, as Bicluster enrichment analysis and Rule and tree learning. Since these methods do not exploit the native hierarchical order of terms of input ontologies, the run-time of algorithms is relatively long in general or an induced hypothesis might have terms that are redundant. For this reason, a new refinement operator has been invented. The refinement operator was incorporated into the well-known CN2 algorithm and uses two reduction procedures: Redundant Generalization and Redundant Non-potential, both of which help to dramatically prune the rule space and consequently, speed-up the entire process of rule induction in comparison with the traditional refinement operator as is presented in CN2. The reduction procedures were published as an R package that we called sem1R. To show a possible practical usage of semantic biclustering in real biological problems, the thesis also describes and specifically adapts the algorithm for two real biological problems. Firstly, we studied a practical application of sem1R algorithm in an analysis of E-3 ubiquitin ligase in the gastrointestinal tract with respect to tissue regeneration potential. Secondly, besides discovering biclusters in gene expression data, we adapted the sem1R algorithm for a different task, concretely for finding potentially pathogenic genetic variants in a cohort of patients

Digital Library of the Czech Technical University in Prague

Fuzzy systems and unsupervised computing: exploration of applications in biology

Author: Cai F.
Publication venue
Publication date: 12/12/2017
Field of study

In this thesis we will explore the use of fuzzy systems theory for applications in bioinformatics. The theory of fuzzy systems is concerned with formulating decision problems in data sets that are ill-defined. It supports the transfer from a subjective human classification to a numerical scale. In this manner it affords the testing of hypothesis and separation of the classes in the data. We first formulate problems in terms of a fuzzy system and then develop and test algorithms in terms of their performance with data from the domain of the life-sciences. From the results and the performance, we will learn about the usefulness of fuzzy systems for the field, as well as the applicability to the kind of problems and practicality for the computation itself. Computer Systems, Imagery and Medi

Leiden University Scholary Publications

Analysing functional genomics data using novel ensemble, consensus and data fusion techniques

Author: Glaab Enrico
Publication venue
Publication date: 15/10/2011
Field of study

Motivation: A rapid technological development in the biosciences and in computer science in the last decade has enabled the analysis of high-dimensional biological datasets on standard desktop computers. However, in spite of these technical advances, common properties of the new high-throughput experimental data, like small sample sizes in relation to the number of features, high noise levels and outliers, also pose novel challenges. Ensemble and consensus machine learning techniques and data integration methods can alleviate these issues, but often provide overly complex models which lack generalization capability and interpretability. The goal of this thesis was therefore to develop new approaches to combine algorithms and large-scale biological datasets, including novel approaches to integrate analysis types from different domains (e.g. statistics, topological network analysis, machine learning and text mining), to exploit their synergies in a manner that provides compact and interpretable models for inferring new biological knowledge. Main results: The main contributions of the doctoral project are new ensemble, consensus and cross-domain bioinformatics algorithms, and new analysis pipelines combining these techniques within a general framework. This framework is designed to enable the integrative analysis of both large- scale gene and protein expression data (including the tools ArrayMining, Top-scoring pathway pairs and RNAnalyze) and general gene and protein sets (including the tools TopoGSA , EnrichNet and PathExpand), by combining algorithms for different statistical learning tasks (feature selection, classification and clustering) in a modular fashion. Ensemble and consensus analysis techniques employed within the modules are redesigned such that the compactness and interpretability of the resulting models is optimized in addition to the predictive accuracy and robustness. The framework was applied to real-word biomedical problems, with a focus on cancer biology, providing the following main results: (1) The identification of a novel tumour marker gene in collaboration with the Nottingham Queens Medical Centre, facilitating the distinction between two clinically important breast cancer subtypes (framework tool: ArrayMining) (2) The prediction of novel candidate disease genes for Alzheimer’s disease and pancreatic cancer using an integrative analysis of cellular pathway definitions and protein interaction data (framework tool: PathExpand, collaboration with the Spanish National Cancer Centre) (3) The prioritization of associations between disease-related processes and other cellular pathways using a new rule-based classification method integrating gene expression data and pathway definitions (framework tool: Top-scoring pathway pairs) (4) The discovery of topological similarities between differentially expressed genes in cancers and cellular pathway definitions mapped to a molecular interaction network (framework tool: TopoGSA, collaboration with the Spanish National Cancer Centre) In summary, the framework combines the synergies of multiple cross-domain analysis techniques within a single easy-to-use software and has provided new biological insights in a wide variety of practical settings

Nottingham eTheses