19 research outputs found

    Som-Based Class Discovery Exploring the ICA-Reduced Features of Microarray Expression Profiles

    Get PDF
    Gene expression datasets are large and complex, having many variables and unknown internal structure. We apply independent component analysis (ICA) to derive a less redundant representation of the expression data. The decomposition produces components with minimal statistical dependence and reveals biologically relevant information. Consequently, to the transformed data, we apply cluster analysis (an important and popular analysis tool for obtaining an initial understanding of the data, usually employed for class discovery). The proposed self-organizing map (SOM)-based clustering algorithm automatically determines the number of ‘natural’ subgroups of the data, being aided at this task by the available prior knowledge of the functional categories of genes. An entropy criterion allows each gene to be assigned to multiple classes, which is closer to the biological representation. These features, however, are not achieved at the cost of the simplicity of the algorithm, since the map grows on a simple grid structure and the learning algorithm remains equal to Kohonen’s one

    Ανάπτυξη προηγμένων μοντέλων υπολογιστικής νοημοσύνης για την αντιμετώπιση πολύπλοκων προβλημάτων της βιοπληροφορικής και της επεξεργασίας βιοσημάτων

    No full text
    The aim of this thesis was the development of computationally effective computational intelligence solutions for real-world problems from the Bioinformatics and Biosignal -Processing field. Real-world problems usually involve large data sets and are characterized by complex class separation boundaries. Recently designed supervised models, like the SVM, are able to tackle complex patterns classification problems, but do that in a computationally ineffective way, often resulting in a prohibitive numerical evaluation of these models for large sized data sets. Unsupervised models on the other hand, although demanding significantly less computational resources, have inherently poor discriminating capabilities near class boundaries. The present work intending to combine the advantages of both models, introduced a novel approach which relied on a simple fact: the state space for many complex pattern classification problems consists of regions that lie near class separation boundaries and require the construction of complex discriminants while for the rest regions the classification task is significantly simpler. In accordance to that in the first part of this thesis the Supervised Network Self-Organizing Map (sNet-SOM) model was designed. The sNet-SOM utilizes unsupervised learning for classifying at the simple regions and supervised learning for the difficult ones in a two stage learning process. The unsupervised learning approach is based on on an adapted version of the Self-Organizing Map (SOM) of Kohonen, while supervised learning is based on the Generalized Radial Basis Functions (GRBF) networks and on the Support Vector Machines (SVM’s). The performance of the sNet-SOM has been evaluated on synthetic data, on simulated data and on an ischemia detection application with data extracted from the European ST-T database. In all cases, the utilization of sNet-SOM with supervised learning based on both Radial Basis Functions and Support Vector Machines has significantly improved the results related to those obtained with the unsupervised SOM and has enhanced the scalability of the supervised learning schemes. The second part of this thesis was aiming at-the application of the sNet-SOM model for the analysis of microarray gene expression data. As the Human Genome Project comes towards completion of the first finished human sequence (now scheduled for 2003), microarray technology offers the potential to open wide new windows into the study of genome complexity. By facilitating the measurement of RNA levels for the complete set of transcripts of an organism microarray analysis greatly assists in defining functions of genes and elucidating important biological pathways. The analysis of the unprecedented quantities of data points that result from these experiments however, requires the use of sophisticated computational tools. The sNet- SOM model met in general the requirements posed, but had to be redesigned in order to fit well to the peculiarities of the data. Additionally, since to this point still the revealing of the structure of the data remains a main objective of analysis and not solely the classification of genes, the clustering potential of the sNet-SOM model has been advanced by a supplement expansion of its unsupervised phase. The application of sNet-SOM on gene expression data resulted in classification performances similar to that of other high accuracy classification tools used recently, with the advantage of low computational requirements and the ability of handling the multi-labeled nature of genes, which cannot be handled or is neglected by other recent approaches. Furthermore compared to solely supervised approaches used for classification or solely unsupervised approaches used for ucovering the structure of data, the designed sNet-SOM model combines both tasks providing in parallel to the classification, an extensive exploratory analysis tool in means of its unsupervised analysis framework. The thesis proceeds as follows. In the first part, after the introduction in Chapter 1 the second chapter, Chapter 2, provides background knowledge by the description of the basic algorithm of the Self-Organizing Map (SOM) of Kohonen. Chapter 3 introduces the sNet-SOM model. The unsupervised modified SOM algorithm is covered in detail and the appropriate supervised models, GRBFs and SVMs, are discussed. In order to justify the selection of the supervised experts and to explain the advantage of the use of sNet-SOM compared to the direct application of the supervised models, the next two chapters are devoted to the GRBFs and the SVMs respectively. Specifically, Chapter 4 concentrates on the first supervised network giving a concise description of the regularization theory of Tikhonov, which forms the formal framework of the Radial Basis Function (RBF) networks, which are described at the same time. Chapter 5 exploits the Support Vector Machines (SVM) algorithm and explores briefly for that purpose some mathematical concepts of the Statistical Learning Theory developed by Vapnik. In Chapter 6 three applications of sNet-SOM are presented which prove the enhanced performance of sNet-SOM and in the last chapter of the first part, Chapter 7, summarizing remarks and some directions for further improvement are presented. In the second part after the introduction in Chapter 8, Chapter 9 provides some basic background knowledge concerning gene expression, and gene expression microarrays. Previous methods for gene expression analysis are reviewed and their drawbacks outlined. Chapter 10 adapts the sNet-SOM algorithm to the specific requirements of gene expression analysis. The unsupervised the supervised and a combined unsupervised-supervised extension phase are each explained in separate sections. Chapter 11 is devoted to the presentation of the application of the sNet-SOM on microarray gene expression data from the budding yeast. First the experiments from which the data are drawn are explained, followed by the explanation of the functional classes used. Afterwards, the results together with the corresponding concluding remarks are presented. The second part ends with Chapter 12 where the conclusions of the analysis of gene expression data with the sNet-SOM are summarized. Finally the whole thesis is briefly reviewed and the overall conclusions are stated.Σκοπός της διατριβής ήταν η ανάπτυξη νέων μοντέλων Υπολογιστκής Νοημοσύνης με χαμηλές υπολογιστικές απαιτήσεις, για την αντιμετώπιση πραγματικών προβλημάτων από τον χώρο της Βιοπληροφορικής και της Επεξεργασίας Βιοσημάτων. Τα πραγματικά προβλήματα συνήθως δημιουργούν μεγάλα σύνολα δεδομένων και χαρακτηρίζονται από περίπλοκα όρια διαχωρισμού κλάσεων (class separation boundaries). Πρόσφατα, σχεδιασμένα μοντέλα όπως το μοντέλο των Support Vector Machines μπορούν να αντιμετωπίσουν περίπλοκα προβλήματα ταξινόμησης προτύπων (pattern classification problems), όμως ο τρόπος που χρησιμοποιούν είναι υπολογιστικά τόσο ατελέσφορος, που για μεγάλα σύνολα δεδομένων, η αριθμητική επίλυση (numerical evaluation) αυτών των μοντέλων καθίσταται σχεδόν απογορευτική. Τα μοντέλα μη-επιβλεπόμενης (unsupervised) μάθησης από την άλλη, αν και έχουν σημαντικά χαμηλότερες υπολογιστικές απαιτήσεις, έχουν μια εγγενή μειωμένη ικανότητα διαχωρισμού κλάσεων κοντά στα όρια των κλάσεων. Η διατριβή αυτή αποσκοπώντας στον συνδυασμό και των δυο μοντέλων (μη επιβλεπόμενων και επιβλεπόμενών) εισήγαγε μια νέα προσέγγιση που στηρίχθηκε σε ένα απλό γεγονός: ο χώρος καταστάσεων για πολλά περίπλοκα προβλήματα ταξινόμησης προτύπων, αποτελείται από περιοχές που βρίσκονται κοντά στα όρια διαχωρισμού κλάσεων και απαιτούν την κατασκευή περίπλοκων διαχωριστών (discriminants), ενώ για τις υπόλοιπες περιοχές το πρόβλημα της ταξινόμησης είναι αρκετά πιο απλό. Σύμφωνα με τα προηγούμενα, στο πρώτο μέρος της διατριβής σχεδιάστηκε το μοντέλο Supervised Network Self-Organizing Map (sNet-SOM) που εκμεταλλεύεται την ύπαρξη του ανομοιόμορφου χώρου καταστάσεων: Το μοντέλο sNetSOM χρησιμοποιεί μη-επιβλεπόμενη μάθηση για την ταξινόμηση στις απλές περιοχές και επιβλεπόμενη για τις περίπλοκες περιοχές (δηλαδή σε αυτές που είναι κοντά στα όρια διαχωρισμού κλάσεων). Ο αλγόριθμος μάθησης του sNetSOM διατυπώνεται επομένως σε δύο στάδια: Η μη-επιβλεπόμενη μάθηση επεκτείνει και προσαρμόζει τον αλγόριθμο SOM του Kohonen, ενώ η επιβλεπόμενη βασίζεται στα μοντέλα Generalized Radial Basis Functions Networks (GRBFN) και σε Support Vector Machines (SVM’s). Η απόδοση του sNet-SOM υπολογίστηκε κατά την εφαρμογή του σε συνθετικά δεδομένα, σε δεδομένα προσομοίωσης και σε πρόβλημα ανίχνευσης ισχαιμικών επεισοδίων με δεδομένα από την European ST-T βάση δεδομένων (καταγραφές Ηλεκτροκαρδιογραφημάτων). Σε όλες τις περιπτώσεις η χρήση του sNet-SOM, με επιβλεπόμενη μάθηση βασιζόμενη σε GRBFN και σε SVM’s, βελτίωσε σημαντικά τα αποτελέσματα που προκύπτουν μόνο από την χρήση του μη-επιβλεπόμενού SOM και μείωσε δραστικά της υπολογιστικές απαιτήσεις των επιβλεπόμενων μοντέλων, λόγω της λειτουργία τους στον σημαντικά μειωμένο χώρο των διφορούμενων περιοχών. Το δεύτερο μέρος της διατριβής αποσκοπούσε στην εφαρμογή του μοντέλου sNet- SOM για την ανάλυση δεδομένων γονιδιακής έκφρασης από μικροσυστοιχίες. Τώρα που το έργο της αλληλούχισης του ανθρώπινου γονιδιώματος οδεύει προς την πλήρη ολοκλήρωσή του (προγραμματισμένο για το 2003), η τεχνολογία των μικροσυστοιχιών ανοίγει νέους ορίζοντες για την μελέτη της πολυπλοκότητας του γονιδιώματος. Επιτρέποντας την μέτρηση της έκφρασης χιλιάδων γονιδίων ταυτόχρονα, βοηθάει στην ανακάλυψη της λειτουργίας των γονιδίων και αποσαφηνίζει σημαντικά βιολογικά μονοπάτια (biological pathways). Η ανάλυση των τεράστιων ποσοτήτων δεδομένων που προκύπτουν από αυτά τα πειράματα όμως, απαιτεί την χρήση προηγμένων υπολογιστικών εργαλείων. Το μοντέλο του sNet- SOM πληρούσε σε γενικές γραμμές τις προϋποθέσεις, έπρεπε όμως να επανασχεδιασθεί για να εναρμονιστεί με τις ιδιαιτερότητες των δεδομένων γονιδιακής έκφρασης. Επιπρόσθετα, επειδή προς το παρόν η αποκάλυψη της δομής των δεδομένων παραμένει βασικός στόχος της ανάλυσης των δεδομένων και όχι απλά η ταξινόμηση των γονιδίων, η δυνατότητα ομαδοποίησης του sNet-SOM ενισχύθηκε, προσθέτοντας, στη μη-επιβλεπόμενης φάση του, μια επιπλέον δυνατότητα επέκτασης. Η εφαρμογή του sNet-SOM για την ανάλυση δεδομένων γονιδιακής έκφρασης είχε απόδοση ταξινόμησης παρόμοια με αυτές άλλων προηγμένων μοντέλων ταξινόμησης, που έχουν χρησιμοποιηθεί πρόσφατα. Επιπλέον όμως είχε και το σημαντικό πλεονέκτημα των χαμηλών υπολογιστικών απαιτήσεων και την ικανότητα της αντιμετώπισης του γεγονότος, ότι τα γονίδια ανήκουν σε περισσότερες της μίας κλάσης (multi-labeling), το οποίο είτε αμελείται, είτε δεν μπορεί να αντιμετωπιστεί από τις περισσότερες άλλες μεθόδους. Επίσης, συγκρίνοντας με μοντέλα που είτε χρησιμοποιούν επιβλεπόμενη μάθηση και ταξινομούν τα δεδομένα, είτε χρησιμοποιούν μη-επιβλεπόμενη μάθηση και αποκαλύπτουν την δομή των δεδομένων, το σχεδιασμένο μοντέλο sNet-SOM καταφέρνει και συνδυάζει και τις δύο διεργασίες, γιατί παράλληλα με την ταξινόμηση παρέχει ένα εργαλείο εκτενούς εξερεύνησης των δεδομένων στο πλαίσιο της μη-επιβλεπόμενης ανάλυσης

    The Software Architecture for Performing Scientific Computation with the JLAPACK Libraries in ScalaLab

    No full text
    Although LAPACK is a powerful library its utilization is difficult. JLAPACK, a Java translation obtained automatically from the Fortran LAPACK sources, retains exactly the same difficult to use interface of LAPACK routines. The MTJ library implements an object oriented Java interface to JLAPACK that hides many complicated details. ScalaLab exploits the flexibility of the Scala language to present an even more friendly and convenient interface to the powerful but complicated JLAPACK library. The article describes the interfacing of the low-level JLAPACK routines within the ScalaLab environment. This is performed rather easily by exploiting well suited features of the Scala language. Also, the paper demonstrates the convenience of using JLAPACK routines for linear algebra operations from within ScalaLab

    On the Computational Prediction of miRNA Promoters

    No full text
    Part 10: Mining Humanistic Data Workshop (MHDW)International audienceMicroRNAs transcription regulation is an open topic in molecular biology and the identification of the promoters of microRNAs would give us relevant insights on cellular regulatory mechanisms. In the present study, we introduce a new computational methodology for the prediction of microRNA promoters, which is based on the hybrid combination of an adaptive genetic algorithm with a nu-Support Vector Regression (nu-SVR) classifier. This methodology uses genetic algorithms to locate the optimal features set and to optimize the parameters of the nu-SVR classifier. The main advantage of the proposed solution is that it systematically studies and calculates a vast number of features that can be used for promoters prediction including frequency-based properties, regulatory elements and epigenetic features. The proposed method also handles efficiently the issues of over-fitting, feature selection, convergence and class imbalance. Experimental results give accuracy over 87 % in the miRNA promoter prediction

    ScalaLab and GroovyLab: Comparing Scala and Groovy for Scientific Computing

    No full text
    ScalaLab and GroovyLab are both MATLAB-like environments for the Java Virtual Machine. ScalaLab is based on the Scala programming language and GroovyLab is based on the Groovy programming language. They present similar user interfaces and functionality to the user. They also share the same set of Java scientific libraries and of native code libraries. From the programmer's point of view though, they have significant differences. This paper compares some aspects of the two environments and highlights some of the strengths and weaknesses of Scala versus Groovy for scientific computing. The discussion also examines some aspects of the dilemma of using dynamic typing versus static typing for scientific programming. The performance of the Java platform is continuously improved at a fast pace. Today Java can effectively support demanding high-performance computing and scales well on multicore platforms. Thus, both systems can challenge the performance of the traditional C/C++/Fortran scientific code with an easier to use and more productive programming environment

    Non-coding RNA Sequences Identification and Classification Using a Multi-class and Multi-label Ensemble Technique

    No full text
    Part 3: MHDWInternational audienceHigh throughput sequencing RNA-sequencing technologies and modern in silico techniques have expanded our knowledge on short non-coding RNAs. These sequences were initially split into various categories based on their cellular functionality and their sequential, thermodynamic and structural properties believing that their sequence can be used as an identifier to distinguish them. However, recent evidence has indicated that the same sequences can act and function as more than one type of non-coding RNAs with a striking example of mature microRNA sequences which can also be transfer RNA fragments. Most of the existing computational methods for the prediction of non-coding RNA sequences have emphasized on the prediction of only one type of noncoding RNAs and even the ones designed for multiclassification do not support multiple labeling and are thus not able to assign a sequence to more than one non-coding RNA type. In the present paper, we introduce a new multilabel- multiclass method based on the combination of multiobjective evolutionary algorithms and multi-label implementations of Random Forests to optimize the feature selection process and assign short RNA sequences to one or more non-coding RNA types. The overall methodology clearly outperformed other machine learning techniques which were used for the same purpose and it is applicable to data coming from RNA-sequencing experiments

    Adaptive Filtering Techniques Combined with Natural Selection-Based Heuristic Algorithms in the Prediction of Protein-Protein Interactions

    No full text
    Part 19: Computational Intelligence Applications in Bioinformatics (CIAB) WorkshopInternational audienceThe analysis of protein-protein interactions (PPIs) is crucial to the understanding of cellular organizations, processes and functions. The reliability of the current experimental approaches interaction data is prone to error. Thus, a variety of computational methods have been developed to supplement the interactions that have been detected experimentally. The present paper’s main objective is to present a novel classification framework for predicting PPIs combining the advantages of two algorithmic methods’ categories (heuristic methods, adaptive filtering techniques) in order to produce high performance classifiers while maintaining their interpretability. Our goal is to find a simple mathematical equation that governs the best classifier enabling the extraction of biological knowledge. State-of-the-art adaptive filtering techniques were combined with the most contemporary heuristic methods which are based in the natural selection process. To the best of our knowledge, this is the first time that the proposed classification framework is applied and analyzed extensively for the problem of predicting PPIs. The proposed methodology was tested with a commonly used data set using all possible combinations of the selected adaptive filtering and heuristic techniques and comparisons were made. The best algorithmic combinations derived from these procedures were Genetic Algorithms with Extended Kalman Filters and Particle Swarm Optimization with Extended Kalman Filters. Using these algorithmic combinations high accuracy interpretable classifiers were produced

    HINT-KB: The Human Interactome Knowledge Base

    No full text
    Part 8: First Workshop on Algorithms for Data and Text Mining in Bioinformatics (WADTMB 2012)International audienceProteins and their interactions are considered to play a significant role in many cellular processes. The identification of Protein-Protein interactions (PPIs) in human is an open research area. Many Databases, which contain information about experimentally and computationally detected human PPIs as well as their corresponding annotation data, have been developed. However, these databases contain many false positive interactions, are partial and only a few of them incorporate data from various sources. To overcome these limitations, we have developed HINT-KB (http://150.140.142.24:84/Default.aspx) which is a knowledge base that integrates data from various sources, provides a user-friendly interface for their retrieval, estimates a set of features of interest and computes a confidence score for every candidate protein interaction using a modern computational hybrid methodology
    corecore