151 research outputs found

    Combined optimization algorithms applied to pattern classification

    Get PDF
    Accurate classification by minimizing the error on test samples is the main goal in pattern classification. Combinatorial optimization is a well-known method for solving minimization problems, however, only a few examples of classifiers axe described in the literature where combinatorial optimization is used in pattern classification. Recently, there has been a growing interest in combining classifiers and improving the consensus of results for a greater accuracy. In the light of the "No Ree Lunch Theorems", we analyse the combination of simulated annealing, a powerful combinatorial optimization method that produces high quality results, with the classical perceptron algorithm. This combination is called LSA machine. Our analysis aims at finding paradigms for problem-dependent parameter settings that ensure high classifica, tion results. Our computational experiments on a large number of benchmark problems lead to results that either outperform or axe at least competitive to results published in the literature. Apart from paxameter settings, our analysis focuses on a difficult problem in computation theory, namely the network complexity problem. The depth vs size problem of neural networks is one of the hardest problems in theoretical computing, with very little progress over the past decades. In order to investigate this problem, we introduce a new recursive learning method for training hidden layers in constant depth circuits. Our findings make contributions to a) the field of Machine Learning, as the proposed method is applicable in training feedforward neural networks, and to b) the field of circuit complexity by proposing an upper bound for the number of hidden units sufficient to achieve a high classification rate. One of the major findings of our research is that the size of the network can be bounded by the input size of the problem and an approximate upper bound of 8 + √2n/n threshold gates as being sufficient for a small error rate, where n := log/SL and SL is the training set

    NOVEL APPLICATIONS OF MACHINE LEARNING IN BIOINFORMATICS

    Get PDF
    Technological advances in next-generation sequencing and biomedical imaging have led to a rapid increase in biomedical data dimension and acquisition rate, which is challenging the conventional data analysis strategies. Modern machine learning techniques promise to leverage large data sets for finding hidden patterns within them, and for making accurate predictions. This dissertation aims to design novel machine learning-based models to transform biomedical big data into valuable biological insights. The research presented in this dissertation focuses on three bioinformatics domains: splice junction classification, gene regulatory network reconstruction, and lesion detection in mammograms. A critical step in defining gene structures and mRNA transcript variants is to accurately identify splice junctions. In the first work, we built the first deep learning-based splice junction classifier, DeepSplice. It outperforms the state-of-the-art classification tools in terms of both classification accuracy and computational efficiency. To uncover transcription factors governing metabolic reprogramming in non-small-cell lung cancer patients, we developed TFmeta, a machine learning approach to reconstruct relationships between transcription factors and their target genes in the second work. Our approach achieves the best performance on benchmark data sets. In the third work, we designed deep learning-based architectures to perform lesion detection in both 2D and 3D whole mammogram images

    A knowledge engineering approach to the recognition of genomic coding regions

    Get PDF
    ได้ทุนอุดหนุนการวิจัยจากมหาวิทยาลัยเทคโนโลยีสุรนารี ปีงบประมาณ พ.ศ.2556-255

    깊은 신경망을 이용한 강인한 특징 학습

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 8. 윤성로.최근 기계 학습의 발전으로 인공 지능은 우리에게 한 걸음 더 가까이 다가오게 되었다. 특히 자율 주행이나 게임 플레이 등 최신 인공 지능 프레임워크들에 있어서, 딥 러닝이 중요한 역할을 하고 있는 상황이다. 딥 러닝이란 multi-layered neural networks 과 관련된 기술들을 총칭하는 용어로서, 데이터의 양이 급속하게 증가하며, 사전 지식들이 축적되고, 효율적인 학습 알고리즘들이 개발되며, 고급 하드웨어들이 만들어짐에 따라 빠르게 변화하고 있다. 현재 딥 러닝은 대부분의 인식 문제에서 최첨단 기술로 활용되고 있다. 여러 레이어로 구성된 깊은 신경망은 많은 양의 파라미터를 학습하기 때문에, 방대한 파라미터 집합 속에서 좋은 해를 효율적으로 찾아내는 것이 중요하다. 본 논문에서는 깊은 신경망의 세 가지 이슈에 대해 접근하며, 그것들을 해결하기 위한 regularization 기법들을 제안한다. 첫째로, 신경망 구조는 adversarial perturbations 이라는 내재적인 blind spots 들에 많이 노출되어 있다. 이러한 adversarial perturbations 에 강인한 신경망을 만들기 위하여, 학습 샘플과 그것의 adversarial perturbations 와의 차이를 최소화하는 manifold loss term을 목적 함수에 추가하였다. 둘째로, restricted Boltzmann machines 의 학습에 있어서, 상대적으로 작은 크기를 가지는 클래스를 학습하는 데에 기존의 contrastive divergence 알고리즘은 한계점을 가지고 있었다. 본 논문에서는 작은 클래스에 더 높은 학습 가중치를 부여하는 boosting 개념과 categorical features를 가진 데이터에 적합한 새로운 regularization 기법을 조합하여 기존의 한계점에 접근하였다. 마지막으로, 신경망의 파라미터를 학습하기에 충분하지 않은 데이터가 주어진 경우, 더 정교한 data augmentation 기법을 다룬다. 샘플의 차원이 많을수록, 데이터 생성의 기저에 깔려있는 사전 지식을 활용하여 augmentation을 하는 것이 더욱 더 필요하다. 나아가, 본 논문은 junction splicing signals 학습을 위한 첫 번째 깊은 신경망 모델링 결과를 제시하고 있다. Junction prediction 문제는 positive 샘플 수가 매우 적어 패턴 모델링이 힘들며, 이는 생명정보학 분야에서 가장 중요한 문제 중 하나로서, 전체 gene expression process 를 이해하는 첫 걸음이라고 할 수 있다. 요약하면, 본 논문은 딥 러닝으로 이미지와 대용량 유전체 데이터를 위한 효과적인 표현법을 학습할 수 있는 regularization 기법들을 제안하였으며, 유명한 벤치마크 데이터와 biomedical imaging 데이터를 사용하여 그 실효성을 검증하였다.Recent advances in machine learning continue to bring us closer to artificial intelligence. In particular, deep learning plays a key role in cutting-edge frameworks such as autonomous driving and game playing. Deep learning refers to a class of multi-layered neural networks, which is rapidly evolving as the amount of data increases, prior knowledge builds up, efficient training schemes are being developed, and high-end hardwares are being build. Currently, deep learning is a state-of-the-art technique for most recognition tasks. As deep neural networks learn many parameters, there has been a variety of attempts to obtain reasonable solutions over a wide search space. In this dissertation, three issues in deep learning are discussed and approaches to solve them with regularization techniques are suggested. First, deep neural networks expose the problem of intrinsic blind spots called adversarial perturbations. Thus, we must construct neural networks that resist the directions of adversarial perturbations by introducing an explicit loss term to minimize the differences between the original and adversarial samples. Second, training restricted Boltzmann machines show limited performance when handling minority samples in class-imbalanced datasets. Our approach addresses this limitation and is combined with a new regularization concept for datasets that have categorical features. Lastly, insufficient data handling is required to be more sophisticated when deep networks learn numerous parameters. Given high-dimensional samples, we must augment datasets with adequate prior knowledge to estimate a high-dimensional distribution. Furthermore, this dissertation shows the first application of deep belief networks to identifying junction splicing signals. Junction prediction is one of the major problems in the field of bioinformatics, and is a starting point to understanding the entire gene expression process. In summary, this dissertation proposes a set of deep learning regularization schemes that can learn the meaningful representation underlying large-scale genomic datasets and image datasets. The effectiveness of these methods was confirmed with a number of experimental studies.Chapter 1 Introduction 1 1.1 Deep neural networks 1 1.2 Issue 1: adversarial examples handling 3 1.3 Issue 2: class-imbalance handling 5 1.4 Issue 3: insufficient data handling 5 1.5 Organization 6 Chapter 2 Background 10 2.1 Basic operations for deep networks 10 2.2 History of deep networks 12 2.3 Modern deep networks 14 2.3.1 Contrastive divergence 16 2.3.2 Deep manifold learning 18 Chapter 3 Adversarial examples handling 20 3.1 Introduction 20 3.2 Methods 21 3.2.1 Manifold regularized networks 21 3.2.2 Generation of adversarial examples 25 3.3 Results and discussion 26 3.3.1 Improved classification performance 28 3.3.2 Disentanglement and generalization 30 3.4 Summary 33 Chapter 4 Class-imbalance handling 35 4.1 Introduction 35 4.1.1 Numerical interpretation of DNA sequences 37 4.1.2 Review of junction prediction problem 41 4.2 Methods 44 4.2.1 Boosted contrastive divergence with categorical gradients 44 4.2.2 Stacking and fine-tuning 46 4.2.3 Initialization and parameter setting 47 4.3 Results and discussion 47 4.3.1 Experiment preparation 47 4.3.2 Improved prediction performance and runtime 49 4.3.3 More robust prediction by proposed approach 51 4.3.4 Effects of regularization on performance 53 4.3.5 Efficient RBM training by boosted CD 54 4.3.6 Identification of non-canonical splice sites 57 4.4 Summary 58 Chapter 5 Insufficient data handling 60 5.1 Introduction 60 5.2 Backgrounds 62 5.2.1 Understanding comets 62 5.2.2 Assessing DNA damage from tail shape 65 5.2.3 Related image processing techniques 66 5.3 Methods 68 5.3.1 Preprocessing 70 5.3.2 Binarization 70 5.3.3 Filtering and overlap correction 72 5.3.4 Characterization and classification 75 5.4 Results and discussion 76 5.4.1 Test data preparation 76 5.4.2 Binarization 77 5.4.3 Robust identification of comets 79 5.4.4 Classification 81 5.4.5 More accurate characterization by DeepComet 82 5.5 Summary 85 Chapter 6 Conclusion 87 6.1 Dissertation summary 87 6.2 Future work 89 Bibliography 91Docto

    Context based bioinformatics

    Get PDF
    The goal of bioinformatics is to develop innovative and practical methods and algorithms for bio- logical questions. In many cases, these questions are driven by new biotechnological techniques, especially by genome and cell wide high throughput experiment studies. In principle there are two approaches: 1. Reduction and abstraction of the question to a clearly defined optimization problem, which can be solved with appropriate and efficient algorithms. 2. Development of context based methods, incorporating as much contextual knowledge as possible in the algorithms, and derivation of practical solutions for relevant biological ques- tions on the high-throughput data. These methods can be often supported by appropriate software tools and visualizations, allowing for interactive evaluation of the results by ex- perts. Context based methods are often much more complex and require more involved algorithmic techniques to get practical relevant and efficient solutions for real world problems, as in many cases already the simplified abstraction of problems result in NP-hard problem instances. In many cases, to solve these complex problems, one needs to employ efficient data structures and heuristic search methods to solve clearly defined sub-problems using efficient (polynomial) op- timization (such as dynamic programming, greedy, path- or tree-algorithms). In this thesis, we present new methods and analyses addressing open questions of bioinformatics from different contexts by incorporating the corresponding contextual knowledge. The two main contexts in this thesis are the protein structure similarity context (Part I) and net- work based interpretation of high-throughput data (Part II). For the protein structure similarity context Part I we analyze the consistency of gold standard structure classification systems and derive a consistent benchmark set usable for different ap- plications. We introduce two methods (Vorolign, PPM) for the protein structure similarity recog- nition problem, based on different features of the structures. Derived from the idea and results of Vorolign, we introduce the concept of contact neighbor- hood potential, aiming to improve the results of protein fold recognition and threading. For the re-scoring problem of predicted structure models we introduce the method Vorescore, clearly improving the fold-recognition performance, and enabling the evaluation of the contact neighborhood potential for structure prediction methods in general. We introduce a contact consistent Vorolign variant ccVorolign further improving the structure based fold recognition performance, and enabling direct optimization of the neighborhood po- tential in the future. Due to the enforcement of contact-consistence, the ccVorolign method has much higher computational complexity than the polynomial Vorolign method - the cost of com- puting interpretable and consistent alignments. Finally, we introduce a novel structural alignment method (PPM) enabling the explicit modeling and handling of phenotypic plasticity in protein structures. We employ PPM for the analysis of effects of alternative splicing on protein structures. With the help of PPM we test the hypothesis, whether splice isoforms of the same protein can lead to protein structures with different folds (fold transitions). In Part II of the thesis we present methods generating and using context information for the interpretation of high-throughput experiments. For the generation of context information of molecular regulations we introduce novel textmin- ing approaches extracting relations automatically from scientific publications. In addition to the fast NER (named entity recognition) method (syngrep) we also present a novel, fully ontology-based context-sensitive method (SynTree) allowing for the context-specific dis- ambiguation of ambiguous synonyms and resulting in much better identification performance. This context information is important for the interpretation of high-throughput data, but often missing in current databases. Despite all improvements, the results of automated text-mining methods are error prone. The RelAnn application presented in this thesis helps to curate the automatically extracted regula- tions enabling manual and ontology based curation and annotation. For the usage of high-throughput data one needs additional methods for data processing, for example methods to map the hundreds of millions short DNA/RNA fragments (so called reads) on a reference genome or transcriptome. Such data (RNA-seq reads) are the output of next generation sequencing methods measured by sequencing machines, which are becoming more and more efficient and affordable. Other than current state-of-the-art methods, our novel read-mapping method ContextMap re- solves the occurring ambiguities at the final step of the mapping process, employing thereby the knowledge of the complete set of possible ambiguous mappings. This approach allows for higher precision, even if more nucleotide errors are tolerated in the read mappings in the first step. The consistence between context information of molecular regulations stored in databases and extracted from textmining against measured data can be used to identify and score consistent reg- ulations (GGEA). This method substantially extends the commonly used gene-set based methods such over-representation (ORA) and gene set enrichment analysis (GSEA). Finally we introduce the novel method RelExplain, which uses the extracted contextual knowl- edge and generates network-based and testable hypotheses for the interpretation of high-throughput data.Bioinformatik befasst sich mit der Entwicklung innovativer und praktisch einsetzbarer Verfahren und Algorithmen für biologische Fragestellungen. Oft ergeben sich diese Fragestellungen aus neuen Beobachtungs- und Messverfahren, insbesondere neuen Hochdurchsatzverfahren und genom- und zellweiten Studien. Im Prinzip gibt es zwei Vorgehensweisen: Reduktion und Abstraktion der Fragestellung auf ein klar definiertes Optimierungsproblem, das dann mit geeigneten möglichst effizienten Algorithmen gelöst wird. Die Entwicklung von kontext-basierten Verfahren, die möglichst viel Kontextwissen und möglichst viele Randbedingungen in den Algorithmen nutzen, um praktisch relevante Lösungen für relvante biologische Fragestellungen und Hochdurchsatzdaten zu erhalten. Die Verfahren können oft durch geeignete Softwaretools und Visualisierungen unterstützt werden, um eine interaktive Auswertung der Ergebnisse durch Fachwissenschaftler zu ermöglichen. Kontext-basierte Verfahren sind oft wesentlich aufwändiger und erfordern involviertere algorithmische Techniken um für reale Probleme, deren simplifizierende Abstraktionen schon NP-hart sind, noch praktisch relevante und effiziente Lösungen zu ermöglichen. Oft werden effiziente Datenstrukturen und heuristische Suchverfahren benötigt, die für klar umrissene Teilprobleme auf effiziente (polynomielle) Optimierungsverfahren (z.B. dynamische Programmierung, Greedy, Wege- und Baumverfahren) zurückgreifen und sie entsprechend für das Gesamtverfahren einsetzen. In dieser Arbeit werden eine Reihe von neuen Methoden und Analysen vorgestellt um offene Fragen der Bioinformatik aus verschiedenen Kontexten durch Verwendung von entsprechendem Kontext-Wissen zu adressieren. Die zwei Hauptkontexte in dieser Arbeit sind (Teil 1) die Ähnlichkeiten von 3D Protein Strukturen und (Teil 2) auf die netzwerkbasierte Interpretation von Hochdurchsatzdaten. Im Proteinstrukturkontext Teil 1 analysieren wir die Konsistenz der heute verfügbaren Goldstandards für Proteinstruktur-Klassifikationen, und leiten ein vielseitig einsetzbares konsistentes Benchmark-Set ab. Für eine genauere Bestimmung der Ähnlichkeit von Proteinstrukturen beschreiben wir zwei Methoden (Vorolign, PPM), die unterschiedliche Strukturmerkmale nutzen. Ausgehend von den für Vorolign erzielten Ergebnissen, führen wir Kontakt-Umgebungs-Potentiale mit dem Ziel ein, Fold-Erkennung (auf Basis der vorhandenen Strukturen) und Threading (zur Proteinstrukturvorhersage) zu verbessern. Für das Problem des Re-scorings von vorhergesagten Strukturmodellen beschreiben wir das Vorescore Verfahren ein, mit dem die Fold-Erkennung deutlich verbessert, aber auch die Anwendbarkeit von Potentialen im Allgemeinen getested werden kann. Zur weiteren Verbesserung führen wir eine Kontakt-konsistente Vorolign Variante (ccVorolign) ein, die wegen der neuen Konsistenz-Randbedingung erheblich aufwändiger als das polynomielle Vorolignverfahren ist, aber eben auch interpretierbare konsistente Alignments liefert. Das neue Strukturalignment Verfahren (PPM) erlaubt es phänotypische Plastizität, explizit zu modellieren und zu berücksichtigen. PPM wird eingesetzt, um die Effekte von alternativem Splicing auf die Proteinstruktur zu untersuchen, insbesondere die Hypothese, ob Splice-Isoformen unterschiedliche Folds annehmen können (Fold-Transitionen). Im zweiten Teil der Arbeit werden Verfahren zur Generierung von Kontextinformationen und zu ihrer Verwendung für die Interpretation von Hochdurchsatz-Daten vorgestellt. Neue Textmining Verfahren extrahieren aus wissenschaftlichen Publikationen automatisch molekulare regulatorische Beziehungen und entsprechende Kontextinformation. Neben schnellen NER (named entity recognition) Verfahren (wie syngrep) wird auch ein vollständig Ontologie-basiertes kontext-sensitives Verfahren (SynTree) eingeführt, das es erlaubt, mehrdeutige Synonyme kontext-spezifisch und damit wesentlich genauer aufzulösen. Diese für die Interpretation von Hochdurchsatzdaten wichtige Kontextinformation fehlt häufig in heutigen Datenbanken. Automatische Verfahren produzieren aber trotz aller Verbesserungen noch viele Fehler. Mithilfe unserer Applikation RelAnn können aus Texten extrahierte regulatorische Beziehungen ontologiebasiert manuell annotiert und kuriert werden. Die Verwendung aktueller Hochdurchsatzdaten benötigt zusätzliche Ansätze für die Datenprozessierung, zum Beispiel für das Mapping von hunderten von Millionen kurzer DNA/RNA Fragmente (sog. reads) auf Genom oder Transkriptom. Diese Daten (RNA-seq) ergeben sich durch next generation sequencing Methoden, die derzeit mit immer leistungsfähigeren Geräten immer kostengünstiger gemessen werden können. In der ContextMap Methode werden im Gegensatz zu state-of-the-art Verfahren die auftretenden Mehrdeutigkeiten erst am Ende des Mappingprozesses aufgelöst, wenn die Gesamtheit der Mappinginformationen zur Verfügung steht. Dadurch könenn mehr Fehler beim Mapping zugelassen und trotzdem höhere Genauigkeit erreicht werden. Die Konsistenz zwischen der Kontextinformation aus Textmining und Datenbanken sowie den gemessenen Daten kann dann für das Auffinden und Bewerten von konsistente Regulationen (GGEA) genutzt werden. Dieses Verfahren stellt eine wesentliche Erweiterung der häufig verwendeten Mengen-orientierten Verfahren wie overrepresentation (ORA) und gene set enrichment analysis (GSEA) dar. Zuletzt stellen wir die Methode RelExplain vor, die aus dem extrahierten Kontextwissen netzwerk-basierte, testbare Hypothesen für die Erklärung von Hochdurchsatzdaten generiert

    Conceptual Modeling Applied to Genomics: Challenges Faced in Data Loading

    Full text link
    Todays genomic domain evolves around insecurity: too many imprecise concepts, too much information to be properly managed. Considering that conceptualization is the most exclusive human characteristic, it makes full sense to try to conceptualize the principles that guide the essence of why humans are as we are. This question can of course be generalized to any species, but we are especially interested in this work in showing how conceptual modeling is strictly required to understand the ''execution model'' that human beings ''implement''. The main issue is to defend the idea that only by having an in-depth knowledge of the Conceptual Model that is associated to the Human Genome, can this Human Genome properly be understood. This kind of Model-Driven perspective of the Human Genome opens challenging possibilities, by looking at the individuals as implementation of that Conceptual Model, where different values associated to different modeling primitives will explain the diversity among individuals and the potential, unexpected variations together with their unwanted effects in terms of illnesses. This work focuses on the challenges faced in loading data from conventional resources into Information Systems created according to the above mentioned conceptual modeling approach. The work reports on various loading efforts, problems encountered and the solutions to these problems. Also, a strong argument is made about why conventional methods to solve the so called `data chaos¿ problems associated to the genomics domain so often fail to meet the demands.Van Der Kroon ., M. (2011). Conceptual Modeling Applied to Genomics: Challenges Faced in Data Loading. http://hdl.handle.net/10251/16993Archivo delegad

    Analysis of microarray and next generation sequencing data for classification and biomarker discovery in relation to complex diseases

    Get PDF
    PhDThis thesis presents an investigation into gene expression profiling, using microarray and next generation sequencing (NGS) datasets, in relation to multi-category diseases such as cancer. It has been established that if the sequence of a gene is mutated, it can result in the unscheduled production of protein, leading to cancer. However, identifying the molecular signature of different cancers amongst thousands of genes is complex. This thesis investigates tools that can aid the study of gene expression to infer useful information towards personalised medicine. For microarray data analysis, this study proposes two new techniques to increase the accuracy of cancer classification. In the first method, a novel optimisation algorithm, COA-GA, was developed by synchronising the Cuckoo Optimisation Algorithm and the Genetic Algorithm for data clustering in a shuffle setup, to choose the most informative genes for classification purposes. Support Vector Machine (SVM) and Multilayer Perceptron (MLP) artificial neural networks are utilised for the classification step. Results suggest this method can significantly increase classification accuracy compared to other methods. An additional method involving a two-stage gene selection process was developed. In this method, a subset of the most informative genes are first selected by the Minimum Redundancy Maximum Relevance (MRMR) method. In the second stage, optimisation algorithms are used in a wrapper setup with SVM to minimise the selected genes whilst maximising the accuracy of classification. A comparative performance assessment suggests that the proposed algorithm significantly outperforms other methods at selecting fewer genes that are highly relevant to the cancer type, while maintaining a high classification accuracy. In the case of NGS, a state-of-the-art pipeline for the analysis of RNA-Seq data is investigated to discover differentially expressed genes and differential exon usages between normal and AIP positive Drosophila datasets, which are produced in house at Queen Mary, University of London. Functional genomic of differentially expressed genes were examined and found to be relevant to the case study under investigation. Finally, after normalising the RNA-Seq data, machine learning approaches similar to those in microarray was successfully implemented for these datasets
    corecore