94 research outputs found

    Analysis of Genomic and Proteomic Signals Using Signal Processing and Soft Computing Techniques

    Get PDF
    Bioinformatics is a data rich field which provides unique opportunities to use computational techniques to understand and organize information associated with biomolecules such as DNA, RNA, and Proteins. It involves in-depth study in the areas of genomics and proteomics and requires techniques from computer science,statistics and engineering to identify, model, extract features and to process data for analysis and interpretation of results in a biologically meaningful manner.In engineering methods the signal processing techniques such as transformation,filtering, pattern analysis and soft-computing techniques like multi layer perceptron(MLP) and radial basis function neural network (RBFNN) play vital role to effectively resolve many challenging issues associated with genomics and proteomics. In this dissertation, a sincere attempt has been made to investigate on some challenging problems of bioinformatics by employing some efficient signal and soft computing methods. Some of the specific issues, which have been attempted are protein coding region identification in DNA sequence, hot spot identification in protein, prediction of protein structural class and classification of microarray gene expression data. The dissertation presents some novel methods to measure and to extract features from the genomic sequences using time-frequency analysis and machine intelligence techniques.The problems investigated and the contribution made in the thesis are presented here in a concise manner. The S-transform, a powerful time-frequency representation technique, possesses superior property over the wavelet transform and short time Fourier transform as the exponential function is fixed with respect to time axis while the localizing scalable Gaussian window dilates and translates. The S-transform uses an analysis window whose width is decreasing with frequency providing a frequency dependent resolution. The invertible property of S-transform makes it suitable for time-band filtering application. Gene prediction and protein coding region identification have been always a challenging task in computational biology,especially in eukaryote genomes due to its complex structure. This issue is resolved using a S-transform based time-band filtering approach by localizing the period-3 property present in the DNA sequence which forms the basis for the identification.Similarly, hot spot identification in protein is a burning issue in protein science due to its importance in binding and interaction between proteins. A novel S-transform based time-frequency filtering approach is proposed for efficient identification of the hot spots. Prediction of structural class of protein has been a challenging problem in bioinformatics.A novel feature representation scheme is proposed to efficiently represent the protein, thereby improves the prediction accuracy. The high dimension and low sample size of microarray data lead to curse of dimensionality problem which affects the classification performance.In this dissertation an efficient hybrid feature extraction method is proposed to overcome the dimensionality issue and a RBFNN is introduced to efficiently classify the microarray samples

    Computational models of gene expression regulation

    Get PDF
    Throughout the last several decades, many efforts have been put into elucidating the genetic or epigenetic defects that result in various diseases. Gene regulation, i.e., the process of how genes are turned on and off in the right place and at the right time, is a paramount and prevailing question for researchers. Thanks to the discoveries made by researchers in this field, our understanding of interactions between proteins and DNA or proteins with themselves, as well as the dynamics of chromatin structure under different conditions, have substantially advanced. Even though there has been a lot achieved through these discoveries, there are still many unknown aspects about gene regulation. For instance, proteins called transcription factors (TFs) recognize and bind to specific regions of DNA and recruit the transcriptional machinery, which is essential for gene regulation. As there have been more than 2000 TFs identified in the human genome, it is important to study where they bind to or which genes they target. Computational approaches are important, in particular, as the biological experiments are often very expensive and cannot be done for all TFs. In 2016, a competition named DREAM Challenge was held encouraging researchers to develop novel computational tools for predicting the binding sites of several TFs. The first chapter of this thesis describes our machine learning approach to address this challenge within the scope of the competition. Using ensembles of random forest classifiers, we formulated our framework such that it is able to benefit from the tissue specificity inherent in the data leading to better generalization. Also, our models were tailored for spotting cofactors involved in the binding of TFs of interest. Comparing the important TFs that our computational models suggested with protein-protein association networks revealed that the models preferentially select motifs of TFs that are potential interaction partners in those networks. Another important aspect beyond predicting TF binding is to link epigeneomics, such as histone modification (HM) data, with gene expression. We, particularly, concentrated on predicting expression in a subset of genes called bidirectional. Bidirectional genes are referred to as pairs of genes that are located on opposite strands of DNA close to each other. As the sequencing technologies advance, more such bidirectional configurations are being detected. This indicates that in order to understand the gene regulatory mechanisms, it would be beneficial to account for such promoter architectures. In the second and third chapters, we focused on genes having bidirectional promoter architectures utilizing high resolution epigenomic signatures and single cell RNA-seq data to dissect the complex epigenetic architecture at these promoters. Using single-cell RNA-seq data as the estimate of gene expression, we were able to generate a hypothetical model for gene regulation in bidirectional promoters. We showed that bidirectional promoters can be categorized into three architecture types with distinct characteristics. Each of these categories corresponds to a unique gene expression profile at single cell level. The single cell RNA-seq data proved to be a powerful means for studying gene regulation. Therefore, in the last chapter, we proposed a novel approach for predicting gene expression at the single cell level using cis-regulatory motifs as well as epigenetic features. To achieve this, we designed a tree-guided multi-task learning framework that considers each cell as a task. Through this framework we were able to explain the single cell gene expression values using either TF binding affinities or TF ChIP-seq data measured at specific genomic regions. This allowed us to identify distinct TFs that show cell-type specific regulation in induced pluripotent stem cells. Our approach does not only limit to TFs, rather it can take any type of data that can potentially be used in explaining gene expression at single cell level. We believe that our findings can be used in drug discovery and development that can regulate the presence of TFs or other regulatory factors, which lead the cell fate into abnormal states, to prevent or cure diseases.In den letzten Jahrzehnten wurden große Anstrengungen unternommen, um die genetischen oder epigenetischen Defekte aufzuklären, die zu verschiedenen Krankheiten führen. Die Genregulation, d.h. der Prozess der Ein- und Abschaltung der Gene am richtigen Ort und zur richtigen Zeit reguliert, ist für die Forscher eine Frage von zentraler Bedeutung. Dank der Entdeckungen von Forschern auf diesem Gebiet ist unser Verständnis der Wechselwirkungen zwischen zwischen den Proteinen und der DNA oder der Proteine untereinander sowie der Dynamik der Chromatinstruktur unter verschiedenen Bedingungen wesentlich fortgeschritten. Obwohl durch diese Entdeckungen viel erreicht wurde, gibt es noch viele unbekannte Aspekte der Genregulation. Beispielsweise erkennen Proteine, sogenannte Transkriptionsfaktoren (Transcription Factors, TFs), bestimmte Bereiche der DNA und binden an diese und rekrutieren die Transkriptionsmaschinerie, die für die Genregulation erforderlich ist. Da mehr als 2000 TFs im menschlichen Genom identifiziert wurden, ist es wichtig zu untersuchen, wo sie binden oder auf welche Gene sie abzielen. Rechnerische Ansätze sind insbesondere wichtig, da die biologischen Experimente oft sehr teuer sind und nicht für alle TFs durchgeführt werden können. Im Jahr 2016 fand ein Wettbewerb namens DREAM Challenge statt, bei dem Forscher aufgefordert wurden, neuartige Rechenwerkzeuge zur Vorhersage der Bindungsstellen mehrerer TFs zu entwickeln. Das erste Kapitel dieser Arbeit beschreibt unseren Ansatz des maschinellen Lernens, um diese Herausforderung im Rahmen des Wettbewerbs anzugehen. Unter Verwendung von Ensembles von Random Forest Klassifikatoren haben wir unser Framework so formuliert, dass es von der Gewebespezifität der Daten profitiert und damit zu einer besseren Generalisierung führt. Außerdem wurden unsere Modelle auf das Erkennen von Kofaktoren angepasst, die an der Bindung von TFs beteiligt sind, die für uns von Interesse sind. Der Vergleich der wichtigen TFs, die unsere Computermodelle mit Protein-Protein-Assoziationsnetzwerken vorschlugen, ergab, dass die Modelle bevorzugt Motive von TFs auswählen, die potenzielle Interaktionspartner in diesen Netzwerken sind. Ein weiterer wichtiger Aspekt, der über die Vorhersage der TF-Bindung hinausgeht, besteht darin, epigeneomische Faktoren wie Histonmodifikationsdaten (HM-Daten) mit der Genexpression zu verknüpfen. Wir konzentrierten uns insbesondere auf die Vorhersage der Expression in einer Untergruppe von Genen, die als bidirektional bezeichnet werden. Bidirektionale Gene werden als Paare von Genen bezeichnet, die sich auf gegenüberliegenden DNA-Strängen befinden und nahe beieinander liegen. Mit dem Fortschritt der Sequenzierungstechnologien werden immer mehr solche bidirektionalen Konfigurationen erkannt. Dies weist darauf hin, dass es zum Verständnis der Genregulationsmechanismen vorteilhaft wäre, solche Promotorarchitekturen zu berücksichtigen. Im zweiten und dritten Kapitel konzentrierten wir uns auf Gene mit bidirektionalen Promotorarchitekturen, um mit Hilfe von epigenomischen Signaturen und Einzelzell-RNA-Sequenzdaten die komplexe epigenetische Architektur an diesen Promotoren zu analysieren. Unter Verwendung von Einzelzell-RNA-Sequenzdaten als Schätzung der Genexpression konnten wir ein hypothetisches Modell für die Genregulation in bidirektionalen Promotoren aufstellen. Wir haben gezeigt, dass bidirektionale Promotoren in drei Architekturtypen mit unterschiedlichen Merkmalen eingeteilt werden können. Jede dieser Kategorien entspricht einem eindeutigen Genexpressionsprofil auf Einzelzellebene. Die Einzelzell-RNA-Sequenzdaten erwiesen sich als leistungsstarkes Mittel zur Untersuchung der Genregulation. Daher haben wir im letzten Kapitel einen neuen Ansatz zur Vorhersage der Genexpression auf Einzelzellebene unter Verwendung von cis-regulatorischen Motiven sowie epigenetischen Merkmalen vorgeschlagen. Um dies zu erreichen, haben wir ein baumgesteuertes Multitasking-Lernsystem entwickelt, das jede Zelle als eine Aufgabe betrachtet. Durch dieses Gerüst konnten wir die Einzelzellgenexpressionswerte entweder mit TF-Bindungsaffinitäten oder mit TF-ChIP-Sequenzdaten erklären, die in bestimmten Genomregionen gemessen wurden. Dies ermöglichte es uns, verschiedene TFs zu identifizieren, die eine zelltypspezifische Regulation in induzierten pluripotenten Stammzellen zeigen. Unser Ansatz beschränkt sich nicht nur auf TFs, sondern kann jede Art von Daten verwenden, die potentiell zur Erklärung der Genexpression auf Einzelzellebene verwendet werden können. Wir glauben, dass unsere Erkenntnisse für die Entdeckung und Entwicklung von Arzneimitteln verwendet werden können, die das Vorhandensein von TFs oder anderen regulatorischen Faktoren regulieren können, die die Zellen abnormal werden lassen, um Krankheiten zu verhindern oder zu heilen

    Genomic and epigenomic studies of acute myeloid leukemia with CEPBA abnormalities

    Get PDF

    Data mining techniques for protein sequence analysis

    Get PDF
    This thesis concerns two areas of bioinformatics related by their role in protein structure and function: protein structure prediction and post translational modification of proteins. The dihedral angles Ψ and Φ are predicted using support vector regression. For the prediction of Ψ dihedral angles the addition of structural information is examined and the normalisation of Ψ and Φ dihedral angles is examined. An application of the dihedral angles is investigated. The relationship between dihedral angles and three bond J couplings determined from NMR experiments is described by the Karplus equation. We investigate the determination of the correct solution of the Karplus equation using predicted Φ dihedral angles. Glycosylation is an important post translational modification of proteins involved in many different facets of biology. The work here investigates the prediction of N-linked and O-linked glycosylation sites using the random forest machine learning algorithm and pairwise patterns in the data. This methodology produces more accurate results when compared to state of the art prediction methods. The black box nature of random forest is addressed by using the trepan algorithm to generate a decision tree with comprehensible rules that represents the decision making process of random forest. The prediction of our program GPP does not distinguish between glycans at a given glycosylation site. We use farthest first clustering, with the idea of classifying each glycosylation site by the sugar linking the glycan to protein. This thesis demonstrates the prediction of protein backbone torsion angles and improves the current state of the art for the prediction of glycosylation sites. It also investigates potential applications and the interpretation of these methods
    corecore