226 research outputs found

    Efficient decoding algorithms for generalized hidden Markov model gene finders

    Get PDF
    BACKGROUND: The Generalized Hidden Markov Model (GHMM) has proven a useful framework for the task of computational gene prediction in eukaryotic genomes, due to its flexibility and probabilistic underpinnings. As the focus of the gene finding community shifts toward the use of homology information to improve prediction accuracy, extensions to the basic GHMM model are being explored as possible ways to integrate this homology information into the prediction process. Particularly prominent among these extensions are those techniques which call for the simultaneous prediction of genes in two or more genomes at once, thereby increasing significantly the computational cost of prediction and highlighting the importance of speed and memory efficiency in the implementation of the underlying GHMM algorithms. Unfortunately, the task of implementing an efficient GHMM-based gene finder is already a nontrivial one, and it can be expected that this task will only grow more onerous as our models increase in complexity. RESULTS: As a first step toward addressing the implementation challenges of these next-generation systems, we describe in detail two software architectures for GHMM-based gene finders, one comprising the common array-based approach, and the other a highly optimized algorithm which requires significantly less memory while achieving virtually identical speed. We then show how both of these architectures can be accelerated by a factor of two by optimizing their content sensors. We finish with a brief illustration of the impact these optimizations have had on the feasibility of our new homology-based gene finder, TWAIN. CONCLUSIONS: In describing a number of optimizations for GHMM-based gene finders and making available two complete open-source software systems embodying these methods, it is our hope that others will be more enabled to explore promising extensions to the GHMM framework, thereby improving the state-of-the-art in gene prediction techniques

    Genomics and proteomics: a signal processor's tour

    Get PDF
    The theory and methods of signal processing are becoming increasingly important in molecular biology. Digital filtering techniques, transform domain methods, and Markov models have played important roles in gene identification, biological sequence analysis, and alignment. This paper contains a brief review of molecular biology, followed by a review of the applications of signal processing theory. This includes the problem of gene finding using digital filtering, and the use of transform domain methods in the study of protein binding spots. The relatively new topic of noncoding genes, and the associated problem of identifying ncRNA buried in DNA sequences are also described. This includes a discussion of hidden Markov models and context free grammars. Several new directions in genomic signal processing are briefly outlined in the end

    Modern Computing Techniques for Solving Genomic Problems

    Get PDF
    With the advent of high-throughput genomics, biological big data brings challenges to scientists in handling, analyzing, processing and mining this massive data. In this new interdisciplinary field, diverse theories, methods, tools and knowledge are utilized to solve a wide variety of problems. As an exploration, this dissertation project is designed to combine concepts and principles in multiple areas, including signal processing, information-coding theory, artificial intelligence and cloud computing, in order to solve the following problems in computational biology: (1) comparative gene structure detection, (2) DNA sequence annotation, (3) investigation of CpG islands (CGIs) for epigenetic studies. Briefly, in problem #1, sequences are transformed into signal series or binary codes. Similar to the speech/voice recognition, similarity is calculated between two signal series and subsequently signals are stitched/matched into a temporal sequence. In the nature of binary operation, all calculations/steps can be performed in an efficient and accurate way. Improving performance in terms of accuracy and specificity is the key for a comparative method. In problem #2, DNA sequences are encoded and transformed into numeric representations for deep learning methods. Encoding schemes greatly influence the performance of deep learning algorithms. Finding the best encoding scheme for a particular application of deep learning is significant. Three applications (detection of protein-coding splicing sites, detection of lincRNA splicing sites and improvement of comparative gene structure identification) are used to show the computing power of deep neural networks. In problem #3, CpG sites are assigned certain energy and a Gaussian filter is applied to detection of CpG islands. By using the CpG box and Markov model, we investigate the properties of CGIs and redefine the CGIs using the emerging epigenetic data. In summary, these three problems and their solutions are not isolated; they are linked to modern techniques in such diverse areas as signal processing, information-coding theory, artificial intelligence and cloud computing. These novel methods are expected to improve the efficiency and accuracy of computational tools and bridge the gap between biology and scientific computing

    Characterizing and Accelerating Bioinformatics Workloads on Modern Microarchitectures

    Get PDF
    Bioinformatics, the use of computer techniques to analyze biological data, has been a particularly active research field in the last two decades. Advances in this field have contributed to the collection of enormous amounts of data, and the sheer amount of available data has started to overtake the processing capability possible with current computer systems. Clearly, computer architects need to have a better understanding of how bioinformatics applications work and what kind of architectural techniques could be used to accelerate these important scientific workloads on future processors. In this dissertation, we develop a bioinformatic benchmark suite and provide a detailed characterization of these applications in common use today from a computer architect's point of view. We analyze a wide range of detailed execution characteristics including instruction mix, IPC measurements, L1 and L2 cache misses on a real architecture; and proceed to analyze the workloads' memory access characteristics. We then concentrate on accelerating a particularly computationally intensive bioinformatics workload on the novel Cell Broadband Engine multiprocessor architecture. The HMMER workload is used for protein profile searching using hidden Markov models, and most of its execution time is spent running the Viterbi algorithm. We parallelize and partition the HMMER application to implement it on the Cell Broadband Engine. In order to run the Viterbi algorithm on the 256KB local stores of the Cell BE synergistic processing units (SPEs), we present a method to develop a fast SIMD implementation of the Viterbi algorithm that reduces the storage requirements significantly. Our HMMER implementation for the Cell BE architecture, Cell-HMMER, exploits the multiple levels of parallelism inherent in this application, and can run protein profile searches up to 27.98 times faster than a modern dual-core x86 microprocessor

    FPGAs in Bioinformatics: Implementation and Evaluation of Common Bioinformatics Algorithms in Reconfigurable Logic

    Get PDF
    Life. Much effort is taken to grant humanity a little insight in this fascinating and complex but fundamental topic. In order to understand the relations and to derive consequences humans have begun to sequence their genomes, i.e. to determine their DNA sequences to infer information, e.g. related to genetic diseases. The process of DNA sequencing as well as subsequent analysis presents a computational challenge for recent computing systems due to the large amounts of data alone. Runtimes of more than one day for analysis of simple datasets are common, even if the process is already run on a CPU cluster. This thesis shows how this general problem in the area of bioinformatics can be tackled with reconfigurable hardware, especially FPGAs. Three compute intensive problems are highlighted: sequence alignment, SNP interaction analysis and genotype imputation. In the area of sequence alignment the software BLASTp for protein database searches is exemplarily presented, implemented and evaluated.SNP interaction analysis is presented with three applications performing an exhaustive search for interactions including the corresponding statistical tests: BOOST, iLOCi and the mutual information measurement. All applications are implemented in FPGA-hardware and evaluated, resulting in an impressive speedup of more than in three orders of magnitude when compared to standard computers. The last topic of genotype imputation presents a two-step process composed of the phasing step and the actual imputation step. The focus lies on the phasing step which is targeted by the SHAPEIT2 application. SHAPEIT2 is discussed with its underlying mathematical methods in detail, and finally implemented and evaluated. A remarkable speedup of 46 is reached here as well

    In silico and biological survey of transcription-associated proteins implicated in the transcriptional machinery during the erythrocytic development of Plasmodium falciparum

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Malaria is the most important parasitic disease in the world with approximately two million people dying every year, mostly due to <it>Plasmodium falciparum </it>infection. During its complex life cycle in the Anopheles vector and human host, the parasite requires the coordinated and modulated expression of diverse sets of genes involved in epigenetic, transcriptional and post-transcriptional regulation. However, despite the availability of the complete sequence of the <it>Plasmodium falciparum </it>genome, we are still quite ignorant about <it>Plasmodium </it>mechanisms of transcriptional gene regulation. This is due to the poor prediction of nuclear proteins, cognate DNA motifs and structures involved in transcription.</p> <p>Results</p> <p>A comprehensive directory of proteins reported to be potentially involved in <it>Plasmodium </it>transcriptional machinery was built from all <it>in silico </it>reports and databanks. The transcription-associated proteins were clustered in three main sets of factors: general transcription factors, chromatin-related proteins (structuring, remodelling and histone modifying enzymes), and specific transcription factors. Only a few of these factors have been molecularly analysed. Furthermore, from transcriptome and proteome data we modelled expression patterns of transcripts and corresponding proteins during the intra-erythrocytic cycle. Finally, an interactome of these proteins based either on <it>in silico </it>or on 2-yeast-hybrid experimental approaches is discussed.</p> <p>Conclusion</p> <p>This is the first attempt to build a comprehensive directory of potential transcription-associated proteins in <it>Plasmodium</it>. In addition, all complete transcriptome, proteome and interactome raw data were re-analysed, compared and discussed for a better comprehension of the complex biological processes of <it>Plasmodium falciparum </it>transcriptional regulation during the erythrocytic development.</p

    Analysis of Genomic and Proteomic Signals Using Signal Processing and Soft Computing Techniques

    Get PDF
    Bioinformatics is a data rich field which provides unique opportunities to use computational techniques to understand and organize information associated with biomolecules such as DNA, RNA, and Proteins. It involves in-depth study in the areas of genomics and proteomics and requires techniques from computer science,statistics and engineering to identify, model, extract features and to process data for analysis and interpretation of results in a biologically meaningful manner.In engineering methods the signal processing techniques such as transformation,filtering, pattern analysis and soft-computing techniques like multi layer perceptron(MLP) and radial basis function neural network (RBFNN) play vital role to effectively resolve many challenging issues associated with genomics and proteomics. In this dissertation, a sincere attempt has been made to investigate on some challenging problems of bioinformatics by employing some efficient signal and soft computing methods. Some of the specific issues, which have been attempted are protein coding region identification in DNA sequence, hot spot identification in protein, prediction of protein structural class and classification of microarray gene expression data. The dissertation presents some novel methods to measure and to extract features from the genomic sequences using time-frequency analysis and machine intelligence techniques.The problems investigated and the contribution made in the thesis are presented here in a concise manner. The S-transform, a powerful time-frequency representation technique, possesses superior property over the wavelet transform and short time Fourier transform as the exponential function is fixed with respect to time axis while the localizing scalable Gaussian window dilates and translates. The S-transform uses an analysis window whose width is decreasing with frequency providing a frequency dependent resolution. The invertible property of S-transform makes it suitable for time-band filtering application. Gene prediction and protein coding region identification have been always a challenging task in computational biology,especially in eukaryote genomes due to its complex structure. This issue is resolved using a S-transform based time-band filtering approach by localizing the period-3 property present in the DNA sequence which forms the basis for the identification.Similarly, hot spot identification in protein is a burning issue in protein science due to its importance in binding and interaction between proteins. A novel S-transform based time-frequency filtering approach is proposed for efficient identification of the hot spots. Prediction of structural class of protein has been a challenging problem in bioinformatics.A novel feature representation scheme is proposed to efficiently represent the protein, thereby improves the prediction accuracy. The high dimension and low sample size of microarray data lead to curse of dimensionality problem which affects the classification performance.In this dissertation an efficient hybrid feature extraction method is proposed to overcome the dimensionality issue and a RBFNN is introduced to efficiently classify the microarray samples

    An optimization framework for fixed-point digital signal processing.

    Get PDF
    Lam Yuet Ming.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references (leaves 80-86).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Motivation --- p.1Chapter 1.1.1 --- Difficulties of fixed-point design --- p.1Chapter 1.1.2 --- Why still fixed-point? --- p.2Chapter 1.1.3 --- Difficulties of converting floating-point to fixed-point --- p.2Chapter 1.1.4 --- Why wordlength optimization? --- p.3Chapter 1.2 --- Objectives --- p.3Chapter 1.3 --- Contributions --- p.3Chapter 1.4 --- Thesis Organization --- p.4Chapter 2 --- Review --- p.5Chapter 2.1 --- Introduction --- p.5Chapter 2.2 --- Simulation approach to address quantization issue --- p.6Chapter 2.3 --- Analytical approach to address quantization issue --- p.8Chapter 2.4 --- Implementation of speech systems --- p.9Chapter 2.5 --- Discussion --- p.10Chapter 2.6 --- Summary --- p.11Chapter 3 --- Fixed-point arithmetic background --- p.12Chapter 3.1 --- Introduction --- p.12Chapter 3.2 --- Fixed-point representation --- p.12Chapter 3.3 --- Fixed-point addition/subtraction --- p.14Chapter 3.4 --- Fixed-point multiplication --- p.16Chapter 3.5 --- Fixed-point division --- p.18Chapter 3.6 --- Summary --- p.20Chapter 4 --- Fixed-point class implementation --- p.21Chapter 4.1 --- Introduction --- p.21Chapter 4.2 --- Fixed-point simulation using overloading --- p.21Chapter 4.3 --- Fixed-point class implementation --- p.24Chapter 4.3.1 --- Fixed-point object declaration --- p.24Chapter 4.3.2 --- Overload the operators --- p.25Chapter 4.3.3 --- Arithmetic operations --- p.26Chapter 4.3.4 --- Automatic monitoring of dynamic range --- p.27Chapter 4.3.5 --- Automatic calculation of quantization error --- p.27Chapter 4.3.6 --- Array supporting --- p.28Chapter 4.3.7 --- Cosine calculation --- p.28Chapter 4.4 --- Summary --- p.29Chapter 5 --- Speech recognition background --- p.30Chapter 5.1 --- Introduction --- p.30Chapter 5.2 --- Isolated word recognition system overview --- p.30Chapter 5.3 --- Linear predictive coding processor --- p.32Chapter 5.3.1 --- The LPC model --- p.32Chapter 5.3.2 --- The LPC processor --- p.33Chapter 5.4 --- Vector quantization --- p.36Chapter 5.5 --- Hidden Markov model --- p.38Chapter 5.6 --- Summary --- p.40Chapter 6 --- Optimization --- p.41Chapter 6.1 --- Introduction --- p.41Chapter 6.2 --- Simplex Method --- p.41Chapter 6.2.1 --- Initialization --- p.42Chapter 6.2.2 --- Reflection --- p.42Chapter 6.2.3 --- Expansion --- p.44Chapter 6.2.4 --- Contraction --- p.44Chapter 6.2.5 --- Stop --- p.45Chapter 6.3 --- One-dimensional optimization approach --- p.45Chapter 6.3.1 --- One-dimensional optimization approach --- p.46Chapter 6.3.2 --- Search space reduction --- p.47Chapter 6.3.3 --- Speeding up convergence --- p.48Chapter 6.4 --- Summary --- p.50Chapter 7 --- Word Recognition System Design Methodology --- p.51Chapter 7.1 --- Introduction --- p.51Chapter 7.2 --- Framework design --- p.51Chapter 7.2.1 --- Fixed-point class --- p.52Chapter 7.2.2 --- Fixed-point application --- p.53Chapter 7.2.3 --- Optimizer --- p.53Chapter 7.3 --- Speech system implementation --- p.54Chapter 7.3.1 --- Model training --- p.54Chapter 7.3.2 --- Simulate the isolated word recognition system --- p.56Chapter 7.3.3 --- Hardware cost model --- p.57Chapter 7.3.4 --- Cost function --- p.58Chapter 7.3.5 --- Fraction size optimization --- p.59Chapter 7.3.6 --- One-dimensional optimization --- p.61Chapter 7.4 --- Summary --- p.63Chapter 8 --- Results --- p.64Chapter 8.1 --- Model training --- p.64Chapter 8.2 --- Simplex method optimization --- p.65Chapter 8.2.1 --- Simulation platform --- p.65Chapter 8.2.2 --- System level optimization --- p.66Chapter 8.2.3 --- LPC processor optimization --- p.67Chapter 8.2.4 --- One-dimensional optimization --- p.68Chapter 8.3 --- Speeding up the optimization convergence --- p.71Chapter 8.4 --- Optimization criteria --- p.73Chapter 8.5 --- Summary --- p.75Chapter 9 --- Conclusion --- p.76Chapter 9.1 --- Search space reduction --- p.76Chapter 9.2 --- Speeding up the searching --- p.77Chapter 9.3 --- Optimization criteria --- p.77Chapter 9.4 --- Flexibility of the framework design --- p.78Chapter 9.5 --- Further development --- p.78Bibliography --- p.8

    High performance reconfigurable architectures for biological sequence alignment

    Get PDF
    Bioinformatics and computational biology (BCB) is a rapidly developing multidisciplinary field which encompasses a wide range of domains, including genomic sequence alignments. It is a fundamental tool in molecular biology in searching for homology between sequences. Sequence alignments are currently gaining close attention due to their great impact on the quality aspects of life such as facilitating early disease diagnosis, identifying the characteristics of a newly discovered sequence, and drug engineering. With the vast growth of genomic data, searching for a sequence homology over huge databases (often measured in gigabytes) is unable to produce results within a realistic time, hence the need for acceleration. Since the exponential increase of biological databases as a result of the human genome project (HGP), supercomputers and other parallel architectures such as the special purpose Very Large Scale Integration (VLSI) chip, Graphic Processing Unit (GPUs) and Field Programmable Gate Arrays (FPGAs) have become popular acceleration platforms. Nevertheless, there are always trade-off between area, speed, power, cost, development time and reusability when selecting an acceleration platform. FPGAs generally offer more flexibility, higher performance and lower overheads. However, they suffer from a relatively low level programming model as compared with off-the-shelf microprocessors such as standard microprocessors and GPUs. Due to the aforementioned limitations, the need has arisen for optimized FPGA core implementations which are crucial for this technology to become viable in high performance computing (HPC). This research proposes the use of state-of-the-art reprogrammable system-on-chip technology on FPGAs to accelerate three widely-used sequence alignment algorithms; the Smith-Waterman with affine gap penalty algorithm, the profile hidden Markov model (HMM) algorithm and the Basic Local Alignment Search Tool (BLAST) algorithm. The three novel aspects of this research are firstly that the algorithms are designed and implemented in hardware, with each core achieving the highest performance compared to the state-of-the-art. Secondly, an efficient scheduling strategy based on the double buffering technique is adopted into the hardware architectures. Here, when the alignment matrix computation task is overlapped with the PE configuration in a folded systolic array, the overall throughput of the core is significantly increased. This is due to the bound PE configuration time and the parallel PE configuration approach irrespective of the number of PEs in a systolic array. In addition, the use of only two configuration elements in the PE optimizes hardware resources and enables the scalability of PE systolic arrays without relying on restricted onboard memory resources. Finally, a new performance metric is devised, which facilitates the effective comparison of design performance between different FPGA devices and families. The normalized performance indicator (speed-up per area per process technology) takes out advantages of the area and lithography technology of any FPGA resulting in fairer comparisons. The cores have been designed using Verilog HDL and prototyped on the Alpha Data ADM-XRC-5LX card with the Virtex-5 XC5VLX110-3FF1153 FPGA. The implementation results show that the proposed architectures achieved giga cell updates per second (GCUPS) performances of 26.8, 29.5 and 24.2 respectively for the acceleration of the Smith-Waterman with affine gap penalty algorithm, the profile HMM algorithm and the BLAST algorithm. In terms of speed-up improvements, comparisons were made on performance of the designed cores against their corresponding software and the reported FPGA implementations. In the case of comparison with equivalent software execution, acceleration of the optimal alignment algorithm in hardware yielded an average speed-up of 269x as compared to the SSEARCH 35 software. For the profile HMM-based sequence alignment, the designed core achieved speed-up of 103x and 8.3x against the HMMER 2.0 and the latest version of HMMER (version 3.0) respectively. On the other hand, the implementation of the gapped BLAST with the two-hit method in hardware achieved a greater than tenfold speed-up compared to the latest NCBI BLAST software. In terms of comparison against other reported FPGA implementations, the proposed normalized performance indicator was used to evaluate the designed architectures fairly. The results showed that the first architecture achieved more than 50 percent improvement, while acceleration of the profile HMM sequence alignment in hardware gained a normalized speed-up of 1.34. In the case of the gapped BLAST with the two-hit method, the designed core achieved 11x speed-up after taking out advantages of the Virtex-5 FPGA. In addition, further analysis was conducted in terms of cost and power performances; it was noted that, the core achieved 0.46 MCUPS per dollar spent and 958.1 MCUPS per watt. This shows that FPGAs can be an attractive platform for high performance computation with advantages of smaller area footprint as well as represent economic ‘green’ solution compared to the other acceleration platforms. Higher throughput can be achieved by redeploying the cores on newer, bigger and faster FPGAs with minimal design effort

    INVESTIGATING THERAPEUTIC OPTIONS FOR LAFORA DISEASE USING STRUCTURAL BIOLOGY AND TRANSLATIONAL METHODS

    Get PDF
    Lafora disease (LD) is a rare yet invariably fatal form of epilepsy characterized by progressive degeneration of the central nervous and motor systems and accumulation of insoluble glucans within cells. LD results from mutation of either the phosphatase laforin, an enzyme that dephosphorylates cellular glycogen, or the E3 ubiquitin ligase malin, the binding partner of laforin. Currently, there are no therapeutic options for LD, or reported methods by which the specific activity of glucan phosphatases such as laforin can be easily measured. To facilitate our translational studies, we developed an assay with which the glucan phosphatase activity of laforin as well as emerging members of the glucan phosphatase family can be characterized. We then adapted this assay for the detection of endogenous laforin activity from human and mouse tissue. This laforin bioassay will prove useful in the detection of functional laforin in LD patient tissue following the application of therapies to LD patients. We subsequently developed an in vitro readthrough reporter system in order to assess the efficacy of aminoglycosides in the readthrough of laforin and malin nonsense mutations. We found that although several laforin and malin nonsense mutations exhibited significant drug-induced readthrough, the location of the epitope tag used to detect readthrough products dramatically affected our readthrough results. Cell lines established from LD patients with nonsense mutations are thus required to accurately assess the efficacy of aminoglycosides as a therapeutic option for LD. Using hydrogen-deuterium exchange mass spectrometry (DXMS), we then gained insight into the molecular etiology of several point mutations in laforin that cause LD. We identified a novel motif in the phosphatase domain of laforin that shares homology with glycosyl hydrolases (GH) and appears to play a role in the interaction of laforin with glucans. We studied the impact of the Y294N and P301L LD mutations within this GH motif on glucan binding. Surprisingly, these mutations did not reduce glucan binding as expected, rather enhancing the binding of laforin to glucans. These findings elucidate the mechanism by which laforin interacts with and acts upon glucan substrates, providing a target for the development of therapeutic compounds
    corecore