148 research outputs found

    BIOMOLECULE INSPIRED DATA SCIENCE

    Get PDF
    BIOMOLECULE INSPIRED DATA SCIENC

    Machine learning applications for the topology prediction of transmembrane beta-barrel proteins

    Get PDF
    The research topic for this PhD thesis focuses on the topology prediction of beta-barrel transmembrane proteins. Transmembrane proteins adopt various conformations that are about the functions that they provide. The two most predominant classes are alpha-helix bundles and beta-barrel transmembrane proteins. Alpha-helix proteins are present in larger numbers than beta-barrel transmembrane proteins in structure databases. Therefore, there is a need to find computational tools that can predict and detect the structure of beta-barrel transmembrane proteins. Transmembrane proteins are used for active transport across the membrane or signal transduction. Knowing the importance of their roles, it becomes essential to understand the structures of the proteins. Transmembrane proteins are also a significant focus for new drug discovery. Transmembrane beta-barrel proteins play critical roles in the translocation machinery, pore formation, membrane anchoring, and ion exchange. In bioinformatics, many years of research have been spent on the topology prediction of transmembrane alpha-helices. The efforts to TMB (transmembrane beta-barrel) proteins topology prediction have been overshadowed, and the prediction accuracy could be improved with further research. Various methodologies have been developed in the past to predict TMB proteins topology. Methods developed in the literature that are available include turn identification, hydrophobicity profiles, rule-based prediction, HMM (Hidden Markov model), ANN (Artificial Neural Networks), radial basis function networks, or combinations of methods. The use of cascading classifier has never been fully explored. This research presents and evaluates approaches such as ANN (Artificial Neural Networks), KNN (K-Nearest Neighbors, SVM (Support Vector Machines), and a novel approach to TMB topology prediction with the use of a cascading classifier. Computer simulations have been implemented in MATLAB, and the results have been evaluated. Data were collected from various datasets and pre-processed for each machine learning technique. A deep neural network was built with an input layer, hidden layers, and an output. Optimisation of the cascading classifier was mainly obtained by optimising each machine learning algorithm used and by starting using the parameters that gave the best results for each machine learning algorithm. The cascading classifier results show that the proposed methodology predicts transmembrane beta-barrel proteins topologies with high accuracy for randomly selected proteins. Using the cascading classifier approach, the best overall accuracy is 76.3%, with a precision of 0.831 and recall or probability of detection of 0.799 for TMB topology prediction. The accuracy of 76.3% is achieved using a two-layers cascading classifier. By constructing and using various machine-learning frameworks, systems were developed to analyse the TMB topologies with significant robustness. We have presented several experimental findings that may be useful for future research. Using the cascading classifier, we used a novel approach for the topology prediction of TMB proteins

    HIV Drug Resistant Prediction and Featured Mutants Selection using Machine Learning Approaches

    Get PDF
    HIV/AIDS is widely spread and ranks as the sixth biggest killer all over the world. Moreover, due to the rapid replication rate and the lack of proofreading mechanism of HIV virus, drug resistance is commonly found and is one of the reasons causing the failure of the treatment. Even though the drug resistance tests are provided to the patients and help choose more efficient drugs, such experiments may take up to two weeks to finish and are expensive. Because of the fast development of the computer, drug resistance prediction using machine learning is feasible. In order to accurately predict the HIV drug resistance, two main tasks need to be solved: how to encode the protein structure, extracting the more useful information and feeding it into the machine learning tools; and which kinds of machine learning tools to choose. In our research, we first proposed a new protein encoding algorithm, which could convert various sizes of proteins into a fixed size vector. This algorithm enables feeding the protein structure information to most state of the art machine learning algorithms. In the next step, we also proposed a new classification algorithm based on sparse representation. Following that, mean shift and quantile regression were included to help extract the feature information from the data. Our results show that encoding protein structure using our newly proposed method is very efficient, and has consistently higher accuracy regardless of type of machine learning tools. Furthermore, our new classification algorithm based on sparse representation is the first application of sparse representation performed on biological data, and the result is comparable to other state of the art classification algorithms, for example ANN, SVM and multiple regression. Following that, the mean shift and quantile regression provided us with the potentially most important drug resistant mutants, and such results might help biologists/chemists to determine which mutants are the most representative candidates for further research

    One-operator two-machine flow shop scheduling with setup times for machines and total completion time objective

    Get PDF
    In a manufacturing environment, when a worker or a machine switches from one type of operation to another, a setup time may be required. I propose a scheduling model with one operator and two machines. In this problem, a single operator completes a set of jobs requiring operations in a two-machine flow shop. The operator can perform only one operation at a time. When one machine is in use, the other is idle. Whenever the operator changes machine, a setup time is required. We consider the objective of total completion time. I formulate the problem as a linear integer programming with \u27 O\u27(\u27n\u273) 0-1 variables and \u27 O\u27(\u27n\u272) constraints. I also introduce some classes of valid inequalities. To obtain the exact solutions, Branch-and-Bound, Cut-and-Branch, Branch-and-Cut algorithms are used. For larger size problems, some heuristic procedures are proposed and the computational results are compared

    Microstructural effects on the mechanical properties of carburized low-alloy steels

    Get PDF
    This study examined the effects of composition and initial microstructure on the physical, metallurgical, and mechanical properties of carburized SAE 8620 and PS-18 steels. Testing was performed on 8620 and PS-18 steels in the as-received and normalized conditions. Hardenability testing was conducted prior to additional heat treatments. Size and shape distortion, residual stress, retained austenite, and effective case depth measurements were obtained for specimens subjected to a carburizing heat treatment. Specimens subjected to a core thermal cycle heat treatment were tested to determine the tensile and Charpy impact properties of the core material of carburized components. Despite differences between the as-received and normalized materials prior to carburizing, testing revealed that normalizing did not have a significant effect on the properties of the carburized or core thermal cycle heat treated materials. PS-18 had a higher hardenability, effective case depth, and ultimate tensile strength and a lowerCharpy impact toughness than 8620

    Using sensor ontologies to create reasoning-ready sensor data for real-time hazard monitoring in a spatial decision support system

    Get PDF
    In order to protect at-risk communities and critical infrastructure, hazard managers use sensor networks to monitor the landscapes and phenomena associated with potential hazards. This strategy can produce large amounts of data, but when investigating an often unstructured problem such as hazard detection it can be beneficial to apply automated analysis routines and artificial intelligence techniques such as reasoning. Current sensor web infrastructure, however, is not designed to support this information-centric monitoring perspective. A generalized methodology to transform typical sensor data representations into a form that enables these analysis techniques has been created and is demonstrated through an implementation that bridges geospatial standards for sensor data and descriptions with an ontology-based monitoring environment. An ontology that describes sensors and measurements so they may be understood by an SDSS has also been developed. These tools have been integrated into a monitoring environment, allowing the hazard manager to thoroughly investigate potential hazards

    A Tale of Two Approaches: Comparing Top-Down and Bottom-Up Strategies for Analyzing and Visualizing High-Dimensional Data

    Get PDF
    The proliferation of high-throughput and sensory technologies in various fields has led to a considerable increase in data volume, complexity, and diversity. Traditional data storage, analysis, and visualization methods are struggling to keep pace with the growth of modern data sets, necessitating innovative approaches to overcome the challenges of managing, analyzing, and visualizing data across various disciplines. One such approach is utilizing novel storage media, such as deoxyribonucleic acid~(DNA), which presents efficient, stable, compact, and energy-saving storage option. Researchers are exploring the potential use of DNA as a storage medium for long-term storage of significant cultural and scientific materials. In addition to novel storage media, scientists are also focussing on developing new techniques that can integrate multiple data modalities and leverage machine learning algorithms to identify complex relationships and patterns in vast data sets. These newly-developed data management and analysis approaches have the potential to unlock previously unknown insights into various phenomena and to facilitate more effective translation of basic research findings to practical and clinical applications. Addressing these challenges necessitates different problem-solving approaches. Researchers are developing novel tools and techniques that require different viewpoints. Top-down and bottom-up approaches are essential techniques that offer valuable perspectives for managing, analyzing, and visualizing complex high-dimensional multi-modal data sets. This cumulative dissertation explores the challenges associated with handling such data and highlights top-down, bottom-up, and integrated approaches that are being developed to manage, analyze, and visualize this data. The work is conceptualized in two parts, each reflecting the two problem-solving approaches and their uses in published studies. The proposed work showcases the importance of understanding both approaches, the steps of reasoning about the problem within them, and their concretization and application in various domains

    Structural Descriptors of gp120 V3 Loop for the Prediction of HIV-1 Coreceptor Usage

    Get PDF
    HIV-1 cell entry commonly uses, in addition to CD4, one of the chemokine receptors CCR5 or CXCR4 as coreceptor. Knowledge of coreceptor usage is critical for monitoring disease progression as well as for supporting therapy with the novel drug class of coreceptor antagonists. Predictive methods for inferring coreceptor usage based on the third hypervariable (V3) loop region of the viral gene coding for the envelope protein gp120 can provide us with these monitoring facilities while avoiding expensive phenotypic tests. All simple heuristics (such as the 11/25 rule) as well as statistical learning methods proposed to date predict coreceptor usage based on sequence features of the V3 loop exclusively. Here, we show, based on a recently resolved structure of gp120 with an untruncated V3 loop, that using structural information on the V3 loop in combination with sequence features of V3 variants improves prediction of coreceptor usage. In particular, we propose a distance-based descriptor of the spatial arrangement of physicochemical properties that increases discriminative performance. For a fixed specificity of 0.95, a sensitivity of 0.77 was achieved, improving further to 0.80 when combined with a sequence-based representation using amino acid indicators. This compares favorably with the sensitivities of 0.62 for the traditional 11/25 rule and 0.73 for a prediction based on sequence information as input to a support vector machine and constitutes a statistically significant improvement. A detailed analysis and interpretation of structural features important for classification shows the relevance of several specific hydrogen-bond donor sites and aliphatic side chains to coreceptor specificity towards CCR5 or CXCR4. Furthermore, an analysis of side chain orientation of the specificity-determining residues suggests a major role of one side of the V3 loop in the selection of the coreceptor. The proposed method constitutes the first approach to an improved prediction of coreceptor usage based on an original integration of structural bioinformatics methods with statistical learning

    Re-using public RNA-Seq data

    Get PDF
    "Järgmise põlvkonna sekveneerimismeetodid"(NGS) on geeniandmete analüüsil kiiresti populaarsust kogumas. RNA-Seq on NGS tehnika, mis võimaldab geeniekspressiooni tasemete hindamist. Eksperimentidest kogutuid andmeid arhiveeritakse jõudsalt avalikesse andmebaasidesse, kuna toorandmete neisse edastamine on üheks eeltingimuseks akadeemilistes ajakirjades avaldamiseks. RNA-Seq toorandmed on mahult üsna suured ja üksikute eksperimentide analüüs üsnagi aeganõudev. Sekveneerimise toorandmeid taaskasutatakse praegu veel üsna vähe. Andmebaasidesse leiduvate andmete taaskasutamisele avaldavad pärssivat mõju ebatäpsed katseplaneerimise kirjeldused ja kindlate standardite puudumine analüüsimeetodites. Tööriistade vahelised algoritmilised eripärad tähendavad erinevatel meetoditel teostatud analüüside vähest võrreldavust. Lihtne kollektsioonide agregeerimine ei tööta, kuna analüüsitud andmed pole võrreldavad. Seega tuleb analüüs kõikide eksperimentide jaoks teostada alates toorandmetest. Iga eksperimendi analüüs on aga üsna aeganõudev ning nõuab kuldsete standardite puudumisel konkreetseid valikuid. Suuremahuliste analüüsiandmete kollektsiooni nõuab seega efektiivset töövoo implementatsiooni. Toimimise tingimusteks on minimaalne inimsekkumine, fikseeritud tööriistade valik ja robustne eksperimentide käsitsemismetoodika. Väga erinevates tingimustes teostatud eksperimentide ekspressiooniandmete agregeerimine loob võimaluse andmekaeve meetodite rakendamiseks. Lokaalselt ilmnevad mustrid võivad taustsüsteemis osutuda signaaliks. Üheks analüüsivallaks, mis selliseid mustreid uurib on koekspressioonianalüüs. Selles magistritöös arendasime ja implementeerisime raamistiku suuremahuliseks avalike RNA-Seq andmete analüüsiks. Analüüs ei vaja eksperimentide analüüsimisele eelnevalt konfiguratsioonifaili vaid toetub ühekordselt konstrueeritud andmebaasile. Kasutajapoolne sekkumine on minimaalne, kõik parameetrid määratakse andmetest lähtuvalt. See võimaldab järjestikulist analüüsi üle arvukate eksperimentide. Loodavat RNA-Seq ekspressiooniandmete kollektsiooni kasutatakse sisendina BIIT töörühma poolt arenda- tud koekspressiooni uurimise tööriistas - MEM. Algselt oli see ehitatud üksnes mikrokiip andmetelt sondide koekspressiooni hindamiseks, kuid RNA-Seq ekspressiooniandmed laiendavad selle rakendusampluaad.Next Generation Sequencing (NGS) methods are rapidly becoming the most popular paradigm for exploring genomic data. RNA-Seq is a NGS method that enables gene expression analyses. Raw sequencing data generated by researchers is actively submitted to public databases as part of the requirements for publishing in academic journals. Raw sequencing data is quite large in size and analysis of each experiment is time consuming. Therefore published raw files are currently not re-used much. Repetitive analysis of uploaded data is also complicated by negligent experiment set-up write-ups and lack of clear standards for the analysis process. Publicly available analysis results have been obtained using a varying set of tools and parameters. There are biases introduced by algorithmic differences of tools which greatly decreases the comparability of results between experiments. This is due because of lack of golden analysis standards. Comprehensive collections of expression data have to account for computational expenses and time limits. Therefore collection set-up needs an effective pipeline implementation with automatic parameter estimation, a defined subset of tools and a robust handling mechanism to ensure minimal required user input. Aggregating expression data from individual experiments with varying experimental conditions creates many new opportunities for data aggregation and mining. Pattern discovery over larger collections generalises local tendencies. One such analysis sub-field is assessing gene co-expression over a broader set of experiments. In this thesis, we have designed and implemented a framework for performing large scale analysis of publicly available RNA-Seq experiments. No separate configuration file for analysis is required, instead a pre-built database is employed. User intervention is minimal and the process is self-guiding. All parameters within the analysis process are determined automatically. This enables unsupervised sequential analysis of numerous experiments. Analysed datasets can be used as an input for co-expression analysis tool MEM which was developed by BIIT research group and was originally designed for public microarray data. RNA-Seq data adds a new application field for the tool. Other than co-expression analysis with MEM, the data can also be used in other downstream analysis applications

    Bioanalytical applications of microfluidic devices

    Get PDF
    The first part of the thesis describes a new patterning technique--microfluidic contact printing--that combines several of the desirable aspects of microcontact printing and microfluidic patterning and addresses some of their important limitations through the integration of a track-etched polycarbonate (PCTE) membrane. Using this technique, biomolecules (e.g., peptides, polysaccharides, and proteins) were printed in high fidelity on a receptor modified polyacrylamide hydrogel substrate. The patterns obtained can be controlled through modifications of channel design and secondary programming via selective membrane wetting. The protocols support the printing of multiple reagents without registration steps and fast recycle times. The second part describes a non-enzymatic, isothermal method to discriminate single nucleotide polymorphisms (SNPs). SNP discrimination using alkaline dehybridization has long been neglected because the pH range in which thermodynamic discrimination can be done is quite narrow. We found, however, that SNPs can be discriminated by the kinetic differences exhibited in the dehybridization of PM and MM DNA duplexes in an alkaline solution using fluorescence microscopy. We combined this method with multifunctional encoded hydrogel particle array (fabricated by stop-flow lithography) to achieve fast kinetics and high versatility. This approach may serve as an effective alternative to temperature-based method for analyzing unamplified genomic DNA in point-of-care diagnostic
    corecore