866 research outputs found

    Predicting reaction based on customer's transaction using machine learning approaches

    Get PDF
    Banking advertisements are important because they help target specific customers on subscribing to their packages or other deals by giving their current customers more fixed-term deposit offers. This is done through promotional advertisements on the Internet or media pages, and this task is the responsibility of the shopping department. In order to build a relationship with them, offer them the best deals, and be appropriate for the client with the company's assurance to recover these deposits, many banks or telecommunications firms store the data of their customers. The Portuguese bank increases its sales by establishing a relationship with its customers. This study proposes creating a prediction model using machine learning algorithms, to see how the customer reacts to subscribe to those fixed-term deposits or offers made with the aid of their past record. This classification is binary, i.e., the prediction of whether or not a customer will embrace these offers. Four classifiers that include k-nearest neighbor (k-NN) algorithm, decision tree, naive Bayes, and support vector machines (SVM) were used, and the best result was obtained from the classifier decision tree with an accuracy of 91% and the other classifier SVM with an accuracy of 89%

    AB INITIO PROTEIN STRUCTURE PREDICTION ALGORITHMS

    Get PDF
    Genes that encode novel proteins are constantly being discovered and added to databases, but the speed with which their structures are being determined is not keeping up with this rate of discovery. Currently, homology and threading methods perform the best for protein structure prediction, but they are not appropriate to use for all proteins. Still, the best way to determine a protein\u27s structure is through biological experimentation. This research looks into possible methods and relations that pertain to ab initio protein structure prediction. The study includes the use of positional and transitional probabilities of amino acids obtained from a non-redundant set of proteins created by Jpred for training computational methods. The methods this study focuses on are Hidden Markov Models and incorporating neighboring amino acids in the primary structure of proteins with the above-mentioned probabilities. The methods are presented to predict the secondary structure of amino acids without relying on the existence of a homolog. The main goal of this research is to be able to obtain information from an amino acid sequence that could be used for all future predictions of protein structures. Further, analysis of the performance of the methods is presented for explanation of how they could be incorporated in current and future work

    Deep Evolutionary Generative Molecular Modeling for RNA Aptamer Drug Design

    Get PDF
    Deep Aptamer Evolutionary Model (DAPTEV Model). Typical drug development processes are costly, time consuming and often manual with regard to research. Aptamers are short, single-stranded oligonucleotides (RNA/DNA) that bind to, and inhibit, target proteins and other types of molecules similar to antibodies. Compared with small-molecule drugs, these aptamers can bind to their targets with high affinity (binding strength) and specificity (designed to uniquely interact with the target only). The typical development process for aptamers utilizes a manual process known as Systematic Evolution of Ligands by Exponential Enrichment (SELEX), which is costly, slow, and often produces mild results. The focus of this research is to create a deep learning approach for the generating and evolving of aptamer sequences to support aptamer-based drug development. These sequences must be unique, contain at least some level of structural complexity, and have a high level of affinity and specificity for the intended target. Moreover, after training, the deep learning system, known as a Variational Autoencoder, must possess the ability to be queried for new sequences without the need for further training. Currently, this research is applied to the SARS-CoV-2 (Covid-19) spike protein’s receptor-binding domain (RBD). However, careful consideration has been placed in the intentional design of a general solution for future viral applications. Each individual run took five and a half days to complete. Over the course of two months, three runs were performed for three different models. After some sequence, score, and statistical comparisons, it was observed that the deep learning model was able to produce structurally complex aptamers with strong binding affinities and specificities to the target Covid-19 RBD. Furthermore, due to the nature of VAEs, this model is indeed able to be queried for new aptamers of similar quality based on previous training. Results suggest that VAE-based deep learning methods are capable of optimizing aptamer-target binding affinities and specificities (multi-objective learning), and are a strong tool to aid in aptamer-based drug development

    New approaches to protein docking

    Get PDF
    In the first part of this work, we propose new methods for protein docking. First, we present two approaches to protein docking with flexible side chains. The first approach is a fast greedy heuristic, while the second is a branch -&-cut algorithm that yields optimal solutions. For a test set of protease-inhibitor complexes, both approaches correctly predict the true complex structure. Another problem in protein docking is the prediction of the binding free energy, which is the the final step of many protein docking algorithms. Therefore, we propose a new approach that avoids the expensive and difficult calculation of the binding free energy and, instead, employs a scoring function that is based on the similarity of the proton nuclear magnetic resonance spectra of the tentative complexes with the experimental spectrum. Using this method, we could even predict the structure of a very difficult protein-peptide complex that could not be solved using any energy-based scoring functions. The second part of this work presents BALL (Biochemical ALgorithms Library), a framework for Rapid Application Development in the field of Molecular Modeling. BALL provides an extensive set of data structures as well as classes for Molecular Mechanics, advanced solvation methods, comparison and analysis of protein structures, file import/export, NMR shift prediction, and visualization. BALL has been carefully designed to be robust, easy to use, and open to extensions. Especially its extensibility, which results from an object-oriented and generic programming approach, distinguishes it from other software packages.Der erste Teil dieser Arbeit beschäftigt sich mit neuen Ansätzen zum Proteindocking. Zunächst stellen wir zwei Ansätze zum Proteindocking mit flexiblen Seitenketten vor. Der erste Ansatz beruht auf einer schnellen, gierigen Heuristik, während der zweite Ansatz auf branch-&-cut-Techniken beruht und das Problem optimal lösen kann. Beide Ansätze sind in der Lage die korrekte Komplexstruktur für einen Satz von Testbeispielen (bestehend aus Protease-Inhibitor-Komplexen) vorherzusagen. Ein weiteres, grösstenteils ungelöstes, Problem ist der letzte Schritt vieler Protein-Docking-Algorithmen, die Vorhersage der freien Bindungsenthalpie. Daher schlagen wir eine neue Methode vor, die die schwierige und aufwändige Berechnung der freien Bindungsenthalpie vermeidet. Statt dessen wird eine Bewertungsfunktion eingesetzt, die auf der Ähnlichkeit der Protonen-Kernresonanzspektren der potentiellen Komplexstrukturen mit dem experimentellen Spektrum beruht. Mit dieser Methode konnten wir sogar die korrekte Struktur eines Protein-Peptid-Komplexes vorhersagen, an dessen Vorhersage energiebasierte Bewertungsfunktionen scheitern. Der zweite Teil der Arbeit stellt BALL (Biochemical ALgorithms Library) vor, ein Rahmenwerk zur schnellen Anwendungsentwicklung im Bereich MolecularModeling. BALL stellt eine Vielzahl von Datenstrukturen und Algorithmen für die FelderMolekülmechanik,Vergleich und Analyse von Proteinstrukturen, Datei-Import und -Export, NMR-Shiftvorhersage und Visualisierung zur Verfügung. Beim Entwurf von BALL wurde auf Robustheit, einfache Benutzbarkeit und Erweiterbarkeit Wert gelegt. Von existierenden Software-Paketen hebt es sich vor allem durch seine Erweiterbarkeit ab, die auf der konsequenten Anwendung von objektorientierter und generischer Programmierung beruht

    The Convergence of Human and Artificial Intelligence on Clinical Care - Part I

    Get PDF
    This edited book contains twelve studies, large and pilots, in five main categories: (i) adaptive imputation to increase the density of clinical data for improving downstream modeling; (ii) machine-learning-empowered diagnosis models; (iii) machine learning models for outcome prediction; (iv) innovative use of AI to improve our understanding of the public view; and (v) understanding of the attitude of providers in trusting insights from AI for complex cases. This collection is an excellent example of how technology can add value in healthcare settings and hints at some of the pressing challenges in the field. Artificial intelligence is gradually becoming a go-to technology in clinical care; therefore, it is important to work collaboratively and to shift from performance-driven outcomes to risk-sensitive model optimization, improved transparency, and better patient representation, to ensure more equitable healthcare for all

    USING MACHINE LEARNING TO OPTIMIZE PREDICTIVE MODELS USED FOR BIG DATA ANALYTICS IN VARIOUS SPORTS EVENTS

    Get PDF
    In today’s world, data is growing in huge volume and type day by day. Historical data can hence be leveraged to predict the likelihood of the events which are to occur in the future. This process of using statistical or any other form of data to predict future outcomes is commonly termed as predictive modelling. Predictive modelling is becoming more and more important and is trending because of several reasons. But mainly, it enables businesses or individual users to gain accurate insights and allows to decide suitable actions for a profitable outcome. Machine learning techniques are generally used in order to build these predictive models. Examples of machine learning models ranges from time-series-based regression models which can be used for predicting volume of airline related traffic and linear regression-based models which can be used for predicting fuel efficiency. There are many domains which can gain competitive advantage by using predictive modelling with machine learning. Few of these domains include, but are not limited to, banking and financial services, retail, insurance, fraud detection, stock market analysis, sentimental analysis etc. In this research project, predictive analysis is used for the sports domain. It’s an upcoming domain where machine learning can help make better predictions. There are numerous sports events happening around the globe every day and the data gathered from these events can very well be used for predicting as well as improving the future events. In this project, machine learning with statistics would be used to perform quantitative and predictive analysis of dataset related to soccer. Comparisons of these models to see how effectively the models are is also presented. Also, few big data tools and techniques are used in order to optimize these predictive models and increase their accuracy to over 90%

    Biomarker lists stability in genomic studies: analysis and improvement by prior biological knowledge integration into the learning process

    Get PDF
    The analysis of high-throughput sequencing, microarray and mass spectrometry data has been demonstrated extremely helpful for the identification of those genes and proteins, called biomarkers, helpful for answering to both diagnostic/prognostic and functional questions. In this context, robustness of the results is critical both to understand the biological mechanisms underlying diseases and to gain sufficient reliability for clinical/pharmaceutical applications. Recently, different studies have proved that the lists of identified biomarkers are poorly reproducible, making the validation of biomarkers as robust predictors of a disease a still open issue. The reasons of these differences are referable to both data dimensions (few subjects with respect to the number of features) and heterogeneity of complex diseases, characterized by alterations of multiple regulatory pathways and of the interplay between different genes and the environment. Typically in an experimental design, data to analyze come from different subjects and different phenotypes (e.g. normal and pathological). The most widely used methodologies for the identification of significant genes related to a disease from microarray data are based on computing differential gene expression between different phenotypes by univariate statistical tests. Such approach provides information on the effect of specific genes as independent features, whereas it is now recognized that the interplay among weakly up/down regulated genes, although not significantly differentially expressed, might be extremely important to characterize a disease status. Machine learning algorithms are, in principle, able to identify multivariate nonlinear combinations of features and have thus the possibility to select a more complete set of experimentally relevant features. In this context, supervised classification methods are often used to select biomarkers, and different methods, like discriminant analysis, random forests and support vector machines among others, have been used, especially in cancer studies. Although high accuracy is often achieved in classification approaches, the reproducibility of biomarker lists still remains an open issue, since many possible sets of biological features (i.e. genes or proteins) can be considered equally relevant in terms of prediction, thus it is in principle possible to have a lack of stability even by achieving the best accuracy. This thesis represents a study of several computational aspects related to biomarker discovery in genomic studies: from the classification and feature selection strategies to the type and the reliability of the biological information used, proposing new approaches able to cope with the problem of the reproducibility of biomarker lists. The study has highlighted that, although reasonable and comparable classification accuracy can be achieved by different methods, further developments are necessary to achieve robust biomarker lists stability, because of the high number of features and the high correlation among them. In particular, this thesis proposes two different approaches to improve biomarker lists stability by using prior information related to biological interplay and functional correlation among the analyzed features. Both approaches were able to improve biomarker selection. The first approach, using prior information to divide the application of the method into different subproblems, improves results interpretability and offers an alternative way to assess lists reproducibility. The second, integrating prior information in the kernel function of the learning algorithm, improves lists stability. Finally, the interpretability of results is strongly affected by the quality of the biological information available and the analysis of the heterogeneities performed in the Gene Ontology database has revealed the importance of providing new methods able to verify the reliability of the biological properties which are assigned to a specific feature, discriminating missing or less specific information from possible inconsistencies among the annotations. These aspects will be more and more deepened in the future, as the new sequencing technologies will monitor an increasing number of features and the number of functional annotations from genomic databases will considerably grow in the next years.L’analisi di dati high-throughput basata sull’utilizzo di tecnologie di sequencing, microarray e spettrometria di massa si è dimostrata estremamente utile per l’identificazione di quei geni e proteine, chiamati biomarcatori, utili per rispondere a quesiti sia di tipo diagnostico/prognostico che funzionale. In tale contesto, la stabilità dei risultati è cruciale sia per capire i meccanismi biologici che caratterizzano le malattie sia per ottenere una sufficiente affidabilità per applicazioni in campo clinico/farmaceutico. Recentemente, diversi studi hanno dimostrato che le liste di biomarcatori identificati sono scarsamente riproducibili, rendendo la validazione di tali biomarcatori come indicatori stabili di una malattia un problema ancora aperto. Le ragioni di queste differenze sono imputabili sia alla dimensione dei dataset (pochi soggetti rispetto al numero di variabili) sia all’eterogeneità di malattie complesse, caratterizzate da alterazioni di più pathway di regolazione e delle interazioni tra diversi geni e l’ambiente. Tipicamente in un disegno sperimentale, i dati da analizzare provengono da diversi soggetti e diversi fenotipi (e.g. normali e patologici). Le metodologie maggiormente utilizzate per l’identificazione di geni legati ad una malattia si basano sull’analisi differenziale dell’espressione genica tra i diversi fenotipi usando test statistici univariati. Tale approccio fornisce le informazioni sull’effetto di specifici geni considerati come variabili indipendenti tra loro, mentre è ormai noto che l’interazione tra geni debolmente up/down regolati, sebbene non differenzialmente espressi, potrebbe rivelarsi estremamente importante per caratterizzare lo stato di una malattia. Gli algoritmi di machine learning sono, in linea di principio, capaci di identificare combinazioni non lineari delle variabili e hanno quindi la possibilità di selezionare un insieme più dettagliato di geni che sono sperimentalmente rilevanti. In tale contesto, i metodi di classificazione supervisionata vengono spesso utilizzati per selezionare i biomarcatori, e diversi approcci, quali discriminant analysis, random forests e support vector machines tra altri, sono stati utilizzati, soprattutto in studi oncologici. Sebbene con tali approcci di classificazione si ottenga un alto livello di accuratezza di predizione, la riproducibilità delle liste di biomarcatori rimane ancora una questione aperta, dato che esistono molteplici set di variabili biologiche (i.e. geni o proteine) che possono essere considerati ugualmente rilevanti in termini di predizione. Quindi in teoria è possibile avere un’insufficiente stabilità anche raggiungendo il massimo livello di accuratezza. Questa tesi rappresenta uno studio su diversi aspetti computazionali legati all’identificazione di biomarcatori in genomica: dalle strategie di classificazione e di feature selection adottate alla tipologia e affidabilità dell’informazione biologica utilizzata, proponendo nuovi approcci in grado di affrontare il problema della riproducibilità delle liste di biomarcatori. Tale studio ha evidenziato che sebbene un’accettabile e comparabile accuratezza nella predizione può essere ottenuta attraverso diversi metodi, ulteriori sviluppi sono necessari per raggiungere una robusta stabilità nelle liste di biomarcatori, a causa dell’alto numero di variabili e dell’alto livello di correlazione tra loro. In particolare, questa tesi propone due diversi approcci per migliorare la stabilità delle liste di biomarcatori usando l’informazione a priori legata alle interazioni biologiche e alla correlazione funzionale tra le features analizzate. Entrambi gli approcci sono stati in grado di migliorare la selezione di biomarcatori. Il primo approccio, usando l’informazione a priori per dividere l’applicazione del metodo in diversi sottoproblemi, migliora l’interpretabilità dei risultati e offre un modo alternativo per verificare la riproducibilità delle liste. Il secondo, integrando l’informazione a priori in una funzione kernel dell’algoritmo di learning, migliora la stabilità delle liste. Infine, l’interpretabilità dei risultati è fortemente influenzata dalla qualità dell’informazione biologica disponibile e l’analisi delle eterogeneità delle annotazioni effettuata sul database Gene Ontology rivela l’importanza di fornire nuovi metodi in grado di verificare l’attendibilità delle proprietà biologiche che vengono assegnate ad una specifica variabile, distinguendo la mancanza o la minore specificità di informazione da possibili inconsistenze tra le annotazioni. Questi aspetti verranno sempre più approfonditi in futuro, dato che le nuove tecnologie di sequencing monitoreranno un maggior numero di variabili e il numero di annotazioni funzionali derivanti dai database genomici crescer`a considerevolmente nei prossimi anni

    Building and simulating protein machines

    Get PDF
    Glycolysis is a central metabolic pathway, present in almost all organisms, that produces energy. The pathway has been extensively investigated by biochemists. There is a significant body of structural and biochemical information about this pathway. The complete pathway is a ten step process. At each step, a specific chemical reaction is catalyzed by a specific enzyme. Fructose bisphosphate aldolase (FBA) and triosephosphate isomerase (TIM) catalyze the fourth and the fifth steps on the pathway. This thesis investigates the possible substrate transfer mechanism between FBA and TIM. FBA cleaves its substrate, the six-carbon fructose-1,6-bisphosphate (FBP), into two three-carbon products - glyceraldehydes 3-phosphate (GAP) and dihydroxy acetone phosphate (DHAP). One component of these two products, DHAP, is the substrate for TIM and the other component GAP goes directly to GAPDH, the subsequent enzyme on the pathway. TIM converts DHAP to GAP and delivers the product to GAPDH. I employ Elastic Network Models (ENM) to investigate the mechanistic and dynamic aspects of the functionality of FBA and TIM enzymes - (1) the effects of the oligomerization of these two enzymes on their functional dynamics and the coordination of the individual protein\u27s structural components along the functional region; and (2) the mechanistic synchrony of these two protein machines that may enable them to operate in a coordinated fashion as a conjugate machine - transferring the product from FBA as substrate to TIM. A macromolecular machine comprised of FBA and TIM will facilitate the substrate catalysis mechanism and the product flow between FBA and TIM. Such a machine could be used as a functional unit in building a larger a machine for the structural modeling of the whole glycolysis pathway. Building such machines for the glycolysis pathway may reveal the interplay of the enzymes as a complete machine. Also the methods and insights developed from the efforts to build such large machines could be applied to build macromolecular structures for other biologically important clusters of interacting enzymes centered around individual metabolic pathways

    SEQUENTIAL DECISIONS AND PREDICTIONS IN NATURAL LANGUAGE PROCESSING

    Get PDF
    Natural language processing has achieved great success in a wide range of ap- plications, producing both commercial language services and open-source language tools. However, most methods take a static or batch approach, assuming that the model has all information it needs and makes a one-time prediction. In this disser- tation, we study dynamic problems where the input comes in a sequence instead of all at once, and the output must be produced while the input is arriving. In these problems, predictions are often made based only on partial information. We see this dynamic setting in many real-time, interactive applications. These problems usually involve a trade-off between the amount of input received (cost) and the quality of the output prediction (accuracy). Therefore, the evaluation considers both objectives (e.g., plotting a Pareto curve). Our goal is to develop a formal understanding of sequential prediction and decision-making problems in natural language processing and to propose efficient solutions. Toward this end, we present meta-algorithms that take an existent batch model and produce a dynamic model to handle sequential inputs and outputs. Webuild our framework upon theories of Markov Decision Process (MDP), which allows learning to trade off competing objectives in a principled way. The main machine learning techniques we use are from imitation learning and reinforcement learning, and we advance current techniques to tackle problems arising in our settings. We evaluate our algorithm on a variety of applications, including dependency parsing, machine translation, and question answering. We show that our approach achieves a better cost-accuracy trade-off than the batch approach and heuristic-based decision- making approaches. We first propose a general framework for cost-sensitive prediction, where dif- ferent parts of the input come at different costs. We formulate a decision-making process that selects pieces of the input sequentially, and the selection is adaptive to each instance. Our approach is evaluated on both standard classification tasks and a structured prediction task (dependency parsing). We show that it achieves similar prediction quality to methods that use all input, while inducing a much smaller cost. Next, we extend the framework to problems where the input is revealed incremen- tally in a fixed order. We study two applications: simultaneous machine translation and quiz bowl (incremental text classification). We discuss challenges in this set- ting and show that adding domain knowledge eases the decision-making problem. A central theme throughout the chapters is an MDP formulation of a challenging problem with sequential input/output and trade-off decisions, accompanied by a learning algorithm that solves the MDP
    • …
    corecore