302 research outputs found

    Evaluation of machine-learning methods for ligand-based virtual screening

    Get PDF
    Machine-learning methods can be used for virtual screening by analysing the structural characteristics of molecules of known (in)activity, and we here discuss the use of kernel discrimination and naive Bayesian classifier (NBC) methods for this purpose. We report a kernel method that allows the processing of molecules represented by binary, integer and real-valued descriptors, and show that it is little different in screening performance from a previously described kernel that had been developed specifically for the analysis of binary fingerprint representations of molecular structure. We then evaluate the performance of an NBC when the training-set contains only a very few active molecules. In such cases, a simpler approach based on group fusion would appear to provide superior screening performance, especially when structurally heterogeneous datasets are to be processed

    Application of Graph Neural Networks and graph descriptors for graph classification

    Full text link
    Graph classification is an important area in both modern research and industry. Multiple applications, especially in chemistry and novel drug discovery, encourage rapid development of machine learning models in this area. To keep up with the pace of new research, proper experimental design, fair evaluation, and independent benchmarks are essential. Design of strong baselines is an indispensable element of such works. In this thesis, we explore multiple approaches to graph classification. We focus on Graph Neural Networks (GNNs), which emerged as a de facto standard deep learning technique for graph representation learning. Classical approaches, such as graph descriptors and molecular fingerprints, are also addressed. We design fair evaluation experimental protocol and choose proper datasets collection. This allows us to perform numerous experiments and rigorously analyze modern approaches. We arrive to many conclusions, which shed new light on performance and quality of novel algorithms. We investigate application of Jumping Knowledge GNN architecture to graph classification, which proves to be an efficient tool for improving base graph neural network architectures. Multiple improvements to baseline models are also proposed and experimentally verified, which constitutes an important contribution to the field of fair model comparison.Comment: Master's thesis submitted at AGH University of Science and Technolog

    Data Fingerprinting -- Identifying Files and Tables with Hashing Schemes

    Get PDF
    Master's thesis in Computer scienceINTRODUCTION: Although hash functions are nothing new, these are not limited to cryptographic purposes. One important field is data fingerprinting. Here, the purpose is to generate a digest which serves as a fingerprint (or a license plate) that uniquely identifies a file. More recently, fuzzy fingerprinting schemes — which will scrap the avalanche effect in favour of detecting local changes — has hit the spotlight. The main purpose of this project is to find ways to classify text tables, and discover where potential changes or inconsitencies have happened. METHODS: Large parts of this report can be considered applied discrete mathematics — and finite fields and combinatorics have played an important part. Rabin’s fingerprinting scheme was tested extensively and compared against existing cryptographic algorithms, CRC and FNV. Moreover, a self-designed fuzzy hashing algorithm with the preliminary name No-Frills Hash has been created and tested against Nilsimsa and Spamsum. NFHash is based on Mersenne primes, and uses a sliding window to create a fuzzy hash. Futhermore, the usefullness of lookup tables (with partial seeds) were also explored. The fuzzy hashing algorithm has also been combined with a k-NN classifier to get an overview over it’s ability to classify files. In addition to NFHash, Bloom filters combined with Merkle Trees have been the most important part of this report. This combination will allow a user to see where a change was made, despite the fact that hash functions are one-way. Large parts of this project has dealt with the study of other open-source libraries and applications, such as Cassandra and SSDeep — as well as how bitcoins work. Optimizations have played a crucial role as well; different approaches to a problem might lead to the same solution, but resource consumption can be very different. RESULTS: The results have shown that the Merkle Tree-based approach can track changes to a table very quickly and efficiently, due to it being conservative when it comes to CPU resources. Moreover, the self-designed algorithm NFHash also does well in terms of file classification when it is coupled with a k-NN classifyer. CONCLUSION: Hash functions refers to a very diverse set of algorithms, and not just algorithms that serve a limited purpose. Fuzzy Fingerprinting Schemes can still be considered to be at their infant stage, but a lot has still happened the last ten years. This project has introduced two new ways to create and compare hashes that can be compared to similar, yet not necessarily identical files — or to detect if (and to what extent) a file was changed. Note that the algorithms presented here should be considered prototypes, and still might need some large scale testing to sort out potential flaw

    Software Tools and Approaches for Compound Identification of LC-MS/MS Data in Metabolomics.

    Get PDF
    The annotation of small molecules remains a major challenge in untargeted mass spectrometry-based metabolomics. We here critically discuss structured elucidation approaches and software that are designed to help during the annotation of unknown compounds. Only by elucidating unknown metabolites first is it possible to biologically interpret complex systems, to map compounds to pathways and to create reliable predictive metabolic models for translational and clinical research. These strategies include the construction and quality of tandem mass spectral databases such as the coalition of MassBank repositories and investigations of MS/MS matching confidence. We present in silico fragmentation tools such as MS-FINDER, CFM-ID, MetFrag, ChemDistiller and CSI:FingerID that can annotate compounds from existing structure databases and that have been used in the CASMI (critical assessment of small molecule identification) contests. Furthermore, the use of retention time models from liquid chromatography and the utility of collision cross-section modelling from ion mobility experiments are covered. Workflows and published examples of successfully annotated unknown compounds are included

    Reduced collision fingerprints and pairwise molecular comparisons for explainable property prediction using Deep Learning

    Full text link
    Les relations entre la structure des composés chimiques et leurs propriétés sont complexes et à haute dimension. Dans le processus de développement de médicaments, plusieurs proprié- tés d’un composé doivent souvent être optimisées simultanément, ce qui complique encore la tâche. Ce travail explore deux représentations des composés chimiques pour les tâches de prédiction des propriétés. L’objectif de ces représentations proposées est d’améliorer l’explicabilité afin de faciliter le processus d’optimisation des propriétés des composés. Pre- mièrement, nous décomposons l’algorithme ECFP (Extended connectivity Fingerprint) et le rendons plus simple pour la compréhension humaine. Nous remplaçons une fonction de hachage sujet aux collisions par une relation univoque de sous structure à bit. Nous consta- tons que ce changement ne se traduit pas par une meilleure performance prédictive d’un perceptron multicouche par rapport à l’ECFP. Toutefois, si la capacité du prédicteur est ra- menée à celle d’un prédicteur linéaire, ses performances sont meilleures que celles de l’ECFP. Deuxièmement, nous appliquons l’apprentissage automatique à l’analyse des paires molécu- laires appariées (MMPA), un paradigme de conception du développement de médicaments. La MMPA compare des paires de composés très similaires, dont la structure diffère par une modification sur un site. Nous formons des modèles de prédiction sur des paires de com- posés afin de prédire les différences d’activité. Nous utilisons des contraintes de similarité par paires comme MMPA, mais nous utilisons également des paires échantillonnées de façon aléatoire pour entraîner les modèles. Nous constatons que les modèles sont plus performants sur des paires choisies au hasard que sur des paires avec des contraintes de similarité strictes. Cependant, les meilleurs modèles par paires ne sont pas capables de battre les performances de prédiction du modèle simple de base. Ces deux études, RCFP et comparaisons par paires, visent à aborder la prédiction des propriétés d’une manière plus compréhensible. En utili- sant l’intuition et l’expérience des chimistes médicinaux dans le cadre de la modélisation prédictive, nous espérons encourager l’explicabilité en tant que composante nécessaire des modèles cheminformatiques prédictifs.The relationships between the structure of chemical compounds and their properties are complex and high dimensional. In the drug development process, multiple properties of a compound often need to be optimized simultaneously, further complicating the task. This work explores two representations of chemical compounds for property prediction tasks. The goal of these suggested representations is improved explainability to better understand the compound property optimization process. First, we decompose the Extended Connectivity Fingerprint (ECFP) algorithm and make it more straightforward for human understanding. We replace a collision-prone hash function with a one-to-one substructure-to-bit relationship. We find that this change which does not translate to higher predictive performance of a multi- layer perceptron compared to ECFP. However, if the capacity of the predictor is lowered to that of a linear predictor, it does perform better than ECFP. Second, we apply machine learning to Matched Molecular Pair Analysis (MMPA), a drug development design paradigm. MMPA compares pairs of highly similar compounds, differing in structure by modification at one site. We train prediction models on pairs of compounds to predict differences in activity. We use pairwise similarity constraints like MMPA, but also use randomly sampled pairs to train the models. We find that models perform better on randomly chosen pairs than on pairs with strict similarity constraints. However, the best pairwise models are not able to beat the prediction performance of the simpler baseline single model. Both of these investigations, RCFP and pairwise comparisons, aim to approach property prediction in a more explainable way. By using intuition and experience of medicinal chemists within predictive modelling, we hope to encourage explainability as a necessary component of predictive cheminformatic models

    Table Augmentation in Data Lakes

    Get PDF
    Data lakes are centralized repositories that store large quantities of raw, unstructured, and structured data, allowing for ad-hoc data analysis, exploratory data analysis, and machine learning. However, the lack of metadata and schema in data lakes makes it challenging to work with tabular data and find related information stored in different tables. However, it is still an open problem how efficiently retrieve these tables at large scale when the settings of a data lake holds. The thesis introduces a novel approach to table augmentation that enables efficient data integration from multiple sources in a data lake. Table augmentation involves adding new data to an existing table in a horizontal fashion (by retrieving tables that can be horizontally concatenated to a query that serves as query table). The proposed approach consists of several components, including data lakes hashing, join search, similarity, and augmentation. The proposed approach is named TASH. TASH is a framework based on a spatial index in which tables are mapped and queried. Its goal is to identify the most useful columns for subsequent machine learning tasks. The table retrieval process employs a combination of set containment search and similarity search. Candidate tables are initially identified using set containment search and then ranked based on their similarity to the query. Experimental results demonstrate that TASH can effectively identify joinable tables and select the most relevant features, thereby enabling efficient table augmentation in data lakes. This research contributes to the field of big data by providing a practical solution to the challenges of data integration and analysis in data lake environments

    Robust optimization of SVM hyperparameters in the classification of bioactive compounds

    Get PDF
    Background: Support Vector Machine has become one of the most popular machine learning tools used in vir - tual screening campaigns aimed at finding new drug candidates. Although it can be extremely effective in finding new potentially active compounds, its application requires the optimization of the hyperparameters with which the assessment is being run, particularly the C and γ values. The optimization requirement in turn, establishes the need to develop fast and effective approaches to the optimization procedure, providing the best predictive power of the constructed model. Results: In this study, we investigated the Bayesian and random search optimization of Support Vector Machine hyperparameters for classifying bioactive compounds. The effectiveness of these strategies was compared with the most popular optimization procedures—grid search and heuristic choice. We demonstrated that Bayesian optimiza- tion not only provides better, more efficient classification but is also much faster—the number of iterations it required for reaching optimal predictive performance was the lowest out of the all tested optimization methods. Moreover, for the Bayesian approach, the choice of parameters in subsequent iterations is directed and justified; therefore, the results obtained by using it are constantly improved and the range of hyperparameters tested provides the best over - all performance of Support Vector Machine. Additionally, we showed that a random search optimization of hyperpa- rameters leads to significantly better performance than grid search and heuristic-based approaches. Conclusions: The Bayesian approach to the optimization of Support Vector Machine parameters was demonstrated to outperform other optimization methods for tasks concerned with the bioactivity assessment of chemical com- pounds. This strategy not only provides a higher accuracy of classification, but is also much faster and more directed than other approaches for optimization. It appears that, despite its simplicity, random search optimization strategy should be used as a second choice if Bayesian approach application is not feasible

    A Plagiarism Detection Algorithm based on Extended Winnowing

    Full text link
    • …
    corecore