Search CORE

73 research outputs found

Do Large Scale Molecular Language Representations Capture Important Structural Information?

Author: Belgodere Brian
Chenthamarakshan Vijil
Das Payel
Mroueh Youssef
Padhi Inkit
Ross Jerret
Publication venue
Publication date: 21/10/2021
Field of study

Predicting the chemical properties of a molecule is of great importance in many applications, including drug discovery and material design. Machine learning based molecular property prediction holds the promise of enabling accurate predictions at much less computationally complex cost when compared to, for example, Density Functional Theory (DFT) calculations. Various representation learning methods in a supervised setting, including the features extracted using graph neural nets, have emerged for such tasks. However, the vast chemical space and the limited availability of labels make supervised learning challenging, calling for learning a general-purpose molecular representation. Recently, pre-trained transformer-based language models on large unlabeled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer. This model employs a linear attention mechanism coupled with highly parallelized training on SMILES sequences of 1.1 billion unlabeled molecules from the PubChem and ZINC datasets. Experiments show that the learned molecular representation outperforms supervised and unsupervised graph neural net baselines on several regression and classification tasks from 10 benchmark datasets, while performing competitively on others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer indeed learns a molecule's local and global structural aspects. These results provide encouraging evidence that large-scale molecular language models can capture sufficient structural information to be able to predict diverse molecular properties, including quantum-chemical propertie

arXiv.org e-Print Archive

Lattice cryptography, from cryptanalysis to new foundations

Author: Woerden W.P.J. (Wessel) van
Publication venue
Publication date: 23/02/2023
Field of study

CWI's Institutional Repository

Äriprotsesside ajaliste näitajate selgitatav ennustav jälgimine

Author: Verenich Ilya
Publication venue
Publication date: 07/01/2019
Field of study

Kaasaegsed ettevõtte infosüsteemid võimaldavad ettevõtetel koguda detailset informatsiooni äriprotsesside täitmiste kohta. Eelnev koos masinõppe meetoditega võimaldab kasutada andmejuhitavaid ja ennustatavaid lähenemisi äriprotsesside jõudluse jälgimiseks. Kasutades ennustuslike äriprotsesside jälgimise tehnikaid on võimalik jõudluse probleeme ennustada ning soovimatu tegurite mõju ennetavalt leevendada. Tüüpilised küsimused, millega tegeleb ennustuslik protsesside jälgimine on “millal antud äriprotsess lõppeb?” või “mis on kõige tõenäolisem järgmine sündmus antud äriprotsessi jaoks?”. Suurim osa olemasolevatest lahendustest eelistavad täpsust selgitatavusele. Praktikas, selgitatavus on ennustatavate tehnikate tähtis tunnus. Ennustused, kas protsessi täitmine ebaõnnestub või selle täitmisel võivad tekkida raskused, pole piisavad. On oluline kasutajatele seletada, kuidas on selline ennustuse tulemus saavutatud ning mida saab teha soovimatu tulemuse ennetamiseks. Töö pakub välja kaks meetodit ennustatavate mudelite konstrueerimiseks, mis võimaldavad jälgida äriprotsesse ning keskenduvad selgitatavusel. Seda saavutatakse ennustuse lahtivõtmisega elementaarosadeks. Näiteks, kui ennustatakse, et äriprotsessi lõpuni on jäänud aega 20 tundi, siis saame anda seletust, et see aeg on moodustatud kõikide seni käsitlemata tegevuste lõpetamiseks vajalikust ajast. Töös võrreldakse omavahel eelmainitud meetodeid, käsitledes äriprotsesse erinevatest valdkondadest. Hindamine toob esile erinevusi selgitatava ja täpsusele põhinevale lähenemiste vahel. Töö teaduslik panus on ennustuslikuks protsesside jälgimiseks vabavaralise tööriista arendamine. Süsteemi nimeks on Nirdizati ning see süsteem võimaldab treenida ennustuslike masinõppe mudeleid, kasutades nii töös kirjeldatud meetodeid kui ka kolmanda osapoole meetodeid. Hiljem saab treenitud mudeleid kasutada hetkel käivate äriprotsesside tulemuste ennustamiseks, mis saab aidata kasutajaid reaalajas.Modern enterprise systems collect detailed data about the execution of the business processes they support. The widespread availability of such data in companies, coupled with advances in machine learning, have led to the emergence of data-driven and predictive approaches to monitor the performance of business processes. By using such predictive process monitoring approaches, potential performance issues can be anticipated and proactively mitigated. Various approaches have been proposed to address typical predictive process monitoring questions, such as what is the most likely continuation of an ongoing process instance, or when it will finish. However, most existing approaches prioritize accuracy over explainability. Yet in practice, explainability is a critical property of predictive methods. It is not enough to accurately predict that a running process instance will end up in an undesired outcome. It is also important for users to understand why this prediction is made and what can be done to prevent this undesired outcome. This thesis proposes two methods to build predictive models to monitor business processes in an explainable manner. This is achieved by decomposing a prediction into its elementary components. For example, to explain that the remaining execution time of a process execution is predicted to be 20 hours, we decompose this prediction into the predicted execution time of each activity that has not yet been executed. We evaluate the proposed methods against each other and various state-of-the-art baselines using a range of business processes from multiple domains. The evaluation reaffirms a fundamental trade-off between explainability and accuracy of predictions. The research contributions of the thesis have been consolidated into an open-source tool for predictive business process monitoring, namely Nirdizati. It can be used to train predictive models using the methods described in this thesis, as well as third-party methods. These models are then used to make predictions for ongoing process instances; thus, the tool can also support users at runtime

DSpace at Tartu University Library

Scaling Lattice Sieves across Multiple Machines

Author: Joe Rowell
Martin R. Albrecht
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 27/05/2024
Field of study

Lattice sieves are algorithms for finding short vectors in lattices. We present an implementation of two such sieves – known as “BGJ1” and “BDGL” in the literature – that scales across multiple servers (with varying success). This class of algorithms requires exponential memory which had put into question their ability to scale across sieving nodes. We discuss our architecture and optimisations and report experimental evidence of the efficiency of our approach

Cryptology ePrint Archive

Mobile Wound Assessment and 3D Modeling from a Single Image

Author: Williamson Victor
Publication venue: UWM Digital Commons
Publication date: 01/08/2020
Field of study

The prevalence of camera-enabled mobile phones have made mobile wound assessment a viable treatment option for millions of previously difficult to reach patients. We have designed a complete mobile wound assessment platform to ameliorate the many challenges related to chronic wound care. Chronic wounds and infections are the most severe, costly and fatal types of wounds, placing them at the center of mobile wound assessment. Wound physicians assess thousands of single-view wound images from all over the world, and it may be difficult to determine the location of the wound on the body, for example, if the wound is taken at close range. In our solution, end-users capture an image of the wound by taking a picture with their mobile camera. The wound image is segmented and classified using modern convolution neural networks, and is stored securely in the cloud for remote tracking. We use an interactive semi-automated approach to allow users to specify the location of the wound on the body. To accomplish this we have created, to the best our knowledge, the first 3D human surface anatomy labeling system, based off the current NYU and Anatomy Mapper labeling systems. To interactively view wounds in 3D, we have presented an efficient projective texture mapping algorithm for texturing wounds onto a 3D human anatomy model. In so doing, we have demonstrated an approach to 3D wound reconstruction that works even for a single wound image

University of Wisconsin-Milwaukee

Survey of Vector Database Management Systems

Author: Li Guoliang
Pan James Jie
Wang Jianguo
Publication venue
Publication date: 21/10/2023
Field of study

There are now over 20 commercial vector database management systems (VDBMSs), all produced within the past five years. But embedding-based retrieval has been studied for over ten years, and similarity search a staggering half century and more. Driving this shift from algorithms to systems are new data intensive applications, notably large language models, that demand vast stores of unstructured data coupled with reliable, secure, fast, and scalable query processing capability. A variety of new data management techniques now exist for addressing these needs, however there is no comprehensive survey to thoroughly review these techniques and systems. We start by identifying five main obstacles to vector data management, namely vagueness of semantic similarity, large size of vectors, high cost of similarity comparison, lack of natural partitioning that can be used for indexing, and difficulty of efficiently answering hybrid queries that require both attributes and vectors. Overcoming these obstacles has led to new approaches to query processing, storage and indexing, and query optimization and execution. For query processing, a variety of similarity scores and query types are now well understood; for storage and indexing, techniques include vector compression, namely quantization, and partitioning based on randomization, learning partitioning, and navigable partitioning; for query optimization and execution, we describe new operators for hybrid queries, as well as techniques for plan enumeration, plan selection, and hardware accelerated execution. These techniques lead to a variety of VDBMSs across a spectrum of design and runtime characteristics, including native systems specialized for vectors and extended systems that incorporate vector capabilities into existing systems. We then discuss benchmarks, and finally we outline research challenges and point the direction for future work.Comment: 25 page

arXiv.org e-Print Archive

Bayesian Methods for Metabolomics

Author: Ye Lifeng
Publication venue: UCL (University College London)
Publication date: 28/09/2020
Field of study

Metabolomics, the large-scale study of small molecules, enables the underlying biochemical activity and state of cells or tissues to be directly captured. Nuclear Magnetic Resonance (NMR) Spectroscopy is one of the major data capturing tech- niques for metabolomics, as it provides highly reproducible, quantitative informa- tion on a wide variety of metabolites. This work presents possible solutions for three problems involved to aid the development of better algorithms for NMR data analy- sis. After reviewing relevant concepts and literature, we first utilise observed NMR chemical shift titration data for a range of urinary metabolites and develop a the- oretical model of chemical shift using a Bayesian statistical framework and model selection procedures to estimate the number of protonation sites, a key parameter to model the relationship between chemical shift variation and pH and usually un- known in uncatalogued metabolites. Secondly, with the aim of obtaining explicit concentration estimates for metabolites from NMR spectra, we discuss a Monte Carlo Co-ordinate Ascent Variational Inference (MC-CAVI) algorithm that com- bines Markov chain Monte Carlo (MCMC) methods with Co-ordinate Ascent VI (CAVI), demonstrate MC-CAVI’s suitability for models with hard constraints and compare MC-CAVI’s performance with that of MCMC in an important complex model used in NMR spectroscopy data analysis. The third distribution seeks to im- prove metabolite identification, one of the biggest bottlenecks in metabolomics and severely hindered by resonance overlapping in one-dimensional NMR spectroscopy. In particular, we present a novel Bayesian method for widely used two-dimensional (2D) 1H J-resolved (JRES) NMR spectroscopy, which has considerable potential to accurately identify and quantify metabolites within complex biological samples, through combining B-spline tight wavelet frames with theoretical templates. We then demonstrate the effectiveness of our approach via analyses of JRES datasets from serum and urine

UCL Discovery

NILM techniques for intelligent home energy management and ambient assisted living: a review

Author: Alvaro Hernandez
Antonio Ruano
Egarter
Firth
Jesus Ureña
Juan Garcia
Kramer
Maria Ruano
Publication venue: 'MDPI AG'
Publication date: 01/01/2019
Field of study

The ongoing deployment of smart meters and different commercial devices has made electricity disaggregation feasible in buildings and households, based on a single measure of the current and, sometimes, of the voltage. Energy disaggregation is intended to separate the total power consumption into specific appliance loads, which can be achieved by applying Non-Intrusive Load Monitoring (NILM) techniques with a minimum invasion of privacy. NILM techniques are becoming more and more widespread in recent years, as a consequence of the interest companies and consumers have in efficient energy consumption and management. This work presents a detailed review of NILM methods, focusing particularly on recent proposals and their applications, particularly in the areas of Home Energy Management Systems (HEMS) and Ambient Assisted Living (AAL), where the ability to determine the on/off status of certain devices can provide key information for making further decisions. As well as complementing previous reviews on the NILM field and providing a discussion of the applications of NILM in HEMS and AAL, this paper provides guidelines for future research in these topics.Agência financiadora: Programa Operacional Portugal 2020 and Programa Operacional Regional do Algarve 01/SAICT/2018/39578 Fundação para a Ciência e Tecnologia through IDMEC, under LAETA: SFRH/BSAB/142998/2018 SFRH/BSAB/142997/2018 UID/EMS/50022/2019 Junta de Comunidades de Castilla-La-Mancha, Spain: SBPLY/17/180501/000392 Spanish Ministry of Economy, Industry and Competitiveness (SOC-PLC project): TEC2015-64835-C3-2-R MINECO/FEDERinfo:eu-repo/semantics/publishedVersio

Multidisciplinary Digital Publishing Institute

Crossref

Estudo Geral

Sapientia

Data Models for Dataset Drift Controls in Machine Learning With Images

Author: Aversa Marco
Buck Michèle
Clausen Christoph
Extermann Jerome
Matek Christian
Murray-Smith Roderick
Neuenschwander Yoan
Nobis Gabriel
Oala Luis
Pomarico Enrico
Samek Wojciech
Sanguinetti Bruno
Willis Kurt
Publication venue
Publication date: 04/11/2022
Field of study

Camera images are ubiquitous in machine learning research. They also play a central role in the delivery of important services spanning medicine and environmental surveying. However, the application of machine learning models in these domains has been limited because of robustness concerns. A primary failure mode are performance drops due to differences between the training and deployment data. While there are methods to prospectively validate the robustness of machine learning models to such dataset drifts, existing approaches do not account for explicit models of the primary object of interest: the data. This makes it difficult to create physically faithful drift test cases or to provide specifications of data models that should be avoided when deploying a machine learning model. In this study, we demonstrate how these shortcomings can be overcome by pairing machine learning robustness validation with physical optics. We examine the role raw sensor data and differentiable data models can play in controlling performance risks related to image dataset drift. The findings are distilled into three applications. First, drift synthesis enables the controlled generation of physically faithful drift test cases. The experiments presented here show that the average decrease in model performance is ten to four times less severe than under post-hoc augmentation testing. Second, the gradient connection between task and data models allows for drift forensics that can be used to specify performance-sensitive data models which should be avoided during deployment of a machine learning model. Third, drift adjustment opens up the possibility for processing adjustments in the face of drift. This can lead to speed up and stabilization of classifier training at a margin of up to 20% in validation accuracy. A guide to access the open code and datasets is available at https://github.com/aiaudit-org/raw2logit.Comment: LO and MA contributed equall

arXiv.org e-Print Archive