270 research outputs found

    Computational Approaches to Drug Profiling and Drug-Protein Interactions

    Get PDF
    Despite substantial increases in R&D spending within the pharmaceutical industry, denovo drug design has become a time-consuming endeavour. High attrition rates led to a long period of stagnation in drug approvals. Due to the extreme costs associated with introducing a drug to the market, locating and understanding the reasons for clinical failure is key to future productivity. As part of this PhD, three main contributions were made in this respect. First, the web platform, LigNFam enables users to interactively explore similarity relationships between ‘drug like’ molecules and the proteins they bind. Secondly, two deep-learning-based binding site comparison tools were developed, competing with the state-of-the-art over benchmark datasets. The models have the ability to predict offtarget interactions and potential candidates for target-based drug repurposing. Finally, the open-source ScaffoldGraph software was presented for the analysis of hierarchical scaffold relationships and has already been used in multiple projects, including integration into a virtual screening pipeline to increase the tractability of ultra-large screening experiments. Together, and with existing tools, the contributions made will aid in the understanding of drug-protein relationships, particularly in the fields of off-target prediction and drug repurposing, helping to design better drugs faster

    a literature review

    Get PDF
    Fonseca, J., & Bacao, F. (2023). Tabular and latent space synthetic data generation: a literature review. Journal of Big Data, 10, 1-37. [115]. https://doi.org/10.1186/s40537-023-00792-7 --- This research was supported by two research grants of the Portuguese Foundation for Science and Technology (“Fundação para a Ciência e a Tecnologia”), references SFRH/BD/151473/2021 and DSAIPA/DS/0116/2019, and by project UIDB/04152/2020 - Centro de Investigação em Gestão de Informação (MagIC).The generation of synthetic data can be used for anonymization, regularization, oversampling, semi-supervised learning, self-supervised learning, and several other tasks. Such broad potential motivated the development of new algorithms, specialized in data generation for specific data formats and Machine Learning (ML) tasks. However, one of the most common data formats used in industrial applications, tabular data, is generally overlooked; Literature analyses are scarce, state-of-the-art methods are spread across domains or ML tasks and there is little to no distinction among the main types of mechanism underlying synthetic data generation algorithms. In this paper, we analyze tabular and latent space synthetic data generation algorithms. Specifically, we propose a unified taxonomy as an extension and generalization of previous taxonomies, review 70 generation algorithms across six ML problems, distinguish the main generation mechanisms identified into six categories, describe each type of generation mechanism, discuss metrics to evaluate the quality of synthetic data and provide recommendations for future research. We expect this study to assist researchers and practitioners identify relevant gaps in the literature and design better and more informed practices with synthetic data.publishersversionpublishe

    The Role of Synthetic Data in Improving Supervised Learning Methods: The Case of Land Use/Land Cover Classification

    Get PDF
    A thesis submitted in partial fulfillment of the requirements for the degree of Doctor in Information ManagementIn remote sensing, Land Use/Land Cover (LULC) maps constitute important assets for various applications, promoting environmental sustainability and good resource management. Although, their production continues to be a challenging task. There are various factors that contribute towards the difficulty of generating accurate, timely updated LULC maps, both via automatic or photo-interpreted LULC mapping. Data preprocessing, being a crucial step for any Machine Learning task, is particularly important in the remote sensing domain due to the overwhelming amount of raw, unlabeled data continuously gathered from multiple remote sensing missions. However a significant part of the state-of-the-art focuses on scenarios with full access to labeled training data with relatively balanced class distributions. This thesis focuses on the challenges found in automatic LULC classification tasks, specifically in data preprocessing tasks. We focus on the development of novel Active Learning (AL) and imbalanced learning techniques, to improve ML performance in situations with limited training data and/or the existence of rare classes. We also show that much of the contributions presented are not only successful in remote sensing problems, but also in various other multidisciplinary classification problems. The work presented in this thesis used open access datasets to test the contributions made in imbalanced learning and AL. All the data pulling, preprocessing and experiments are made available at https://github.com/joaopfonseca/publications. The algorithmic implementations are made available in the Python package ml-research at https://github.com/joaopfonseca/ml-research

    Proceedings of the 8th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2023)

    Get PDF
    This volume gathers the papers presented at the Detection and Classification of Acoustic Scenes and Events 2023 Workshop (DCASE2023), Tampere, Finland, during 21–22 September 2023

    An uncertainty prediction approach for active learning - application to earth observation

    Get PDF
    Mapping land cover and land usage dynamics are crucial in remote sensing since farmers are encouraged to either intensify or extend crop use due to the ongoing rise in the world’s population. A major issue in this area is interpreting and classifying a scene captured in high-resolution satellite imagery. Several methods have been put forth, including neural networks which generate data-dependent models (i.e. model is biased toward data) and static rule-based approaches with thresholds which are limited in terms of diversity(i.e. model lacks diversity in terms of rules). However, the problem of having a machine learning model that, given a large amount of training data, can classify multiple classes over different geographic Sentinel-2 imagery that out scales existing approaches remains open. On the other hand, supervised machine learning has evolved into an essential part of many areas due to the increasing number of labeled datasets. Examples include creating classifiers for applications that recognize images and voices, anticipate traffic, propose products, act as a virtual personal assistant and detect online fraud, among many more. Since these classifiers are highly dependent from the training datasets, without human interaction or accurate labels, the performance of these generated classifiers with unseen observations is uncertain. Thus, researchers attempted to evaluate a number of independent models using a statistical distance. However, the problem of, given a train-test split and classifiers modeled over the train set, identifying a prediction error using the relation between train and test sets remains open. Moreover, while some training data is essential for supervised machine learning, what happens if there is insufficient labeled data? After all, assigning labels to unlabeled datasets is a time-consuming process that may need significant expert human involvement. When there aren’t enough expert manual labels accessible for the vast amount of openly available data, active learning becomes crucial. However, given a large amount of training and unlabeled datasets, having an active learning model that can reduce the training cost of the classifier and at the same time assist in labeling new data points remains an open problem. From the experimental approaches and findings, the main research contributions, which concentrate on the issue of optical satellite image scene classification include: building labeled Sentinel-2 datasets with surface reflectance values; proposal of machine learning models for pixel-based image scene classification; proposal of a statistical distance based Evidence Function Model (EFM) to detect ML models misclassification; and proposal of a generalised sampling approach for active learning that, together with the EFM enables a way of determining the most informative examples. Firstly, using a manually annotated Sentinel-2 dataset, Machine Learning (ML) models for scene classification were developed and their performance was compared to Sen2Cor the reference package from the European Space Agency – a micro-F1 value of 84% was attained by the ML model, which is a significant improvement over the corresponding Sen2Cor performance of 59%. Secondly, to quantify the misclassification of the ML models, the Mahalanobis distance-based EFM was devised. This model achieved, for the labeled Sentinel-2 dataset, a micro-F1 of 67.89% for misclassification detection. Lastly, EFM was engineered as a sampling strategy for active learning leading to an approach that attains the same level of accuracy with only 0.02% of the total training samples when compared to a classifier trained with the full training set. With the help of the above-mentioned research contributions, we were able to provide an open-source Sentinel-2 image scene classification package which consists of ready-touse Python scripts and a ML model that classifies Sentinel-2 L1C images generating a 20m-resolution RGB image with the six studied classes (Cloud, Cirrus, Shadow, Snow, Water, and Other) giving academics a straightforward method for rapidly and effectively classifying Sentinel-2 scene images. Additionally, an active learning approach that uses, as sampling strategy, the observed prediction uncertainty given by EFM, will allow labeling only the most informative points to be used as input to build classifiers; Sumário: Uma Abordagem de Previsão de Incerteza para Aprendizagem Ativa – Aplicação à Observação da Terra O mapeamento da cobertura do solo e a dinâmica da utilização do solo são cruciais na deteção remota uma vez que os agricultores são incentivados a intensificar ou estender as culturas devido ao aumento contínuo da população mundial. Uma questão importante nesta área é interpretar e classificar cenas capturadas em imagens de satélite de alta resolução. Várias aproximações têm sido propostas incluindo a utilização de redes neuronais que produzem modelos dependentes dos dados (ou seja, o modelo é tendencioso em relação aos dados) e aproximações baseadas em regras que apresentam restrições de diversidade (ou seja, o modelo carece de diversidade em termos de regras). No entanto, a criação de um modelo de aprendizagem automática que, dada uma uma grande quantidade de dados de treino, é capaz de classificar, com desempenho superior, as imagens do Sentinel-2 em diferentes áreas geográficas permanece um problema em aberto. Por outro lado, têm sido utilizadas técnicas de aprendizagem supervisionada na resolução de problemas nas mais diversas áreas de devido à proliferação de conjuntos de dados etiquetados. Exemplos disto incluem classificadores para aplicações que reconhecem imagem e voz, antecipam tráfego, propõem produtos, atuam como assistentes pessoais virtuais e detetam fraudes online, entre muitos outros. Uma vez que estes classificadores são fortemente dependente do conjunto de dados de treino, sem interação humana ou etiquetas precisas, o seu desempenho sobre novos dados é incerta. Neste sentido existem propostas para avaliar modelos independentes usando uma distância estatística. No entanto, o problema de, dada uma divisão de treino-teste e um classificador, identificar o erro de previsão usando a relação entre aqueles conjuntos, permanece aberto. Mais ainda, embora alguns dados de treino sejam essenciais para a aprendizagem supervisionada, o que acontece quando a quantidade de dados etiquetados é insuficiente? Afinal, atribuir etiquetas é um processo demorado e que exige perícia, o que se traduz num envolvimento humano significativo. Quando a quantidade de dados etiquetados manualmente por peritos é insuficiente a aprendizagem ativa torna-se crucial. No entanto, dada uma grande quantidade dados de treino não etiquetados, ter um modelo de aprendizagem ativa que reduz o custo de treino do classificador e, ao mesmo tempo, auxilia a etiquetagem de novas observações permanece um problema em aberto. A partir das abordagens e estudos experimentais, as principais contribuições deste trabalho, que se concentra na classificação de cenas de imagens de satélite óptico incluem: criação de conjuntos de dados Sentinel-2 etiquetados, com valores de refletância de superfície; proposta de modelos de aprendizagem automática baseados em pixels para classificação de cenas de imagens de satétite; proposta de um Modelo de Função de Evidência (EFM) baseado numa distância estatística para detetar erros de classificação de modelos de aprendizagem; e proposta de uma abordagem de amostragem generalizada para aprendizagem ativa que, em conjunto com o EFM, possibilita uma forma de determinar os exemplos mais informativos. Em primeiro lugar, usando um conjunto de dados Sentinel-2 etiquetado manualmente, foram desenvolvidos modelos de Aprendizagem Automática (AA) para classificação de cenas e seu desempenho foi comparado com o do Sen2Cor – o produto de referência da Agência Espacial Europeia – tendo sido alcançado um valor de micro-F1 de 84% pelo classificador, o que representa uma melhoria significativa em relação ao desempenho Sen2Cor correspondente, de 59%. Em segundo lugar, para quantificar o erro de classificação dos modelos de AA, foi concebido o Modelo de Função de Evidência baseado na distância de Mahalanobis. Este modelo conseguiu, para o conjunto de dados etiquetado do Sentinel-2 um micro-F1 de 67,89% na deteção de classificação incorreta. Por fim, o EFM foi utilizado como uma estratégia de amostragem para a aprendizagem ativa, uma abordagem que permitiu atingir o mesmo nível de desempenho com apenas 0,02% do total de exemplos de treino quando comparado com um classificador treinado com o conjunto de treino completo. Com a ajuda das contribuições acima mencionadas, foi possível desenvolver um pacote de código aberto para classificação de cenas de imagens Sentinel-2 que, utilizando num conjunto de scripts Python, um modelo de classificação, e uma imagem Sentinel-2 L1C, gera a imagem RGB correspondente (com resolução de 20m) com as seis classes estudadas (Cloud, Cirrus, Shadow, Snow, Water e Other), disponibilizando à academia um método direto para a classificação de cenas de imagens do Sentinel-2 rápida e eficaz. Além disso, a abordagem de aprendizagem ativa que usa, como estratégia de amostragem, a deteção de classificacão incorreta dada pelo EFM, permite etiquetar apenas os pontos mais informativos a serem usados como entrada na construção de classificadores

    A survey, review, and future trends of skin lesion segmentation and classification

    Get PDF
    The Computer-aided Diagnosis or Detection (CAD) approach for skin lesion analysis is an emerging field of research that has the potential to alleviate the burden and cost of skin cancer screening. Researchers have recently indicated increasing interest in developing such CAD systems, with the intention of providing a user-friendly tool to dermatologists to reduce the challenges encountered or associated with manual inspection. This article aims to provide a comprehensive literature survey and review of a total of 594 publications (356 for skin lesion segmentation and 238 for skin lesion classification) published between 2011 and 2022. These articles are analyzed and summarized in a number of different ways to contribute vital information regarding the methods for the development of CAD systems. These ways include: relevant and essential definitions and theories, input data (dataset utilization, preprocessing, augmentations, and fixing imbalance problems), method configuration (techniques, architectures, module frameworks, and losses), training tactics (hyperparameter settings), and evaluation criteria. We intend to investigate a variety of performance-enhancing approaches, including ensemble and post-processing. We also discuss these dimensions to reveal their current trends based on utilization frequencies. In addition, we highlight the primary difficulties associated with evaluating skin lesion segmentation and classification systems using minimal datasets, as well as the potential solutions to these difficulties. Findings, recommendations, and trends are disclosed to inform future research on developing an automated and robust CAD system for skin lesion analysis

    Deep learning in food category recognition

    Get PDF
    Integrating artificial intelligence with food category recognition has been a field of interest for research for the past few decades. It is potentially one of the next steps in revolutionizing human interaction with food. The modern advent of big data and the development of data-oriented fields like deep learning have provided advancements in food category recognition. With increasing computational power and ever-larger food datasets, the approach’s potential has yet to be realized. This survey provides an overview of methods that can be applied to various food category recognition tasks, including detecting type, ingredients, quality, and quantity. We survey the core components for constructing a machine learning system for food category recognition, including datasets, data augmentation, hand-crafted feature extraction, and machine learning algorithms. We place a particular focus on the field of deep learning, including the utilization of convolutional neural networks, transfer learning, and semi-supervised learning. We provide an overview of relevant studies to promote further developments in food category recognition for research and industrial applicationsMRC (MC_PC_17171)Royal Society (RP202G0230)BHF (AA/18/3/34220)Hope Foundation for Cancer Research (RM60G0680)GCRF (P202PF11)Sino-UK Industrial Fund (RP202G0289)LIAS (P202ED10Data Science Enhancement Fund (P202RE237)Fight for Sight (24NN201);Sino-UK Education Fund (OP202006)BBSRC (RM32G0178B8

    One Shot Learning with class partitioning and cross validation voting (CP-CVV)

    Get PDF
    Producción CientíficaOne Shot Learning includes all those techniques that make it possible to classify images using a single image per category. One of its possible applications is the identification of food products. For a grocery store, it is interesting to record a single image of each product and be able to recognise it again from other images, such as photos taken by customers. Within deep learning, Siamese neural networks are able to verify whether two images belong to the same category or not. In this paper, a new Siamese network training technique, called CP-CVV, is presented. It uses the combination of different models trained with different classes. The separation of validation classes has been done in such a way that each of the combined models is different in order to avoid overfitting with respect to the validation. Unlike normal training, the test images belong to classes that have not previously been used in training, allowing the model to work on new categories, of which only one image exists. Different backbones have been evaluated in the Siamese composition, but also the integration of multiple models with different backbones. The results show that the model improves on previous works and allows the classification problem to be solved, an additional step towards the use of Siamese networks. To the best of our knowledge, there is no existing work that has proposed integrating Siamese neural networks using a class-based validation set separation technique so as to be better at generalising for unknown classes. Additionally, we have applied Cross-Validation-Voting with ConvNeXt to improve the existing classification results of a well-known Grocery Store Dataset.The Centre for the Development of Industrial Technology (CDTI) and by the Instituto para la Competitividad Empresarial de Castilla y León - FEDER (Project CCTT3/20/VA/0003

    Towards Deep Learning with Competing Generalisation Objectives

    Get PDF
    The unreasonable effectiveness of Deep Learning continues to deliver unprecedented Artificial Intelligence capabilities to billions of people. Growing datasets and technological advances keep extending the reach of expressive model architectures trained through efficient optimisations. Thus, deep learning approaches continue to provide increasingly proficient subroutines for, among others, computer vision and natural interaction through speech and text. Due to their scalable learning and inference priors, higher performance is often gained cost-effectively through largely automatic training. As a result, new and improved capabilities empower more people while the costs of access drop. The arising opportunities and challenges have profoundly influenced research. Quality attributes of scalable software became central desiderata of deep learning paradigms, including reusability, efficiency, robustness and safety. Ongoing research into continual, meta- and robust learning aims to maximise such scalability metrics in addition to multiple generalisation criteria, despite possible conflicts. A significant challenge is to satisfy competing criteria automatically and cost-effectively. In this thesis, we introduce a unifying perspective on learning with competing generalisation objectives and make three additional contributions. When autonomous learning through multi-criteria optimisation is impractical, it is reasonable to ask whether knowledge of appropriate trade-offs could make it simultaneously effective and efficient. Informed by explicit trade-offs of interest to particular applications, we developed and evaluated bespoke model architecture priors. We introduced a novel architecture for sim-to-real transfer of robotic control policies by learning progressively to generalise anew. Competing desiderata of continual learning were balanced through disjoint capacity and hierarchical reuse of previously learnt representations. A new state-of-the-art meta-learning approach is then proposed. We showed that meta-trained hypernetworks efficiently store and flexibly reuse knowledge for new generalisation criteria through few-shot gradient-based optimisation. Finally, we characterised empirical trade-offs between the many desiderata of adversarial robustness and demonstrated a novel defensive capability of implicit neural networks to hinder many attacks simultaneously
    corecore