554 research outputs found

    Development of a Robust Read-Across Model for the Prediction of Biological Potency of Novel Peroxisome Proliferator-Activated Receptor Delta Agonists

    Get PDF
    A robust predictive model was developed using 136 novel peroxisome proliferator-activated receptor delta (PPARδ) agonists, a distinct subtype of lipid-activated transcription factors of the nuclear receptor superfamily that regulate target genes by binding to characteristic sequences of DNA bases. The model employs various structural descriptors and docking calculations and provides predictions of the biological activity of PPARδ agonists, following the criteria of the Organization for Economic Co-operation and Development (OECD) for the development and validation of quantitative structure–activity relationship (QSAR) models. Specifically focused on small molecules, the model facilitates the identification of highly potent and selective PPARδ agonists and offers a read-across concept by providing the chemical neighbours of the compound under study. The model development process was conducted on Isalos Analytics Software (v. 0.1.17) which provides an intuitive environment for machine-learning applications. The final model was released as a user-friendly web tool and can be accessed through the Enalos Cloud platform’s graphical user interface (GUI)

    Automatic annotation for weakly supervised learning of detectors

    Get PDF
    PhDObject detection in images and action detection in videos are among the most widely studied computer vision problems, with applications in consumer photography, surveillance, and automatic media tagging. Typically, these standard detectors are fully supervised, that is they require a large body of training data where the locations of the objects/actions in images/videos have been manually annotated. With the emergence of digital media, and the rise of high-speed internet, raw images and video are available for little to no cost. However, the manual annotation of object and action locations remains tedious, slow, and expensive. As a result there has been a great interest in training detectors with weak supervision where only the presence or absence of object/action in image/video is needed, not the location. This thesis presents approaches for weakly supervised learning of object/action detectors with a focus on automatically annotating object and action locations in images/videos using only binary weak labels indicating the presence or absence of object/action in images/videos. First, a framework for weakly supervised learning of object detectors in images is presented. In the proposed approach, a variation of multiple instance learning (MIL) technique for automatically annotating object locations in weakly labelled data is presented which, unlike existing approaches, uses inter-class and intra-class cue fusion to obtain the initial annotation. The initial annotation is then used to start an iterative process in which standard object detectors are used to refine the location annotation. Finally, to ensure that the iterative training of detectors do not drift from the object of interest, a scheme for detecting model drift is also presented. Furthermore, unlike most other methods, our weakly supervised approach is evaluated on data without manual pose (object orientation) annotation. Second, an analysis of the initial annotation of objects, using inter-class and intra-class cues, is carried out. From the analysis, a new method based on negative mining (NegMine) is presented for the initial annotation of both object and action data. The NegMine based approach is a much simpler formulation using only inter-class measure and requires no complex combinatorial optimisation but can still meet or outperform existing approaches including the previously pre3 sented inter-intra class cue fusion approach. Furthermore, NegMine can be fused with existing approaches to boost their performance. Finally, the thesis will take a step back and look at the use of generic object detectors as prior knowledge in weakly supervised learning of object detectors. These generic object detectors are typically based on sampling saliency maps that indicate if a pixel belongs to the background or foreground. A new approach to generating saliency maps is presented that, unlike existing approaches, looks beyond the current image of interest and into images similar to the current image. We show that our generic object proposal method can be used by itself to annotate the weakly labelled object data with surprisingly high accuracy

    Continual machine learning for non-stationary data analysis

    Get PDF
    Although deep learning models have achieved significant successes in various fields, most of them have limited capacity in learning multiple tasks sequentially. The issue of forgetting the previously learned tasks in continual learning is known as catastrophic forgetting or interference. When the input data or the goal of learning changes, a conventional machine learning model will learn and adapt to the new status. However, the model will not remember or recognise any revisits to the previous states. This causes performance reduction and re-training curves in dealing with periodic or irregularly reoccurring changes in the data or goals. Without continual learning ability, one cannot deploy an adaptive machine learning model in a changing environment. This thesis investigates the continual learning and mitigating the catastrophic forgetting problem in neural networks. We assume non-stationary data contains multiple different tasks which are coming in sequence and will not be stored. We propose a regularisation method, which is to identify and penalise the changes of important parameters of previous tasks while learning a new one. However, when the number of tasks is sufficiently large, this method cannot preserve all the previously learned knowledge, or it impedes the integration of new knowledge. This is also known as the stability-plasticity dilemma. To solve this problem, we proposed a replay method based on Generative Adversarial Networks (GANs). Different from other replay methods, the proposed model is not bounded by the fitting capacity of the generator. However, the number of parameters increases rapidly as the number of learned tasks grows. Therefore, we propose a continual learning model based on Bayesian neural networks and a Mixture of Experts (MoE) framework. The proposed model integrates different experts which are responsible for different tasks into a giant model. Previously knowledge is preserved, and new tasks can be efficiently learned by assigning new experts. Based on Monte-Carlo Sampling, the performance is not satisfied. To address this issue, we propose a Probabilistic Neural Network (PNN) and integrate it with a conventional neural network. The PNN can produce the likelihood given input and be used in a variety of fields. To apply continual learning methods to real-world applications, we then propose a semi-supervised learning model to analyse healthcare datasets. The proposed framework extracts the general features from unlabelled data. We integrate the PNN into the framework to classify the data, which includes a smaller set of labelled samples and continually learn the new cases. The proposed model has been tested on benchmark datasets and also a real-world clinical dataset. The results showed that our proposed model outperforms the state-of-the-art models without requiring prior knowledge of the tasks and overall accuracy of the continual learning. The experiments on the real-world clinical data were designed to identify the risk of Urinary Tract Infections (UTIs) using in-home monitoring data. The UTI risk analysis model has been deployed in a digital platform and is currently part of the on-going Minder clinical study at the UK Dementia Research Institute (UK DRI). An earlier version of the model was deployed as a part of a Class-I CE marked medical device. The UK DRI Minder platform and the deployed machine learning models, including the UTI risk analysis model developed in this research, are in the process to be accredited as a Class-IIa medical device. Overall, this PhD research tackles theoretical and applied challenges of continuous learning models in dealing with real-world data. We evaluate the proposed continual learning methods in a variety of benchmarks with comprehensive analysis and show their effectiveness. Furthermore, we have applied the proposed methods in real-world applications and demonstrated the applicability of the models to real-world settings and clinical problems.Open Acces

    Social Data Mining for Crime Intelligence

    Get PDF
    With the advancement of the Internet and related technologies, many traditional crimes have made the leap to digital environments. The successes of data mining in a wide variety of disciplines have given birth to crime analysis. Traditional crime analysis is mainly focused on understanding crime patterns, however, it is unsuitable for identifying and monitoring emerging crimes. The true nature of crime remains buried in unstructured content that represents the hidden story behind the data. User feedback leaves valuable traces that can be utilised to measure the quality of various aspects of products or services and can also be used to detect, infer, or predict crimes. Like any application of data mining, the data must be of a high quality standard in order to avoid erroneous conclusions. This thesis presents a methodology and practical experiments towards discovering whether (i) user feedback can be harnessed and processed for crime intelligence, (ii) criminal associations, structures, and roles can be inferred among entities involved in a crime, and (iii) methods and standards can be developed for measuring, predicting, and comparing the quality level of social data instances and samples. It contributes to the theory, design and development of a novel framework for crime intelligence and algorithm for the estimation of social data quality by innovatively adapting the methods of monitoring water contaminants. Several experiments were conducted and the results obtained revealed the significance of this study in mining social data for crime intelligence and in developing social data quality filters and decision support systems

    Classifying tor traffic using character analysis

    Get PDF
    Tor is a privacy-preserving network that enables users to browse the Internet anonymously. Although the prospect of such anonymity is welcomed in many quarters, Tor can also be used for malicious purposes, prompting the need to monitor Tor network connections. Most traffic classification methods depend on flow-based features, due to traffic encryption. However, these features can be less reliable due to issues like asymmetric routing, and processing multiple packets can be time-intensive. In light of Tor’s sophisticated multilayered payload encryption compared with nonTor encryption, our research explored patterns in the encrypted data of both networks, challenging conventional encryption theory which assumes that ciphertexts should not be distinguishable from random strings of equal length. Our novel approach leverages machine learning to differentiate Tor from nonTor traffic using only the encrypted payload. We focused on extracting statistical hex character-based features from their encrypted data. For consistent findings, we drew from two datasets: a public one, which was divided into eight application types for more granular insight and a private one. Both datasets covered Tor and nonTor traffic. We developed a custom Python script called Charcount to extract relevant data and features accurately. To verify our results’ robustness, we utilized both Weka and scikit-learn for classification. In our first line of research, we conducted hex character analysis on the encrypted payloads of both Tor and nonTor traffic using statistical testing. Our investigation revealed a significant differentiation rate between Tor and nonTor traffic of 95.42% for the public dataset and 100% for the private dataset. The second phase of our study aimed to distinguish between Tor and nonTor traffic using machine learning, focusing on encrypted payload features that are independent of length. In our evaluations, the public dataset yielded an average accuracy of 93.56% when classified with the Decision Tree (DT) algorithm in scikit-learn, and 95.65% with the j48 algorithm in Weka. For the private dataset, the accuracies were 95.23% and 97.12%, respectively. Additionally, we found that the combination of WrapperSubsetEval+BestFirst with the J48 classifier both enhanced accuracy and optimized processing efficiency. In conclusion, our study contributes to both demonstrating the distinction between Tor and nonTor traffic and achieving efficient classification of both types of traffic using features derived exclusively from a single encrypted payload packet. This work holds significant implications for cybersecurity and points towards further advancements in the field.Tor is a privacy-preserving network that enables users to browse the Internet anonymously. Although the prospect of such anonymity is welcomed in many quarters, Tor can also be used for malicious purposes, prompting the need to monitor Tor network connections. Most traffic classification methods depend on flow-based features, due to traffic encryption. However, these features can be less reliable due to issues like asymmetric routing, and processing multiple packets can be time-intensive. In light of Tor’s sophisticated multilayered payload encryption compared with nonTor encryption, our research explored patterns in the encrypted data of both networks, challenging conventional encryption theory which assumes that ciphertexts should not be distinguishable from random strings of equal length. Our novel approach leverages machine learning to differentiate Tor from nonTor traffic using only the encrypted payload. We focused on extracting statistical hex character-based features from their encrypted data. For consistent findings, we drew from two datasets: a public one, which was divided into eight application types for more granular insight and a private one. Both datasets covered Tor and nonTor traffic. We developed a custom Python script called Charcount to extract relevant data and features accurately. To verify our results’ robustness, we utilized both Weka and scikit-learn for classification. In our first line of research, we conducted hex character analysis on the encrypted payloads of both Tor and nonTor traffic using statistical testing. Our investigation revealed a significant differentiation rate between Tor and nonTor traffic of 95.42% for the public dataset and 100% for the private dataset. The second phase of our study aimed to distinguish between Tor and nonTor traffic using machine learning, focusing on encrypted payload features that are independent of length. In our evaluations, the public dataset yielded an average accuracy of 93.56% when classified with the Decision Tree (DT) algorithm in scikit-learn, and 95.65% with the j48 algorithm in Weka. For the private dataset, the accuracies were 95.23% and 97.12%, respectively. Additionally, we found that the combination of WrapperSubsetEval+BestFirst with the J48 classifier both enhanced accuracy and optimized processing efficiency. In conclusion, our study contributes to both demonstrating the distinction between Tor and nonTor traffic and achieving efficient classification of both types of traffic using features derived exclusively from a single encrypted payload packet. This work holds significant implications for cybersecurity and points towards further advancements in the field

    Computational Approaches to Drug Profiling and Drug-Protein Interactions

    Get PDF
    Despite substantial increases in R&D spending within the pharmaceutical industry, denovo drug design has become a time-consuming endeavour. High attrition rates led to a long period of stagnation in drug approvals. Due to the extreme costs associated with introducing a drug to the market, locating and understanding the reasons for clinical failure is key to future productivity. As part of this PhD, three main contributions were made in this respect. First, the web platform, LigNFam enables users to interactively explore similarity relationships between ‘drug like’ molecules and the proteins they bind. Secondly, two deep-learning-based binding site comparison tools were developed, competing with the state-of-the-art over benchmark datasets. The models have the ability to predict offtarget interactions and potential candidates for target-based drug repurposing. Finally, the open-source ScaffoldGraph software was presented for the analysis of hierarchical scaffold relationships and has already been used in multiple projects, including integration into a virtual screening pipeline to increase the tractability of ultra-large screening experiments. Together, and with existing tools, the contributions made will aid in the understanding of drug-protein relationships, particularly in the fields of off-target prediction and drug repurposing, helping to design better drugs faster

    Automated anomaly recognition in real time data streams for oil and gas industry.

    Get PDF
    There is a growing demand for computer-assisted real-time anomaly detection - from the identification of suspicious activities in cyber security, to the monitoring of engineering data for various applications across the oil and gas, automotive and other engineering industries. To reduce the reliance on field experts' knowledge for identification of these anomalies, this thesis proposes a deep-learning anomaly-detection framework that can help to create an effective real-time condition-monitoring framework. The aim of this research is to develop a real-time and re-trainable generic anomaly-detection framework, which is capable of predicting and identifying anomalies with a high level of accuracy - even when a specific anomalous event has no precedent. Machine-based condition monitoring is preferable in many practical situations where fast data analysis is required, and where there are harsh climates or otherwise life-threatening environments. For example, automated conditional monitoring systems are ideal in deep sea exploration studies, offshore installations and space exploration. This thesis firstly reviews studies about anomaly detection using machine learning. It then adopts the best practices from those studies in order to propose a multi-tiered framework for anomaly detection with heterogeneous input sources, which can deal with unseen anomalies in a real-time dynamic problem environment. The thesis then applies the developed generic multi-tiered framework to two fields of engineering: data analysis and malicious cyber attack detection. Finally, the framework is further refined based on the outcomes of those case studies and is used to develop a secure cross-platform API, capable of re-training and data classification on a real-time data feed

    Classification of socially generated medical data

    Get PDF
    The growth of online health communities, particularly those involving socially generated content, can provide considerable value for society. Participants can gain knowledge of medical information or interact with peers on medical forum platforms. However, the sheer volume of information so generated – and the consequent ‘noise’ associated with large data volumes – can create difficulties for information consumers. We propose a solution to this problem by applying high-level analytics to the data – primarily sentiment analysis, but also content and topic analysis - for accurate classification. We believe that such analysis can be of significant value to data users, such as identifying a particular aspect of an information space, determining themes that predominate among a large dataset, and allowing people to summarize topics within a big dataset. In this thesis, we apply machine learning strategies to identify sentiments expressed in online medical forums that discuss Lyme Disease. As part of this process, we distinguish a complete and relevant set of categories that can be used to characterize Lyme Disease discourse. We present a feature-based model that employs supervised learning algorithms and assess the feasibility and accuracy of this sentiment classification model. We further evaluate our model by assessing its ability to adapt to an online medical forum discussing a disease with similar characteristics, Lupus. The experimental results demonstrate the effectiveness of our approach. In many sentiment analysis applications, the labelled training datasets are expensive to obtain, whereas unlabelled datasets are readily available. Therefore, we present an adaptation of a well-known semi-supervised learning technique, in which co-training is implemented by combining labelled and unlabelled data. Our results would suggest the ability to learn even with limited labelled data. In addition, we investigate complementary analytic techniques – content and topic analysis – to leverage best used of the data for various consumer groups. Within the work described in this thesis, some particular research issues are addressed, specifically when applied to socially generated medical/health datasets: • When applying binary sentiment analysis to short-form text data (e.g. Twitter), could meta-level features improve performance of classification? • When applying more complex multi-class sentiment analysis to classification of long-form content-rich text data, would meta-level features be a useful addition to more conventional features? • Can this multi-class analysis approach be generalised to other medical/health domains? • How would alternative classification strategies benefit different groups of information consumers
    corecore