1,245 research outputs found

    Comparing Algorithms for Predictive Data Analytics

    Get PDF
    The master's degree thesis is composed of theoretical and practical parts. The theoretical part describes the basics of predictive data analytics and machine learning algorithms for classification such as Logistic Regression, Decision Tree, Random Forest, SVM, and KNN. We also describe different evaluation metrics such as Recall, Precision, Accuracy, F1 Score, Cohen's Kappa, Hamming Loss, and Jaccard Index that are used to measure the performance of these algorithms. Additionally, we record the time taken for the training and prediction processes to provide insights into algorithm scalability. The key part master's thesis is the practical part that compares these algorithms with a self-implemented tool that shows results for different evaluation metrics on seven datasets. First, we describe the implementation of an application for testing where we measure evaluation metrics scores. We tested these algorithms on all seven datasets using Python libraries such as scikit-learn. Finally, we analyze the results obtained and provide final conclusions

    Probabilistic Graphical Models for ERP-Based Brain Computer Interfaces

    Get PDF
    An event related potential (ERP) is an electrical potential recorded from the nervous system of humans or other animals. An ERP is observed after the presentation of a stimulus. Some examples of the ERPs are P300, N400, among others. Although ERPs are used very often in neuroscience, its generation is not yet well understood and different theories have been proposed to explain the phenomena. ERPs could be generated due to changes in the alpha rhythm, an internal neural control that reset the ongoing oscillations in the brain, or separate and distinct additive neuronal phenomena. When different repetitions of the same stimuli are averaged, a coherence addition of the oscillations is obtained which explain the increase in amplitude in the signals. Two ERPs are mostly studied: N400 and P300. N400 signals arise when a subject tries to make semantic operations that support neural circuits for explicit memory. N400 potentials have been observed mostly in the rhinal cortex. P300 signals are related to attention and memory operations. When a new stimulus appears, a P300 ERP (named P3a) is generated in the frontal lobe. In contrast, when a subject perceives an expected stimulus, a P300 ERP (named P3b) is generated in the temporal – parietal areas. This implicates P3a and P3b are related, suggesting a circuit pathway between the frontal and temporal–parietal regions, whose existence has not been verified. Un potencial relacionado con un evento (ERP) es un potencial eléctrico registrado en el sistema nervioso de los seres humanos u otros animales. Un ERP se observa tras la presentación de un estímulo. Aunque los ERPs se utilizan muy a menudo en neurociencia, su generación aún no se entiende bien y se han propuesto diferentes teorías para explicar el fenómeno. Una interfaz cerebro-computador (BCI) es un sistema de comunicación en el que los mensajes o las órdenes que un sujeto envía al mundo exterior proceden de algunas señales cerebrales en lugar de los nervios y músculos periféricos. La BCI utiliza ritmos sensorimotores o señales ERP, por lo que se necesita un clasificador para distinguir entre los estímulos correctos y los incorrectos. En este trabajo, proponemos utilizar modelos probabilísticos gráficos para el modelado de la dinámica temporal y espacial de las señales cerebrales con aplicaciones a las BCIs. Los modelos gráficos han sido seleccionados por su flexibilidad y capacidad de incorporar información previa. Esta flexibilidad se ha utilizado anteriormente para modelar únicamente la dinámica temporal. Esperamos que el modelo refleje algunos aspectos del funcionamiento del cerebro relacionados con los ERPs, al incluir información espacial y temporal.DoctoradoDoctor en Ingeniería Eléctrica y Electrónic

    LightGBM: An Effective and Scalable Algorithm for Prediction of Chemical Toxicity – Application to the Tox21 and Mutagenicity Datasets

    Get PDF
    Machine learning algorithms have attained widespread use in assessing the potential toxicities of pharmaceuticals and industrial chemicals because of their faster-speed and lower-cost compared to experimental bioassays. Gradient boosting is an effective algorithm that often achieves high predictivity, but historically the relative long computational time limited its applications in predicting large compound libraries or developing in silico predictive models that require frequent retraining. LightGBM, a recent improvement of the gradient boosting algorithm inherited its high predictivity but resolved its scalability and long computational time by adopting leaf-wise tree growth strategy and introducing novel techniques. In this study, we compared the predictive performance and the computational time of LightGBM to deep neural networks, random forests, support vector machines, and XGBoost. All algorithms were rigorously evaluated on publicly available Tox21 and mutagenicity datasets using a Bayesian optimization integrated nested 10-fold cross-validation scheme that performs hyperparameter optimization while examining model generalizability and transferability to new data. The evaluation results demonstrated that LightGBM is an effective and highly scalable algorithm offering the best predictive performance while consuming significantly shorter computational time than the other investigated algorithms across all Tox21 and mutagenicity datasets. We recommend LightGBM for applications in in silico safety assessment and also in other areas of cheminformatics to fulfill the ever-growing demand for accurate and rapid prediction of various toxicity or activity related endpoints of large compound libraries present in the pharmaceutical and chemical industry

    Computational analysis of transcriptional responses to the Activin signal

    Get PDF
    Die Signalwege des transformierenden Wachstumsfaktors β (TGF-β) spielen eine entscheidende Rolle bei der Zellproliferation, -migration und -apoptose durch die Aktivierung von Smad-Proteinen. Untersuchungen haben gezeigt, dass die biologischen Wirkungen des TGF-β-Signalwegs stark vom Zellkontext abhängen. In dieser Arbeit ging es darum zu verstehen, wie TGF-β-Signale Zielgene unterschiedlich regulieren können, wie unterschiedliche Dynamiken der Genexpression durch TGF-β-Signale induziert werden und auf welche Weise Smad-Proteine zu unterschiedlichen Expressionsmustern von TGF- β-Zielgenen beitragen. Der Fokus dieser Studie liegt auf den transkriptionsregulatorischen Effekten des Nodal / Activin-Liganden, der zur TGF-β-Superfamilie gehört und ein wichtiger Faktor in der frühen embryonalen Entwicklung ist. Um diese Effekte zu analysieren, habe ich kinetische Modelle entwickelt und mit den Zeitverlaufsdaten von RNA-Polymerase II (Pol II) und Smad2-Chromatin-Bindungsprofilen für die Zielgene kalibriert. Unter Verwendung des Akaike-Informationskriteriums (AIC) zur Bewertung verschiedener kinetischer Modelle stellten wir fest, dass der Nodal / Activin-Signalweg Zielgene über verschiedene Mechanismen reguliert. Im Nodal / Activin-Smad2-Signalweg spielt Smad2 für verschiedene Zielgene unterschiedliche regulatorische Rollen. Wir zeigen, wie Smad2 daran beteiligt ist, die Transkriptions- oder Abbaurate jedes Zielgens separat zu regulieren. Darüber hinaus werden eine Reihe von Merkmalen, die die Transkriptionsdynamik von Zielgenen vorhersagen können, durch logistische Regression ausgewählt. Der hier vorgestellte Ansatz liefert quantitative Beziehungen zwischen der Dynamik des Transkriptionsfaktors und den Transkriptionsantworten. Diese Arbeit bietet auch einen allgemeinen mathematischen Rahmen für die Untersuchung der Transkriptionsregulation anderer Signalwege.Transforming growth factor-β (TGF-β) signaling pathways play a crucial role in cell proliferation, migration, and apoptosis through the activation of Smad proteins. Research has shown that the biological effects of TGF-β signaling pathway are highly cellular-context-dependent. In this thesis work, I aimed at understanding how TGF-β signaling can regulate target genes differently, how different dynamics of gene expressions are induced by TGF-β signal, and what is the role of Smad proteins in differing the profiles of target gene expression. In this study, I focused on the transcriptional responses to the Nodal/Activin ligand, which is a member of the TGF-β superfamily and a key regulator of early embryonic development. Kinetic models were developed and calibrated with the time course data of RNA polymerase II (Pol II) and Smad2 chromatin binding profiles for the target genes. Using the Akaike information criterion (AIC) to evaluate different kinetic models, we discovered that Nodal/Activin signaling regulates target genes via different mechanisms. In the Nodal/Activin-Smad2 signaling pathway, Smad2 plays different regulatory roles on different target genes. We show how Smad2 participates in regulating the transcription or degradation rate of each target gene separately. Moreover, a series of features that can predict the transcription dynamics of target genes are selected by logistic regression. The approach we present here provides quantitative relationships between transcription factor dynamics and transcriptional responses. This work also provides a general computational framework for studying the transcription regulations of other signaling pathways

    Machine learning algorithms development for sleep cycles detection and general physical activity based on biosignals

    Get PDF
    In this work, machine learning algorithms for automatic sleep cycles detection were developed. The features were selected based on the AASM manual, which is considered the gold standard for human technicians. These include features such as saturation of peripheral oxygen or others related to heart rate variation. As normally, the sleep phases naturally differ in frequency, to balance the classes within the dataset, we either oversampled the least common sleep stages or undersampled the most common, allowing for a less skewed performance favouring the most represented stages, while simultaneously improving worst-stage classification. For training the models we used MESA, a database containing 2056 full overnight unattended polysomnographies from a group of 2237 participants. With the goal of developing an algorithm that would only require a PPG device to be able to accurately predict sleep stages and quality, the main channels used from this dataset were SpO2 and PPG. Employing several popular Python libraries used for the development of machine learning and deep learning algorithms, we exhaustively explored the optimisation of the manifold parameters and hyperparameters conditioning both the training and architecture of these models in order for them to better fit our purposes. As a result of these strategies, we were able to develop a neural network model (Multilayer perceptron) with 80.50% accuracy, 0.7586 Cohen’s kappa, and 77.38% F1- score, for five sleep stages. The performance of our algorithm does not seem to be correlated with sleep quality or the number of transitional epochs in each recording, suggesting uniform performance regardless of the presence of sleep disorders. To test its performance in a different real-world scenario we compared the classifications attributed by a popular sleep stage classification android app, which collected information using a smartwatch, and our algorithm, using signals obtained from a device developed by PLUX. These algorithms displayed a strong level of agreement (90.96% agreement, 0.8663 Cohen’s kappa).Neste trabalho, foram desenvolvidos algoritmos de aprendizagem de máquinas para a detecção automática de ciclos de sono. Os sinais específicos captados durante a extração de características foram selecionados com base no manual AASM, que é considerado o padrão-ouro para técnicos. Estas incluem características como a saturação do oxigénio periférico ou outras relacionadas com a variação do ritmo cardíaco. A fim de equilibrar a frequência das classes dentro do conjunto de dados, ora se fez a sobreamostragem das fases menos comuns do sono, ora se fez a subamostragem das mais comuns, permitindo um desempenho menos enviesado em favor das fases mais representadas e, simultaneamente, melhorando a classificação das fases com pior desempenho. Para o treino dos modelos criados, utilizámos MESA, uma base de dados contendo 2056 polissonografias completas, feitas durante a noite e sem vigilância, de um grupo de 2237 participantes. Do conjunto de dados escolhido, os principais canais utilizados foram SpO2 e PPG, com o objetivo de desenvolver um algoritmo que apenas exigiria um dispositivo PPG para poder prever com precisão as fases e a qualidade do sono. Utilizando várias bibliotecas populares de Python para o desenvolvimento de algoritmos de aprendizagem de máquinas e de aprendizagem profunda, explorámos exaustivamente a optimização dos múltiplos parâmetros e hiperparâmetros que tanto condicionam a formação como a arquitetura destes modelos, de modo a que se ajustem melhor aos nossos propósitos. Como resultado disto, fomos capazes de desenvolver um modelo de rede neural (Multilayer perceptron) com 80.50% de precisão, 0.7586 kappa de Cohen e F1-score de 77.38%, para cinco fases de sono. O desempenho do nosso algoritmo não parece estar correlacionado com a qualidade do sono ou o número de épocas de transição em cada gravação, sugerindo um desempenho uniforme independentemente da presença de distúrbios do sono. Para testar o seu desempenho num cenário de mundo real diferente, comparámos as classificações atribuídas por uma aplicação Android de classificação de fases do sono popular, através da recolha de informação por um smartwatch, e o nosso algoritmo, utilizando sinais obtidos a partir de um dispositivo desenvolvido pela PLUX. Estes algoritmos demonstraram um forte nível de concordância (90.96% de concordância, 0.8663 kappa de Cohen)

    Detecting Feature Requests of Third-Party Developers through Machine Learning: A Case Study of the SAP Community

    Get PDF
    The elicitation of requirements is central for the development of successful software products. While traditional requirement elicitation techniques such as user interviews are highly labor-intensive, data-driven elicitation techniques promise enhanced scalability through the exploitation of new data sources like app store reviews or social media posts. For enterprise software vendors, requirements elicitation remains challenging because app store reviews are scarce and vendors have no direct access to users. Against this background, we investigate whether enterprise software vendors can elicit requirements from their sponsored developer communities through data-driven techniques. Following the design science methodology, we collected data from the SAP Community and developed a supervised machine learning classifier, which automatically detects feature requests of third-party developers. Based on a manually labeled data set of 1,500 questions, our classifier reached a high accuracy of 0.819. Our findings reveal that supervised machine learning models are an effective means for the identification of feature requests

    AI-assisted patent prior art searching - feasibility study

    Get PDF
    This study seeks to understand the feasibility, technical complexities and effectiveness of using artificial intelligence (AI) solutions to improve operational processes of registering IP rights. The Intellectual Property Office commissioned Cardiff University to undertake this research. The research was funded through the BEIS Regulators’ Pioneer Fund (RPF). The RPF fund was set up to help address barriers to innovation in the UK economy

    Data Mining Methods Applied to a Digital Forensics Task for Supervised Machine Learning

    Get PDF
    Digital forensics research includes several stages. Once we have collected the data the last goal is to obtain a model in order to predict the output with unseen data. We focus on supervised machine learning techniques. This chapter performs an experimental study on a forensics data task for multi-class classification including several types of methods such as decision trees, bayes classifiers, based on rules, artificial neural networks and based on nearest neighbors. The classifiers have been evaluated with two performance measures: accuracy and Cohen’s kappa. The followed experimental design has been a 4-fold cross validation with thirty repetitions for non-deterministic algorithms in order to obtain reliable results, averaging the results from 120 runs. A statistical analysis has been conducted in order to compare each pair of algorithms by means of t-tests using both the accuracy and Cohen’s kappa metrics
    corecore