7 research outputs found

    An Approach to Developing Benchmark Datasets for Protein Secondary Structure Segmentation from Cryo-EM Density Maps

    Get PDF
    More and more deep learning approaches have been proposed to segment secondary structures from cryo-electron density maps at medium resolution range (5--10Ã…). Although the deep learning approaches show great potential, only a few small experimental data sets have been used to test the approaches. There is limited understanding about potential factors, in data, that affect the performance of segmentation. We propose an approach to generate data sets with desired specifications in three potential factors - the protein sequence identity, structural contents, and data quality. The approach was implemented and has generated a test set and various training sets to study the effect of secondary structure content and data quality on the performance of DeepSSETracer, a deep learning method that segments regions of protein secondary structures from cryo-EM map components. Results show that various content levels in the secondary structure and data quality influence the performance of segmentation for DeepSSETracer

    A particle swarm based hybrid system for imbalanced medical data sampling

    Get PDF
    BackgroundMedical and biological data are commonly with small sample size, missing values, and most importantly, imbalanced class distribution. In this study we propose a particle swarm based hybrid system for remedying the class imbalance problem in medical and biological data mining. This hybrid system combines the particle swarm optimization (PSO) algorithm with multiple classifiers and evaluation metrics for evaluation fusion. Samples from the majority class are ranked using multiple objectives according to their merit in compensating the class imbalance, and then combined with the minority class to form a balanced dataset.ResultsOne important finding of this study is that different classifiers and metrics often provide different evaluation results. Nevertheless, the proposed hybrid system demonstrates consistent improvements over several alternative methods with three different metrics. The sampling results also demonstrate good generalization on different types of classification algorithms, indicating the advantage of information fusion applied in the hybrid system.ConclusionThe experimental results demonstrate that unlike many currently available methods which often perform unevenly with different datasets the proposed hybrid system has a better generalization property which alleviates the method-data dependency problem. From the biological perspective, the system provides indication for further investigation of the highly ranked samples, which may result in the discovery of new conditions or disease subtypes.<br /

    Understanding risk factors in cardiac rehabilitation patients with random forests and decision trees

    Full text link
    Cardiac rehabilitation is a well-recognised non-pharmacological intervention recommended for the prevention of cardiovascular disease. Numerous studies have produced large amounts of data to examine the above aspects in patient groups. In this paper, datasets collected for over a 10 year period by one Australian hospital are analysed using decision trees to derive prediction rules for the outcome of phase II cardiac rehabilitation. Analysis includes prediction of the outcome of the cardiac rehabilitation program in terms of three groups of cardiovascular risk factors: physiological, psychosocial and performance risk factors. Random forests are used for feature selection to make the models compact and interpretable. Balanced sampling is used to deal with heavily imbalanced class distribution. Experimental results show that the outcome of phase II cardiac rehabilitation in terms of physiological, psychosocial and performance risk factor can be predicted based on initial readings of cholesterol level and hypertension, level achieved in six minute walk test, and Hospital Anxiety and Depression Score (HADS) anxiety score and HADS depression score respectively. This will allow for identifying high risk patient groups and developing personalised cardiac rehabilitation programs for those patients to increase their chances of success and minimize their risk of failure. © 2011, Australian Computer Society, Inc

    A new framework in improving prediction of class imbalance for student performance in Oman educational dataset using clustering based sampling techniques

    Get PDF
    According to the Oman Education Portal (OEP), data set imbalances are common in student performance. Most of the students are performing welI, while only small cases of students are underperformed. Classification techniques for the imbalanced dataset can yield deceivingly high prediction accuracy. The majority class usually drives the overall predictive accuracy at the expense of having abysmal performance on the minority class. The main objective of this study was to predict students' performance which consisted of imbalanced class distribution, by exploiting different sampling techniques and several data mining classifier models. Three main sampling techniques - synthetic minority over-sampling technique (SMOTE), random under-sampling (RUS), and clustering-based sampling were compared to improve the predictive accuracy in the minority class while maintaining satisfactory overall classification performance. Five different data-mining classifiers - J48, Random Forest, K-Nearest Neighbour, Naïve Bayes, and Logistic Regression were used to predict the student performance. 10-fold cross-validation was utilized to minimize the sampling bias. The classifiers' performance was evaluated using four metrics: accuracy, False Positive (FP), Matthews correlation coefficient (MCC), and Receiver Operating Characteristic (ROC). The OEP datasets between 2018 and 2019 were extracted to assess the efficacy of both sampling techniques and classification methods. The results indicated that the K-Nearest Neighbors combined with the clustering-based sampling technique produced the best classification performance with an MCC value of 98.4% on the 10-fold crossvalidation. The clustering-based sampling techniques improved the overall prediction performance for the minority class. In addition, the most important variables to accurately predict student performance were identified by utilizing the Random Forest model. OEP contains a large amount of data and analyses based on this large and complex data can be useful for OEP stakeholders in improving student performance and identifying students who require additional attention

    Inferencia de interacciones causales génicas usando técnicas basadas en Manto de Markov

    Get PDF
    Conocer cómo interactúan los genes en las células es un objetivo importante en biología y medicina. Este conocimiento permitiría la creación de terapias celulares precisas para corregir disfunciones de los mecanismos moleculares detrás de condiciones patológicas como el cáncer [1,2]. El estudio de estas interacciones ha sido realizado tradicionalmente por medio de experimentos que involucran perturbaciones a los sistemas celulares, y con ello una alta demanda de tiempo y mano de obra. La premisa común para realizar estos costosos experimentos de intervención es que ellos permiten detectar relaciones de causalidad entre genes sin ambigüedad, a diferencia de realizar únicamente observaciones en los sistemas celulares que no permitirían distinguir de forma confiable relaciones causales de correlaciones estadísticas generadas indirectamente por mecanismos no observados. En redes génicas, es necesario distinguir entre una causa de un efecto y el efecto de una causa, ya que esto permitiría saber cómo funciona la regulación génica en las células. No obstante, Maathius et al. [6] demostró que inferir relaciones causales en redes moleculares es posible usando datos de observaciones de los componentes del sistema (genes) y una metodología de análisis de datos. Estos trabajos generaron interés en el tema motivando diversos trabajos en consecuencia con el enfoque de estadística inferencial y causalidad. Sin embargo, las metodologías propuestas incorporan fuertes consideraciones en los modelos, como aciclicidad de las interacciones y gausianidad en los niveles de expresión de los genes, consideraciones que son biológicamente cuestionables, así como un elevado costo computacional para su procesamiento. Es en dicho contexto donde el presente proyecto propone aplicar un enfoque basado en Aprendizaje Máquina (AM). Este campo estudia cómo generar modelos que aprendan a discriminar objetos o instancias en categorías o clases conocidas, con base a un conjunto de instancias ya clasificadas (datos de entrenamiento). La idea de usar Aprendizaje Máquina en la detección de interacciones causales entre genes es aprender las diferencias mínimas que puedan existir dentro de las observaciones temporales de las expresiones de los genes que pueden caracterizar comportamientos causales entre genes. Sin embargo, al aplicar Aprendizaje Máquina en problemas de alta dimensionalidad como el descrito, es común hallar un alto costo computacional para su ejecución, lo cual genera la necesidad de métodos de reducción de dimensionalidad. En el presente proyecto se propone investigar un enfoque basado en el concepto de Manto de Markov (MM), cuyos estimadores han probado ser teóricamente óptimos para la detección del conjunto de variables causalmente relevante respecto a una variable de interés.Tesi

    An Energy-Efficient Spiking CNN Implementation for Cross-Patient Epileptic Seizure Detection

    Get PDF
    This research aims to develop a data-driven computationally efficient strategy for automatic cross-patient seizure detection using spatio temporal features learned from multichannel electroencephalogram (EEG) time-series data. In this approach, we utilize an algorithm that seeks to capture spectral, temporal, and spatial information in order to achieve high generalization. This algorithm's initial step is to convert EEG signals into a series of temporal and multi-spectral pictures. The produced images are then sent into a convolutional neural network (CNN) as inputs. Our convolutional neural network as a deep learning method learns a general spatially irreducible representation of a seizure to improves sensitivity, specificity, and accuracy results comparable to the state-of-the-art results. In this work, in order to avoid the inherent high computational cost of CNNs while benefiting from their superior classification performance, a neuromorphic computing strategy for seizure prediction called spiking CNN is developed from the traditional CNN method, which is motivated by the energy-efficient spiking neural networks (SNNs) of the human brain

    Evaluasi Metode Hierarchical Clustering Berbasis Linkage pada MWMOTE : Studi Kasus Data Akademik Universitas XYZ dan Data UCI

    Get PDF
    Ketidakseimbangan (Imbalanced) data terjadi pada berbagai macam data termasuk data akademik Universitas XYZ dan data UCI. Kasus tersebut menyebabkan adanya misclassified dikarenakan data mayoritas dominan terhadap data minoritas yang berakibat pada menurunnya nilai akurasi. Metode MWMOTE dapat menjadi pilihan dalam menyelesaikan kasus imbalanced melalui pembobotan dan clustering. Penelitian ini bertujuan menangani permasalahan imbalanced dataset akademik di Universitas XYZ angkatan 2014 dan 2015 dan data UCI dengan mengevaluasi hierarchical clustering. Tujuan tersebut dicapai dengan mengevaluasi tiga metoda hierarchical cluster sebagai salah satu sub proses pada MWMOTE untuk menghasilkan data sintetik yang lebih representatif. Hasil yang didapat dari penelitian ini adalah ketiga metoda AHC tersebut tidak memberikan perbedaan yang signifikan dalam perbaikan akurasi MWMOTE pada data akademik dan 7 data UCI yang diuji dengan one-way ANOVA dengan nilai sig/alpha > 0.0
    corecore