20 research outputs found

    Large-Scale Automatic Feature Selection for Biomarker Discovery in High-Dimensional OMICs Data

    Get PDF
    The identification of biomarker signatures in omics molecular profiling is usually performed to predict outcomes in a precision medicine context, such as patient disease susceptibility, diagnosis, prognosis, and treatment response. To identify these signatures, we have developed a biomarker discovery tool, called BioDiscML. From a collection of samples and their associated characteristics, i.e., the biomarkers (e.g., gene expression, protein levels, clinico-pathological data), BioDiscML exploits various feature selection procedures to produce signatures associated to machine learning models that will predict efficiently a specified outcome. To this purpose, BioDiscML uses a large variety of machine learning algorithms to select the best combination of biomarkers for predicting categorical or continuous outcomes from highly unbalanced datasets. The software has been implemented to automate all machine learning steps, including data pre-processing, feature selection, model selection, and performance evaluation. BioDiscML is delivered as a stand-alone program and is available for download at https://github.com/mickaelleclercq/BioDiscML

    Causal Discovery from Temporal Data: An Overview and New Perspectives

    Full text link
    Temporal data, representing chronological observations of complex systems, has always been a typical data structure that can be widely generated by many domains, such as industry, medicine and finance. Analyzing this type of data is extremely valuable for various applications. Thus, different temporal data analysis tasks, eg, classification, clustering and prediction, have been proposed in the past decades. Among them, causal discovery, learning the causal relations from temporal data, is considered an interesting yet critical task and has attracted much research attention. Existing casual discovery works can be divided into two highly correlated categories according to whether the temporal data is calibrated, ie, multivariate time series casual discovery, and event sequence casual discovery. However, most previous surveys are only focused on the time series casual discovery and ignore the second category. In this paper, we specify the correlation between the two categories and provide a systematical overview of existing solutions. Furthermore, we provide public datasets, evaluation metrics and new perspectives for temporal data casual discovery.Comment: 52 pages, 6 figure

    Hierarchical ensemble methods for protein function prediction

    Get PDF
    Protein function prediction is a complex multiclass multilabel classification problem, characterized by multiple issues such as the incompleteness of the available annotations, the integration of multiple sources of high dimensional biomolecular data, the unbalance of several functional classes, and the difficulty of univocally determining negative examples. Moreover, the hierarchical relationships between functional classes that characterize both the Gene Ontology and FunCat taxonomies motivate the development of hierarchy-aware prediction methods that showed significantly better performances than hierarchical-unaware \u201cflat\u201d prediction methods. In this paper, we provide a comprehensive review of hierarchical methods for protein function prediction based on ensembles of learning machines. According to this general approach, a separate learning machine is trained to learn a specific functional term and then the resulting predictions are assembled in a \u201cconsensus\u201d ensemble decision, taking into account the hierarchical relationships between classes. The main hierarchical ensemble methods proposed in the literature are discussed in the context of existing computational methods for protein function prediction, highlighting their characteristics, advantages, and limitations. Open problems of this exciting research area of computational biology are finally considered, outlining novel perspectives for future research

    High-Performance Modelling and Simulation for Big Data Applications

    Get PDF
    This open access book was prepared as a Final Publication of the COST Action IC1406 “High-Performance Modelling and Simulation for Big Data Applications (cHiPSet)“ project. Long considered important pillars of the scientific method, Modelling and Simulation have evolved from traditional discrete numerical methods to complex data-intensive continuous analytical optimisations. Resolution, scale, and accuracy have become essential to predict and analyse natural and complex systems in science and engineering. When their level of abstraction raises to have a better discernment of the domain at hand, their representation gets increasingly demanding for computational and data resources. On the other hand, High Performance Computing typically entails the effective use of parallel and distributed processing units coupled with efficient storage, communication and visualisation systems to underpin complex data-intensive applications in distinct scientific and technical domains. It is then arguably required to have a seamless interaction of High Performance Computing with Modelling and Simulation in order to store, compute, analyse, and visualise large data sets in science and engineering. Funded by the European Commission, cHiPSet has provided a dynamic trans-European forum for their members and distinguished guests to openly discuss novel perspectives and topics of interests for these two communities. This cHiPSet compendium presents a set of selected case studies related to healthcare, biological data, computational advertising, multimedia, finance, bioinformatics, and telecommunications

    Development of a Smart Chair Sensors System and Classification of Sitting Postures with Deep Learning Algorithms

    Get PDF
    Nowadays in modern societies, a sedentary lifestyle is almost inevitable for a majority of the population. Long hours of sitting, especially in wrong postures, may result in health complications. A smart chair with the capability to identify sitting postures can help reduce health risks induced by a modern lifestyle. This paper presents the design, realization and evaluation of a new smart chair sensors system capable of sitting postures identification. The system consists of eight pressure sensors placed on the chair's sitting cushion and the backrest. A signal acquisition board was designed from scratch to acquire data generated by the pressure sensors and transmit them via a Wi-Fi network to a purposely developed graphical user interface which monitors and stores the acquired sensors' data on a computer. The designed system was tested by means of an extensive sitting experiment involving 40 subjects, and from the acquired data, the classification of the respective sitting postures out of eight possible postures was performed. Hereby, the performance of seven deep-learning algorithms was assessed. The best accuracy of 91.68% was achieved by an echo memory network model. The designed smart chair sensors system is simple and versatile, low cost and accurate, and it can easily be deployed in several smart chair environments, both for public and private contexts

    Mass & secondary structure propensity of amino acids explain their mutability and evolutionary replacements

    Get PDF
    Why is an amino acid replacement in a protein accepted during evolution? The answer given by bioinformatics relies on the frequency of change of each amino acid by another one and the propensity of each to remain unchanged. We propose that these replacement rules are recoverable from the secondary structural trends of amino acids. A distance measure between high-resolution Ramachandran distributions reveals that structurally similar residues coincide with those found in substitution matrices such as BLOSUM: Asn Asp, Phe Tyr, Lys Arg, Gln Glu, Ile Val, Met → Leu; with Ala, Cys, His, Gly, Ser, Pro, and Thr, as structurally idiosyncratic residues. We also found a high average correlation (\overline{R} R = 0.85) between thirty amino acid mutability scales and the mutational inertia (I X ), which measures the energetic cost weighted by the number of observations at the most probable amino acid conformation. These results indicate that amino acid substitutions follow two optimally-efficient principles: (a) amino acids interchangeability privileges their secondary structural similarity, and (b) the amino acid mutability depends directly on its biosynthetic energy cost, and inversely with its frequency. These two principles are the underlying rules governing the observed amino acid substitutions. © 2017 The Author(s)

    High-Performance Modelling and Simulation for Big Data Applications

    Get PDF
    This open access book was prepared as a Final Publication of the COST Action IC1406 “High-Performance Modelling and Simulation for Big Data Applications (cHiPSet)“ project. Long considered important pillars of the scientific method, Modelling and Simulation have evolved from traditional discrete numerical methods to complex data-intensive continuous analytical optimisations. Resolution, scale, and accuracy have become essential to predict and analyse natural and complex systems in science and engineering. When their level of abstraction raises to have a better discernment of the domain at hand, their representation gets increasingly demanding for computational and data resources. On the other hand, High Performance Computing typically entails the effective use of parallel and distributed processing units coupled with efficient storage, communication and visualisation systems to underpin complex data-intensive applications in distinct scientific and technical domains. It is then arguably required to have a seamless interaction of High Performance Computing with Modelling and Simulation in order to store, compute, analyse, and visualise large data sets in science and engineering. Funded by the European Commission, cHiPSet has provided a dynamic trans-European forum for their members and distinguished guests to openly discuss novel perspectives and topics of interests for these two communities. This cHiPSet compendium presents a set of selected case studies related to healthcare, biological data, computational advertising, multimedia, finance, bioinformatics, and telecommunications

    Shape analysis of the human brain.

    Get PDF
    Autism is a complex developmental disability that has dramatically increased in prevalence, having a decisive impact on the health and behavior of children. Methods used to detect and recommend therapies have been much debated in the medical community because of the subjective nature of diagnosing autism. In order to provide an alternative method for understanding autism, the current work has developed a 3-dimensional state-of-the-art shape based analysis of the human brain to aid in creating more accurate diagnostic assessments and guided risk analyses for individuals with neurological conditions, such as autism. Methods: The aim of this work was to assess whether the shape of the human brain can be used as a reliable source of information for determining whether an individual will be diagnosed with autism. The study was conducted using multi-center databases of magnetic resonance images of the human brain. The subjects in the databases were analyzed using a series of algorithms consisting of bias correction, skull stripping, multi-label brain segmentation, 3-dimensional mesh construction, spherical harmonic decomposition, registration, and classification. The software algorithms were developed as an original contribution of this dissertation in collaboration with the BioImaging Laboratory at the University of Louisville Speed School of Engineering. The classification of each subject was used to construct diagnoses and therapeutic risk assessments for each patient. Results: A reliable metric for making neurological diagnoses and constructing therapeutic risk assessment for individuals has been identified. The metric was explored in populations of individuals having autism spectrum disorders, dyslexia, Alzheimers disease, and lung cancer. Conclusion: Currently, the clinical applicability and benefits of the proposed software approach are being discussed by the broader community of doctors, therapists, and parents for use in improving current methods by which autism spectrum disorders are diagnosed and understood

    Analyse de séries temporelles d’images à moyenne résolution spatiale : reconstruction de profils de LAI, démélangeage : application pour le suivi de la végétation sur des images MODIS

    Get PDF
    This PhD dissertation is concerned with time series analysis for medium spatial resolution (MSR) remote sensing images. The main advantage of MSR data is their high temporal rate which allows to monitor land use. However, two main problems arise with such data. First, because of cloud coverage and bad acquisition conditions, the resulting time series are often corrupted and not directly exploitable. Secondly, pixels in medium spatial resolution images are often “mixed” in the sense that the spectral response is a combination of the response of “pure” elements.These two problems are addressed in this PhD. First, we propose a data assimilation technique able to recover consistent time series of Leaf Area Index from corrupted MODIS sequences. To this end, a plant growth model, namely GreenLab, is used as a dynamical constraint. Second, we propose a new and efficient unmixing technique for time series. It is in particular based on the use of “elastic” kernels able to properly compare time series shifted in time or of various lengths.Experimental results are shown both on synthetic and real data and demonstrate the efficiency of the proposed methodologies.Cette thèse s’intéresse à l’analyse de séries temporelles d’images satellites à moyenne résolution spatiale. L’intérêt principal de telles données est leur haute répétitivité qui autorise des analyses de l’usage des sols. Cependant, deux problèmes principaux subsistent avec de telles données. En premier lieu, en raison de la couverture nuageuse, des mauvaises conditions d’acquisition, ..., ces données sont souvent très bruitées. Deuxièmement, les pixels associés à la moyenne résolution spatiale sont souvent “mixtes” dans la mesure où leur réponse spectrale est une combinaison de la réponse de plusieurs éléments “purs”. Ces deux problèmes sont abordés dans cette thèse. Premièrement, nous proposons une technique d’assimilation de données capable de recouvrer des séries temporelles cohérentes de LAI (Leaf Area Index) à partir de séquences d’images MODIS bruitées. Pour cela, le modèle de croissance de plantes GreenLab estutilisé. En second lieu, nous proposons une technique originale de démélangeage, qui s’appuie notamment sur des noyaux “élastiques” capables de gérer les spécificités des séries temporelles (séries de taille différentes, décalées dans le temps, ...)Les résultats expérimentaux, sur des données synthétiques et réelles, montrent de bonnes performances des méthodologies proposées
    corecore