14 research outputs found

    Aircraft Numerical "Twin": A Time Series Regression Competition

    Get PDF
    International audienceThis paper presents the design and analysis of a data science competition on a problem of time series regression from aeronautics data. For the purpose of performing predictive maintenance, aviation companies seek to create aircraft "numerical twins", which are programs capable of accurately predicting strains at strategic positions in various body parts of the aircraft. Given a number of input parameters (sensor data) recorded in sequence during the flight, the competition participants had to predict output values (gauges), also recorded sequentially during test flights, but not recorded during regular flights. The competition data included hundreds of complete flights. It was a code submission competition with complete blind testing of algorithms. The results indicate that such a problem can be effectively solved with gradient boosted trees, after preprocessing and feature engineering. Deep learning methods did not prove as efficient

    Algorithmes ab initio pour l'identification et la classification des ARNs non-codants

    No full text
    L'identification des ARN non codants (ARNncs) permet d'améliorer notre compréhension de la biologie.Actuellement, les fonctions biologiques d'une grande partie des ARNncs sont connues.Mais il reste d'autre classes à découvrir.L'identification et la classification des ARNncs n'est pas une tâche triviale.Elle dépend de plusieurs types de données hétérogènes (séquence, structure secondaire, interaction avec d'autres composants biologiques, etc.) et nécessite l'utilisation de méthode appropriées.Durant cette thèse, nous avons développé des méthodes basées sur les cartes auto-organisatrice (SOM).Les SOMs nous permettent analyser et de représenter les ARNncs par une carte où la topologie des données est conservée.Nous avons proposé un nouvel algorithme de SOM qui permet d'intégrer plusieurs sources de données sous forme numérique ou sous forme complexe (représenté par des noyaux).Ce nouvel algorithm que nous appelons MSSOM calcule une SOM pour chaque source de données et les combine à l'aide d'une SOM finale.MSSOM calcule pour chaque cluster la meilleur combinaison de sources.Nous avons par ailleurs développer une variante supervisée de SOM qui s'appelle SLSOM.SLSOM classifie les classes connues à l'aide d'un perceptron multicouche et de la sortie d'une SOM.SLSOM intègre également une option de rejet qui lui permet de rejeter les prédictions incertaines et d’identifier de nouvelles classes.Ces méthodes nous ont permis de développer deux nouveaux outils bioinformatique.Le premier est l'application d'une variante de SLSOM pour la discrimination entre les ARNs codants et non-codants.Cet outil que nous appelons IRSOM a été testé sur plusieurs espèce venant de différents règnes (plantes, animales, bactéries et champignons).A l'aide de caractéristique simples, nous avons montré que IRSOM permet de séparer les ARNs codants des non-codants.De plus, avec la visualisation de SOM et l'option de rejet nous avons pu identifier les ARNs ambiguë chez l'humain.Le second s'appelle CRSOM et permet de classifier les ARNncs en différentes sous-classes.CRSOM est une combinaison de MSSOM et SLSOM et utilise deux sources de données qui sont la fréquence des k-mers de séquence et un noyau Gaussien de structure secondaire utilisant la distance d'édition.Nous avons montrer que CRSOM obtient des performances comparable à l'outil de référence (nRC) sans rejet, et de meilleur résultats avec le rejet.The non-coding RNA (ncRNA) identification helps to improve our comprehension of biology. We know the biological functions for a majority of ncRNA classes. But, we don't know all the classes of ncRNAs. Besides, the identification of ncRNAs using computational methods is not a trivial task. The relevant features for each class of ncRNAs rely on multiple heterogeneous sources of data (sequences, secondary structure, interaction with other biological components, etc.). During this thesis, we developed methods relying on Self-Organizing Maps (SOM).The SOM is used to analyze and represent the ncRNAs by a map of clusters where the topology of the data is preserved.We proposed a new SOM version called MSSOM which can handle multiple sources of data composed of numerical data or complex data (represented by kernels). MSSOM combines data sources by using a SOM for each source and learns the weights of each source at the cluster level.We also proposed a supervised variant of SOM with rejection called SLSOM. SLSOM is able to identify and classify the known classes using multi layer perceptron and the output of a SOM.The rejection options associated to the output layer allow to reject the unreliable prediction and to identify the potential new classes.These methods lead to the development of bioinformatic tools.We applied a variant of SLSOM to the discrimination of coding and non-coding RNAs. This method called IRSOM has been evaluated on a wide range of species coming from different reigns (plants, animals, bacteria and fungi).By using a simple set of sequence features, we showed that IRSOM is able to separate the coding and non-coding RNAs efficiently.With the SOM visualization and the rejection option, we also highlighted and analyzed some ambiguous RNAs on the human. The second one is called CRSOM.CRSOM classify ncRNAs into sub classes by integrating two data sources which are the sequence k-mer frequencies and a Gaussian kernel using the edit distance. We show that CRSOM give comparable results with the reference tool (nRC) without reject and better results with the rejection option

    Algorithmes pour la prédiction et la classification ab initio et à grande échelle des ARNs non-codants

    No full text
    The analysis of very large volumes of data generated by NGS (next-generation sequencing) requires the use of efficient bioinformatics tools. One of the aspects of this analysis is the identification of the non-coding RNAs (ncRNAs) that play important roles in many biological processes. The identification of ncRNAs by bioinformatics and computational tools raises two challenges: (i) prediction and classification (ab initio) of different types of ncRNAs, and (ii) large-scale processing of these data. Most currently existing tools for ncRNA prediction are specialized to one type of ncRNA, the largest number being dedicated to microRNAs (miRNAs). This is particularly the case of tools that we developed previously (and available on our software platform EvryRNA (http://EvryRNA.ibisc.univ-evry.fr)). Some tools of the literature can also determine other types of ncRNAs by comparison with sequences listed in various databases dedicated to ncRNAs (homology-based approach). In addition, there are tools to predict different types of ARNncs but without classification or by homology-based classification. The very few ab initio methods (very recently published) are very insufficient in term of prediction and time running. The goal of this project is to develop an ab initio algorithm for predicting and classifying at a large scale several classes of ncRNAs from NGS data, using both combinatory optimization and machine learning methods, and considering different types of ncRNAs features: features on sequence, secondary structure, genomic position, neighborhood, etc. One of the principal characteristics of ncARNs is its structure, notably the secondary structure. It is therefore important to take into account the structure in the ncRNA prediction algorithms, and the challenge is to develop fast algorithms to handle with huge volumes of NGS data. The developed algorithms will be applied for the identification of non-coding RNAs involved in sex determination in plants, particularly in cucurbit (melon, cucumber, …), where large volumes of data are available at IPS2.L'analyse des très gros volumes de données générées par les nouvelles technologies de séquençage (NGS) nécessite l'utilisation d'outils bioinformatiques efficaces. L'un des aspects de cette analyse est d'identifier des ARNs non codants (ARNncs) qui jouent des rôles importants dans de nombreux processus biologiques. L'identification des ARNncs par les outils informatiques soulève deux défis: (i) la prédiction et la classification (ab initio) de différents types d'ARNncs, et (ii) le traitement à grande échelle de ces données. La plupart des outils existants actuellement pour la prédiction d'ARNncs sont spécialisés pour un type d'ARNnc, le plus grand nombre étant dédié aux microARNs (miARN). C'est notamment le cas des outils que nous avons développés précédemment (et disponibles sur notre plate-forme logicielle EvryRNA (http://EvryRNA.ibisc.univ-evry.fr)). Certains outils de la littérature peuvent également déterminer d'autres types d'ARNncs par comparaison avec des séquences figurant dans diverses bases de données dédiées aux ARNncs (approche par homologie). En outre, il existe des outils pour prédire les différents types d'ARNncs mais sans classification ou classification basée sur l'homologie. Le très peu de méthodes ab initio (très récemment publiées) sont très insuffisantes en terme de prédiction et de temps d'exécution. Le but de ce projet est de développer un algorithme ab initio de prédiction et de classification à grande échelle d'ARNncs à partir de données NGS, en utilisant des méthodes d'optimisation combinatoire et d'apprentissage automatique. Une des caractéristiques très importantes des ARNncs est la structure, notamment la structure secondaire, lorsque celle-ci existe et est connue. Il est donc important de la prendre en compte dans les algorithmes de prédiction, dont le défi est notamment d'être rapides afin de pouvoir traiter de très grand volumes de données issues des NGS. Ces algorithmes seront appliqués à la problématique d'identification des ARNncs impliquées dans le déterminisme sexuel chez les plantes, notamment chez les cucurbitacées (melon, concombre, …) où de très gros volumes de données sont disponibles au sein de l'IPS2

    Self-Organizing Maps with supervised layer

    No full text
    International audienceWe present in this paper a new approach of supervised self organizing map (SOM). We added a supervised perceptron layer to the classical SOM approach. This combination allows the classification of new patterns by taking into account all the map prototypes without changing the SOM organization. We also propose to associate two reject options to our supervised SOM. This allows to improve the results reliability and to discover new classes in applications where some classes are unknown. We obtain two variants of supervised SOM with rejection that have been evaluated on different datasets. The results indicate that our approaches are competitive with most popular supervised leaning algorithms like support vector machines and random forest

    IRSOM, a reliable identifier of ncRNAs based on supervised Self-Organizing Maps with rejection

    No full text
    International audienceMotivation: Non-coding RNAs play important roles in many biological processes and are involved in many diseases. Their identification is an important task, and many tools exist in the literature for this purpose. However, almost all of them are focused on the discrimination of coding and non-coding RNAs without giving more biological insight. In this paper, we propose a new reliable method called IRSOM, based on a supervised Self-Organizing Map(SOM) with a rejection option, that overcomes these limitations. The rejection option in IRSOM improves the accuracy of the method and also allows identifing the ambiguous transcripts. Furthermore, with the visualization of the SOM we analyse the rejected prediction and highlight the ambiguity of the transcripts.Results: IRSOM was tested on datasets of several species from different reigns, and shown better results compared to state-of-art. The accuracy of IRSOM is always greater than 0,95 for all the species with an average specificity of 0,98 and an average sensitivity of 0,99. Besides, IRSOM is fast (it takes around 254 seconds to analyse a dataset of 147 000 transcripts) and is able to handle very large datasets. Availability and Implementation: IRSOM is implemented in Python and C++. It is available on our software platform EvryRNA (http://EvryRNA.ibisc.univ-evry.fr)

    IRSOM, a reliable identifier of ncRNAs based on supervised self-organizing maps with rejection

    No full text
    Motivation: Non-coding RNAs (ncRNAs) play important roles in many biological processes and are involved in many diseases. Their identification is an important task, and many tools exist in the literature for this purpose. However, almost all of them are focused on the discrimination of coding and ncRNAs without giving more biological insight. In this paper, we propose a new reliable method called IRSOM, based on a supervised Self-Organizing Map (SOM) with a rejection option, that overcomes these limitations. The rejection option in IRSOM improves the accuracy of the method and also allows identifing the ambiguous transcripts. Furthermore, with the visualization of the SOM, we analyze the rejected predictions and highlight the ambiguity of the transcripts. Results: IRSOM was tested on datasets of several species from different reigns, and shown better results compared to state-of-art. The accuracy of IRSOM is always greater than 0.95 for all the species with an average specificity of 0.98 and an average sensitivity of 0.99. Besides, IRSOM is fast (it takes around 254s to analyze a dataset of 147 000 transcripts) and is able to handle very large datasets

    IRSOM2: a web server for predicting bifunctional RNAs

    No full text
    International audienceRecent advances have shown that some biologically active non-coding RNAs (ncRNAs) are actually translated into polypeptides that have a physiological function as well. This paradigm shift requires adapted computational methods to predict this new class of ‘bifunctional RNAs’. Previously, we developed IRSOM, an open-source algorithm to classify non-coding and coding RNAs. Here, we use the binary statistical model of IRSOM as a ternary classifier, called IRSOM2, to identify bifunctional RNAs as a rejection of the two other classes. We present its easy-to-use web interface, which allows users to perform predictions on large datasets of RNA sequences in a short time, to re-train the model with their own data, and to visualize and analyze the classification results thanks to the implementation of self-organizing maps (SOM). We also propose a new benchmark of experimentally validated RNAs that play both protein-coding and non-coding roles, in different organisms. Thus, IRSOM2 showed promising performance in detecting these bifunctional transcripts among ncRNAs of different types, such as circRNAs and lncRNAs (in particular those of shorter lengths). The web server is freely available on the EvryRNA platform: https://evryrna.ibisc.univ-evry.fr

    A computational approach for phenotypic comparisons of cell populations in high-dimensional cytometry data

    Get PDF
    International audienceBackground: Cytometry is an experimental technique used to measure molecules expressed by cells at a single cell resolution. Recently, several technological improvements have made possible to increase greatly the number of cell markers that can be simultaneously measured. Many computational methods have been proposed to identify clusters of cells having similar phenotypes. Nevertheless, only a limited number of computational methods permits to compare the phenotypes of the cell clusters identified by different clustering approaches. These phenotypic comparisons are necessary to choose the appropriate clustering methods and settings. Because of this lack of tools, comparisons of cell cluster phenotypes are often performed manually, a highly biased and time-consuming process. Results: We designed CytoCompare, an R package that performs comparisons between the phenotypes of cell clusters with the purpose of identifying similar and different ones, based on the distribution of marker expressions. For each phenotype comparison of two cell clusters, CytoCompare provides a distance measure as well as a p-value asserting the statistical significance of the difference. CytoCompare can import clustering results from various algorithms including SPADE, viSNE/ACCENSE, and Citrus, the most current widely used algorithms. Additionally, CytoCompare can generate parallel coordinates, parallel heatmaps, multidimensional scaling or circular graph representations to visualize easily cell cluster phenotypes and the comparison results. Conclusions: CytoCompare is a flexible analysis pipeline for comparing the phenotypes of cell clusters identified by automatic gating algorithms in high-dimensional cytometry data. This R package is ideal for benchmarking different clustering algorithms and associated parameter
    corecore