99 research outputs found

    A Novel Clustering Algorithm to Capture Utility Information in Transactional Data

    Get PDF
    We develop and design a novel clustering algorithm to capture utility information in transactional data. Transactional data is a special type of categorical data where transactions can be of varying length. A key objective for all categorical data analysis is pattern recognition. Therefore, transactional clustering algorithms focus on capturing the information on high frequency patterns from the data in the clusters. In recent times, utility information for category types in the data has been added to the transactional data model for a more realistic representation of data. As a result, the key information of interest has become high utility patterns instead of high frequency patterns. To the best our knowledge, no existing clustering algorithm for transactional data captures the utility information in the clusters found. Along with our new clustering rationale we also develop corresponding metrics for evaluating quality of clusters found. Experiments on real datasets show that the clusters found by our algorithm successfully capture the high utility patterns in the data. Comparative experiments with other clustering algorithms further illustrate the effectiveness of our algorithm

    Contributions to outlier detection and recommendation systems

    Get PDF
    Le forage de données, appelé également "Découverte de connaissance dans les bases de données" , est un jeune domaine de recherche interdisciplinaire. Le forage de données étudie les processus d'analyse de grands ensembles de données pour en extraire des connaissances, et les processus de transformation de ces connaissances en des structures faciles à comprendre et à utiliser par les humains. Cette thèse étudie deux tâches importantes dans le domaine du forage de données : la détection des anomalies et la recommandation de produits. La détection des anomalies est l'identification des données non conformes aux observations normales. La recommandation de produit est la prédiction du niveau d'intérêt d'un client pour des produits en se basant sur des données d'achats antérieurs et des données socio-économiques. Plus précisément, cette thèse porte sur 1) la détection des anomalies dans de grands ensembles de données de type catégorielles; et 2) les techniques de recommandation à partir des données de classements asymétriques. La détection des anomalies dans des données catégorielles de grande échelle est un problème important qui est loin d'être résolu. Les méthodes existantes dans ce domaine souffrnt d'une faible efficience et efficacité en raison de la dimensionnalité élevée des données, de la grande taille des bases de données, de la complexité élevée des tests statistiques, ainsi que des mesures de proximité non adéquates. Cette thèse propose une définition formelle d'anomalie dans les données catégorielles ainsi que deux algorithmes efficaces et efficients pour la détection des anomalies dans les données de grande taille. Ces algorithmes ont besoin d'un seul paramètre : le nombre des anomalies. Pour déterminer la valeur de ce paramètre, nous avons développé un critère en nous basant sur un nouveau concept qui est l'holo-entropie. Plusieurs recherches antérieures sur les systèmes de recommandation ont négligé un type de classements répandu dans les applications Web, telles que le commerce électronique (ex. Amazon, Taobao) et les sites fournisseurs de contenu (ex. YouTube). Les données de classements recueillies par ces sites se différencient de celles de classements des films et des musiques par leur distribution asymétrique élevée. Cette thèse propose un cadre mieux adapté pour estimer les classements et les préférences quantitatives d'ordre supérieur pour des données de classements asymétriques. Ce cadre permet de créer de nouveaux modèles de recommandation en se basant sur la factorisation de matrice ou sur l'estimation de voisinage. Des résultats expérimentaux sur des ensembles de données asymétriques indiquent que les modèles créés avec ce cadre ont une meilleure performance que les modèles conventionnels non seulement pour la prédiction de classements, mais aussi pour la prédiction de la liste des Top-N produits

    Applications of Molecular Dynamics simulations for biomolecular systems and improvements to density-based clustering in the analysis

    Get PDF
    Molecular Dynamics simulations provide a powerful tool to study biomolecular systems with atomistic detail. The key to better understand the function and behaviour of these molecules can often be found in their structural variability. Simulations can help to expose this information that is otherwise experimentally hard or impossible to attain. This work covers two application examples for which a sampling and a characterisation of the conformational ensemble could reveal the structural basis to answer a topical research question. For the fungal toxin phalloidin—a small bicyclic peptide—observed product ratios in different cyclisation reactions could be rationalised by assessing the conformational pre-organisation of precursor fragments. For the C-type lectin receptor langerin, conformational changes induced by different side-chain protonations could deliver an explanation of the pH-dependency in the protein’s calcium-binding. The investigations were accompanied by the continued development of a density-based clustering protocol into a respective software package, which is generally well applicable for the use case of extracting conformational states from Molecular Dynamics data

    An entropy-based uncertainty measure for developing granular models

    Get PDF
    There are two main ways to construct Fuzzy Logic rule-based models: using expert knowledge and using data mining methods. One of the most important aspects of Granular Computing (GrC) is to discover and extract knowledge from raw data in the form of information granules. The knowledge gained from the GrC, the information granules, can be used in constructing the linguistic rule-bases of a Fuzzy-Logic based system. Algorithms for iterative data granulation in the literature, so far, do not account for data uncertainty during the granulation process. In this paper, the uncertainty during the data granulation process is captured using the fundamental concept in information theory, entropy. In the proposed GrC algorithm, data granules are defined as information objects, hence the entropy measure being used in this research work is to capture the uncertainty in the data vectors resulting from the merging of the information granules. The entropy-based uncertainty measure is used to guide the iterative granulation process, hence promoting the formation of new granules with less uncertainty. The enhanced information granules are then being translated into a Fuzzy Logic inference system. The effectiveness of the proposed approach is demonstrated using established datasets

    Identification des régimes et regroupement des séquences pour la prévision des marchés financiers

    Get PDF
    Abstract : Regime switching analysis is extensively advocated to capture complex behaviors underlying financial time series for market prediction. Two main disadvantages in current approaches of regime identification are raised in the literature: 1) the lack of a mechanism for identifying regimes dynamically, restricting them to switching among a fixed set of regimes with a static transition probability matrix; 2) failure to utilize cross-sectional regime dependencies among time series, since not all the time series are synchronized to the same regime. As the numerical time series can be symbolized into categorical sequences, a third issue raises: 3) the lack of a meaningful and effective measure of the similarity between chronological dependent categorical values, in order to identify sequence clusters that could serve as regimes for market forecasting. In this thesis, we propose a dynamic regime identification model that can identify regimes dynamically with a time-varying transition probability, to address the first issue. For the second issue, we propose a cluster-based regime identification model to account for the cross-sectional regime dependencies underlying financial time series for market forecasting. For the last issue, we develop a dynamic order Markov model, making use of information underlying frequent consecutive patterns and sparse patterns, to identify the clusters that could serve as regimes identified on categorized financial time series. Experiments on synthetic and real-world datasets show that our two regime models show good performance on both regime identification and forecasting, while our dynamic order Markov clustering model also demonstrates good performance on identifying clusters from categorical sequences.L'analyse de changement de régime est largement préconisée pour capturer les comportements complexes sous-jacents aux séries chronologiques financières pour la prédiction du marché. Deux principaux problèmes des approches actuelles d'identifica-tion de régime sont soulevés dans la littérature. Il s’agit de: 1) l'absence d'un mécanisme d'identification dynamique des régimes. Ceci limite la commutation entre un ensemble fixe de régimes avec une matrice de probabilité de transition statique; 2) l’incapacité à utiliser les dépendances transversales des régimes entre les séries chronologiques, car toutes les séries chronologiques ne sont pas synchronisées sur le même régime. Étant donné que les séries temporelles numériques peuvent être symbolisées en séquences catégorielles, un troisième problème se pose: 3) l'absence d'une mesure significative et efficace de la similarité entre les séries chronologiques dépendant des valeurs catégorielles pour identifier les clusters de séquences qui pourraient servir de régimes de prévision du marché. Dans cette thèse, nous proposons un modèle d'identification de régime dynamique qui identifie dynamiquement des régimes avec une probabilité de transition variable dans le temps afin de répondre au premier problème. Ensuite, pour adresser le deuxième problème, nous proposons un modèle d'identification de régime basé sur les clusters. Notre modèle considère les dépendances transversales des régimes sous-jacents aux séries chronologiques financières avant d’effectuer la prévision du marché. Pour terminer, nous abordons le troisième problème en développant un modèle de Markov d'ordre dynamique, en utilisant les informations sous-jacentes aux motifs consécutifs fréquents et aux motifs clairsemés, pour identifier les clusters qui peuvent servir de régimes identifiés sur des séries chronologiques financières catégorisées. Nous avons mené des expériences sur des ensembles de données synthétiques et du monde réel. Nous démontrons que nos deux modèles de régime présentent de bonnes performances à la fois en termes d'identification et de prévision de régime, et notre modèle de clustering de Markov d'ordre dynamique produit également de bonnes performances dans l'identification de clusters à partir de séquences catégorielles

    Clustering for 2D chemical structures

    Get PDF
    The clustering of chemical structures is important and widely used in several areas of chemoinformatics. A little-discussed aspect of clustering is standardization, it ensures all descriptors in a chemical representation make a comparable contribution to the measurement of similarity. The initial study compares the effectiveness of seven different standardization procedures that have been suggested previously, the results were also compared with unstandardized datasets. It was found that no one standardization method offered consistently the best performance. Comparative studies of clustering effectiveness are helpful in providing suitability and guidelines of different methods. In order to examine the suitability of different clustering methods for the application in chemoinformatics, especially those had not previously been applied to chemoinformatics, the second piece of study carries out an effectiveness comparison of nine clustering methods. However, the result revealed that it is unlikely that a single clustering method can provide consistently the best partition under all circumstances. Consensus clustering is a technique to combine multiple input partitions of the same set of objects to achieve a single clustering that is expected to provide a more robust and more generally effective representation of the partitions that are submitted. The third piece of study reports the use of seven different consensus clustering methods which had not previously been used on sets of chemical compounds represented by 2D fingerprints. Their effectiveness was compared with some traditional clustering methods discussed in the second study. It was observed that no consistently best consensus clustering method was found

    Current state-of-the-art of the research conducted in mapping protein cavities – binding sites of bioactive compounds, peptides or other proteins

    Get PDF
    Ο σκοπός της διπλωματικής εργασίας είναι η διερεύνηση και αποτύπωση των ερευνητικών μελετών που αφορούν στον χαρακτηρισμό μιας πρωτεϊνικής κοιλότητας – κέντρου πρόσδεσης βιοδραστικών ενώσεων, πεπτιδίων ή άλλων πρωτεϊνών. Στην παρούσα εργασία χρησιμοποιήθηκε η μέθοδος της βιβλιογραφικής επισκόπησης. Παρουσιάζονται τα κυριότερα ευρήματα προηγούμενων ερευνών που σχετίζονται με τη διαδικασία σχεδιασμού φαρμάκων και τον εντοπισμό φαρμακοφόρων με βάση ένα σύνολο προσδετών. Στη συνέχεια συγκρίνονται διαδικασίες επεξεργασίας και ανάλυσης της πρωτεϊνικής κοιλότητας προγενέστερων ερευνών με τη προσέγγιση που προτάθηκε από τους Παπαθανασίου και Φωτόπουλου το 2015. Αναδεικνύονται βασικά πλεονεκτήματα της προσέγγισης αυτής, όπως η εφαρμογή του αλγορίθμου πολυδιάστατη k-means ομαδοποίηση (multidimensional k-means clustering). Η εύρεση βιβλιογραφίας βασίστηκε σε αναζήτηση επιστημονικών άρθρων σε ξενόγλωσσα επιστημονικά περιοδικά, σε κεφάλαια βιβλίων και σε διάφορα άρθρα σε ηλεκτρονικούς ιστότοπους σχετικά με τον σχεδιασμό φαρμάκων και τις κοιλότητες που απαντώνται στις πρωτεΐνες. Στην παρούσα εργασία παρουσιάζονται εν συντομία εργαλεία που εντοπίστηκαν χρησιμοποιώντας λέξεις κλειδιά όπως για παράδειγμα δυναμική πρωτεϊνικής κοιλότητας, καταλυτικό κέντρο ενός ενζύμου, πρόσδεση, πρωτεϊνική θήκη κλπ. Στη συνέχεια συγκροτήθηκε κατάλογος με τα εργαλεία βιοπληροφορικής ανάλυσης που βρέθηκαν και ακολούθησε εκτενής αναφορά επιλεκτικά σε κάποια από αυτά. Κριτήριο επιλογής αυτών των εργαλείων αποτέλεσε η ημερομηνία δημοσίευσής τους, οι αλγόριθμοι και η μεθοδολογία που χρησιμοποιούν. Τα εργαλεία αυτά κατηγοριοποιήθηκαν με βάση τις λέξεις κλειδιά που χρησιμοποιήθηκαν για την εξόρυξη των δεδομένων από την βιβλιογραφία. Τέλος πραγματοποιήθηκε συγκριτική μελέτη αυτών αναδεικνύοντας τα πλεονεκτήματα και εστιάζοντας στην περαιτέρω αξιοποίησή τους.The aim of this thesis was to report on the current state-of-the-art of the research conducted concerning mapping of protein cavities with a potential function role as binding sites of bioactive compounds, peptides or other proteins. A literature review was performed with emphasis on the relevant tools developed during the last decade. In addition, the main research findings regarding drug design and druggable targets based on binding sites are presented. Processes performed in protein cavity detection and analysis, of previous research articles, are compared with the approach described by Anaxagoras Fotopoulos and Athanasios Papathanasiou (2015). The results showed that a competitive advantage of their approach is the multidimensional k-means algorithm for clustering. For the bibliographic review the scientific knowledgebase has been used, which includes international articles and journals, book chapters, as well as online articles regarding drug design and protein cavity. Search keywords such as protein cavity dynamics, catalytic sites of enzymes, protein pocket etc. were used to identify bioinformatics tools with text mining. A catalogue of the most recently developed tools is presented followed by a brief description of selected tools. The selection criteria imposed for preparing the catalogue and the detailed description included the publication date, as well as the algorithms and the methods they use. The tools were then classified according to the search keywords. The findings of this research are discussed, and the algorithms and methods they use are compared, highlighting the advantages of protein cavity detection

    Deep execution monitor for robot assistive tasks

    Get PDF
    We consider a novel approach to high-level robot task execution for a robot assistive task. In this work we explore the problem of learning to predict the next subtask by introducing a deep model for both sequencing goals and for visually evaluating the state of a task. We show that deep learning for monitoring robot tasks execution very well supports the interconnection between task-level planning and robot operations. These solutions can also cope with the natural non-determinism of the execution monitor. We show that a deep execution monitor leverages robot performance. We measure the improvement taking into account some robot helping tasks performed at a warehouse

    Hiding in plain sight, Ficus desertorum (Moraceae), a new species of rock fig for Central Australia

    Get PDF
    A new species of lithophytic fig, Ficus desertorum B.C.Wilde & R.L.Barrett, endemic to arid Central Australia, is described and illustrated. It is distinguished from other species in Ficus section Malvanthera Corner by having stiff lanceolate, dark green, discolorous leaves; many parallel, often obscure lateral veins; petioles that are continuous with the midrib; with minute, usually white hairs and non- or slightly sunken intercostal regions on the lower surface. Previously included under broad concepts of either Ficus platypoda (Miq.) Miq. or Ficus brachypoda (Miq.) Miq., this species has a scattered distribution throughout Central Australia on rocky outcrops, jump-ups (mesas) and around waterholes. This culturally significant plant, colloquially referred to as the desert fig, grows on elevated landscapes in central Australia, including Uluru (Ayers Rock), Kata Tjuta (The Olgas) and Karlu Karlu (Devils Marbles), three of Central Australia’s best-known natural landmarks. Evidence is provided to show these plants are geographically and morphologically distinct from Ficus brachypoda, justifying the recognition of F. desertorum as a new species. Taxonomic issues with F. brachypoda and F. atricha D.J.Dixon are also discussed. Lectotypes are selected for Urostigma platypodum forma glabrior Miq. and Ficus platypoda var. minor Benth
    corecore