16 research outputs found

    Constraint-based sequence mining using constraint programming

    Full text link
    The goal of constraint-based sequence mining is to find sequences of symbols that are included in a large number of input sequences and that satisfy some constraints specified by the user. Many constraints have been proposed in the literature, but a general framework is still missing. We investigate the use of constraint programming as general framework for this task. We first identify four categories of constraints that are applicable to sequence mining. We then propose two constraint programming formulations. The first formulation introduces a new global constraint called exists-embedding. This formulation is the most efficient but does not support one type of constraint. To support such constraints, we develop a second formulation that is more general but incurs more overhead. Both formulations can use the projected database technique used in specialised algorithms. Experiments demonstrate the flexibility towards constraint-based settings and compare the approach to existing methods.Comment: In Integration of AI and OR Techniques in Constraint Programming (CPAIOR), 201

    データマイニングにおけるユーザベリティ向上と応用に関する研究

    Get PDF
    制度:新 ; 文部省報告番号:甲2574号 ; 学位の種類:博士(工学) ; 授与年月日:2008/3/15 ; 早大学位記番号:新473

    A Novel Approach to Knowledge Discovery and Representation in Biological Databases.

    Get PDF
    Extraction of motifs from biological sequences is among the frontier research issues in bioinformatics, with sequential patterns mining becoming one of the most important computational techniques in this area. A number of applications motivate the search for more structured patterns and concurrent protein motif mining is considered here. This paper builds on the concept of structural relation patterns and applies the Concurrent Sequential Patterns (ConSP) mining approach to biological databases. Specifically, an original method is presented using support vectors as the data structure for the extraction of novel patterns in protein sequences. Data modelling is pursued to represent the more interesting concurrent patterns visually. Experiments with real-world protein datasets from the UniProt and NCBI databases highlight the applicability of the ConSP methodology in protein data mining and modelling. The results show the potential for knowledge discovery in the field of protein structure identification. A pilot experiment extends the methodology to DNA sequences to indicate a future direction

    Can we Take Advantage of Time-Interval Pattern Mining to Model Students Activity?

    Get PDF
    International audienceAnalyzing students' activities in their learning process is an issue that has received significant attention in the educational data mining research field. Many approaches have been proposed, including the popular sequential pattern mining. However, the vast majority of the works do not focus on the time of occurrence of the events within the activities. This paper relies on the hypothesis that we can get a better understanding of students' activities, as well as design more accurate models, if time is considered. With this in mind, we propose to study time-interval patterns. To highlight the benefits of managing time, we analyze the data collected about 113 first-year university students interacting with their LMS. Experiments reveal that frequent time-interval patterns are actually identified, which means that some students' activities are regulated not only by the order of learning resources but also by time. In addition, the experiments emphasize that the sets of intervals highly influence the patterns mined and that the set of intervals that represents the human natural time (minute, hour, day, etc.) seems to be the most appropriate one to represent time gap between resources. Finally, we show that time-interval pattern mining brings additional information compared to sequential pattern mining. Indeed, not only the view of students' possible future activities is less uncertain (in terms of learning resources and their temporal gap) but also, as soon as two students differ in their time-intervals, this di↵erence indicates that their following activities are likely to diverge

    Novel algorithms for protein sequence analysis

    Get PDF
    Each protein is characterized by its unique sequential order of amino acids, the so-called protein sequence. Biology__s paradigm is that this order of amino acids determines the protein__s architecture and function. In this thesis, we introduce novel algorithms to analyze protein sequences. Chapter 1 begins with the introduction of amino acids, proteins and protein families. Then fundamental techniques from computer science related to the thesis are briefly described. Making a multiple sequence alignment (MSA) and constructing a phylogenetic tree are traditional means of sequence analysis. Information entropy, feature selection and sequential pattern mining provide alternative ways to analyze protein sequences and they are all from computer science. In Chapter 2, information entropy was used to measure the conservation on a given position of the alignment. From an alignment which is grouped into subfamilies, two types of information entropy values are calculated for each position in the MSA. One is the average entropy for a given position among the subfamilies, the other is the entropy for the same position in the entire multiple sequence alignment. This so-called two-entropies analysis or TEA in short, yields a scatter-plot in which all positions are represented with their two entropy values as x- and y-coordinates. The different locations of the positions (or dots) in the scatter-plot are indicative of various conservation patterns and may suggest different biological functions. The globally conserved positions show up at the lower left corner of the graph, which suggests that these positions may be essential for the folding or for the main functions of the protein superfamily. In contrast the positions neither conserved between subfamilies nor conserved in each individual subfamily appear at the upper right corner. The positions conserved within each subfamily but divergent among subfamilies are in the upper left corner. They may participate in biological functions that divide subfamilies, such as recognition of an endogenous ligand in G protein-coupled receptors. The TEA method requires a definition of protein subfamilies as an input. However such definition is a challenging problem by itself, particularly because this definition is crucial for the following prediction of specificity positions. In Chapter 3, we automated the TEA method described in Chapter 2 by tracing the evolutionary pressure from the root to the branches of the phylogenetic tree. At each level of the tree, a TEA plot is produced to capture the signal of the evolutionary pressure. A consensus TEA-O plot is composed from the whole series of plots to provide a condensed representation. Positions related to functions that evolved early (conserved) or later (specificity) are close to the lower left or upper left corner of the TEA-O plot, respectively. This novel approach allows an unbiased, user-independent, analysis of residue relevance in a protein family. We tested the TEA-O method on a synthetic dataset as well as on __real__ data, i.e., LacI and GPCR datasets. The ROC plots for the real data showed that TEA-O works perfectly well on all datasets and much better than other considered methods such as evolutionary trace, SDPpred and TreeDet. While positions were treated independently from each other in Chapter 2 and 3 in predicting specificity positions, in Chapter 4 multi-RELIEF considers both sequence similarity and distance in 3D structure in the specificity scoring function. The multi-RELIEF method was developed based on RELIEF, a state-of-the-art Machine-Learning technique for feature weighting. It estimates the expected __local__ functional specificity of residues from an alignment divided in multiple classes. Optionally, 3D structure information is exploited by increasing the weight of residues that have high-weight neighbors. Using ROC curves over a large body of experimental reference data, we showed that multi-RELIEF identifies specificity residues for the seven test sets used. In addition, incorporating structural information improved the prediction for specificity of interaction with small molecules. Comparison of multi-RELIEF with four other state-of-the-art algorithms indicates its robustness and best overall performance. In Chapter 2, 3 and 4, we heavily relied on multiple sequence alignment to identify conserved and specificity positions. As mentioned before, the construction of such alignment is not self-evident. Following the principle of sequential pattern mining, in Chapter 5, we proposed a new algorithm that directly identifies frequent biologically meaningful patterns from unaligned sequences. Six algorithms were designed and implemented to mine three different pattern types from either one or two datasets using a pattern growth approach. We compared our approach to PRATT2 and TEIRESIAS in efficiency, completeness and the diversity of pattern types. Compared to PRATT2, our approach is faster, capable of processing large datasets and able to identify the so-called type III patterns. Our approach is comparable to TEIRESIAS in the discovery of the so-called type I patterns but has additional functionality such as mining the so-called type II and type III patterns and finding discriminating patterns between two datasets. From Chapter 2 to 5, we aimed to identify functional residues from either aligned or unaligned protein sequences. In Chapter 6, we introduce an alignment-independent procedure to cluster protein sequences, which may be used to predict protein function. Traditionally phylogeny reconstruction is usually based on multiple sequence alignment. The procedure can be computationally intensive and often requires manual adjustment, which may be particularly difficult for a set of deviating sequences. In cheminformatics, constructing a similarity tree of ligands is usually alignment free. Feature spaces are routine means to convert compounds into binary fingerprints. Then distances among compounds can be obtained and similarity trees are constructed via clustering techniques. We explored building feature spaces for phylogeny reconstruction either using the so-called k-mer method or via sequential pattern mining with additional filtering and combining operations. Satisfying trees were built from both approaches compared with alignment-based methods. We found that when k equals 3, the phylogenetic tree built from the k-mer fingerprints is as good as one of the alignment-based methods, in which PAM and Neighborhood joining are used for computing distance and constructing a tree, respectively (NJ-PAM). As for the sequential pattern mining approach, the quality of the phylogenetic tree is better than one of the alignment-based method (NJ-PAM), if we set the support value to 10% and used maximum patterns only as descriptors. Finally in Chapter 7, general conclusions about the research described in this thesis are drawn. They are supplemented with an outlook on further research lines. We are convinced that the described algorithms can be useful in, e.g., genomic analyses, and provide further ideas for novel algorithms in this respect.Leiden University, NWO (Horizon Breakthrough project 050-71-041) and the Dutch Top Institute Pharma (D1-105)UBL - phd migration 201

    Mining patterns in complex data

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Supporting the design of sequences of cumulative activities impacting on multiple areas through a data mining approach : application to design of cognitive rehabilitation programs for traumatic brain injury patients

    Get PDF
    Traumatic brain injury (TBI) is a leading cause of disability worldwide. It is the most common cause of death and disability during the first three decades of life and accounts for more productive years of life lost than cancer, cardiovascular disease and HIV/AIDS combined. Cognitive Rehabilitation (CR), as part of Neurorehabilitation, aims to reduce the cognitive deficits caused by TBI. CR treatment consists of sequentially organized tasks that require repetitive use of impaired cognitive functions. While task repetition is not the only important feature, it is becoming clear that neuroplastic change and functional improvement only occur after a number of specific tasks are performed in a certain order and repetitions and does not occur otherwise. Until now, there has been an important lack of well-established criteria and on-field experience by which to identify the right number and order of tasks to propose to each individual patient. This thesis proposes the CMIS methodology to support health professionals to compose CR programs by selecting the most promising tasks in the right order. Two contributions to this topic were developed for specific steps of CMIS through innovative data mining techniques SAIMAP and NRRMR methodologies. SAIMAP (Sequence of Activities Improving Multi-Area Performance) proposes an innovative combination of data mining techniques in a hybrid generic methodological framework to find sequential patterns of a predefined set of activities and to associate them with multi-criteria improvement indicators regarding a predefined set of areas targeted by the activities. It combines data and prior knowledge with preprocessing, clustering, motif discovery and classes` post-processing to understand the effects of a sequence of activities on targeted areas, provided that these activities have high interactions and cumulative effects. Furthermore, this work introduces and defines the Neurorehabilitation Range (NRR) concept to determine the degree of performance expected for a CR task and the number of repetitions required to produce maximum rehabilitation effects on the individual. An operationalization of NRR is proposed by means of a visualization tool called SAP. SAP (Sectorized and Annotated Plane) is introduced to identify areas where there is a high probability of a target event occurring. Three approaches to SAP are defined, implemented, applied, and validated to a real case: Vis-SAP, DT-SAP and FT-SAP. Finally, the NRRMR (Neurorehabilitation Range Maximal Regions) problem is introduced as a generalization of the Maximal Empty Rectangle problem (MER) to identify maximal NRR over a FT-SAP. These contributions combined together in the CMIS methodology permit to identify a convenient pattern for a CR program (by means of a regular expression) and to instantiate by a real sequence of tasks in NRR by maximizing expected improvement of patients, thus provide support for the creation of CR plans. First of all, SAIMAP provides the general structure of successful CR sequences providing the length of the sequence and the kind of task recommended at every position (attention tasks, memory task or executive function task). Next, NRRMR provides specific tasks information to help decide which particular task is placed at each position in the sequence, the number of repetitions, and the expected range of results to maximize improvement after treatment. From the Artificial Intelligence point of view the proposed methodologies are general enough to be applied in similar problems where a sequence of interconnected activities with cumulative effects are used to impact on a set of areas of interest, for example spinal cord injury patients following physical rehabilitation program or elderly patients facing cognitive decline due to aging by cognitive stimulation programs or on educational settings to find the best way to combine mathematical drills in a program for a specific Mathematics course.El traumatismo craneoencefálico (TCE) es una de las principales causas de morbilidad y discapacidad a nivel mundial. Es la causa más común de muerte y discapacidad en personas menores de 30 años y es responsable de la pérdida de más años de vida productiva que el cáncer, las enfermedades cardiovasculares y el SIDA sumados. La Rehabilitación Cognitiva (RC) como parte de la Neurorehabilitación, tiene como objetivo reducir el impacto de las condiciones de discapacidad y disminuir los déficits cognitivos causados (por ejemplo) por un TCE. Un tratamiento de RC está formado por un conjunto de tareas organizadas de forma secuencial que requieren un uso repetitivo de las funciones cognitivas afectadas. Mientras que el número de ejecuciones de una tarea no es la única característica importante, es cada vez más evidente que las transformaciones neuroplásticas ocurren cuando se ejecutan un número específico de tareas en un cierto orden y no ocurren en caso contrario. Esta tesis propone la metodología CMIS para dar soporte a los profesionales de la salud en la composición de programas de RC, seleccionando las tareas más prometedoras en el orden correcto. Se han desarrollado dos contribuciones para CMIS mediante las metodologías SAMDMA y RNRRM basadas en técnicas innovadoras de minería de datos. SAMDMA (Secuencias de Actividades que Mejoran el Desempeño en Múltiples Áreas) propone una combinación de técnicas de minería de datos y un marco de trabajo genérico híbrido para encontrar patrones secuenciales en un conjunto de actividades y asociarlos con indicadores de mejora multi-criterio en relación a un conjunto de áreas hacia las cuales las actividades están dirigidas. Combina el uso de datos y conocimiento experto con técnicas de pre-procesamiento, clustering, descubrimiento de motifs y post procesamiento de clases. Además, se introduce y define el concepto de Rango de NeuroRehabilitación (RNR) para determinar el grado de performance esperado para una tarea de RC y el número de repeticiones que debe ejecutarse para producir mayores efectos rehabilitadores. Se propone una operacionalización del RNR por medio de una herramienta de visualización llamada Plano Sectorizado Anotado (PAS). PAS permite identificar áreas en las que hay una alta probabilidad de que ocurra un evento. Tres enfoques diferentes al PAS se definen, implementan, aplican y validan en un caso real : Vis-PAS, DT-PAS y FT-PAS. Finalmente, el problema RNRRM (Rango de NeuroRehabilitación de Regiones Máximas) se presenta como una generalización del problema del Máximo Rectángulo Vacío para identificar RNR máximos sobre un FT-PAS. La combinación de estas dos contribuciones en la metodología CMIS permite identificar un patrón conveniente para un programa de RC (por medio de una expresión regular) e instanciarlo en una secuencia real de tareas en RNR maximizando las mejoras esperadas de los pacientes, proporcionando soporte a la creación de planes de RC. Inicialmente, SAMDMA proporciona la estructura general de secuencias de RC exitosas para cada paciente, proporcionando el largo de la secuencia y el tipo de tarea recomendada en cada posición. RNRRM proporciona información específica de tareas para ayudar a decidir cuál se debe ejecutar en cada posición de la secuencia, el número de veces que debe ser repetida y el rango esperado de resultados para maximizar la mejora. Desde el punto de vista de la Inteligencia Artificial, ambas metodologías propuestas, son suficientemente generales como para ser aplicadas a otros problemas de estructura análoga en que una secuencia de actividades interconectadas con efectos acumulativos se utilizan para impactar en un conjunto de áreas de interés. Por ejemplo pacientes lesionados medulares en tratamiento de rehabilitación física, personas mayores con deterioro cognitivo debido al envejecimiento y utilizan programas de estimulación cognitiva, o entornos educacionales para combinar ejercicios de cálculo en un programa específico de Matemáticas

    Extraction d'arguments de relations n-aires dans les textes guidée par une RTO de domaine

    Get PDF
    Today, a huge amount of data is made available to the research community through several web-based libraries. Enhancing data collected from scientific documents is a major challenge in order to analyze and reuse efficiently domain knowledge. To be enhanced, data need to be extracted from documents and structured in a common representation using a controlled vocabulary as in ontologies. Our research deals with knowledge engineering issues of experimental data, extracted from scientific articles, in order to reuse them in decision support systems. Experimental data can be represented by n-ary relations which link a studied object (e.g. food packaging, transformation process) with its features (e.g. oxygen permeability in packaging, biomass grinding) and capitalized in an Ontological and Terminological Ressource (OTR). An OTR associates an ontology with a terminological and/or a linguistic part in order to establish a clear distinction between the term and the notion it denotes (the concept). Our work focuses on n-ary relation extraction from scientific documents in order to populate a domain OTR with new instances. Our contributions are based on Natural Language Processing (NLP) together with data mining approaches guided by the domain OTR. More precisely, firstly, we propose to focus on unit of measure extraction which are known to be difficult to identify because of their typographic variations. We propose to rely on automatic classification of texts, using supervised learning methods, to reduce the search space of variants of units, and then, we propose a new similarity measure that identifies them, taking into account their syntactic properties. Secondly, we propose to adapt and combine data mining methods (sequential patterns and rules mining) and syntactic analysis in order to overcome the challenging process of identifying and extracting n-ary relation instances drowned in unstructured texts.Aujourd'hui, la communauté scientifique a l'opportunité de partager des connaissances et d'accéder à de nouvelles informations à travers les documents publiés et stockés dans les bases en ligne du web. Dans ce contexte, la valorisation des données disponibles reste un défi majeur pour permettre aux experts de les réutiliser et les analyser afin de produire de la connaissance du domaine. Pour être valorisées, les données pertinentes doivent être extraites des documents puis structurées. Nos travaux s'inscrivent dans la problématique de la capitalisation des données expérimentales issues des articles scientifiques, sélectionnés dans des bases en ligne, afin de les réutiliser dans des outils d'aide à la décision. Les mesures expérimentales (par exemple, la perméabilité à l'oxygène d'un emballage ou le broyage d'une biomasse) réalisées sur différents objets d'études (par exemple, emballage ou procédé de bioraffinerie) sont représentées sous forme de relations n-aires dans une Ressource Termino-Ontologique (RTO). La RTO est modélisée pour représenter les relations n-aires en associant une partie terminologique et/ou linguistique aux ontologies afin d'établir une distinction claire entre la manifestation linguistique (le terme) et la notion qu'elle dénote (le concept). La thèse a pour objectif de proposer une contribution méthodologique d'extraction automatique ou semi-automatique d'arguments de relations n-aires provenant de documents textuels afin de peupler la RTO avec de nouvelles instances. Les méthodologies proposées exploitent et adaptent conjointement des approches de Traitement automatique de la Langue (TAL) et de fouille de données, le tout s'appuyant sur le support sémantique apporté par la RTO de domaine. De manière précise, nous cherchons, dans un premier temps, à extraire des termes, dénotant les concepts d'unités de mesure, réputés difficiles à identifier du fait de leur forte variation typographique dans les textes. Après la localisation de ces derniers par des méthodes de classification automatique, les variants d'unités sont identifiés en utilisant des mesures d'édition originales. La seconde contribution méthodologique de nos travaux repose sur l'adaptation et la combinaison de méthodes de fouille de données (extraction de motifs et règles séquentiels) et d'analyse syntaxique pour identifier les instances d'arguments de la relation n-aire recherchée

    A framework for dynamic heterogeneous information networks change discovery based on knowledge engineering and data mining methods

    Get PDF
    Information Networks are collections of data structures that are used to model interactions in social and living phenomena. They can be either homogeneous or heterogeneous and static or dynamic depending upon the type and nature of relations between the network entities. Static, homogeneous and heterogenous networks have been widely studied in data mining but recently, there has been renewed interest in dynamic heterogeneous information networks (DHIN) analysis because the rich temporal, structural and semantic information is hidden in this kind of network. The heterogeneity and dynamicity of the real-time networks offer plenty of prospects as well as a lot of challenges for data mining. There has been substantial research undertaken on the exploration of entities and their link identification in heterogeneous networks. However, the work on the formal construction and change mining of heterogeneous information networks is still infant due to its complex structure and rich semantics. Researchers have used clusters-based methods and frequent pattern-mining techniques in the past for change discovery in dynamic heterogeneous networks. These methods only work on small datasets, only provide the structural change discovery and fail to consider the quick and parallel process on big data. The problem with these methods is also that cluster-based approaches provide the structural changes while the pattern-mining provide semantic characteristics of changes in a dynamic network. Another interesting but challenging problem that has not been considered by past studies is to extract knowledge from these semantically richer networks based on the user-specific constraint.This study aims to develop a new change mining system ChaMining to investigate dynamic heterogeneous network data, using knowledge engineering with semantic web technologies and data mining to overcome the problems of previous techniques, this system and approach are important in academia as well as real-life applications to support decision-making based on temporal network data patterns. This research has designed a novel framework “ChaMining” (i) to find relational patterns in dynamic networks locally and globally by employing domain ontologies (ii) extract knowledge from these semantically richer networks based on the user-specific (meta-paths) constraints (iii) Cluster the relational data patterns based on structural properties of nodes in the dynamic network (iv) Develop a hybrid approach using knowledge engineering, temporal rule mining and clustering to detect changes in the dynamic heterogeneous networks.The evidence is presented in this research shows that the proposed framework and methods work very efficiently on the benchmark big dynamic heterogeneous datasets. The empirical results can contribute to a better understanding of the rich semantics of DHIN and how to mine them using the proposed hybrid approach. The proposed framework has been evaluated with the previous six dynamic change detection algorithms or frameworks and it performs very well to detect microscopic as well as macroscopic human-understandable changes. The number of change patterns extracted in this approach was higher than the previous approaches which help to reduce the information loss
    corecore