66 research outputs found

    8th SC@RUG 2011 proceedings:Student Colloquium 2010-2011

    Get PDF

    8th SC@RUG 2011 proceedings:Student Colloquium 2010-2011

    Get PDF

    Distributed Spectral Graph Methods for Analyzing Large-Scale Unstructured Biomedical Data

    Get PDF
    There is an ever-expanding body of biological data, growing in size and complexity, out- stripping the capabilities of standard database tools or traditional analysis techniques. Such examples include molecular dynamics simulations, drug-target interactions, gene regulatory networks, and high-throughput imaging. Large-scale acquisition and curation biological data has already yielded results in the form of lower costs for genome sequencing and greater cov- erage in databases such as GenBank, and is viewed as the future of biocuration. The “big data” philosophy and its associated paradigms and frameworks have the potential to uncover solutions to problems otherwise intractable with more traditional investigative techniques. Here, we focus on two biological systems whose data form large, undirected graphs. First, we develop a quantitative model of ciliary motion phenotypes, using spectral graph methods for unsupervised latent pattern discovery. Second, we apply similar techniques to identify a mapping between physiochemical structure and odor percept in human olfaction. In both cases, we experienced computational bottlenecks in our statistical machinery, necessitating the creation of a new analysis framework. At the core of this framework is a distributed hierarchical eigensolver, which we compare directly to other popular solvers. We demon- strate its essential role in enabling the discovery of novel ciliary motion phenotypes and in identifying physiochemical-perceptual associations

    A survey on the development status and application prospects of knowledge graph in smart grids

    Full text link
    With the advent of the electric power big data era, semantic interoperability and interconnection of power data have received extensive attention. Knowledge graph technology is a new method describing the complex relationships between concepts and entities in the objective world, which is widely concerned because of its robust knowledge inference ability. Especially with the proliferation of measurement devices and exponential growth of electric power data empowers, electric power knowledge graph provides new opportunities to solve the contradictions between the massive power resources and the continuously increasing demands for intelligent applications. In an attempt to fulfil the potential of knowledge graph and deal with the various challenges faced, as well as to obtain insights to achieve business applications of smart grids, this work first presents a holistic study of knowledge-driven intelligent application integration. Specifically, a detailed overview of electric power knowledge mining is provided. Then, the overview of the knowledge graph in smart grids is introduced. Moreover, the architecture of the big knowledge graph platform for smart grids and critical technologies are described. Furthermore, this paper comprehensively elaborates on the application prospects leveraged by knowledge graph oriented to smart grids, power consumer service, decision-making in dispatching, and operation and maintenance of power equipment. Finally, issues and challenges are summarised.Comment: IET Generation, Transmission & Distributio

    Elicitation of relevant information from medical databases: application to the encoding of secondary diagnoses

    Get PDF
    Dans cette thèse, nous nous concentrons sur le codage du séjour d'hospitalisation en codes standards. Ce codage est une tâche médicale hautement sensible dans les hôpitaux français, nécessitant des détails minutieux et une haute précision, car le revenu de l'hôpital en dépend directement. L'encodage du séjour d'hospitalisation comprend l'encodage du diagnostic principal qui motive le séjour d'hospitalisation et d'autres diagnostics secondaires qui surviennent pendant le séjour. Nous proposons une analyse rétrospective mettant en oeuvre des méthodes d'apprentissage, sur la tâche d'encodage de certains diagnostics secondaires sélectionnés. Par conséquent, la base de données PMSI, une grande base de données médicales qui documente toutes les informations sur les séjours d'hospitalisation en France.} est analysée afin d'extraire à partir de séjours de patients hospitalisés antérieurement, des variables décisives (Features). Identifier ces variables permet de pronostiquer le codage d'un diagnostic secondaire difficile qui a eu lieu avec un diagnostic principal fréquent. Ainsi, à la fin d'une session de codage, nous proposons une aide pour les codeurs en proposant une liste des encodages pertinents ainsi que des variables utilisées pour prédire ces encodages. Les défis nécessitent une connaissance métier dans le domaine médical et une méthodologie d'exploitation efficace de la base de données médicales par les méthodes d'apprentissage automatique. En ce qui concerne le défi lié à la connaissance du domaine médical, nous collaborons avec des codeurs experts dans un hôpital local afin de fournir un aperçu expert sur certains diagnostics secondaires difficiles à coder et afin d'évaluer les résultats de la méthodologie proposée. En ce qui concerne le défi lié à l'exploitation des bases de données médicales par des méthodes d'apprentissage automatique, plus spécifiquement par des méthodes de "Feature Selection" (FS), nous nous concentrons sur la résolution de certains points : le format des bases de données médicales, le nombre de variables dans les bases de données médicales et les variables instables extraites des bases de données médicales. Nous proposons une série de transformations afin de rendre le format de la base de données médicales, en général sous forme de bases de données relationnelles, exploitable par toutes les méthodes de type FS. Pour limiter l'explosion du nombre de variables représentées dans la base de données médicales, généralement motivée par la quantité de diagnostics et d'actes médicaux, nous analysons l'impact d'un regroupement de ces variables dans un niveau de représentation approprié et nous choisissons le meilleur niveau de représentation. Enfin, les bases de données médicales sont souvent déséquilibrées à cause de la répartition inégale des exemples positifs et négatifs. Cette répartition inégale cause des instabilités de variables extraites par des méthodes de FS. Pour résoudre ce problème, nous proposons une méthodologie d'extraction des variables stables en échantillonnant plusieurs fois l'ensemble de données et en extrayant les variables pertinentes de chaque ensemble de données échantillonné. Nous évaluons la méthodologie en établissant un modèle de classification qui prédit les diagnostics étudiés à partir des variables extraites. La performance du modèle de classification indique la qualité des variables extraites, car les variables de bonne qualité produisent un bon modèle de classification. Deux échelles de base de données PMSI sont utilisées: échelle locale et régionale. Le modèle de classification est construit en utilisant l'échelle locale de PMSI et testé en utilisant des échelles locales et régionales. Les évaluations ont montré que les variables extraites sont de bonnes variables pour coder des diagnostics secondaires. Par conséquent, nous proposons d'appliquer notre méthodologie pour éviter de manquer des encodages importants qui affectent le budget de l'hôpital en fournissant aux codeurs les encodages potentiels des diagnostics secondaires ainsi que les variables qui conduisent à ce codage.In the thesis we focus on encoding inpatient episode into standard codes, a highly sensitive medical task in French hospitals, requiring minute detail and accuracy, since the hospital's income directly depends on it. Encoding inpatient episode includes encoding the primary diagnosis that motivates the hospitalisation stay and other secondary diagnoses that occur during the stay. Unlike primary diagnosis, encoding secondary diagnoses is prone to human error, due to the difficulty of collecting relevant data from different medical sources, or to the outright absence of relevant data that helps encoding the diagnosis. We propose a retrospective analysis on the encoding task of some selected secondary diagnoses. Hence, the PMSI database is analysed in order to extract, from previously encoded inpatient episodes, the decisive features to encode a difficult secondary diagnosis occurred with frequent primary diagnosis. Consequently, at the end of an encoding session, once all the features are available, we propose to help the coders by proposing a list of relevant encodings as well as the features used to predict these encodings. Nonetheless, a set of challenges need to be addressed for the development of an efficient encoding help system. The challenges include, an expert knowledge in the medical domain and an efficient exploitation methodology of the medical database by Machine Learning methods. With respect to the medical domain knowledge challenge, we collaborate with expert coders in a local hospital in order to provide expert insight on some difficult secondary diagnoses to encode and in order to evaluate the results of the proposed methodology. With respect to the medical databases exploitation challenge, we use ML methods such as Feature Selection (FS), focusing on resolving several issues such as the incompatible format of the medical databases, the excessive number features of the medical databases in addition to the unstable features extracted from the medical databases. Regarding to issue of the incompatible format of the medical databases caused by relational databases, we propose a series of transformation in order to make the database and its features more exploitable by any FS methods. To limit the effect of the excessive number of features in the medical database, usually motivated by the amount of the diagnoses and the medical procedures, we propose to group the excessive number features into a proper representation level and to study the best representation level. Regarding to issue of unstable features extracted from medical databases, as the dataset linked with diagnoses are highly imbalanced due to classification categories that are unequally represented, most existing FS methods tend not to perform well on them even if sampling strategies are used. We propose a methodology to extract stable features by sampling the dataset multiple times and extracting the relevant features from each sampled dataset. Thus, we propose a methodology that resolves these issues and extracts stable set of features from medical database regardless to the sampling method and the FS method used in the methodology. Lastly, we evaluate the methodology by building a classification model that predicts the studied diagnoses out of the extracted features. The performance of the classification model indicates the quality of the extracted features, since good quality features produces good classification model. Two scales of PMSI database are used: local and regional scales. The classification model is built using the local scale of PMSI and tested out using both local and regional scales. Hence, we propose applying our methodology to increase the integrity of the encoded diagnoses and to prevent missing important encodings. We propose modifying the encoding process and providing the coders with the potential encodings of the secondary diagnoses as well as the features that lead to this encoding

    Data Mining

    Get PDF
    Data mining is a branch of computer science that is used to automatically extract meaningful, useful knowledge and previously unknown, hidden, interesting patterns from a large amount of data to support the decision-making process. This book presents recent theoretical and practical advances in the field of data mining. It discusses a number of data mining methods, including classification, clustering, and association rule mining. This book brings together many different successful data mining studies in various areas such as health, banking, education, software engineering, animal science, and the environment
    corecore