80 research outputs found

    Predictive Modelling Approach to Data-Driven Computational Preventive Medicine

    Get PDF
    This thesis contributes novel predictive modelling approaches to data-driven computational preventive medicine and offers an alternative framework to statistical analysis in preventive medicine research. In the early parts of this research, this thesis presents research by proposing a synergy of machine learning methods for detecting patterns and developing inexpensive predictive models from healthcare data to classify the potential occurrence of adverse health events. In particular, the data-driven methodology is founded upon a heuristic-systematic assessment of several machine-learning methods, data preprocessing techniques, models’ training estimation and optimisation, and performance evaluation, yielding a novel computational data-driven framework, Octopus. Midway through this research, this thesis advances research in preventive medicine and data mining by proposing several new extensions in data preparation and preprocessing. It offers new recommendations for data quality assessment checks, a novel multimethod imputation (MMI) process for missing data mitigation, a novel imbalanced resampling approach, and minority pattern reconstruction (MPR) led by information theory. This thesis also extends the area of model performance evaluation with a novel classification performance ranking metric called XDistance. In particular, the experimental results show that building predictive models with the methods guided by our new framework (Octopus) yields domain experts' approval of the new reliable models’ performance. Also, performing the data quality checks and applying the MMI process led healthcare practitioners to outweigh predictive reliability over interpretability. The application of MPR and its hybrid resampling strategies led to better performances in line with experts' success criteria than the traditional imbalanced data resampling techniques. Finally, the use of the XDistance performance ranking metric was found to be more effective in ranking several classifiers' performances while offering an indication of class bias, unlike existing performance metrics The overall contributions of this thesis can be summarised as follow. First, several data mining techniques were thoroughly assessed to formulate the new Octopus framework to produce new reliable classifiers. In addition, we offer a further understanding of the impact of newly engineered features, the physical activity index (PAI) and biological effective dose (BED). Second, the newly developed methods within the new framework. Finally, the newly accepted developed predictive models help detect adverse health events, namely, visceral fat-associated diseases and advanced breast cancer radiotherapy toxicity side effects. These contributions could be used to guide future theories, experiments and healthcare interventions in preventive medicine and data mining

    Investigation of Heterogeneous Approach to Fact Invention of Web Users’ Web Access Behaviour

    Get PDF
    World Wide Web consists of a huge volume of different types of data. Web mining is one of the fields of data mining wherein there are different web services and a large number of web users. Web user mining is also one of the fields of web mining. The web users’ information about the web access is collected through different ways. The most common technique to collect information about the web users is through web log file. There are several other techniques available to collect web users’ web access information; they are through browser agent, user authentication, web review, web rating, web ranking and tracking cookies. The web users find it difficult to retrieve their required information in time from the web because of the huge volume of unstructured and structured information which increases the complexity of the web. Web usage mining is very much important for various purposes such as organizing website, business and maintenance service, personalization of website and reducing the network bandwidth. This paper provides an analysis about the web usage mining techniques. Â

    Automatic production and integration of knowledge to the support of the decision and planning activities in medical-clinical diagnosis, treatment and prognosis.

    Get PDF
    El concepto de procedimiento médico se refiere al conjunto de actividades seguidas por los profesionales de la salud para solucionar o mitigar el problema de salud que afecta a un paciente. La toma de decisiones dentro del procedimiento médico ha sido, por largo tiempo, uno de las áreas más interesantes de investigación en la informática médica y el contexto de investigación de esta tesis. La motivación para desarrollar este trabajo de investigación se basa en tres aspectos fundamentales: no hay modelos de conocimiento para todas las actividades médico-clínicas que puedan ser inducidas a partir de datos médicos, no hay soluciones de aprendizaje inductivo para todas las actividades de la asistencia médica y no hay un modelo integral que formalice el concepto de procedimiento médico. Por tanto, nuestro objetivo principal es desarrollar un modelo computable basado en conocimiento que integre todas las actividades de decisión y planificación para el diagnóstico, tratamiento y pronóstico médico-clínicos. Para alcanzar el objetivo principal, en primer lugar, explicamos el problema de investigación. En segundo lugar, describimos los antecedentes del problema de investigación desde los contextos médico e informático. En tercer lugar, explicamos el desarrollo de la propuesta de investigación, basada en cuatro contribuciones principales: un nuevo modelo, basado en datos y conocimiento, para la actividad de planificación en el diagnóstico y tratamiento médico-clínicos; una novedosa metodología de aprendizaje inductivo para la actividad de planificación en el diagnóstico y tratamiento médico-clínico; una novedosa metodología de aprendizaje inductivo para la actividad de decisión en el pronóstico médico-clínico, y finalmente, un nuevo modelo computable, basado en datos y conocimiento, que integra las actividades de decisión y planificación para el diagnóstico, tratamiento y pronóstico médico-clínicos.The concept of medical procedure refers to the set of activities carried out by the health care professionals to solve or mitigate the health problems that affect a patient. Decisions making within a medical procedure has been, for a long time, one of the most interesting research areas in medical informatics and the research context of this thesis. The motivation to develop this research work is based on three main aspects: Nowadays there are not knowledge models for all the medical-clinical activities that can be induced from medical data, there are not inductive learning solutions for all the medical-clinical activities, and there is not an integral model that formalizes the concept of medical procedure. Therefore, our main objective is to develop a computable model based in knowledge that integrates all the decision and planning activities for the medical-clinical diagnosis, treatment and prognosis. To achieve this main objective: first, we explain the research problem. Second, we describe the background of the work from both the medical and the informatics contexts. Third, we explain the development of the research proposal based on four main contributions: a novel knowledge representation model, based in data, to the planning activity in medical-clinical diagnosis and treatment; a novel inductive learning methodology to the planning activity in diagnosis and medical-clinical treatment; a novel inductive learning methodology to the decision activity in medical-clinical prognosis, and finally, a novel computable model, based on data and knowledge, which integrates the decision and planning activities of medical-clinical diagnosis, treatment and prognosis

    Image analysis for gene expression based phenotype characterization in yeast cells

    Get PDF
      Image analysis of objects in the microscope scale requires accuracy so that measurements can be used to differentiate between groups of objects that are being studied. This thesis deals with measurements in yeast biology that are obtained through microscope images. We study the algorithms and workflow of image analysis of yeast cells in order to understand and improve the measurement accuracy. The Saccharomyces cerevisiae cell is widely used as a model organism in the life sciences. It is essential to study the gene and protein behaviour within these cells, and consequently making it possible to find treatment and solutions for genetic and hereditary diseases. This is possible since many processes that occurs at the molecular level in this organism are similar to those in human cells. In the research group Imaging and Bioinformatics, we have developed a framework for analysis of yeast cells. This framework is intended to serve as a support for research in yeast biology. The framework is integrated in one application and presented via a GUI. The application integrates modules and algorithms including segmentation, measurement, analysis and visualization.  Erasmus-Mundus, Raymond-Sackler, LSBSLIACS - OU

    Decision tree learning for intelligent mobile robot navigation

    Get PDF
    The replication of human intelligence, learning and reasoning by means of computer algorithms is termed Artificial Intelligence (Al) and the interaction of such algorithms with the physical world can be achieved using robotics. The work described in this thesis investigates the applications of concept learning (an approach which takes its inspiration from biological motivations and from survival instincts in particular) to robot control and path planning. The methodology of concept learning has been applied using learning decision trees (DTs) which induce domain knowledge from a finite set of training vectors which in turn describe systematically a physical entity and are used to train a robot to learn new concepts and to adapt its behaviour. To achieve behaviour learning, this work introduces the novel approach of hierarchical learning and knowledge decomposition to the frame of the reactive robot architecture. Following the analogy with survival instincts, the robot is first taught how to survive in very simple and homogeneous environments, namely a world without any disturbances or any kind of "hostility". Once this simple behaviour, named a primitive, has been established, the robot is trained to adapt new knowledge to cope with increasingly complex environments by adding further worlds to its existing knowledge. The repertoire of the robot behaviours in the form of symbolic knowledge is retained in a hierarchy of clustered decision trees (DTs) accommodating a number of primitives. To classify robot perceptions, control rules are synthesised using symbolic knowledge derived from searching the hierarchy of DTs. A second novel concept is introduced, namely that of multi-dimensional fuzzy associative memories (MDFAMs). These are clustered fuzzy decision trees (FDTs) which are trained locally and accommodate specific perceptual knowledge. Fuzzy logic is incorporated to deal with inherent noise in sensory data and to merge conflicting behaviours of the DTs. In this thesis, the feasibility of the developed techniques is illustrated in the robot applications, their benefits and drawbacks are discussed

    Supporting the design of sequences of cumulative activities impacting on multiple areas through a data mining approach : application to design of cognitive rehabilitation programs for traumatic brain injury patients

    Get PDF
    Traumatic brain injury (TBI) is a leading cause of disability worldwide. It is the most common cause of death and disability during the first three decades of life and accounts for more productive years of life lost than cancer, cardiovascular disease and HIV/AIDS combined. Cognitive Rehabilitation (CR), as part of Neurorehabilitation, aims to reduce the cognitive deficits caused by TBI. CR treatment consists of sequentially organized tasks that require repetitive use of impaired cognitive functions. While task repetition is not the only important feature, it is becoming clear that neuroplastic change and functional improvement only occur after a number of specific tasks are performed in a certain order and repetitions and does not occur otherwise. Until now, there has been an important lack of well-established criteria and on-field experience by which to identify the right number and order of tasks to propose to each individual patient. This thesis proposes the CMIS methodology to support health professionals to compose CR programs by selecting the most promising tasks in the right order. Two contributions to this topic were developed for specific steps of CMIS through innovative data mining techniques SAIMAP and NRRMR methodologies. SAIMAP (Sequence of Activities Improving Multi-Area Performance) proposes an innovative combination of data mining techniques in a hybrid generic methodological framework to find sequential patterns of a predefined set of activities and to associate them with multi-criteria improvement indicators regarding a predefined set of areas targeted by the activities. It combines data and prior knowledge with preprocessing, clustering, motif discovery and classes` post-processing to understand the effects of a sequence of activities on targeted areas, provided that these activities have high interactions and cumulative effects. Furthermore, this work introduces and defines the Neurorehabilitation Range (NRR) concept to determine the degree of performance expected for a CR task and the number of repetitions required to produce maximum rehabilitation effects on the individual. An operationalization of NRR is proposed by means of a visualization tool called SAP. SAP (Sectorized and Annotated Plane) is introduced to identify areas where there is a high probability of a target event occurring. Three approaches to SAP are defined, implemented, applied, and validated to a real case: Vis-SAP, DT-SAP and FT-SAP. Finally, the NRRMR (Neurorehabilitation Range Maximal Regions) problem is introduced as a generalization of the Maximal Empty Rectangle problem (MER) to identify maximal NRR over a FT-SAP. These contributions combined together in the CMIS methodology permit to identify a convenient pattern for a CR program (by means of a regular expression) and to instantiate by a real sequence of tasks in NRR by maximizing expected improvement of patients, thus provide support for the creation of CR plans. First of all, SAIMAP provides the general structure of successful CR sequences providing the length of the sequence and the kind of task recommended at every position (attention tasks, memory task or executive function task). Next, NRRMR provides specific tasks information to help decide which particular task is placed at each position in the sequence, the number of repetitions, and the expected range of results to maximize improvement after treatment. From the Artificial Intelligence point of view the proposed methodologies are general enough to be applied in similar problems where a sequence of interconnected activities with cumulative effects are used to impact on a set of areas of interest, for example spinal cord injury patients following physical rehabilitation program or elderly patients facing cognitive decline due to aging by cognitive stimulation programs or on educational settings to find the best way to combine mathematical drills in a program for a specific Mathematics course.El traumatismo craneoencefálico (TCE) es una de las principales causas de morbilidad y discapacidad a nivel mundial. Es la causa más común de muerte y discapacidad en personas menores de 30 años y es responsable de la pérdida de más años de vida productiva que el cáncer, las enfermedades cardiovasculares y el SIDA sumados. La Rehabilitación Cognitiva (RC) como parte de la Neurorehabilitación, tiene como objetivo reducir el impacto de las condiciones de discapacidad y disminuir los déficits cognitivos causados (por ejemplo) por un TCE. Un tratamiento de RC está formado por un conjunto de tareas organizadas de forma secuencial que requieren un uso repetitivo de las funciones cognitivas afectadas. Mientras que el número de ejecuciones de una tarea no es la única característica importante, es cada vez más evidente que las transformaciones neuroplásticas ocurren cuando se ejecutan un número específico de tareas en un cierto orden y no ocurren en caso contrario. Esta tesis propone la metodología CMIS para dar soporte a los profesionales de la salud en la composición de programas de RC, seleccionando las tareas más prometedoras en el orden correcto. Se han desarrollado dos contribuciones para CMIS mediante las metodologías SAMDMA y RNRRM basadas en técnicas innovadoras de minería de datos. SAMDMA (Secuencias de Actividades que Mejoran el Desempeño en Múltiples Áreas) propone una combinación de técnicas de minería de datos y un marco de trabajo genérico híbrido para encontrar patrones secuenciales en un conjunto de actividades y asociarlos con indicadores de mejora multi-criterio en relación a un conjunto de áreas hacia las cuales las actividades están dirigidas. Combina el uso de datos y conocimiento experto con técnicas de pre-procesamiento, clustering, descubrimiento de motifs y post procesamiento de clases. Además, se introduce y define el concepto de Rango de NeuroRehabilitación (RNR) para determinar el grado de performance esperado para una tarea de RC y el número de repeticiones que debe ejecutarse para producir mayores efectos rehabilitadores. Se propone una operacionalización del RNR por medio de una herramienta de visualización llamada Plano Sectorizado Anotado (PAS). PAS permite identificar áreas en las que hay una alta probabilidad de que ocurra un evento. Tres enfoques diferentes al PAS se definen, implementan, aplican y validan en un caso real : Vis-PAS, DT-PAS y FT-PAS. Finalmente, el problema RNRRM (Rango de NeuroRehabilitación de Regiones Máximas) se presenta como una generalización del problema del Máximo Rectángulo Vacío para identificar RNR máximos sobre un FT-PAS. La combinación de estas dos contribuciones en la metodología CMIS permite identificar un patrón conveniente para un programa de RC (por medio de una expresión regular) e instanciarlo en una secuencia real de tareas en RNR maximizando las mejoras esperadas de los pacientes, proporcionando soporte a la creación de planes de RC. Inicialmente, SAMDMA proporciona la estructura general de secuencias de RC exitosas para cada paciente, proporcionando el largo de la secuencia y el tipo de tarea recomendada en cada posición. RNRRM proporciona información específica de tareas para ayudar a decidir cuál se debe ejecutar en cada posición de la secuencia, el número de veces que debe ser repetida y el rango esperado de resultados para maximizar la mejora. Desde el punto de vista de la Inteligencia Artificial, ambas metodologías propuestas, son suficientemente generales como para ser aplicadas a otros problemas de estructura análoga en que una secuencia de actividades interconectadas con efectos acumulativos se utilizan para impactar en un conjunto de áreas de interés. Por ejemplo pacientes lesionados medulares en tratamiento de rehabilitación física, personas mayores con deterioro cognitivo debido al envejecimiento y utilizan programas de estimulación cognitiva, o entornos educacionales para combinar ejercicios de cálculo en un programa específico de Matemáticas

    Machine Learning for Modelling Tissue Distribution of Drugs and the Impact of Transporters

    Get PDF
    The ability to predict human pharmacokinetics in early stages of drug development is of paramount importance to prevent late stage attrition as well as in managing toxicity. This thesis explores the machine learning modelling of one of the main pharmacokinetics parameters that determines the therapeutic success of a drug - volume of distribution. In order to do so, a variety of physiological phenomena with known mechanisms of impact on drug distribution were considered as input features during the modelling of volume of distribution namely, Solute Carriers-mediated uptake and ATP-binding Cassette-mediated efflux, drug-induced phospholipidosis and plasma protein binding. These were paired with molecular descriptors to provide both chemical and biological information to the building of the predictive models. Since biological data used as input is limited, prior to modelling volume of distribution, the various types of physiological descriptors were also modelled. Here, a focus was placed on harnessing the information contained in correlations within the two transporter families, which was done by using multi-label classification. The application of such approach to transporter data is very recent and its use to model Solute Carriers data, for example, is reported here for the first time. On both transporter families, there was evidence that accounting for correlations between transporters offers useful information that is not portrayed by molecular descriptors. This effort also allowed uncovering new potential links between members of the Solute Carriers family, which are not obvious from a purely physiological standpoint. The models created for the different physiological parameters were then used to predict these parameters and fill in the gaps in the available experimental data, and the resulting merging of experimental and predicted data was used to model volume of distribution. This exercise improved the accuracy of volume of distribution models, and the generated models incorporated a wide variety of the different physiological descriptors supplied along with molecular features. The use of most of these physiological descriptors in the modelling of distribution is unprecedented, which is one of the main novelty points of this thesis. Additionally, as a parallel complementary work, a new method to characterize the predictive reliability of machine learning classification model was proposed, and an in depth analysis of mispredictions, their trends and causes was carried out, using one of the transporter models as example. This is an important complement to the main body of work in this thesis, as predictive performance is necessarily tied to prediction reliability

    Approximate Data Mining Techniques on Clinical Data

    Get PDF
    The past two decades have witnessed an explosion in the number of medical and healthcare datasets available to researchers and healthcare professionals. Data collection efforts are highly required, and this prompts the development of appropriate data mining techniques and tools that can automatically extract relevant information from data. Consequently, they provide insights into various clinical behaviors or processes captured by the data. Since these tools should support decision-making activities of medical experts, all the extracted information must be represented in a human-friendly way, that is, in a concise and easy-to-understand form. To this purpose, here we propose a new framework that collects different new mining techniques and tools proposed. These techniques mainly focus on two aspects: the temporal one and the predictive one. All of these techniques were then applied to clinical data and, in particular, ICU data from MIMIC III database. It showed the flexibility of the framework, which is able to retrieve different outcomes from the overall dataset. The first two techniques rely on the concept of Approximate Temporal Functional Dependencies (ATFDs). ATFDs have been proposed, with their suitable treatment of temporal information, as a methodological tool for mining clinical data. An example of the knowledge derivable through dependencies may be "within 15 days, patients with the same diagnosis and the same therapy usually receive the same daily amount of drug". However, current ATFD models are not analyzing the temporal evolution of the data, such as "For most patients with the same diagnosis, the same drug is prescribed after the same symptom". To this extent, we propose a new kind of ATFD called Approximate Pure Temporally Evolving Functional Dependencies (APEFDs). Another limitation of such kind of dependencies is that they cannot deal with quantitative data when some tolerance can be allowed for numerical values. In particular, this limitation arises in clinical data warehouses, where analysis and mining have to consider one or more measures related to quantitative data (such as lab test results and vital signs), concerning multiple dimensional (alphanumeric) attributes (such as patient, hospital, physician, diagnosis) and some time dimensions (such as the day since hospitalization and the calendar date). According to this scenario, we introduce a new kind of ATFD, named Multi-Approximate Temporal Functional Dependency (MATFD), which considers dependencies between dimensions and quantitative measures from temporal clinical data. These new dependencies may provide new knowledge as "within 15 days, patients with the same diagnosis and the same therapy receive a daily amount of drug within a fixed range". The other techniques are based on pattern mining, which has also been proposed as a methodological tool for mining clinical data. However, many methods proposed so far focus on mining of temporal rules which describe relationships between data sequences or instantaneous events, without considering the presence of more complex temporal patterns into the dataset. These patterns, such as trends of a particular vital sign, are often very relevant for clinicians. Moreover, it is really interesting to discover if some sort of event, such as a drug administration, is capable of changing these trends and how. To this extent, we propose a new kind of temporal patterns, called Trend-Event Patterns (TEPs), that focuses on events and their influence on trends that can be retrieved from some measures, such as vital signs. With TEPs we can express concepts such as "The administration of paracetamol on a patient with an increasing temperature leads to a decreasing trend in temperature after such administration occurs". We also decided to analyze another interesting pattern mining technique that includes prediction. This technique discovers a compact set of patterns that aim to describe the condition (or class) of interest. Our framework relies on a classification model that considers and combines various predictive pattern candidates and selects only those that are important to improve the overall class prediction performance. We show that our classification approach achieves a significant reduction in the number of extracted patterns, compared to the state-of-the-art methods based on minimum predictive pattern mining approach, while preserving the overall classification accuracy of the model. For each technique described above, we developed a tool to retrieve its kind of rule. All the results are obtained by pre-processing and mining clinical data and, as mentioned before, in particular ICU data from MIMIC III database

    The impact of knowledge management processes on organizational resilience: data mining as an instrument of measurement.

    Get PDF
    The aim of the research conducted for this thesis is to test the feasibility of using data mining (DM) to assess the relationship between and the impact of knowledge management (KM) on organizational resilience (OR). The emphasis currently placed on the value of intangible assets by private sector organizations and the recent increase in the use of data mining technologies are the key drivers in this evaluation of the use of data mining tools as an alternative to classical statistics when measuring intangibles. Data was collected using a questionnaire that was sent to the senior executives of a number of mid-sized companies located in the mid-west of the USA. Using Microsoft's SQL Server's Analytical Services (MSSAS) and the data provided by the respondents, five predictive models are built to test the suitability of the MSSAS' DM tool for assessing the relationships between and the impact of KM on OR. Of the five models constructed as part of this research, four classification models (two Naïve Bayes models, one neural network model, and one decision tree model) and one clustering model were found to be suitable tools for capturing the intricate relationships that exist between KM and OR. These models made it possible to evaluate the strengths of the relationships between KM and OR and to identify which KM processes contribute, and to what extent, to OR. In addition, the models enabled the collation of predicted OR scores, based on the responses given in the questionnaire. Finally, this research identifies some of the key challenges associated with using DM as a measurement instrument for assessing the relationship between and the impact of KM on OR. This research makes a number of significant contributions to the existing body of knowledge. It contributes to the understanding of the impact of KM on OR, to the understanding of the methods used to measure such impact and to the processes involved in measuring such impact using DM. From a practitioner perspective, this research contributes to the understanding of OR and provides a framework for achieving OR within an organizational context
    corecore