210 research outputs found

    Comparison of Imputation Methods for Univariate Time Series

    Get PDF
    In order to predict and forecast with greater accuracy, handling “missing values” in “time series” information is crucial. Complete and accurate historical data are essential. There are many research studies on multivariate time series imputation, however due to the lack of associated factors, imputation in univariate time series data is rarely taken into consideration. It is natural that “missing values” could arise because almost all scientific disciplines that collect, store, and monitor data use "time series" observations. Therefore, time series characteristics must be considered in order to develop an effective and acceptable method for dealing with missing data. This work uses the statistical package R to assess and measure the effectiveness of imputation methods in the context of "univariate time series" data. The “imputation algorithms” explored are evaluated using “root mean square error”, “mean absolute error” and “mean absolute percent error”. Four types of “time series” are taken into consideration. According to experimental findings, “seasonal decomposition” performs better on the time series having seasonality characteristic, followed by “linear interpolation”, and “kalman smoothing” provides values that are more similar to the original time series data set and have lower error rates than other imputation techniques

    Effects of Missing Data Imputation Methods on Univariate Time Series Forecasting with Arima and LSTM

    Get PDF
    Missing data are common in real-life studies and missing observations within the univariate time series cause analytical problems in the flow of the analysis. Imputation of missing values is an inevitable step in the analysis of every incomplete univariate time series data. The reviewed literature has shown that the focus of existing studies is on comparing the distribution of imputed data. There is a gap of knowledge on how different imputation methods for univariate time series data affect the fit and prediction performance of time series models. In this work, we evaluated the predictive performance of autoregressive integrated moving average (ARIMA) and long short-term memory (LSTM) models on imputed time-series data using Kalman smoothing on ARIMA, Kalman smoothing on structural time series model, mean imputation, exponentially weighted moving average, simple moving average, linear, cubic spline, stine, and KNN interpolation techniques under missing completely at random (MCAR) mechanism. Missing values were generated at 10%, 15%, 25%, and 35% rates using complete data of 24-hour ambulatory diastolic blood pressure readings. The performance of models was compared on imputed and original data using mean absolute percentage error (MAPE) and root mean square error (RMSE). Kalman smoothing on structural time series, exponentially weighted moving average, and Kalman smoothing on ARIMA were the best missing data replacement techniques as the gap of the missingness increased. The performance of mean imputation, cubic spline, KNN, and the other simple interpolation methods reduced significantly as the gap of missingness increased. The LSTM gave better predictions on the original training data, but the ARIMA predictions on imputed data gave consistent results across the four scenarios

    Robust data cleaning procedure for large scale medium voltage distribution networks feeders

    Get PDF
    Relatively little attention has been given to the short-term load forecasting problem of primary substations mainly because load forecasts were not essential to secure the operation of passive distribution networks. With the increasing uptake of intermittent generations, distribution networks are becoming active since power flows can change direction in a somewhat volatile fashion. The volatility of power flows introduces operational constraints on voltage control, system fault levels, thermal constraints, systems losses and high reverse power flows. Today, greater observability of the networks is essential to maintain a safe overall system and to maximise the utilisation of existing assets. Hence, to identify and anticipate for any forthcoming critical operational conditions, networks operators are compelled to broaden their visibility of the networks to time horizons that include not only real-time information but also hour-ahead and day-ahead forecasts. With this change in paradigm, progressively, large scales of short-term load forecasters is integrated as an essential component of distribution networks' control and planning tools. The data acquisition of large scale real-world data is prone to errors; anomalies in data sets can lead to erroneous forecasting outcomes. Hence, data cleansing is an essential first step in data-driven learning techniques. Data cleansing is a labour-intensive and time-consuming task for the following reasons: 1) to select a suitable cleansing method is not trivial 2) to generalise or automate a cleansing procedure is challenging, 3) there is a risk to introduce new errors in the data. This thesis attempts to maximise the performance of large scale forecasting models by addressing the quality of the modelling data. Thus, the objectives of this research are to identify the bad data quality causes, design an automatic data cleansing procedure suitable for large scale distribution network datasets and, to propose a rigorous framework for modelling MV distribution network feeders time series with deep learning architecture. The thesis discusses in detail the challenges in handling and modelling real-world distribution feeders time series. It also discusses a robust technique to detect outliers in the presence of level-shifts, and suitable missing values imputation techniques. All the concepts have been demonstrated on large real-world distribution network data.Open Acces

    Data pre-processing to identify environmental risk factors associated with diabetes

    Get PDF
    Genetics, diet, obesity, and lack of exercise play a major role in the development of type II diabetes. Additionally, environmental conditions are also linked to type II diabetes. The aim of this research is to identify the environmental conditions associated with diabetes. To achieve this, the research study utilises hospital-admitted patient data in NSW integrated with weather, pollution, and demographic data. The environmental variables (air pollution and weather) change over time and space, necessitating spatiotemporal data analysis to identify associations. Moreover, the environmental variables are measured using sensors, and they often contain large gaps of missing values due to sensor failures. Therefore, enhanced methodologies in data cleaning and imputation are needed to facilitate research using this data. Hence, the objectives of this study are twofold: first, to develop a data cleaning and imputation framework with improved methodologies to clean and pre-process the environmental data, and second, to identify environmental conditions associated with diabetes. This study develops a novel data-cleaning framework that streamlines the practice of data analysis and visualisation, specifically for studying environmental factors such as climate change monitoring and the effects of weather and pollution. The framework is designed to efficiently handle data collected by remote sensors, enabling more accurate and comprehensive analyses of environmental phenomena that would otherwise not be possible. The study initially focuses on the Sydney Region, identifies missing data patterns, and utilises established imputation methods. It assesses the performance of existing techniques and finds that Kalman smoothing on structural time series models outperforms other methods. However, when dealing with larger gaps in missing data, none of the existing methods yield satisfactory results. To address this, the study proposes enhanced methodologies for filling substantial gaps in environmental datasets. The first proposed algorithm employs regularized regression models to fill large gaps in air quality data using a univariate approach. It is then extended to incorporate seasonal patterns and expand its applicability to weather data with similar patterns. Furthermore, the algorithm is enhanced by incorporating other correlated variables to accurately fill substantial gaps in environmental variables. Consistently, the algorithm presented in this thesis outperforms other methods in imputing large gaps. This algorithm is applicable for filling large gaps in air pollution and weather data, facilitating downstream analysis

    Imputation, modelling and optimal sampling design for digital camera data in recreational fisheries monitoring

    Get PDF
    Digital camera monitoring has evolved as an active application-oriented scheme to help address questions in areas such as fisheries, ecology, computer vision, artificial intelligence, and criminology. In recreational fisheries research, digital camera monitoring has become a viable option for probability-based survey methods, and is also used for corroborative and validation purposes. In comparison to onsite surveys (e.g. boat ramp surveys), digital cameras provide a cost-effective method of monitoring boating activity and fishing effort, including night-time fishing activities. However, there are challenges in the use of digital camera monitoring that need to be resolved. Notably, missing data problems and the cost of data interpretation are among the most pertinent. This study provides relevant statistical support to address these challenges of digital camera monitoring of boating effort, to improve its utility to enhance recreational fisheries management in Western Australia and elsewhere, with capacity to extend to other areas of application. Digital cameras can provide continuous recordings of boating and other recreational fishing activities; however, interruptions of camera operations can lead to significant gaps within the data. To fill these gaps, some climatic and other temporal classification variables were considered as predictors of boating effort (defined as number of powerboat launches and retrievals). A generalized linear mixed effect model built on fully-conditional specification multiple imputation framework was considered to fill in the gaps in the camera dataset. Specifically, the zero-inflated Poisson model was found to satisfactorily impute plausible values for missing observations for varied durations of outages in the digital camera monitoring data of recreational boating effort. Additional modelling options were explored to guide both short- and long-term forecasting of boating activity and to support management decisions in monitoring recreational fisheries. Autoregressive conditional Poisson (ACP) and integer-valued autoregressive (INAR) models were identified as useful time series models for predicting short-term behaviour of such data. In Western Australia, digital camera monitoring data that coincide with 12-month state-wide boat-based surveys (now conducted on a triennial basis) have been read but the periods between the surveys have not been read. A Bayesian regression framework was applied to describe the temporal distribution of recreational boating effort using climatic and temporally classified variables to help construct data for such missing periods. This can potentially provide a useful cost-saving alternative of obtaining continuous time series data on boating effort. Finally, data from digital camera monitoring are often manually interpreted and the associated cost can be substantial, especially if multiple sites are involved. Empirical support for low-level monitoring schemes for digital camera has been provided. It was found that manual interpretation of camera footage for 40% of the days within a year can be deemed as an adequate level of sampling effort to obtain unbiased, precise and accurate estimates to meet broad management objectives. A well-balanced low-level monitoring scheme will ultimately reduce the cost of manual interpretation and produce unbiased estimates of recreational fishing indexes from digital camera surveys

    A step towards Advancing Digital Phenotyping In Mental Healthcare

    Get PDF
    Smartphones and wrist-wearable devices have infiltrated our lives in recent years. According to published statistics, nearly 84% of the world’s population owns a smartphone, and almost 10% own a wearable device today (2022). These devices continuously generate various data sources from multiple sensors and apps, creating our digital phenotypes. This opens new research opportunities, particularly in mental health care, which has previously relied almost exclusively on self-reports of mental health symptoms. Unobtrusive monitoring using patients’ devices may result in clinically valuable markers that can improve diagnostic processes, tailor treatment choices, provide continuous insights into their condition for actionable outcomes, such as early signs of relapse, and develop new intervention models. However, these data sources must be translated into meaningful, actionable features related to mental health to achieve their full potential. In the mental health field, there is a great need and much to be gained from defining a way to continuously assess the evolution of patients’ mental states, ideally in their everyday environment, to support the monitoring and treatments by health care providers. A smartphone-based approach may be valuable in gathering long-term objective data, aside from the usually used self-ratings, to predict clinical state changes and investigate causal inferences about state changes in patients (e.g., those with affective disorders). Being objective does not imply that passive data collection is also perfect. It has several challenges: some sensors generate vast volumes of data, and others cause significant battery drain. Furthermore, the analysis of raw passive data is complicated, and collecting certain types of data may interfere with the phenotype of interest. Nonetheless, machine learning is predisposed to address these matters and advance psychiatry’s era of personalised medicine. This work aimed to advance the research efforts on mobile and wearable sensors for mental health monitoring. We applied supervised and unsupervised machine learning methods to model and understand mental disease evolution based on the digital phenotype of patients and clinician assessments at the follow-up visits, which provide ground truths. We needed to cope with regularly and irregularly sampled, high-dimensional, and heterogeneous time series data susceptible to distortion and missingness. Hence, the developed methods must be robust to these limitations and handle missing data properly. Throughout the various projects presented here, we used probabilistic latent variable models for data imputation and feature extraction, namely, mixture models (MM) and hidden Markov models (HMM). These unsupervised models can learn even in the presence of missing data by marginalising the missing values in the function of the present observations. Once the generative models are trained on the data set with missing values, they can be used to generate samples for imputation. First, the most probable component/state has to be found for each sample. Then, sampling from the most probable distribution yields valid and robust parameter estimates and explicit imputed values for variables that can be analysed as outcomes or predictors. The imputation process can be repeated several times, creating multiple datasets, thereby accounting for the uncertainty in the imputed values and implicitly augmenting the data. Moreover, they are robust to moderate deviations of the observed data from the assumed underlying distribution and provide accurate estimates even when missingness is high. Depending on the properties of the data at hand, we employed feature extraction methods combined with classical machine learning algorithms or deep learning-based techniques for temporal modelling to predict various mental health outcomes - emotional state, World Health Organisation Disability Assessment Schedule (WHODAS 2.0) functionality scores and Generalised Anxiety Disorder-7 (GAD-7) scores, of psychiatric outpatients. We mainly focused on one-size-fits-all models, as the labelled sample size per patient was limited; however, in the mood prediction case, it was possible to apply personalised models. Integrating machines and algorithms into the clinical workflow require interpretability to increase acceptance. Therefore, we also analysed feature importance by computing Shapley additive explanations (SHAP) values. SHAP values provide an overview of essential features in the machine learning models by designating the weight of predictability of each feature positively or negatively to the target variable. The provided solutions, as such, are proof of concept, which require further clinical validation to be deployable in the clinical workflow. Still, the results are promising and lay some foundations for future research and collaboration among clinicians, patients, and computer scientists. They set the paths to advance future research prospects in technology-based mental healthcare.En los últimos años, los smartphones y los dispositivos y pulseras inteligentes, comúnmente conocidos como wearables, se han infiltrado en nuestras vidas. Según las estadísticas publicadas a día de hoy (2022), cerca del 84% de la población tiene un smartphone y aproximadamente un 10% también posee un wearable. Estos dispositivos generan datos de forma continua en base a distintos sensores y aplicaciones, creando así nuestro fenotipo digital. Estos datos abren nuevas vías de investigación, particularmente en el área de salud mental, dónde las fuentes de datos han sido casi exclusivamente autoevaluaciones de síntomas de salud mental. Monitorizar de forma no intrusiva a los pacientes mediante sus dispositivos puede dar lugar a marcadores valiosos en aplicación clínica. Esto permite mejorar los procesos de diagnóstico, adaptar tratamientos, e incluso proporcionar información continua sobre el estado de los pacientes, como signos tempranos de recaída, y hasta desarrollar nuevos modelos de intervención. Aun así, estos datos en crudo han de ser traducidos a datos interpretables relacionados con la salud mental para conseguir un máximo rendimiento de los mismos. En salud mental existe una gran necesidad, y además hay mucho que ganar, de definir cómo evaluar de forma continuada la evolución del estado mental de los pacientes en su entorno cotidiano para ayudar en el tratamiento y seguimiento de los mismos por parte de los profesionales sanitarios. En este ámbito, un enfoque basado en datos recopilados desde sus smartphones puede ser valioso para recoger datos objetivos a largo plazo al mismo tiempo que se acompaña de las autoevaluaciones utilizadas habitualmente. La combinación de ambos tipos de datos puede ayudar a predecir los cambios en el estado clínico de estos pacientes e investigar las relaciones causales sobre estos cambios (por ejemplo, en aquellos que padecen trastornos afectivos). Aunque la recogida de datos de forma pasiva tiene la ventaja de ser objetiva, también implica varios retos. Por un lado, ciertos sensores generan grandes volúmenes de datos, provocando un importante consumo de batería. Además, el análisis de los datos pasivos en crudo es complicado, y la recogida de ciertos tipos de datos puede interferir con el fenotipo que se quiera analizar. No obstante, el machine learning o aprendizaje automático, está predispuesto a resolver estas cuestiones y aportar avances en la medicina personalizada aplicada a psiquiatría. Esta tesis tiene como objetivo avanzar en la investigación de los datos recogidos por sensores de smartphones y wearables para la monitorización en salud mental. Para ello, aplicamos métodos de aprendizaje automático supervisado y no supervisado para modelar y comprender la evolución de las enfermedades mentales basándonos en el fenotipo digital de los pacientes. Estos resultados se comparan con las evaluaciones de los médicos en las visitas de seguimiento, que proporcionan las etiquetas reales. Para aplicar estos métodos hemos lidiado con datos provenientes de series temporales con alta dimensionalidad, muestreados de forma regular e irregular, heterogéneos y, además, susceptibles a presentar patrones de datos perdidos y/o distorsionados. Por lo tanto, los métodos desarrollados deben ser resistentes a estas limitaciones y manejar adecuadamente los datos perdidos. A lo largo de los distintos proyectos presentados en este trabajo, hemos utilizado modelos probabilísticos de variables latentes para la imputación de datos y la extracción de características, como por ejemplo, Mixture Models (MM) y hidden Markov Models (HMM). Estos modelos no supervisados pueden aprender incluso en presencia de datos perdidos, marginalizando estos valores en función de las datos que sí han sido observados. Una vez entrenados los modelos generativos en el conjunto de datos con valores perdidos, pueden utilizarse para imputar dichos valores generando muestras. En primer lugar, hay que encontrar el componente/estado más probable para cada muestra. Luego, se muestrea de la distirbución más probable resultando en estimaciones de parámetros robustos y válidos. Además, genera imputaciones explícitas que pueden ser tratadas como resultados. Este proceso de imputación puede repetirse varias veces, creando múltiples conjuntos de datos, con lo que se tiene en cuenta la incertidumbre de los valores imputados y aumentándose así, implícitamente, los datos. Además, estas imputaciones son resistentes a desviaciones que puedan existir en los datos observados con respecto a la distribución subyacente asumida y proporcionan estimaciones precisas incluso cuando la falta de datos es elevada. Dependiendo de las propiedades de los datos en cuestión, hemos usado métodos de extracción de características combinados con algoritmos clásicos de aprendizaje automático o técnicas basadas en deep learning o aprendizaje profundo para el modelado temporal. La finalidad de ambas opciones es ser capaces de predecir varios resultados de salud mental/estado emocional, como la puntuación sobre el World Health Organisation Disability Assessment Schedule (WHODAS 2.0), o las puntuaciones del generalised anxiety disorder-7 (GAD-7) de pacientes psiquiátricos ambulatorios. Nos centramos principalmente en modelos generalizados, es decir, no personalizados para cada paciente sino explicativos para la mayoría, ya que el tamaño de muestras etiquetada por paciente es limitado; sin embargo, en el caso de la predicción del estado de ánimo, puidmos aplicar modelos personalizados. Para que la integración de las máquinas y algoritmos dentro del flujo de trabajo clínico sea aceptada, se requiere que los resultados sean interpretables. Por lo tanto, en este trabajo también analizamos la importancia de las características sacadas por cada algoritmo en base a los valores de las explicaciones aditivas de Shapley (SHAP). Estos valores proporcionan una visión general de las características esenciales en los modelos de aprendizaje automático designando el peso, positivo o negativo, de cada característica en su predictibilidad sobre la variable objetivo. Las soluciones aportadas en esta tesis, como tales, son pruebas de concepto, que requieren una mayor validación clínica para poder ser desplegadas en el flujo de trabajo clínico. Aun así, los resultados son prometedores y sientan base para futuras investigaciones y colaboraciones entre clínicos, pacientes y científicos de datos. Éstas establecen las guías para avanzar en las perspectivas de investigación futuras en la atención sanitaria mental basada en la tecnología.Programa de Doctorado en Multimedia y Comunicaciones por la Universidad Carlos III de Madrid y la Universidad Rey Juan CarlosPresidente: David Ramírez García.- Secretario: Alfredo Nazábal Rentería.- Vocal: María Luisa Barrigón Estéve

    Stacked LSTM for wind turbine yaw fault forecasting based on SCADA data analysis

    Get PDF
    With the final change of our society from the fossil resources reliant only to the one with higher use of renewable resources, the necessity of improving the efficiency and profitability of renewable resources is of primary importance. The performance of a wind turbine depends on the wind conditions as well as on the optimal extraction of kinetic energy and its transformation in the electricity. A wind turbine is a complex system and the coordination of all subsystems should be carefully orchestrated. The focus of this thesis is improvement of predict yaw faults relying on data collected via SCADA system. The data used in the experiment is kindly provided by Statkraft. A possibility to forecast the on-start of a fault alarm some time in advance gives possibility to implement required measures for eliminating the fault. Remote and automatic forecasting of such faults is utter importance for offshore wind parks that are emerging now all around the world. The goal of this thesis is to improve the algorithms implemented by fellow student \citeA{Tallaksrud}. The employed strategy focused on expanding the amount of information in the dataset. The study of influence of such parameters as rolling window length, a method used for handling the missing data, the number of features in the dataset and the number of time sequences in probability of a yaw alarm on-start. The preferential choice is a longer period in order to give time to fix the fault without risking long downtime and failure of other subsystems. The results for 12 models are presented together with a single-layer LSTM model for comparison of stacked models predicting better than a simple one. The best results was produced with a single-layer model with the with MAE score of 0.001. The model presents varying forecasting behaviour as a result of randomness and instability. The conclusion is that the stacked LSTM models do not cope with solving the problem in the thesisMed den endelige skiften av våres samfunn fra å være bare fossile ressursene basert til en med høyere bruk av fornybare ressurser, behovet for forbedring av effektivitet og lønsomhet av fornybære ressurser er av primær betydning. Ytelsen til en vindturbin avhenger av vindforholdene samtidig som optimal utvinning av kinetisk energi og dens transformasjon i elektrisiteten. En vindturbin er et komplekst system og koordineringen av alle delsystemer bør være nøye organisert. Fokuset i denne oppgaven er forbedring av yaw feil prediksjon basert på data samlet med SCADA-systemet. Data som ble brukt i forsøket er levert av Statkraft. En mulighet til å forutsi start av en feilalarm i forveien gir mulighet til å iverksette nødvendige tiltak for å eliminere feilen. Automatisk og mennesker-uavhengig varsling av slike feil er ekstremt viktig for offshore vindparker som dukker opp nå over hele verden. Målet med denne oppgaven er å forbedre algoritmene implementert av medstudent Tallaksrud (2021). Den anvendte strategien fokuserte på å utvide mengden informasjon i datasettet. Undersøkelse av påvirkning av parametere som rullende vindulengde, en metode som brukes for å håndtere de manglende dataene, antall parametrene i datasettet og antall tidssekvenser med sannsynlighet for en yaw ralarm ved start. Det foretrukne valget er en lengre periode for å gi tid til å fikse feilen uten å risikere lang nedetid og svikt i andre delsystemer. Resultatene for 12 modeller presenteres sammen med en enkeltlags LSTM-modell for sammenligning om stablede modeller klarer å predikere bedre enn en enkel model. De beste resultatene ble produsert med en enkeltlagsmodell med MAE-score på 0,001. Modellen presenterer varierende prognoseatferd som følge av tilfeldighet og ustabilitet. Derfor konklusjonen er at stablede LSTM modeller passer ikke til denne problmestillingen.M-M

    Multivariate extreme storm surge flooding events on the UK’s east coast

    Get PDF
    In the United Kingdom (UK), floods, and specifically coastal flooding, are a hazard that is commonly thought likely to increase due to the impacts of climate change and the results of development in areas at risk. East coast storm surges have been extremely devastating in the recent past, such as the events of 1953 or the winter of 2013/14. The challenge is to analysis the risk of widespread, concurrent and clustered coastal flooding in a regional scale. It is widely accepted that extreme value analysis (EVA) is an important tool for studying coastal flood risk, but it requires the estimation of a threshold to define extreme events and has to cope with the problems of missing values within the dataset. The main areas of research discussed in this thesis involve making improvements to the way that extreme thresholds are selected and providing an alternative approach for multivariate missing values. By applying an automated threshold selection method to the data, more plausible and less subjective results can be yielded over the traditional manual approach. The alternative multivariate analysis at regional scale considers the statistical dependences between locations and which possible combination of events to take into account in order to handle missing values within time series dataset. Both areas of research provide developments to existing extreme value methodologies, hence enhancing the predicted future storm surge coastal flood modelling. An application of this research is to analysis the potential impacts of proposed nuclear power stations considering the increase likelihood of occurrence of extreme storm surge events. This research undertakes EVA with the statistical programming language R. However, R provides a range of functions embedded in different R packages, it was necessary to create new functions, scripts and commands to improve the analysis of extremes in order to undertake the threshold selection and cope with missing values. This research selects, as a case study, fourteen tide gauges along the East Coast of the UK from Lerwick to Dover. The main measure is skew surge due to be an independent and identically distributed variable and all phase differences in the calculations are removed. The multivariate model provides the likelihood of future significant storm surge flooding events along the East Coast of the UK. Results show that return levels for 50, 100 and 250 years estimates higher impact of ≈1m in Felixstowe, Sheerness, Immingham, Cromer and Lowestoft, while the northern gauges show an increment of ≈0.5m. Moreover, due to the overdispersion of the dataset, high predicted values are estimated in Lowestoft, Felixstowe and Dover where currently nuclear power sites are generating energy and new sites will be built in the future. In summary, the main aim of this research is to undertake a multivariate extreme model to analysis the potential impacts of future storm surge coastal flooding at a regional scale. By analysing extreme skew surge events at a regional level, a more complex storm surge coastal flooding model can be elaborated, and therefore, better results can be obtained. The multivariate extreme model requires how to select extreme events and how to handle missing values within the dataset. Hence, the proposed Automated Graphic Threshold Selection (AGTS) method provides a mathematical and computational tool to select extreme threshold, and moreover, the Multivariate Extreme Missing Value Approach (MEMVA) handles the missing values in time series dataset. The multivariate extreme model has the potential to improve the regional risk assessment of widespread, concurrent and clustered coastal flooding events

    Intelligent Systems Approach for Classification and Management of Patients with Headache

    Get PDF
    Primary headache disorders are the most common complaints worldwide. The socioeconomic and personal impact of headache disorders is enormous, as it is the leading cause of workplace absence. Headache patients’ consultations are increasing as the population has increased in size, live longer and many people have multiple conditions, however, access to specialist services across the UK is currently inequitable because the numbers of trained consultant neurologists in the UK are 10 times lower than other European countries. Additionally, more than two third of headache cases presented to primary care were labelled with unspecified headache. Therefore, an alternative pathway to diagnose and manage patients with primary headache could be crucial to reducing the need for specialist assessment and increase capacity within the current service model. Several recent studies have targeted this issue through the development of clinical decision support systems, which can help non-specialist doctors and general practitioners to diagnose patients with primary headache disorders in primary clinics. However, the majority of these studies were following a rule-based system style, in which the rules were summarised and expressed by a computer engineer. This style carries many downsides, and we will discuss them later on in this dissertation. In this study, we are adopting a completely different approach. The use of machine learning is recruited for the classification of primary headache disorders, for which a dataset of 832 records of patients with primary headaches was considered, originating from three medical centres located in Turkey. Three main types of primary headaches were derived from the data set including Tension Type Headache in both episodic and chronic forms, Migraine with and without Aura, followed by Trigeminal Autonomic Cephalalgia that further subdivided into Cluster headache, paroxysmal hemicrania and short-lasting unilateral neuralgiform headache attacks with conjunctival injection and tearing. Six popular machine-learning based classifiers, including linear and non-linear ensemble learning, in addition to one regression based procedure, have been evaluated for the classification of primary headaches within a supervised learning setting, achieving highest aggregate performance outcomes of AUC 0.923, sensitivity 0.897, and overall classification accuracy of 0.843. This study also introduces the proposed HydroApp system, which is an M-health based personalised application for the follow-up of patients with long-term conditions such as chronic headache and hydrocephalus. We managed to develop this system with the supervision of headache specialists at Ashford hospital, London, and neurology experts at Walton Centre and Alder Hey hospital Liverpool. We have successfully investigated the acceptance of using such an M-health based system via an online questionnaire, where 86% of paediatric patients and 60% of adult patients were interested in using HydroApp system to manage their conditions. Features and functions offered by HydroApp system such as recording headache score, recording of general health and well-being as well as alerting the treating team, have been perceived as very or extremely important aspects from patients’ point of view. The study concludes that the advances in intelligent systems and M-health applications represent a promising atmosphere through which to identify alternative solutions, which in turn increases the capacity in the current service model and improves diagnostic capability in the primary headache domain and beyond

    Contributions to time series data mining towards the detection of outliers/anomalies

    Get PDF
    148 p.Los recientes avances tecnológicos han supuesto un gran progreso en la recogida de datos, permitiendo recopilar una gran cantidad de datos a lo largo del tiempo. Estos datos se presentan comúnmente en forma de series temporales, donde las observaciones se han registrado de forma cronológica y están correlacionadas en el tiempo. A menudo, estas dependencias temporales contienen información significativa y útil, por lo que, en los últimos años, ha surgido un gran interés por extraer dicha información. En particular, el área de investigación que se centra en esta tarea se denomina minería de datos de series temporales.La comunidad de investigadores de esta área se ha dedicado a resolver diferentes tareas como por ejemplo la clasificación, la predicción, el clustering o agrupamiento y la detección de valores atípicos/anomalías. Los valores atípicos o anomalías son aquellas observaciones que no siguen el comportamiento esperado en una serie temporal. Estos valores atípicos o anómalos suelen representar mediciones no deseadas o eventos de interés, y, por lo tanto, detectarlos suele ser relevante ya que pueden empeorar la calidad de los datos o reflejar fenómenos interesantes para el analista.Esta tesis presenta varias contribuciones en el campo de la minería de datos de series temporales, más específicamente sobre la detección de valores atípicos o anomalías. Estas contribuciones se pueden dividir en dos partes o bloques. Por una parte, la tesis presenta contribuciones en el campo de la detección de valores atípicos o anomalías en series temporales. Para ello, se ofrece una revisión de las técnicas en la literatura, y se presenta una nueva técnica de detección de anomalías en series temporales univariantes para la detección de fugas de agua, basada en el aprendizaje autosupervisado. Por otra parte, la tesis también introduce contribuciones relacionadas con el tratamiento de las series temporales con valores perdidos y demuestra su aplicabilidad en el campo de la detección de anomalías
    corecore