108 research outputs found
IoT Data Imputation with Incremental Multiple Linear Regression
In this paper, we address the problem related to missing data imputation in the IoT domain. More specifically, we propose an Incremental Space-Time-based model (ISTM) for repairing missing values in IoT real-time data streams. ISTM is based on Incremental Multiple Linear Regression, which processes data as follows: Upon data arrival, ISTM updates the model after reading again the intermediary data matrix instead of accessing all historical information. If a missing value is detected, ISTM will provide an estimation for the missing value based on nearly historical data and the observations of neighboring sensors of the default one. Experiments conducted with real traffic data show the performance of ISTM in comparison with known techniques
Online Machine Learning for Inference from Multivariate Time-series
Inference and data analysis over networks have become significant areas of research due to the increasing prevalence of interconnected systems and the growing volume of data they produce. Many of these systems generate data in the form of multivariate time series, which are collections of time series data that are observed simultaneously across multiple variables. For example, EEG measurements of the brain produce multivariate time series data that record the electrical activity of different brain regions over time. Cyber-physical systems generate multivariate time series that capture the behaviour of physical systems in response to cybernetic inputs. Similarly, financial time series reflect the dynamics of multiple financial instruments or market indices over time. Through the analysis of these time series, one can uncover important details about the behavior of the system, detect patterns, and make predictions. Therefore, designing effective methods for data analysis and inference over networks of multivariate time series is a crucial area of research with numerous applications across various fields. In this Ph.D. Thesis, our focus is on identifying the directed relationships between time series and leveraging this information to design algorithms for data prediction as well as missing data imputation. This Ph.D. thesis is organized as a compendium of papers, which consists of seven chapters and appendices. The first chapter is dedicated to motivation and literature survey, whereas in the second chapter, we present the fundamental concepts that readers should understand to grasp the material presented in the dissertation with ease. In the third chapter, we present three online nonlinear topology identification algorithms, namely NL-TISO, RFNL-TISO, and RFNL-TIRSO. In this chapter, we assume the data is generated from a sparse nonlinear vector autoregressive model (VAR), and propose online data-driven solutions for identifying nonlinear VAR topology. We also provide convergence guarantees in terms of dynamic regret for the proposed algorithm RFNL-TIRSO. Chapters four and five of the dissertation delve into the issue of missing data and explore how the learned topology can be leveraged to address this challenge. Chapter five is distinct from other chapters in its exclusive focus on edge flow data and introduces an online imputation strategy based on a simplicial complex framework that leverages the known network structure in addition to the learned topology. Chapter six of the dissertation takes a different approach, assuming that the data is generated from nonlinear structural equation models. In this chapter, we propose an online topology identification algorithm using a time-structured approach, incorporating information from both the data and the model evolution. The algorithm is shown to have convergence guarantees achieved by bounding the dynamic regret. Finally, chapter seven of the dissertation provides concluding remarks and outlines potential future research directions.publishedVersio
Support matrix machine: A review
Support vector machine (SVM) is one of the most studied paradigms in the
realm of machine learning for classification and regression problems. It relies
on vectorized input data. However, a significant portion of the real-world data
exists in matrix format, which is given as input to SVM by reshaping the
matrices into vectors. The process of reshaping disrupts the spatial
correlations inherent in the matrix data. Also, converting matrices into
vectors results in input data with a high dimensionality, which introduces
significant computational complexity. To overcome these issues in classifying
matrix input data, support matrix machine (SMM) is proposed. It represents one
of the emerging methodologies tailored for handling matrix input data. The SMM
method preserves the structural information of the matrix data by using the
spectral elastic net property which is a combination of the nuclear norm and
Frobenius norm. This article provides the first in-depth analysis of the
development of the SMM model, which can be used as a thorough summary by both
novices and experts. We discuss numerous SMM variants, such as robust, sparse,
class imbalance, and multi-class classification models. We also analyze the
applications of the SMM model and conclude the article by outlining potential
future research avenues and possibilities that may motivate academics to
advance the SMM algorithm
Towards Name Disambiguation: Relational, Streaming, and Privacy-Preserving Text Data
In the real world, our DNA is unique but many people share names. This phenomenon often causes erroneous aggregation of documents of multiple persons who are namesakes of one another. Such mistakes deteriorate the performance of document retrieval, web search, and more seriously, cause improper attribution of credit or blame in digital forensics. To resolve this issue, the name disambiguation task 1 is designed to partition the documents associated with a name reference such that each partition contains documents pertaining to a unique real-life person. Existing algorithms for this task mainly suffer from the following drawbacks. First, the majority of existing solutions substantially rely on feature engineering, such as biographical feature extraction, or construction of auxiliary features from Wikipedia. However, for many scenarios, such features may be costly to obtain or unavailable in privacy sensitive domains. Instead we solve the name disambiguation task in restricted setting by leveraging only the relational data in the form of anonymized graphs. Second, most of the existing works for this task operate in a batch mode, where all records to be disambiguated are initially available to the algorithm. However, more realistic settings require that the name disambiguation task should be performed in an online streaming fashion in order to identify records of new ambiguous entities having no preexisting records. Finally, we investigate the potential disclosure risk of textual features used in name disambiguation and propose several algorithms to tackle the task in a privacy-aware scenario. In summary, in this dissertation, we present a number of novel approaches to address name disambiguation tasks from the above three aspects independently, namely relational, streaming, and privacy preserving textual data
Personalized data analytics for internet-of-things-based health monitoring
The Internet-of-Things (IoT) has great potential to fundamentally alter the delivery of modern healthcare, enabling healthcare solutions outside the limits of conventional clinical settings. It can offer ubiquitous monitoring to at-risk population groups and allow diagnostic care, preventive care, and early intervention in everyday life. These services can have profound impacts on many aspects of health and well-being. However, this field is still at an infancy stage, and the use of IoT-based systems in real-world healthcare applications introduces new challenges. Healthcare applications necessitate satisfactory quality attributes such as reliability and accuracy due to their mission-critical nature, while at the same time, IoT-based systems mostly operate over constrained shared sensing, communication, and computing resources. There is a need to investigate this synergy between the IoT technologies and healthcare applications from a user-centered perspective. Such a study should examine the role and requirements of IoT-based systems in real-world health monitoring applications. Moreover, conventional computing architecture and data analytic approaches introduced for IoT systems are insufficient when used to target health and well-being purposes, as they are unable to overcome the limitations of IoT systems while fulfilling the needs of healthcare applications. This thesis aims to address these issues by proposing an intelligent use of data and computing resources in IoT-based systems, which can lead to a high-level performance and satisfy the stringent requirements. For this purpose, this thesis first delves into the state-of-the-art IoT-enabled healthcare systems proposed for in-home and in-hospital monitoring. The findings are analyzed and categorized into different domains from a user-centered perspective. The selection of home-based applications is focused on the monitoring of the elderly who require more remote care and support compared to other groups of people. In contrast, the hospital-based applications include the role of existing IoT in patient monitoring and hospital management systems. Then, the objectives and requirements of each domain are investigated and discussed. This thesis proposes personalized data analytic approaches to fulfill the requirements and meet the objectives of IoT-based healthcare systems. In this regard, a new computing architecture is introduced, using computing resources in different layers of IoT to provide a high level of availability and accuracy for healthcare services. This architecture allows the hierarchical partitioning of machine learning algorithms in these systems and enables an adaptive system behavior with respect to the user's condition. In addition, personalized data fusion and modeling techniques are presented, exploiting multivariate and longitudinal data in IoT systems to improve the quality attributes of healthcare applications. First, a real-time missing data resilient decision-making technique is proposed for health monitoring systems. The technique tailors various data resources in IoT systems to accurately estimate health decisions despite missing data in the monitoring. Second, a personalized model is presented, enabling variations and event detection in long-term monitoring systems. The model evaluates the sleep quality of users according to their own historical data. Finally, the performance of the computing architecture and the techniques are evaluated in this thesis using two case studies. The first case study consists of real-time arrhythmia detection in electrocardiography signals collected from patients suffering from cardiovascular diseases. The second case study is continuous maternal health monitoring during pregnancy and postpartum. It includes a real human subject trial carried out with twenty pregnant women for seven months
Applications
Volume 3 describes how resource-aware machine learning methods and techniques are used to successfully solve real-world problems. The book provides numerous specific application examples: in health and medicine for risk modelling, diagnosis, and treatment selection for diseases in electronics, steel production and milling for quality control during manufacturing processes in traffic, logistics for smart cities and for mobile communications
Applications
Volume 3 describes how resource-aware machine learning methods and techniques are used to successfully solve real-world problems. The book provides numerous specific application examples: in health and medicine for risk modelling, diagnosis, and treatment selection for diseases in electronics, steel production and milling for quality control during manufacturing processes in traffic, logistics for smart cities and for mobile communications
Recommended from our members
Using Computational Psychology to Profile Unhappy and Happy People
Social psychology has a long tradition of studying the personality traits associated with subjective well-being (SWB). However, research often depends on a priori but unempirical assumptions about how to (a) measure the constructs, and (b) mitigate confounded associations. These assumptions have caused profligate and often contradictory findings. To remedy, I demonstrate how a computational psychology paradigm—predicated on large online data and iterative analyses—might help isolate more robust personality trait associations.
At the outset, I focussed on univariate measurement. In the first set of studies, I evaluated the extent researchers could measure psychological characteristics at scale from online behaviour. Specifically, I used a combination of simulated and real-world data to determine whether predicted constructs like big five personality were accurate for specific individuals. I found that it was usually more effective to simply assume everyone was average for the characteristic, and that imprecision was not remedied by collapsing predicted scores into buckets (e.g. low, medium, high). Overall, I concluded that predictions were unlikely to yield precise individual-level insights, but could still be used to examine normative group-based tendencies. In the second set of studies, I evaluated the construct validity of a novel SWB scale. Specifically, I repurposed the balanced measure of psychological needs (BMPN), which was originally designed to capture the substrates of intrinsic motivation. I found that the BMPN robustly captured (a) dissociable experiences of suffering and flourishing, (b) more transitive SWB than the existing criterion measure, and (c) unique variation in real-world outcomes. Thus, I used it as my primary outcome.
Then, I focussed on bivariate associations. The third set of studies extracted pairs of participants with similar patterns of covarying personality traits—and differing target traits—to isolate less-confounded SWB correlations. I found my extraction method—an adapted version of propensity score matching—outperformed even advanced machine learning alternatives. The final set of studies isolated the subset of facets that had the most robust associations with SWB. It combined real-world surveys with a total of eight billion simulated participants to find the traits most prevalent in extreme suffering and flourishing. For validation purposes, I first found that depression and cheerfulness—the trait components of SWB—were highly implicated in both suffering and flourishing. Then, I found that self-discipline was the only other trait implicated in both forms of SWB. However, there were also domain-specific effects: anxiety, vulnerability and cooperation were implicated in just suffering; and, assertiveness, altruism and self-efficacy were implicated in just flourishing. These seven traits were most likely to be the definitive, stable, drivers of SWB because their effects were totally consistent across the full range of intrapersonal contexts.Gates Cambridge Scholarshi
Probabilistic models for human behavior learning
The problem of human behavior learning is a popular interdisciplinary research topic that
has been explored from multiple perspectives, with a principal branch of study in the
context of computer vision systems and activity recognition. However, the statistical methods
used in these frameworks typically assume short time scales, usually of minutes or even
seconds. The emergence of mobile electronic devices, such as smartphones and wearables,
has changed this paradigm as long as we are now able to massively collect digital records
from users. This collection of smartphone-generated data, whose attributes are obtained in
an unobtrusive manner from the devices via multiple sensors and apps, shape the behavioral
footprint that is unique for everyone of us. At an individual level, the data projection also
di ers from person to person, as not all sensors are equal, neither the apps installed, or the
devices used in the real life. This point actually reflects that learning the human behavior
from the digital signature of users is an arduous task, that requires to fuse irregular data.
For instance, collections of samples that are corrupted, heterogeneous, outliers or have shortterm
correlations. The statistical modelling of this sort of objects is one of the principal
contributions of this thesis, that we study from the perspective of Gaussian processes (gp).
In the particular case of humans, as well as many other life species in our world, we are
inherently conditioned to the diurnal and nocturnal cycles that everyday shape our behavior,
and hence, our data. We can study these cycles in our behavioral representation to see that
there exists a perpetual circadian rhytm in everyone of us. This tempo is the 24h periodic
component that shapes the baseline temporal structure of our behavior, not the particular
patterns that change for every person. Looking to the trajectories and variabilities that our
behavior may take in the data, we can appreciate that there is not a single repetitive behavior.
Instead, there are typically several patterns or routines, sampled from our own dictionary,
that we choose for every special situation. At the same time, these routines are arbitrary
combinations of di erents timescales, correlations, levels of mobility, social interaction, sleep
quality or will for working during the same hours on weekdays. Together, the properties of
human behavior already indicate to us how we shall proceed to model its structure, not as
unique functions, but as a dictionary of latent behavioral profiles. To discover them, we have
considered latent variable models.
The main application of the statistical methods developed for human behavior learning
appears as we look to medicine. Having a personalized model that is accurately fitted to
the behavioral patterns of some patient of interest, sudden changes in them could be early
indicators of future relapses. From a technical point of view, the traditional question use to
be if newer observations conform or not to the expected behavior indicated by the already
fitted model. The problem can be analyzed from two perspectives that are interrelated, one
more oriented to the characterization of that single object as outlier, typically named as
anomaly detection, and another focused in refreshing the learning model if no longer fits to
the new sequential data. This last problem, widely known as change-point detection (cpd)
is another pillar of this thesis. These methods are oriented to mental health applications,
and particularly to the passive detection of crisis events. The final goal is to provide an
early detection methodology based on probabilistic modeling for early intervention, e.g. prevent
suicide attempts, on psychiatric outpatients with severe a ective disorders of higher
prevalence, such as depression or bipolar diseases.El problema de aprendizaje del comportamiento humano es un tema de investigación interdisciplinar
que ha sido explorado desde múltiples perspectivas, con una lÃnea de estudio
principal en torno a los sistemas de visión por ordenador y el reconocimiento de actividades.
Sin embargo, los métodos estadÃsticos usados en estos casos suelen asumir escalas de tiempo
cortas, generalmente de minutos o incluso segundos. La aparición de tecnologÃas móviles,
tales como teléfonos o relojes inteligentes, ha cambiado este paradigma, dado que ahora es
posible recolectar ingentes colecciones de datos a partir de los usuarios. Este conjunto de
datos generados a partir de nuestro teléfono, cuyos atributos se obtienen de manera no invasiva
desde múltiples sensores y apps, conforman la huella de comportamiento que es única
para cada uno de nosotros. A nivel individual, la proyección sobre los datos difiere de persona
a persona, dado que no todos los sensores son iguales, ni las apps instaladas asà como
los dispositivos utilizados en la vida real. Esto precisamente refleja que el aprendizaje del
comportamiento humano a partir de la huella digital de los usuarios es una ardua tarea,
que requiere principalmente fusionar datos irregulares. Por ejemplo, colecciones de muestras
corruptas, heterogéneas, con outliers o poseedoras de correlaciones cortas. El modelado estadÃstico de este tipo de objetos es una de las contribuciones principales de esta tesis, que
estudiamos desde la perspectiva de los procesos Gaussianos (gp).
En el caso particular de los humanos, asà como para muchas otras especies en nuestro
planeta, estamos inherentemente condicionados a los ciclos diurnos y nocturnos que cada
dÃa dan forma a nuestro comportamiento, y por tanto, a nuestros datos. Podemos estudiar
estos ciclos en la representación del comportamiento que obtenemos y ver que realmente
existe un ritmo circadiano perpetuo en cada uno de nosotros. Este tempo es en realidad
la componente periódica de 24 horas que construye la base sobre la que se asienta nuestro
comportamiento, no únicamente los patrones que cambian para cada persona. Mirando a las
trayectorias y variabilidades que nuestro comportamiento puede plasmar en los datos, podemos
apreciar que no existe un comportamiento único y repetitivo. En su lugar, hay varios
patrones o rutinas, obtenidas de nuestro propio diccionario, que elegimos para cada situación
especial. Al mismo tiempo, estas rutinas son combinaciones arbitrarias de diferentes escalas
de tiempo, correlaciones, niveles de movilidad, interacción social, calidad del sueño o iniciativa
para trabajar durante las mismas horas cada dÃa laborable. Juntas, estas propiedades
del comportamiento humano nos indican como debemos proceder a modelar su estructura,
no como funciones únicas, sino como un diccionario de perfiles ocultos de comportamiento,
Para descubrirlos, hemos considerado modelos de variables latentes.
La aplicación principal de los modelos estadÃsticos desarrollados para el aprendizaje de
comportamiento humano aparece en cuanto miramos a la medicina. Teniendo un modelo
personalizado que está ajustado de una manera precisa a los patrones de comportamiento
de un paciente, los cambios espontáneos en ellos pueden ser indicadores de futuras recaÃdas.
Desde un punto de vista técnico, la pregunta clásica suele ser si nuevas observaciones encajan
o no con lo indicado por el modelo. Este problema se puede enfocar desde dos perspectivas
que están interrelacionadas, una más orientada a la caracterización de aquellos objetos como
outliers, que usualmente se conoce como detección de anomalÃas, y otro enfocado en refrescar
el modelo de aprendizaje si este deja de ajustarse debidamente a los nuevos datos secuenciales.
Este último problema, ampliamente conocido como detección de puntos de cambio (cpd) es otro de los pilares de esta tesis. Estos métodos se han orientado a aplicaciones de salud
mental, y particularmente, a la detección pasiva de eventos crÃticos. El objetivo final es
proveer de una metodologÃa de detección temprana basada en el modelado probabilÃstico
para intervenciones rápidas. Por ejemplo, de cara a prever intentos de suicidio en pacientes
fuera de hospitales con trastornos afectivos severos de gran prevalencia, como depresión o
sÃndrome bipolar.Programa de Doctorado en Multimedia y Comunicaciones por la Universidad Carlos III de Madrid y la Universidad Rey Juan CarlosPresidente: Pablo MartÃnez Olmos.- Secretario: Daniel Hernández Lobato.- Vocal: Javier González Hernánde
- …