94 research outputs found
Single-Microphone Speech Dereverberation based on Multiple-Step Linear Predictive Inverse Filtering and Spectral Subtraction
Single-channel speech dereverberation is a challenging problem of deconvolution of reverberation, produced by the room impulse response, from the speech signal, when only one observation of the reverberant signal (one microphone) is available. Although reverberation in mild levels is helpful in perceiving the speech (or any audio) signal, the adverse effect of reverberation, particularly at high levels, could both deteriorate the performance of automatic recognition systems and make it less intelligible by humans. Single-microphone speech dereverberation is more challenging than multi-microphone speech dereverberation, since it does not allow for spatial processing of different observations of the signal.
A review of the recent single-channel dereverberation techniques reveals that, those based on LP-residual enhancement are the most promising ones. On the other hand, spectral subtraction has also been effectively used for dereverberation particularly when long reflections are involved. By using LP-residuals and spectral subtraction as two promising tools for dereverberation, a new dereverberation technique is proposed. The first stage of the proposed technique consists of pre-whitening followed by a delayed long-term LP filtering whose kurtosis or skewness of LP-residuals is maximized to control the weight updates of the inverse filter. The second stage consists of nonlinear spectral subtraction. The proposed two-stage dereverberation scheme leads to two separate algorithms depending on whether kurtosis or skewness maximization is used to establish a feedback function for the weight updates of the adaptive inverse filter.
It is shown that the proposed algorithms have several advantages over the existing major single-microphone methods, including a reduction in both early and late reverberations, speech enhancement even in the case of very high reverberation time, robustness to additive background noise, and introducing only a few minor artifacts. Equalized room impulse responses by the proposed algorithms have less reverberation times. This means the inverse-filtering by the proposed algorithms is more successful in dereverberating the speech signal. For short, medium and high reverberation times, the signal-to-reverberation ratio of the proposed technique is significantly higher than that of the existing major algorithms. The waveforms and spectrograms of the inverse-filtered and fully-processed signals indicate the superiority of the proposed algorithms. Assessment of the overall quality of the processed speech signals by automatic speech recognition and perceptual evaluation of speech quality test also confirms that in most cases the proposed technique yields higher scores and in the cases that it does not do so, the difference is not as significant as the other aspects of the performance evaluation. Finally, the robustness of the proposed algorithms against the background noise is investigated and compared to that of the benchmark algorithms, which shows that the proposed algorithms are capable of maintaining a rather stable performance for contaminated speech signals with SNR levels as low as 0 dB
OBJECTIVE AND SUBJECTIVE EVALUATION OF DEREVERBERATION ALGORITHMS
Reverberation significantly impacts the quality and intelligibility of speech. Several dereverberation algorithms have been proposed in the literature to combat this problem. A majority of these algorithms utilize a single channel and are developed for monaural applications, and as such do not preserve the cues necessary for sound localization. This thesis describes a blind two-channel dereverberation technique that improves the quality of speech corrupted by reverberation while preserving cues that affect localization. The method is based by combining a short term (2ms) and long term (20ms) weighting function of the linear prediction (LP) residual of the input signal. The developed and other dereverberation algorithms are evaluated objectively and subjectively in terms of sound quality and localization accuracy. The binaural adaptation provides a significant increase in sound quality while removing the loss in localization ability found in the bilateral implementation
Speech assessment and characterization for law enforcement applications
Speech signals acquired, transmitted or stored in non-ideal conditions are often degraded by
one or more effects including, for example, additive noise. These degradations alter the signal
properties in a manner that deteriorates the intelligibility or quality of the speech signal. In
the law enforcement context such degradations are commonplace due to the limitations in
the audio collection methodology, which is often required to be covert. In severe degradation
conditions, the acquired signal may become unintelligible, losing its value in an investigation
and in less severe conditions, a loss in signal quality may be encountered, which can lead to
higher transcription time and cost.
This thesis proposes a non-intrusive speech assessment framework from which algorithms for
speech quality and intelligibility assessment are derived, to guide the collection and transcription
of law enforcement audio. These methods are trained on a large database labelled using
intrusive techniques (whose performance is verified with subjective scores) and shown to perform
favorably when compared with existing non-intrusive techniques. Additionally, a non-intrusive
CODEC identification and verification algorithm is developed which can identify a CODEC with
an accuracy of 96.8 % and detect the presence of a CODEC with an accuracy higher than 97 %
in the presence of additive noise.
Finally, the speech description taxonomy framework is developed, with the aim of characterizing
various aspects of a degraded speech signal, including the mechanism that results in a signal
with particular characteristics, the vocabulary that can be used to describe those degradations
and the measurable signal properties that can characterize the degradations. The taxonomy is
implemented as a relational database that facilitates the modeling of the relationships between
various attributes of a signal and promises to be a useful tool for training and guiding audio
analysts
Emotion Recognition from Speech with Acoustic, Non-Linear and Wavelet-based Features Extracted in Different Acoustic Conditions
ABSTRACT: In the last years, there has a great progress in automatic speech recognition. The challenge now it is not only recognize the semantic content in the speech but also the called "paralinguistic" aspects of the speech, including the emotions, and the personality of the speaker. This research work aims in the development of a methodology for the automatic emotion recognition from speech signals in non-controlled noise conditions. For that purpose, different sets of acoustic, non-linear, and wavelet based features are used to characterize emotions in different databases created for such purpose
Suivi Multi-Locuteurs avec des Informations Audio-Visuelles pour la Perception des Robots
Robot perception plays a crucial role in human-robot interaction (HRI). Perception system provides the robot information of the surroundings and enables the robot to give feedbacks. In a conversational scenario, a group of people may chat in front of the robot and move freely. In such situations, robots are expected to understand where are the people, who are speaking, or what are they talking about. This thesis concentrates on answering the first two questions, namely speaker tracking and diarization. We use different modalities of the robot’s perception system to achieve the goal. Like seeing and hearing for a human-being, audio and visual information are the critical cues for a robot in a conversational scenario. The advancement of computer vision and audio processing of the last decade has revolutionized the robot perception abilities. In this thesis, we have the following contributions: we first develop a variational Bayesian framework for tracking multiple objects. The variational Bayesian framework gives closed-form tractable problem solutions, which makes the tracking process efficient. The framework is first applied to visual multiple-person tracking. Birth and death process are built jointly with the framework to deal with the varying number of the people in the scene. Furthermore, we exploit the complementarity of vision and robot motorinformation. On the one hand, the robot’s active motion can be integrated into the visual tracking system to stabilize the tracking. On the other hand, visual information can be used to perform motor servoing. Moreover, audio and visual information are then combined in the variational framework, to estimate the smooth trajectories of speaking people, and to infer the acoustic status of a person- speaking or silent. In addition, we employ the model to acoustic-only speaker localization and tracking. Online dereverberation techniques are first applied then followed by the tracking system. Finally, a variant of the acoustic speaker tracking model based on von-Mises distribution is proposed, which is specifically adapted to directional data. All the proposed methods are validated on datasets according to applications.La perception des robots joue un rôle crucial dans l’interaction homme-robot (HRI). Le système de perception fournit les informations au robot sur l’environnement, ce qui permet au robot de réagir en consequence. Dans un scénario de conversation, un groupe de personnes peut discuter devant le robot et se déplacer librement. Dans de telles situations, les robots sont censés comprendre où sont les gens, ceux qui parlent et de quoi ils parlent. Cette thèse se concentre sur les deux premières questions, à savoir le suivi et la diarisation des locuteurs. Nous utilisons différentes modalités du système de perception du robot pour remplir cet objectif. Comme pour l’humain, l’ouie et la vue sont essentielles pour un robot dans un scénario de conversation. Les progrès de la vision par ordinateur et du traitement audio de la dernière décennie ont révolutionné les capacités de perception des robots. Dans cette thèse, nous développons les contributions suivantes : nous développons d’abord un cadre variationnel bayésien pour suivre plusieurs objets. Le cadre bayésien variationnel fournit des solutions explicites, rendant le processus de suivi très efficace. Cette approche est d’abord appliqué au suivi visuel de plusieurs personnes. Les processus de créations et de destructions sont en adéquation avecle modèle probabiliste proposé pour traiter un nombre variable de personnes. De plus, nous exploitons la complémentarité de la vision et des informations du moteur du robot : d’une part, le mouvement actif du robot peut être intégré au système de suivi visuel pour le stabiliser ; d’autre part, les informations visuelles peuvent être utilisées pour effectuer l’asservissement du moteur. Par la suite, les informations audio et visuelles sont combinées dans le modèle variationnel, pour lisser les trajectoires et déduire le statut acoustique d’une personne : parlant ou silencieux. Pour experimenter un scenario où l’informationvisuelle est absente, nous essayons le modèle pour la localisation et le suivi des locuteurs basé sur l’information acoustique uniquement. Les techniques de déréverbération sont d’abord appliquées, dont le résultat est fourni au système de suivi. Enfin, une variante du modèle de suivi des locuteurs basée sur la distribution de von-Mises est proposée, celle-ci étant plus adaptée aux données directionnelles. Toutes les méthodes proposées sont validées sur des bases de données specifiques à chaque application
Glottal-synchronous speech processing
Glottal-synchronous speech processing is a field of speech science where the pseudoperiodicity
of voiced speech is exploited. Traditionally, speech processing involves segmenting
and processing short speech frames of predefined length; this may fail to exploit the inherent
periodic structure of voiced speech which glottal-synchronous speech frames have
the potential to harness. Glottal-synchronous frames are often derived from the glottal
closure instants (GCIs) and glottal opening instants (GOIs).
The SIGMA algorithm was developed for the detection of GCIs and GOIs from
the Electroglottograph signal with a measured accuracy of up to 99.59%. For GCI and
GOI detection from speech signals, the YAGA algorithm provides a measured accuracy
of up to 99.84%. Multichannel speech-based approaches are shown to be more robust to
reverberation than single-channel algorithms.
The GCIs are applied to real-world applications including speech dereverberation,
where SNR is improved by up to 5 dB, and to prosodic manipulation where the importance
of voicing detection in glottal-synchronous algorithms is demonstrated by subjective
testing. The GCIs are further exploited in a new area of data-driven speech modelling,
providing new insights into speech production and a set of tools to aid deployment into
real-world applications. The technique is shown to be applicable in areas of speech coding,
identification and artificial bandwidth extension of telephone speec
An investigation of the utility of monaural sound source separation via nonnegative matrix factorization applied to acoustic echo and reverberation mitigation for hands-free telephony
In this thesis we investigate the applicability and utility of Monaural Sound Source Separation (MSSS) via Nonnegative Matrix Factorization (NMF) for various problems related to audio for hands-free telephony. We first investigate MSSS via NMF as an alternative acoustic echo reduction approach to existing approaches such as Acoustic Echo Cancellation (AEC). To this end, we present the single-channel acoustic echo problem as an MSSS problem, in which the objective is to extract the users signal from a mixture also containing acoustic echo and noise. To perform separation, NMF is used to decompose the near-end microphone signal onto the union of two nonnegative bases in the magnitude Short Time Fourier Transform domain. One of these bases is for the spectral energy of the acoustic echo signal, and is formed from the in- coming far-end user’s speech, while the other basis is for the spectral energy of the near-end speaker, and is trained with speech data a priori. In comparison to AEC, the speaker extraction approach obviates Double-Talk Detection (DTD), and is demonstrated to attain its maximal echo mitigation performance immediately upon initiation and to maintain that performance during and after room changes for similar computational requirements. Speaker extraction is also shown to introduce distortion of the near-end speech signal during double-talk, which is quantified by means of a speech distortion measure and compared to that of AEC. Subsequently, we address Double-Talk Detection (DTD) for block-based AEC algorithms. We propose a novel block-based DTD algorithm that uses the available signals and the estimate of the echo signal that is produced by NMF-based speaker extraction to compute a suitably normalized correlation-based decision variable, which is compared to a fixed threshold to decide on doubletalk. Using a standard evaluation technique, the proposed algorithm is shown to have comparable detection performance to an existing conventional block-based DTD algorithm. It is also demonstrated to inherit the room change insensitivity of speaker extraction, with the proposed DTD algorithm generating minimal false doubletalk indications upon initiation and in response to room changes in comparison to the existing conventional DTD. We also show that this property allows its paired AEC to converge at a rate close to the optimum. Another focus of this thesis is the problem of inverting a single measurement of a non- minimum phase Room Impulse Response (RIR). We describe the process by which percep- tually detrimental all-pass phase distortion arises in reverberant speech filtered by the inverse of the minimum phase component of the RIR; in short, such distortion arises from inverting the magnitude response of the high-Q maximum phase zeros of the RIR. We then propose two novel partial inversion schemes that precisely mitigate this distortion. One of these schemes employs NMF-based MSSS to separate the all-pass phase distortion from the target speech in the magnitude STFT domain, while the other approach modifies the inverse minimum phase filter such that the magnitude response of the maximum phase zeros of the RIR is not fully compensated. Subjective listening tests reveal that the proposed schemes generally produce better quality output speech than a comparable inversion technique
An investigation of the utility of monaural sound source separation via nonnegative matrix factorization applied to acoustic echo and reverberation mitigation for hands-free telephony
In this thesis we investigate the applicability and utility of Monaural Sound Source Separation (MSSS) via Nonnegative Matrix Factorization (NMF) for various problems related to audio for hands-free telephony. We first investigate MSSS via NMF as an alternative acoustic echo reduction approach to existing approaches such as Acoustic Echo Cancellation (AEC). To this end, we present the single-channel acoustic echo problem as an MSSS problem, in which the objective is to extract the users signal from a mixture also containing acoustic echo and noise. To perform separation, NMF is used to decompose the near-end microphone signal onto the union of two nonnegative bases in the magnitude Short Time Fourier Transform domain. One of these bases is for the spectral energy of the acoustic echo signal, and is formed from the in- coming far-end user’s speech, while the other basis is for the spectral energy of the near-end speaker, and is trained with speech data a priori. In comparison to AEC, the speaker extraction approach obviates Double-Talk Detection (DTD), and is demonstrated to attain its maximal echo mitigation performance immediately upon initiation and to maintain that performance during and after room changes for similar computational requirements. Speaker extraction is also shown to introduce distortion of the near-end speech signal during double-talk, which is quantified by means of a speech distortion measure and compared to that of AEC. Subsequently, we address Double-Talk Detection (DTD) for block-based AEC algorithms. We propose a novel block-based DTD algorithm that uses the available signals and the estimate of the echo signal that is produced by NMF-based speaker extraction to compute a suitably normalized correlation-based decision variable, which is compared to a fixed threshold to decide on doubletalk. Using a standard evaluation technique, the proposed algorithm is shown to have comparable detection performance to an existing conventional block-based DTD algorithm. It is also demonstrated to inherit the room change insensitivity of speaker extraction, with the proposed DTD algorithm generating minimal false doubletalk indications upon initiation and in response to room changes in comparison to the existing conventional DTD. We also show that this property allows its paired AEC to converge at a rate close to the optimum. Another focus of this thesis is the problem of inverting a single measurement of a non- minimum phase Room Impulse Response (RIR). We describe the process by which percep- tually detrimental all-pass phase distortion arises in reverberant speech filtered by the inverse of the minimum phase component of the RIR; in short, such distortion arises from inverting the magnitude response of the high-Q maximum phase zeros of the RIR. We then propose two novel partial inversion schemes that precisely mitigate this distortion. One of these schemes employs NMF-based MSSS to separate the all-pass phase distortion from the target speech in the magnitude STFT domain, while the other approach modifies the inverse minimum phase filter such that the magnitude response of the maximum phase zeros of the RIR is not fully compensated. Subjective listening tests reveal that the proposed schemes generally produce better quality output speech than a comparable inversion technique
Multi-sensor data fusion in mobile devices for the identification of Activities of Daily Living
Following the recent advances in technology and the growing use of mobile devices such as
smartphones, several solutions may be developed to improve the quality of life of users in the
context of Ambient Assisted Living (AAL). Mobile devices have different available sensors, e.g.,
accelerometer, gyroscope, magnetometer, microphone and Global Positioning System (GPS)
receiver, which allow the acquisition of physical and physiological parameters for the
recognition of different Activities of Daily Living (ADL) and the environments in which they are
performed. The definition of ADL includes a well-known set of tasks, which include basic selfcare
tasks, based on the types of skills that people usually learn in early childhood, including
feeding, bathing, dressing, grooming, walking, running, jumping, climbing stairs, sleeping,
watching TV, working, listening to music, cooking, eating and others. On the context of AAL,
some individuals (henceforth called user or users) need particular assistance, either because
the user has some sort of impairment, or because the user is old, or simply because users
need/want to monitor their lifestyle. The research and development of systems that provide a
particular assistance to people is increasing in many areas of application. In particular, in the
future, the recognition of ADL will be an important element for the development of a personal
digital life coach, providing assistance to different types of users. To support the recognition
of ADL, the surrounding environments should be also recognized to increase the reliability of
these systems.
The main focus of this Thesis is the research on methods for the fusion and classification of the
data acquired by the sensors available in off-the-shelf mobile devices in order to recognize ADL
in almost real-time, taking into account the large diversity of the capabilities and
characteristics of the mobile devices available in the market. In order to achieve this objective,
this Thesis started with the review of the existing methods and technologies to define the
architecture and modules of the method for the identification of ADL. With this review and
based on the knowledge acquired about the sensors available in off-the-shelf mobile devices,
a set of tasks that may be reliably identified was defined as a basis for the remaining research
and development to be carried out in this Thesis. This review also identified the main stages
for the development of a new method for the identification of the ADL using the sensors
available in off-the-shelf mobile devices; these stages are data acquisition, data processing,
data cleaning, data imputation, feature extraction, data fusion and artificial intelligence. One
of the challenges is related to the different types of data acquired from the different sensors,
but other challenges were found, including the presence of environmental noise, the positioning
of the mobile device during the daily activities, the limited capabilities of the mobile devices
and others. Based on the acquired data, the processing was performed, implementing data
cleaning and feature extraction methods, in order to define a new framework for the recognition of ADL. The data imputation methods were not applied, because at this stage of
the research their implementation does not have influence in the results of the identification
of the ADL and environments, as the features are extracted from a set of data acquired during
a defined time interval and there are no missing values during this stage. The joint selection of
the set of usable sensors and the identifiable set of tasks will then allow the development of a
framework that, considering multi-sensor data fusion technologies and context awareness, in
coordination with other information available from the user context, such as his/her agenda
and the time of the day, will allow to establish a profile of the tasks that the user performs in
a regular activity day. The classification method and the algorithm for the fusion of the features
for the recognition of ADL and its environments needs to be deployed in a machine with some
computational power, while the mobile device that will use the created framework, can
perform the identification of the ADL using a much less computational power. Based on the
results reported in the literature, the method chosen for the recognition of the ADL is composed
by three variants of Artificial Neural Networks (ANN), including simple Multilayer Perceptron
(MLP) networks, Feedforward Neural Networks (FNN) with Backpropagation, and Deep Neural
Networks (DNN).
Data acquisition can be performed with standard methods. After the acquisition, the data must
be processed at the data processing stage, which includes data cleaning and feature extraction
methods. The data cleaning method used for motion and magnetic sensors is the low pass filter,
in order to reduce the noise acquired; but for the acoustic data, the Fast Fourier Transform
(FFT) was applied to extract the different frequencies. When the data is clean, several features
are then extracted based on the types of sensors used, including the mean, standard deviation,
variance, maximum value, minimum value and median of raw data acquired from the motion
and magnetic sensors; the mean, standard deviation, variance and median of the maximum
peaks calculated with the raw data acquired from the motion and magnetic sensors; the five
greatest distances between the maximum peaks calculated with the raw data acquired from
the motion and magnetic sensors; the mean, standard deviation, variance, median and 26 Mel-
Frequency Cepstral Coefficients (MFCC) of the frequencies obtained with FFT based on the raw
data acquired from the microphone data; and the distance travelled calculated with the data
acquired from the GPS receiver. After the extraction of the features, these will be grouped in
different datasets for the application of the ANN methods and to discover the method and
dataset that reports better results. The classification stage was incrementally developed,
starting with the identification of the most common ADL (i.e., walking, running, going upstairs,
going downstairs and standing activities) with motion and magnetic sensors. Next, the
environments were identified with acoustic data, i.e., bedroom, bar, classroom, gym, kitchen,
living room, hall, street and library. After the environments are recognized, and based on the
different sets of sensors commonly available in the mobile devices, the data acquired from the
motion and magnetic sensors were combined with the recognized environment in order to
differentiate some activities without motion, i.e., sleeping and watching TV. The number of recognized activities in this stage was increased with the use of the distance travelled,
extracted from the GPS receiver data, allowing also to recognize the driving activity.
After the implementation of the three classification methods with different numbers of
iterations, datasets and remaining configurations in a machine with high processing
capabilities, the reported results proved that the best method for the recognition of the most
common ADL and activities without motion is the DNN method, but the best method for the
recognition of environments is the FNN method with Backpropagation. Depending on the
number of sensors used, this implementation reports a mean accuracy between 85.89% and
89.51% for the recognition of the most common ADL, equals to 86.50% for the recognition of
environments, and equals to 100% for the recognition of activities without motion, reporting
an overall accuracy between 85.89% and 92.00%.
The last stage of this research work was the implementation of the structured framework for
the mobile devices, verifying that the FNN method requires a high processing power for the
recognition of environments and the results reported with the mobile application are lower
than the results reported with the machine with high processing capabilities used. Thus, the
DNN method was also implemented for the recognition of the environments with the mobile
devices. Finally, the results reported with the mobile devices show an accuracy between 86.39%
and 89.15% for the recognition of the most common ADL, equal to 45.68% for the recognition
of environments, and equal to 100% for the recognition of activities without motion, reporting
an overall accuracy between 58.02% and 89.15%.
Compared with the literature, the results returned by the implemented framework show only
a residual improvement. However, the results reported in this research work comprehend the
identification of more ADL than the ones described in other studies. The improvement in the
recognition of ADL based on the mean of the accuracies is equal to 2.93%, but the maximum
number of ADL and environments previously recognized was 13, while the number of ADL and
environments recognized with the framework resulting from this research is 16. In conclusion,
the framework developed has a mean improvement of 2.93% in the accuracy of the recognition
for a larger number of ADL and environments than previously reported.
In the future, the achievements reported by this PhD research may be considered as a start
point of the development of a personal digital life coach, but the number of ADL and
environments recognized by the framework should be increased and the experiments should be
performed with different types of devices (i.e., smartphones and smartwatches), and the data
imputation and other machine learning methods should be explored in order to attempt to
increase the reliability of the framework for the recognition of ADL and its environments.Após os recentes avanços tecnológicos e o crescente uso dos dispositivos móveis, como por
exemplo os smartphones, várias soluções podem ser desenvolvidas para melhorar a qualidade
de vida dos utilizadores no contexto de Ambientes de Vida Assistida (AVA) ou Ambient Assisted
Living (AAL). Os dispositivos móveis integram vários sensores, tais como acelerómetro,
giroscópio, magnetómetro, microfone e recetor de Sistema de Posicionamento Global (GPS),
que permitem a aquisição de vários parâmetros fÃsicos e fisiológicos para o reconhecimento de
diferentes Atividades da Vida Diária (AVD) e os seus ambientes. A definição de AVD inclui um
conjunto bem conhecido de tarefas que são tarefas básicas de autocuidado, baseadas nos tipos
de habilidades que as pessoas geralmente aprendem na infância. Essas tarefas incluem
alimentar-se, tomar banho, vestir-se, fazer os cuidados pessoais, caminhar, correr, pular, subir
escadas, dormir, ver televisão, trabalhar, ouvir música, cozinhar, comer, entre outras. No
contexto de AVA, alguns indivÃduos (comumente chamados de utilizadores) precisam de
assistência particular, seja porque o utilizador tem algum tipo de deficiência, seja porque é
idoso, ou simplesmente porque o utilizador precisa/quer monitorizar e treinar o seu estilo de
vida. A investigação e desenvolvimento de sistemas que fornecem algum tipo de assistência
particular está em crescente em muitas áreas de aplicação. Em particular, no futuro, o
reconhecimento das AVD é uma parte importante para o desenvolvimento de um assistente
pessoal digital, fornecendo uma assistência pessoal de baixo custo aos diferentes tipos de
pessoas. pessoas. Para ajudar no reconhecimento das AVD, os ambientes em que estas se
desenrolam devem ser reconhecidos para aumentar a fiabilidade destes sistemas.
O foco principal desta Tese é o desenvolvimento de métodos para a fusão e classificação dos
dados adquiridos a partir dos sensores disponÃveis nos dispositivos móveis, para o
reconhecimento quase em tempo real das AVD, tendo em consideração a grande diversidade
das caracterÃsticas dos dispositivos móveis disponÃveis no mercado. Para atingir este objetivo,
esta Tese iniciou-se com a revisão dos métodos e tecnologias existentes para definir a
arquitetura e os módulos do novo método de identificação das AVD. Com esta revisão da
literatura e com base no conhecimento adquirido sobre os sensores disponÃveis nos dispositivos
móveis disponÃveis no mercado, um conjunto de tarefas que podem ser identificadas foi
definido para as pesquisas e desenvolvimentos desta Tese. Esta revisão também identifica os
principais conceitos para o desenvolvimento do novo método de identificação das AVD,
utilizando os sensores, são eles: aquisição de dados, processamento de dados, correção de
dados, imputação de dados, extração de caracterÃsticas, fusão de dados e extração de
resultados recorrendo a métodos de inteligência artificial. Um dos desafios está relacionado
aos diferentes tipos de dados adquiridos pelos diferentes sensores, mas outros desafios foram
encontrados, sendo os mais relevantes o ruÃdo ambiental, o posicionamento do dispositivo durante a realização das atividades diárias, as capacidades limitadas dos dispositivos móveis.
As diferentes caracterÃsticas das pessoas podem igualmente influenciar a criação dos métodos,
escolhendo pessoas com diferentes estilos de vida e caracterÃsticas fÃsicas para a aquisição e
identificação dos dados adquiridos a partir de sensores. Com base nos dados adquiridos,
realizou-se o processamento dos dados, implementando-se métodos de correção dos dados e a
extração de caracterÃsticas, para iniciar a criação do novo método para o reconhecimento das
AVD. Os métodos de imputação de dados foram excluÃdos da implementação, pois não iriam
influenciar os resultados da identificação das AVD e dos ambientes, na medida em que são
utilizadas as caracterÃsticas extraÃdas de um conjunto de dados adquiridos durante um intervalo
de tempo definido.
A seleção dos sensores utilizáveis, bem como das AVD identificáveis, permitirá o
desenvolvimento de um método que, considerando o uso de tecnologias para a fusão de dados
adquiridos com múltiplos sensores em coordenação com outras informações relativas ao
contexto do utilizador, tais como a agenda do utilizador, permitindo estabelecer um perfil de
tarefas que o utilizador realiza diariamente. Com base nos resultados obtidos na literatura, o
método escolhido para o reconhecimento das AVD são as diferentes variantes das Redes
Neuronais Artificiais (RNA), incluindo Multilayer Perceptron (MLP), Feedforward Neural
Networks (FNN) with Backpropagation and Deep Neural Networks (DNN). No final, após a
criação dos métodos para cada fase do método para o reconhecimento das AVD e ambientes, a
implementação sequencial dos diferentes métodos foi realizada num dispositivo móvel para
testes adicionais.
Após a definição da estrutura do método para o reconhecimento de AVD e ambientes usando
dispositivos móveis, verificou-se que a aquisição de dados pode ser realizada com os métodos
comuns. Após a aquisição de dados, os mesmos devem ser processados no módulo de
processamento de dados, que inclui os métodos de correção de dados e de extração de
caracterÃsticas. O método de correção de dados utilizado para sensores de movimento e
magnéticos é o filtro passa-baixo de modo a reduzir o ruÃdo, mas para os dados acústicos, a
Transformada Rápida de Fourier (FFT) foi aplicada para extrair as diferentes frequências.
Após a correção dos dados, as diferentes caracterÃsticas foram extraÃdas com base nos tipos de
sensores usados, sendo a média, desvio padrão, variância, valor máximo, valor mÃnimo e
mediana de dados adquiridos pelos sensores magnéticos e de movimento, a média, desvio
padrão, variância e mediana dos picos máximos calculados com base nos dados adquiridos pelos
sensores magnéticos e de movimento, as cinco maiores distâncias entre os picos máximos
calculados com os dados adquiridos dos sensores de movimento e magnéticos, a média, desvio
padrão, variância e 26 Mel-Frequency Cepstral Coefficients (MFCC) das frequências obtidas
com FFT com base nos dados obtidos a partir do microfone, e a distância calculada com os
dados adquiridos pelo recetor de GPS. Após a extração das caracterÃsticas, as mesmas são agrupadas em diferentes conjuntos de dados
para a aplicação dos métodos de RNA de modo a descobrir o método e o conjunto de
caracterÃsticas que reporta melhores resultados. O módulo de classificação de dados foi
incrementalmente desenvolvido, começando com a identificação das AVD comuns com sensores
magnéticos e de movimento, i.e., andar, correr, subir escadas, descer escadas e parado. Em
seguida, os ambientes são identificados com dados de sensores acústicos, i.e., quarto, bar, sala
de aula, ginásio, cozinha, sala de estar, hall, rua e biblioteca. Com base nos ambientes
reconhecidos e os restantes sensores disponÃveis nos dispositivos móveis, os dados adquiridos
dos sensores magnéticos e de movimento foram combinados com o ambiente reconhecido para
diferenciar algumas atividades sem movimento (i.e., dormir e ver televisão), onde o número
de atividades reconhecidas nesta fase aumenta com a fusão da distância percorrida, extraÃda
a partir dos dados do recetor GPS, permitindo também reconhecer a atividade de conduzir.
Após a implementação dos três métodos de classificação com diferentes números de iterações,
conjuntos de dados e configurações numa máquina com alta capacidade de processamento, os
resultados relatados provaram que o melhor método para o reconhecimento das atividades
comuns de AVD e atividades sem movimento é o método DNN, mas o melhor método para o
reconhecimento de ambientes é o método FNN with Backpropagation. Dependendo do número
de sensores utilizados, esta implementação reporta uma exatidão média entre 85,89% e 89,51%
para o reconhecimento das AVD comuns, igual a 86,50% para o reconhecimento de ambientes,
e igual a 100% para o reconhecimento de atividades sem movimento, reportando uma exatidão
global entre 85,89% e 92,00%.
A última etapa desta Tese foi a implementação do método nos dispositivos móveis, verificando
que o método FNN requer um alto poder de processamento para o reconhecimento de
ambientes e os resultados reportados com estes dispositivos são inferiores aos resultados
reportados com a máquina com alta capacidade de processamento utilizada no
desenvolvimento do método. Assim, o método DNN foi igualmente implementado para o
reconhecimento dos ambientes com os dispositivos móveis. Finalmente, os resultados relatados
com os dispositivos móveis reportam uma exatidão entre 86,39% e 89,15% para o
reconhecimento das AVD comuns, igual a 45,68% para o reconhecimento de ambientes, e igual
a 100% para o reconhecimento de atividades sem movimento, reportando uma exatidão geral
entre 58,02% e 89,15%.
Com base nos resultados relatados na literatura, os resultados do método desenvolvido mostram
uma melhoria residual, mas os resultados desta Tese identificam mais AVD que os demais
estudos disponÃveis na literatura. A melhoria no reconhecimento das AVD com base na média
das exatidões é igual a 2,93%, mas o número máximo de AVD e ambientes reconhecidos pelos
estudos disponÃveis na literatura é 13, enquanto o número de AVD e ambientes reconhecidos
com o método implementado é 16. Assim, o método desenvolvido tem uma melhoria de 2,93%
na exatidão do reconhecimento num maior número de AVD e ambientes. Como trabalho futuro, os resultados reportados nesta Tese podem ser considerados um ponto
de partida para o desenvolvimento de um assistente digital pessoal, mas o número de ADL e
ambientes reconhecidos pelo método deve ser aumentado e as experiências devem ser
repetidas com diferentes tipos de dispositivos móveis (i.e., smartphones e smartwatches), e os
métodos de imputação e outros métodos de classificação de dados devem ser explorados de
modo a tentar aumentar a confiabilidade do método para o reconhecimento das AVD e
ambientes
- …