Search CORE

6 research outputs found

Innovative use of IoT and Machine Learning technologies for the monitoring and management of smart spaces

Author: Tsalera Eleni
Τσαλέρα Ελένη
Publication venue: Πανεπιστήμιο Δυτικής Αττικής
Publication date: 01/01/2024
Field of study

Recent developments in the field of signal processing, sensor technologies, communications as well as High Performance Computing enable the realization of “smart” spaces, in the sense of physical spaces equipped with technology suitable for the collection, transfer and processing of data with the aim of increasing operational efficiency and improving the quality of the processes performed in them. Applying sophisticated Machine Learning (ML) approaches is benefiting a growing number of applications and a part of the Internet of Things (IoT) vision is being realized. Now, systems and states can be monitored and controlled as required. The central idea lies in the recording of data beyond those characterizing the physical conditions of the space. Individuals produce signals, that related to their activities and are indicative of the process being carried out, and at the same time, they are indicators of its orderly or not progress. This Doctoral Thesis aims to study the two preeminent signals that are indicative of the process and the prevailing conditions: Sound and Image. The study case is the classrooms of theoretical teaching or laboratory experiments. The aim is to clarify the activities and the interest of the parties involved, resulting in the improvement and upgrading of the quality of the work carried out in a smart room. These signals are rich in information, characterized by various limitations, while processing and understanding them is a challenge. At the same time, a solid understanding of the conditions and consequences of human activity can support the evaluation and therefore the strengthening of processes. In the first part of the Thesis, the classification of sound signals that are characteristic in the evolution of the educational process. Initially, an extensive set of sound features (143 in total) is reported, algorithms for extracting these values are implemented and the sounds are classified with ML algorithms. Given the large number of audio features, the corresponding computational burden using all of them, but also the possible degradation of the classification accuracy due to overfitting, a method of prioritizing the features based on their descriptive ability is proposed. The method is based on Principal Component Analysis (PCA) and is compared to the well-known Relief-F method. Experimental tests are performed with five ML algorithms using an increasing number of features. The experimental results demonstrated the utility of the dimensionality reduction method by achieving a classification accuracy of more than 90% using 25 audio features. Then, Deep Learning (DL) mechanisms are employed, specifically Convolutional Neural Networks (CNNs). In order to improve the classification accuracy, well-established, well-known architectures and networks pre-trained on the large ImageNet image dataset are used. The use of these CNNs is done after the audio signal has been converted into a suitable virtual representation through transfer learning. At this point, the set of hyper-parameter values of retraining networks on new datasets is extensively explored. The result of this investigation is the tuning of the hyper-parameter values that lead to the maximization of the classification accuracy while minimizing the corresponding computational time, achieving a classification accuracy exceeding 90% in three different databases. The research field of sound classification has developed rapidly in recent decades resulting in the creation of sound datasets, which include different, arbitrarily selected sound classes for different case studies. In this regard, two types of systematic associations between sound classes were explored: a) semantic and b) comparative based on sound features. In terms of the first association, audio classes are semantically related taking into account the unifying AudioSet ontology, with audio classes being associated based on the origin (source) of the audio. Regarding the second correlation, it is based on the calculation of the distance of the values of the audio characteristics. At the same time, sounds originating from realistic environments include classes which are combined in a sequential and/or overlapping manner. In these cases it is necessary to separate (segment) the audio stream in order to achieve the classification of each audio segment. A set of parameters such as the minimum sound duration and sound intervals were defined to achieve the segmentation process of the audio streams. The second part of the Thesis concerns the study of Image. The development of increasingly advanced algorithms and models has made object detection and recognition trivial. Therefore, this Thesis focused on a more refined analysis of the image with the aim of facial expression recognition (Facial Emotion Recognition). On the occasion of this study, two image feature extraction mechanisms were investigated: manually, with handcrafted methods, and automatically through DL methods, based on CNNs. Both mechanisms were thoroughly studied in terms of their internal parameters and evaluated on FER databases in terms of their classification accuracy performance and the corresponding computational time. Handcrafted methods were examined concerning their internal parameter values, while the study of neural networks was two-fold. First, the features were extracted without retraining the networks on the new data from different depths of their layers, and then the feature extraction was done after retraining the networks on the new databases through transfer learning. The research showed that without retraining the networks, extracting the features from the deepest layer of CNNs leads to inferior classification accuracy results (74% on average) compared to handcrafted methods (86% on average). Extracting image features form 50% or 75% of the depth of the CNNs results in higher classification accuracy (90% on average) for each image quality case. Retraining CNNs and using the transfer learning method improves the classification accuracy when large databases are available. In addition, each method was evaluated for its robustness to two commonly encountered types of noise, Gaussian and Salt & Pepper. CNNs appear to be more robust as the classification accuracy decreases by 10% versus a 60% decrease using handcrafted methods. The result was the creation of a method selection framework according to the quality of available images, application requirements and specifications. By choosing the appropriate method, a classification accuracy exceeding 92% is achieved for each FER database used. The heterogeneity of available algorithms for signal classification is reflected in the computational resources required to train the models and extract the results. From this point of view, the computational resource requirements can be one of the criteria for choosing the classification method. So a set of training time values for different neural network architectures, with different training configurations and for different datasets was generated. Five neural network-based regression models were trained to estimate the training time. Models were evaluated based on correlation coefficient and root mean square error. The result was that the two-layer neural network yielded the highest correlation coefficient with the smallest error, indicating that this model can provide a good approximation of the computational time required for a case of adjacent data. The available algorithms differ from each other in terms of their architecture and in particular, the number of their layers, the complexity of the connection between them, the number and size of filters. The heterogeneity of the available algorithms is reflected in the computational resources required to train the models and derive the inferences. From this point of view, the computational resource requirements can be the one of the criteria for choosing the classification method. For this purpose, a set of training time values for different CNN architectures, with different training configurations, and for different datasets was constructed. Five neural network-based regression models were trained to estimate the training time. Models were evaluated based on correlation coefficient and root mean square error. The value of this PhD Thesis is confirmed by further application of the methods and algorithms in open-space research cases. In particular, sound classification methods with the prioritization of sound features were applied to research focused on environmental noise in an urban landscape, achieving a classification accuracy of eight types of urban noise that reaches 85%. By achieving high noise classification performance it is possible to understand its origin and take appropriate measures to reduce noise pollution. Image classification, combined with semantic segmentation of their content, was applied to research focused on forest fire detection. In this way, a risk analysis framework is created based on the infrastructure that exists in the area, but also the method of intervention depending on the access to it. The detection of objects of interest combined with the classification results that exceed 95% highlight the value of the proposed research in the early response to fires, but also the extensibility of the field of applications of the developed algorithms.Οι πρόσφατες εξελίξεις στον τομέα της επεξεργασίας σήματος, των τεχνολογιών των αισθητήρων, των επικοινωνιών καθώς και, της υπολογιστικής υψηλών επιδόσεων (High Performance Computing) καθιστούν δυνατή την υλοποίηση «έξυπνων» χώρων, με την έννοια των φυσικών χώρων οι οποίοι είναι εξοπλισμένοι με τεχνολογία συλλογής, μεταφοράς και επεξεργασίας δεδομένων με στόχο την αύξηση της λειτουργικής αποτελεσματικότητας και την βελτίωση της ποιότητας των διεργασιών που εκτελούνται σε αυτούς. Με την ανάπτυξη εξελιγμένων προσεγγίσεων Μηχανικής Μάθησης (ΜΜ) επωφελείται ένας αυξανόμενος αριθμός εφαρμογών και πραγματοποιείται ένα μέρος του οράματος Internet of Things (IoT). Πλέον, τα συστήματα και οι καταστάσεις μπορούν να παρακολουθούνται και να ελέγχονται αναλόγως των απαιτήσεων. Η κεντρική ιδέα έγκειται στην καταγραφή δεδομένων πέρα από αυτά που χαρακτηρίζουν τις φυσικές συνθήκες του χώρου. Τα άτομα παράγουν σήματα, τα οποία σχετίζονται με τις δραστηριότητες τους, και ταυτόχρονα αποτελούν δείκτες της ποιότητας της διαδικασίας που εκτελείται. Αυτή η Διδακτορική Διατριβή στοχεύει στην μελέτη των δύο κατεξοχήν σημάτων τα οποία είναι ενδεικτικά της διεργασίας και των συνθηκών που επικρατούν: του Ήχου και της Εικόνας. Η περίπτωση μελέτης είναι οι αίθουσες θεωρητικής διδασκαλίας ή εργαστηριακού πειράματος. Στόχος είναι η αποσαφήνιση των δραστηριοτήτων και του ενδιαφέροντος των εμπλεκόμενων μερών με αποτέλεσμα την βελτίωση και αναβάθμιση της ποιότητας του έργου που συντελείται σε μία έξυπνη αίθουσα. Τα σήματα αυτά είναι πλούσια σε πληροφορίες, χαρακτηρίζονται από διάφορους περιορισμούς, ενώ η επεξεργασία και η κατανόησή τους αποτελεί πρόκληση. Ταυτόχρονα, η συμπαγής κατανόηση των δραστηριοτήτων και της αλληλεπίδρασης των ατόμων μπορεί να υποστηρίξει την αξιολόγηση και συνεπώς την ενίσχυση των επιμέρους διεργασιών. Στο πρώτο μέρος της διατριβής πραγματοποιείται η ταξινόμηση ηχητικών σημάτων οι οποίοι είναι χαρακτηριστικοί στην εξέλιξη της εκπαιδευτικής διαδικασίας. Αρχικά, αναφέρεται ένα εκτενές σύνολο ηχητικών χαρακτηριστικών (συνολικά 143), υλοποιούνται αλγόριθμοι εξαγωγής των τιμών αυτών, και πραγματοποιείται η ταξινόμηση των ήχων με αλγόριθμους ΜΜ. Δεδομένου του μεγάλου πλήθους των χαρακτηριστικών, της αντίστοιχης υπολογιστικής επιβάρυνσης χρησιμοποιώντας το σύνολο αυτών, αλλά και της πιθανής υποβάθμισης της ακρίβειας ταξινόμησης λόγω υπερπροσαρμογής, προτείνεται μέθοδος ιεράρχησης των χαρακτηριστικών με βάση την περιγραφική τους ικανότητα. Η μέθοδος βασίζεται στην Ανάλυση Κύριων Συνιστωσών (Principal Component Analysis – PCA) και συγκρίνεται με την γνωστή μέθοδο Relief-F. Πραγματοποιούνται πειραματικοί έλεγχοι με πέντε αλγορίθμους ΜΜ χρησιμοποιώντας αυξανόμενο αριθμό χαρακτηριστικών. Τα πειραματικά αποτελέσματα ανέδειξαν την χρησιμότητα της μεθόδου μείωσης της διαστασιολόγησης, επιτυγχάνοντας ακρίβεια ταξινόμησης μεγαλύτερη από 90% χρησιμοποιώντας 25 ηχητικά χαρακτηριστικά.Στην συνέχεια εφαρμόζονται μηχανισμοί Βαθιάς Μάθησης (ΒΜ) και συγκεκριμένα τα Συνελικτικά Νευρωνικά Δίκτυα (ΣΝΔ). Προκειμένου να αναβαθμιστεί η ακρίβεια ταξινόμησης, χρησιμοποιούνται καθιερωμένες, ευρέως γνωστές αρχιτεκτονικές. Η χρήση αυτών των ΣΝΔ γίνεται, αφού το ηχητικό σήμα μετατραπεί σε κατάλληλη εικονική αναπαράσταση, μέσω μεταφοράς μάθησης. Σε αυτό το σημείο διερευνάται εκτενώς το σύνολο των τιμών των υπερπαραμέτρων επανεκπαίδευσης των δικτύων σε νέα σύνολα δεδομένων. Το αποτέλεσμα αυτής της διερεύνησης είναι η ρύθμιση των τιμών των υπερπαραμέτρων που οδηγούν σε μεγιστοποίηση της ακρίβειας ταξινόμησης με ταυτόχρονη ελαχιστοποίηση του αντίστοιχου υπολογιστικού χρόνου, επιτυγχάνοντας ακρίβεια ταξινόμησης που ξεπερνά το 90% σε τρεις διαφορετικές βάσεις δεδομένων.Το δεύτερο μέρος της διατριβής αφορά την μελέτη του σήματος της Εικόνας. Η ανάπτυξη ολοένα και πιο προηγμένων αλγορίθμων και μοντέλων έχει καταστήσει την ανίχνευση και την αναγνώριση αντικειμένων τετριμμένη. Επομένως, η διατριβή αυτή επικεντρώθηκε σε μία πιο εκλεπτυσμένη ανάλυση της εικόνας με στόχο την αναγνώριση της έκφρασης του προσώπου (Facial Emotion Recognition-FER). Με αφορμή αυτή την μελέτη διερευνήθηκαν δύο μηχανισμοί εξαγωγής χαρακτηριστικών της εικόνας: με χειροκίνητo τρόπο, και αυτόματα μέσω Βαθιάς Μάθησης, με βάση τα ΣΝΔ. Και οι δύο μηχανισμοί μελετήθηκαν διεξοδικά και αξιολογήθηκαν σε τρεις βάσεις δεδομένων FER ως προς την απόδοση της ακρίβειας ταξινόμησης και τον αντίστοιχο υπολογιστικό χρόνο. Οι χειροκίνητες μέθοδοι εξετάστηκαν ως προς τις τιμές των εσωτερικών τους παραμέτρων, ενώ η μελέτη των νευρωνικών δικτύων ήταν διττή. Αρχικά, τα χαρακτηριστικά εξήχθησαν χωρίς να επανεκπαιδευτούν τα δίκτυα στα νέα δεδομένα από επίπεδα διαφορετικών βαθών, και στην συνέχεια η εξαγωγή των χαρακτηριστικών έγινε μετά την επανεκπαίδευση των δικτύων στις νέες βάσεις δεδομένων μέσω μεταφοράς μάθησης. Από την έρευνα προέκυψε ότι χωρίς την επανεκπαίδευση των δικτύων η εξαγωγή των χαρακτηριστικών της εικόνας από το βαθύτερο επίπεδο των ΣΝΔ οδηγεί σε υποδεέστερα αποτελέσματα ακρίβειας ταξινόμησης (κατά μέσο όρο 74%) σε σχέση με τα αντίστοιχα των χειροκίνητων μεθόδων (κατά μέσο όρο 86%). Η εξαγωγή χαρακτηριστικών από το 50% ή το 75% του βάθους των ΣΝΔ οδηγεί σε υψηλότερη απόδοση ταξινόμησης (κατά μέσο όρο 90%) για κάθε περίπτωση ποιότητας εικόνων. Η επανεκπαίδευση των ΣΝΔ και η χρήση της μεθόδου της μεταφοράς μάθησης βελτιώνει την ακρίβεια ταξινόμησης στην περίπτωση που είναι διαθέσιμες μεγάλες βάσεις δεδομένων. Επιπρόσθετα, κάθε μέθοδος αξιολογήθηκε ως προς την ανθεκτικότητα της σε δύο συχνά απαντώμενους τύπους θορύβου, τον Gaussian και τον Salt & Pepper, με τα ΣΝΔ να εμφανίζονται πιο ανθεκτικά (η ακρίβειας ταξινόμησης μειώνεται κατά περίπου 10% έναντι της μείωσης της ακρίβειας ταξινόμησης κατά 60% με τις χειροκίνητες μεθόδους). Το αποτέλεσμα ήταν η δημιουργία ενός πλαισίου επιλογής μεθόδου ανάλογα με την ποιότητα των διαθέσιμων εικόνων, τις απαιτήσεις και τις προδιαγραφές της εφαρμογής. Επιλέγοντας την κατάλληλη μέθοδο επιτεύχθηκε ακρίβεια ταξινόμησης που ξεπερνά το 92% για κάθε βάση δεδομένων FER που χρησιμοποιήθηκε.Η αξία αυτής της διδακτορικής διατριβής επιβεβαιώνεται από την περαιτέρω εφαρμογή των μεθόδων και αλγορίθμων σε ερευνητικές περιπτώσεις ανοιχτών χώρων. Συγκεκριμένα, οι μέθοδοι ταξινόμησης του ήχου με την ιεράρχηση των ηχητικών χαρακτηριστικών εφαρμόστηκε σε έρευνα εστιασμένη στον περιβαλλοντικό θόρυβο σε αστικό τοπίο, επιτυγχάνοντας ακρίβειας ταξινόμησης οκτώ τύπων αστικού θορύβου που φτάνει το 85%. Επιτυγχάνοντας υψηλή απόδοση ταξινόμησης του θορύβου είναι δυνατή η κατανόηση της προέλευσης αυτού και της λήψης αντίστοιχων μέτρων προκειμένου να μειωθεί η ηχορρύπανση. Η ταξινόμηση των εικόνων, σε συνδυασμό με την σημασιολογική κατάτμηση του περιεχομένου τους, εφαρμόστηκε σε έρευνα εστιασμένη στην ανίχνευση πυρκαγιών σε δασικές περιοχές. Με αυτόν τον τρόπο δημιουργείται πλαίσιο ανάλυσης του κινδύνου με βάση τις υποδομές που υπάρχουν στον χώρο αλλά και του τρόπου επέμβασης αναλόγως της πρόσβασης σε αυτόν. Η ανίχνευση αντικειμένων ενδιαφέροντος σε συνδυασμό με τα αποτελέσματα ταξινόμησης που ξεπερνούν το 95% αναδεικνύουν την αξία της προτεινόμενης έρευνας στην έγκαιρη αντιμετώπιση των πυρκαγιών, αλλά και της επεκτασιμότητας του πεδίου εφαρμογών των ανεπτυγμένων αλγορίθμων

Hellenic National Archive of Doctoral Dissertations

Comparison of Pre-Trained CNNs for Audio Classification Using Transfer Learning

Author: Andreas Papadakis
Eleni Tsalera
Maria Samarakou
Publication venue: 'MDPI AG'
Publication date: 10/12/2021
Field of study

The paper investigates retraining options and the performance of pre-trained Convolutional Neural Networks (CNNs) for sound classification. CNNs were initially designed for image classification and recognition, and, at a second phase, they extended towards sound classification. Transfer learning is a promising paradigm, retraining already trained networks upon different datasets. We selected three ‘Image’- and two ‘Sound’-trained CNNs, namely, GoogLeNet, SqueezeNet, ShuffleNet, VGGish, and YAMNet, and applied transfer learning. We explored the influence of key retraining parameters, including the optimizer, the mini-batch size, the learning rate, and the number of epochs, on the classification accuracy and the processing time needed in terms of sound preprocessing for the preparation of the scalograms and spectrograms as well as CNN training. The UrbanSound8K, ESC-10, and Air Compressor open sound datasets were employed. Using a two-fold criterion based on classification accuracy and time needed, we selected the ‘champion’ transfer-learning parameter combinations, discussed the consistency of the classification results, and explored possible benefits from fusing the classification estimations. The Sound CNNs achieved better classification accuracy, reaching an average of 96.4% for UrbanSound8K, 91.25% for ESC-10, and 100% for the Air Compressor dataset

Multidisciplinary Digital Publishing Institute

Survey on Sound and Video Analysis Methods for Monitoring Face-to-Face Module Delivery

Author: Papadakis Andreas
Samarakou Maria
Tsalera Eleni
Publication venue: 'International Association of Online Engineering (IAOE)'
Publication date: 30/04/2019
Field of study

The objective of this work is to identify unobtrusive methodologies that allow the monitoring and understanding of the educational environment, during face to face activities, through capturing and processing of sound and video signals. It is a survey on application and techniques that exploit these two signals (sound and video) retrieved in classrooms, offices and other spaces. We categorize such applications based upon the high level characteristics extracted from the analysis of the low level features of the sound and video signals. Through the overview of these technologies, we attempt to achieve a degree of understanding the human behavior in a smart classroom, on behalf of the students and the teacher. Additionally, we illustrate open-research points for further investigation

Online-Journals.org (International Association of Online Engineering)

Comparison of Pre-Trained CNNs for Audio Classification Using Transfer Learning

Author: Andreas Papadakis
Eleni Tsalera
Maria Samarakou
Publication venue: MDPI AG
Publication date: 01/12/2021
Field of study

Directory of Open Access Journals

Survey on Sound and Video Analysis Methods for Monitoring Face-to-Face Module Delivery

Author: Andreas Papadakis
Eleni Tsalera
Maria Samarakou
Publication venue: 'International Association of Online Engineering (IAOE)'
Publication date
Field of study

Crossref

Feature Extraction with Handcrafted Methods and Convolutional Neural Networks for Facial Emotion Recognition

Author: Andreas Papadakis
Eleni Tsalera
Ioannis Voyiatzis
Maria Samarakou
Publication venue: 'MDPI AG'
Publication date: 24/08/2022
Field of study

This research compares the facial expression recognition accuracy achieved using image features extracted (a) manually through handcrafted methods and (b) automatically through convolutional neural networks (CNNs) from different depths, with and without retraining. The Karolinska Directed Emotional Faces, Japanese Female Facial Expression, and Radboud Faces Database databases have been used, which differ in image number and characteristics. Local binary patterns and histogram of oriented gradients have been selected as handcrafted methods and the features extracted are examined in terms of image and cell size. Five CNNs have been used, including three from the residual architecture of increasing depth, Inception_v3, and EfficientNet-B0. The CNN-based features are extracted from the pre-trained networks from the 25%, 50%, 75%, and 100% of their depths and, after their retraining on the new databases. Each method is also evaluated in terms of calculation time. CNN-based feature extraction has proved to be more efficient since the classification results are superior and the computational time is shorter. The best performance is achieved when the features are extracted from shallower layers of pre-trained CNNs (50% or 75% of their depth), achieving high accuracy results with shorter computational time. CNN retraining is, in principle, beneficial in terms of classification accuracy, mainly for the larger databases by an average of 8%, also increasing the computational time by an average of 70%. Its contribution in terms of classification accuracy is minimal when applied in smaller databases. Finally, the effect of two types of noise on the models is examined, with ResNet50 appearing to be the most robust to noise

Multidisciplinary Digital Publishing Institute