19 research outputs found

    Towards a Multimodal Silent Speech Interface for European Portuguese

    Get PDF
    Automatic Speech Recognition (ASR) in the presence of environmental noise is still a hard problem to tackle in speech science (Ng et al., 2000). Another problem well described in the literature is the one concerned with elderly speech production. Studies (Helfrich, 1979) have shown evidence of a slower speech rate, more breaks, more speech errors and a humbled volume of speech, when comparing elderly with teenagers or adults speech, on an acoustic level. This fact makes elderly speech hard to recognize, using currently available stochastic based ASR technology. To tackle these two problems in the context of ASR for HumanComputer Interaction, a novel Silent Speech Interface (SSI) in European Portuguese (EP) is envisioned.info:eu-repo/semantics/acceptedVersio

    Viseme-based Lip-Reading using Deep Learning

    Get PDF
    Research in Automated Lip Reading is an incredibly rich discipline with so many facets that have been the subject of investigation including audio-visual data, feature extraction, classification networks and classification schemas. The most advanced and up-to-date lip-reading systems can predict entire sentences with thousands of different words and the majority of them use ASCII characters as the classification schema. The classification performance of such systems however has been insufficient and the need to cover an ever expanding range of vocabulary using as few classes as possible is challenge. The work in this thesis contributes to the area concerning classification schemas by proposing an automated lip reading model that predicts sentences using visemes as a classification schema. This is an alternative schema to using ASCII characters, which is the conventional class system used to predict sentences. This thesis provides a review of the current trends in deep learning- based automated lip reading and analyses a gap in the research endeavours of automated lip-reading by contributing towards work done in the region of classification schema. A whole new line of research is opened up whereby an alternative way to do lip-reading is explored and in doing so, lip-reading performance results for predicting s entences from a benchmark dataset are attained which improve upon the current state-of-the-art. In this thesis, a neural network-based lip reading system is proposed. The system is lexicon-free and uses purely visual cues. With only a limited number of visemes as classes to recognise, the system is designed to lip read sentences covering a wide range of vocabulary and to recognise words that may not be included in system training. The lip-reading system predicts sentences as a two-stage procedure with visemes being recognised as the first stage and words being classified as the second stage. This is such that the second-stage has to both overcome the one-to-many mapping problem posed in lip-reading where one set of visemes can map to several words, and the problem of visemes being confused or misclassified to begin with. To develop the proposed lip-reading system, a number of tasks have been performed in this thesis. These include the classification of continuous sequences of visemes; and the proposal of viseme-to-word conversion models that are both effective in their conversion performance of predicting words, and robust to the possibility of viseme confusion or misclassification. The initial system reported has been testified on the challenging BBC Lip Reading Sentences 2 (LRS2) benchmark dataset attaining a word accuracy rate of 64.6%. Compared with the state-of-the-art works in lip reading sentences reported at the time, the system had achieved a significantly improved performance. The lip reading system is further improved upon by using a language model that has been demonstrated to be effective at discriminating between homopheme words and being robust to incorrectly classified visemes. An improved performance in predicting spoken sentences from the LRS2 dataset is yielded with an attained word accuracy rate of 79.6% which is still better than another lip-reading system trained and evaluated on the the same dataset that attained a word accuracy rate 77.4% and it is to the best of our knowledge the next best observed result attained on LRS2

    Ear Biometrics In Personal Identification

    Get PDF
    Tez (Yüksek Lisans) -- İstanbul Teknik Üniversitesi, Fen Bilimleri Enstitüsü, 2008Thesis (M.Sc.) -- İstanbul Technical University, Institute of Science and Technology, 2008Bu tez biyometrik tabanlı, kimlik tespitine dayalı güvenlik sistemi geliştirme fikriyle başlayan çalışmanın bir parçasıdır. Günümüzde, biyometrik tabanlı ve türleri içinde en yüksek doğruluk oranına sahip parmak izi ve iris tarama yöntemleri kriminal vakalarda ve yüksek güvenlik gerektiren tesislerde kullanılmaktatır. Yüz tanıma hala gelişmekte olan bir biyometrik yöntemidir, fakat yapılan literatür araştırmalarında ortam ışıklandırması, makyaj, verilen poz, duygusal ifadeler ve estetik operasyonlar gibi yüz görünümü üzerinde etkisi olan faktörlerin yüz tanıma probleminde doğrudan yöntemlerin başarımını azaltacak yönde etkili olduğu görülmüştür. Dolayısıyla, yüz gibi erişimi kolay fakat onun gibi gündelik hayatın makyaj, duygusal ifadeler, bıyık ve sakal bırakma gibi faktörlerinden etkilenmeyecek bir biyometrik gereksinimi ortaya çıkmıştır. Alternatif biyometriğin başarımının yüzle kıyaslanabilir mertebelerde olması gerektiği açıktır. Araştırmaların devamında, tek yumurta ikizlerinin birbirlerine ne kadar benzeseler de kulak yapılarının farklı olduğu, kulağın 3 boyutlu olsa da yüz kadar detay içermediği ve kulağın yapısı itibariyle duygusal açılımlar ifade edilirken biçimini değiştirmediği görülmüştür. Bunların ışığında, kulak yüze karşı güçlü bir alternatif biyometrik olarak ortaya çıkmaktadır. Bu çalışmada, literatürde önerilen yöntemler kulak resimleri üzerine uygulanmıştır. Bu yöntemler veri kümesi olarak 2 boyutlu resimleri kullanan ve veri kümesi üzerinde sınıflandırma yapan, lineer yöntemlerdir. Yapılan çalışma sonunda görülmüştür ki, PCA, FLD, FLD nın geliştirilmesiyle oluşturulan DCVA ve LPP yöntemlerinin kulak tanımadaki başarımları yüz tanımadaki başarılarından daha yüksektir. Bu yöntemlerin kulak tanımadaki doğru eşleştirme oranları, literatürde bulunan, yüz tanımadaki eşleştirme oranlarıyla karşılaştırıldıklarında daha yüksektir. Yapılan bu çalışmanın sonuçları biyometrik tabanlı kimlik tesbit yöntemleri için kulağın yüzden daha iyi bir alternatif olduğunu göstermiştir.This thesis is one of the parts of a biometric based identity verification security system development project. Today, the most successful biometric based identification technologies such as fingerprint and iris scan are used worldwide in both criminal investigations and high security facilities. Face recognition is one of the developing biometric methods; however illumination, makeup, posing, emotional expressions and face-lifting reduce the success of face recognition. A new biometric which is not effected by any of the factors above is needed. The new biometric should be as successful as face recognition. Twins are identical but their ears differ from each other, ear is also 3-dimensional but it is simpler than face and emotional expressions do not affect the ear. In the light of this, ear is a good alternative to face, as a biometric. In this study, the methods presented in the literature are tested on ear images. These methods are linear classification algorithms that work on 2D image databases. It is found out that, PCA, FLD, modified FLD which is also known as DCVA and LPP has better results at ear recognition than face recognition. Ear recognition has higher hit rates, when compared with face recognition researches that are presented in the literature previously. The results of this study proved that ear is the best alternative to face at personal identification tasks.Yüksek LisansM.Sc

    Multimodal analysis for object classification and event detection

    Get PDF

    Human Action Recognition Using Deep Probabilistic Graphical Models

    Get PDF
    Building intelligent systems that are capable of representing or extracting high-level representations from high-dimensional sensory data lies at the core of solving many A.I. related tasks. Human action recognition is an important topic in computer vision that lies in high-dimensional space. Its applications include robotics, video surveillance, human-computer interaction, user interface design, and multi-media video retrieval amongst others. A number of approaches have been proposed to extract representative features from high-dimensional temporal data, most commonly hard wired geometric or bio-inspired shape context features. This thesis first demonstrates some \emph{ad-hoc} hand-crafted rules for effectively encoding motion features, and later elicits a more generic approach for incorporating structured feature learning and reasoning, \ie deep probabilistic graphical models. The hierarchial dynamic framework first extracts high level features and then uses the learned representation for estimating emission probability to infer action sequences. We show that better action recognition can be achieved by replacing gaussian mixture models by Deep Neural Networks that contain many layers of features to predict probability distributions over states of Markov Models. The framework can be easily extended to include an ergodic state to segment and recognise actions simultaneously. The first part of the thesis focuses on analysis and applications of hand-crafted features for human action representation and classification. We show that the ``hard coded" concept of correlogram can incorporate correlations between time domain sequences and we further investigate multi-modal inputs, \eg depth sensor input and its unique traits for action recognition. The second part of this thesis focuses on marrying probabilistic graphical models with Deep Neural Networks (both Deep Belief Networks and Deep 3D Convolutional Neural Networks) for structured sequence prediction. The proposed Deep Dynamic Neural Network exhibits its general framework for structured 2D data representation and classification. This inspires us to further investigate for applying various graphical models for time-variant video sequences

    Machine Intelligence for Advanced Medical Data Analysis: Manifold Learning Approach

    Get PDF
    In the current work, linear and non-linear manifold learning techniques, specifically Principle Component Analysis (PCA) and Laplacian Eigenmaps, are studied in detail. Their applications in medical image and shape analysis are investigated. In the first contribution, a manifold learning-based multi-modal image registration technique is developed, which results in a unified intensity system through intensity transformation between the reference and sensed images. The transformation eliminates intensity variations in multi-modal medical scans and hence facilitates employing well-studied mono-modal registration techniques. The method can be used for registering multi-modal images with full and partial data. Next, a manifold learning-based scale invariant global shape descriptor is introduced. The proposed descriptor benefits from the capability of Laplacian Eigenmap in dealing with high dimensional data by introducing an exponential weighting scheme. It eliminates the limitations tied to the well-known cotangent weighting scheme, namely dependency on triangular mesh representation and high intra-class quality of 3D models. In the end, a novel descriptive model for diagnostic classification of pulmonary nodules is presented. The descriptive model benefits from structural differences between benign and malignant nodules for automatic and accurate prediction of a candidate nodule. It extracts concise and discriminative features automatically from the 3D surface structure of a nodule using spectral features studied in the previous work combined with a point cloud-based deep learning network. Extensive experiments have been conducted and have shown that the proposed algorithms based on manifold learning outperform several state-of-the-art methods. Advanced computational techniques with a combination of manifold learning and deep networks can play a vital role in effective healthcare delivery by providing a framework for several fundamental tasks in image and shape processing, namely, registration, classification, and detection of features of interest

    Interfaces de fala silenciosa multimodais para português europeu com base na articulação

    Get PDF
    Doutoramento conjunto MAPi em InformáticaThe concept of silent speech, when applied to Human-Computer Interaction (HCI), describes a system which allows for speech communication in the absence of an acoustic signal. By analyzing data gathered during different parts of the human speech production process, Silent Speech Interfaces (SSI) allow users with speech impairments to communicate with a system. SSI can also be used in the presence of environmental noise, and in situations in which privacy, confidentiality, or non-disturbance are important. Nonetheless, despite recent advances, performance and usability of Silent Speech systems still have much room for improvement. A better performance of such systems would enable their application in relevant areas, such as Ambient Assisted Living. Therefore, it is necessary to extend our understanding of the capabilities and limitations of silent speech modalities and to enhance their joint exploration. Thus, in this thesis, we have established several goals: (1) SSI language expansion to support European Portuguese; (2) overcome identified limitations of current SSI techniques to detect EP nasality (3) develop a Multimodal HCI approach for SSI based on non-invasive modalities; and (4) explore more direct measures in the Multimodal SSI for EP acquired from more invasive/obtrusive modalities, to be used as ground truth in articulation processes, enhancing our comprehension of other modalities. In order to achieve these goals and to support our research in this area, we have created a multimodal SSI framework that fosters leveraging modalities and combining information, supporting research in multimodal SSI. The proposed framework goes beyond the data acquisition process itself, including methods for online and offline synchronization, multimodal data processing, feature extraction, feature selection, analysis, classification and prototyping. Examples of applicability are provided for each stage of the framework. These include articulatory studies for HCI, the development of a multimodal SSI based on less invasive modalities and the use of ground truth information coming from more invasive/obtrusive modalities to overcome the limitations of other modalities. In the work here presented, we also apply existing methods in the area of SSI to EP for the first time, noting that nasal sounds may cause an inferior performance in some modalities. In this context, we propose a non-invasive solution for the detection of nasality based on a single Surface Electromyography sensor, conceivable of being included in a multimodal SSI.O conceito de fala silenciosa, quando aplicado a interação humano-computador, permite a comunicação na ausência de um sinal acústico. Através da análise de dados, recolhidos no processo de produção de fala humana, uma interface de fala silenciosa (referida como SSI, do inglês Silent Speech Interface) permite a utilizadores com deficiências ao nível da fala comunicar com um sistema. As SSI podem também ser usadas na presença de ruído ambiente, e em situações em que privacidade, confidencialidade, ou não perturbar, é importante. Contudo, apesar da evolução verificada recentemente, o desempenho e usabilidade de sistemas de fala silenciosa tem ainda uma grande margem de progressão. O aumento de desempenho destes sistemas possibilitaria assim a sua aplicação a áreas como Ambientes Assistidos. É desta forma fundamental alargar o nosso conhecimento sobre as capacidades e limitações das modalidades utilizadas para fala silenciosa e fomentar a sua exploração conjunta. Assim, foram estabelecidos vários objetivos para esta tese: (1) Expansão das linguagens suportadas por SSI com o Português Europeu; (2) Superar as limitações de técnicas de SSI atuais na deteção de nasalidade; (3) Desenvolver uma abordagem SSI multimodal para interação humano-computador, com base em modalidades não invasivas; (4) Explorar o uso de medidas diretas e complementares, adquiridas através de modalidades mais invasivas/intrusivas em configurações multimodais, que fornecem informação exata da articulação e permitem aumentar a nosso entendimento de outras modalidades. Para atingir os objetivos supramencionados e suportar a investigação nesta área procedeu-se à criação de uma plataforma SSI multimodal que potencia os meios para a exploração conjunta de modalidades. A plataforma proposta vai muito para além da simples aquisição de dados, incluindo também métodos para sincronização de modalidades, processamento de dados multimodais, extração e seleção de características, análise, classificação e prototipagem. Exemplos de aplicação para cada fase da plataforma incluem: estudos articulatórios para interação humano-computador, desenvolvimento de uma SSI multimodal com base em modalidades não invasivas, e o uso de informação exata com origem em modalidades invasivas/intrusivas para superar limitações de outras modalidades. No trabalho apresentado aplica-se ainda, pela primeira vez, métodos retirados do estado da arte ao Português Europeu, verificando-se que sons nasais podem causar um desempenho inferior de um sistema de fala silenciosa. Neste contexto, é proposta uma solução para a deteção de vogais nasais baseada num único sensor de eletromiografia, passível de ser integrada numa interface de fala silenciosa multimodal

    Micro-architectural Threats to Modern Computing Systems

    Get PDF
    With the abundance of cheap computing power and high-speed internet, cloud and mobile computing replaced traditional computers. As computing models evolved, newer CPUs were fitted with additional cores and larger caches to accommodate run multiple processes concurrently. In direct relation to these changes, shared hardware resources emerged and became a source of side-channel leakage. Although side-channel attacks have been known for a long time, these changes made them practical on shared hardware systems. In addition to side-channels, concurrent execution also opened the door to practical quality of service attacks (QoS). The goal of this dissertation is to identify side-channel leakages and architectural bottlenecks on modern computing systems and introduce exploits. To that end, we introduce side-channel attacks on cloud systems to recover sensitive information such as code execution, software identity as well as cryptographic secrets. Moreover, we introduce a hard to detect QoS attack that can cause over 90+\% slowdown. We demonstrate our attack by designing an Android app that causes degradation via memory bus locking. While practical and quite powerful, mounting side-channel attacks is akin to listening on a private conversation in a crowded train station. Significant manual labor is required to de-noise and synchronizes the leakage trace and extract features. With this motivation, we apply machine learning (ML) to automate and scale the data analysis. We show that classical machine learning methods, as well as more complicated convolutional neural networks (CNN), can be trained to extract useful information from side-channel leakage trace. Finally, we propose the DeepCloak framework as a countermeasure against side-channel attacks. We argue that by exploiting adversarial learning (AL), an inherent weakness of ML, as a defensive tool against side-channel attacks, we can cloak side-channel trace of a process. With DeepCloak, we show that it is possible to trick highly accurate (99+\% accuracy) CNN classifiers. Moreover, we investigate defenses against AL to determine if an attacker can protect itself from DeepCloak by applying adversarial re-training and defensive distillation. We show that even in the presence of an intelligent adversary that employs such techniques, DeepCloak still succeeds

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
    corecore