2,697 research outputs found

    Learning from Very Few Samples: A Survey

    Full text link
    Few sample learning (FSL) is significant and challenging in the field of machine learning. The capability of learning and generalizing from very few samples successfully is a noticeable demarcation separating artificial intelligence and human intelligence since humans can readily establish their cognition to novelty from just a single or a handful of examples whereas machine learning algorithms typically entail hundreds or thousands of supervised samples to guarantee generalization ability. Despite the long history dated back to the early 2000s and the widespread attention in recent years with booming deep learning technologies, little surveys or reviews for FSL are available until now. In this context, we extensively review 300+ papers of FSL spanning from the 2000s to 2019 and provide a timely and comprehensive survey for FSL. In this survey, we review the evolution history as well as the current progress on FSL, categorize FSL approaches into the generative model based and discriminative model based kinds in principle, and emphasize particularly on the meta learning based FSL approaches. We also summarize several recently emerging extensional topics of FSL and review the latest advances on these topics. Furthermore, we highlight the important FSL applications covering many research hotspots in computer vision, natural language processing, audio and speech, reinforcement learning and robotic, data analysis, etc. Finally, we conclude the survey with a discussion on promising trends in the hope of providing guidance and insights to follow-up researches.Comment: 30 page

    Low-Power Computer Vision: Improve the Efficiency of Artificial Intelligence

    Get PDF
    Energy efficiency is critical for running computer vision on battery-powered systems, such as mobile phones or UAVs (unmanned aerial vehicles, or drones). This book collects the methods that have won the annual IEEE Low-Power Computer Vision Challenges since 2015. The winners share their solutions and provide insight on how to improve the efficiency of machine learning systems

    Development of Kinematic Templates for Automatic Pronunciation Assessment Using Acoustic-to-Articulatory Inversion

    Get PDF
    Computer-aided pronunciation training (CAPT) is a subcategory of computer-aided language learning (CALL) that deals with the correction of mispronunciation during language learning. For a CAPT system to be effective, it must provide useful and informative feedback that is comprehensive, qualitative, quantitative, and corrective. While the majority of modern systems address the first 3 aspects of feedback, most of these systems do not provide corrective feedback. As part of the National Science Foundation (NSF) funded study “RI: Small: Speaker Independent Acoustic-Articulator Inversion for Pronunciation Assessment”, the Marquette Speech and Swallowing Lab and Marquette Speech and Signal Processing Lab are conducting a pilot study on the feasibility of the use of acoustic-to-articulatory inversion for CAPT. In order to evaluate the results of a speaker’s acoustic-to-articulatory inversion to determine pronunciation accuracy, kinematic templates are required. The templates would represent the vowels, consonant clusters, and stress characteristics of a typical American English (AE) speaker in the midsagittal plane. The Marquette University electromagnetic articulography Mandarin-accented English (EMA-MAE) database, which contains acoustic and kinematic speech data for 40 speakers (20 of which are native AE speakers), provides the data used to form the kinematic templates. The objective of this work is the development and implementation of these templates. The data provided in the EMA-MAE database is analyzed in detail, and the information obtained from the analysis is used to develop the kinematic templates. The vowel templates are designed as sets of concentric confidence ellipses, which specify (in the midsagittal plane) the ranges of tongue and lip positions corresponding to correct pronunciation. These ranges were defined using the typical articulator positioning of all English speakers of the EMA-MAE database. The data from these English speakers were also used to model the magnitude, speed history, movement pattern, and duration (MSTD) features of each consonant cluster in the EMA-MAE corpus. Cluster templates were designed as set of average MSTD parameters across English speakers for each cluster. Finally, English stress characteristics were similarly modeled as a set of average magnitude, speed, and duration parameters across English speakers. The kinematic templates developed in this work, while still in early stages, form the groundwork for assessment of features returned by the acoustic-to-articulatory inversion system. This in turn allows for assessment of articulatory inversion as a pronunciation training tool

    CGAMES'2009

    Get PDF

    Exploration and Optimization of Noise Reduction Algorithms for Speech Recognition in Embedded Devices

    Get PDF
    Environmental noise present in real-life applications substantially degrades the performance of speech recognition systems. An example is an in-car scenario where a speech recognition system has to support the man-machine interface. Several sources of noise coming from the engine, wipers, wheels etc., interact with speech. Special challenge is given in an open window scenario, where noise of traffic, park noise, etc., has to be regarded. The main goal of this thesis is to improve the performance of a speech recognition system based on a state-of-the-art hidden Markov model (HMM) using noise reduction methods. The performance is measured with respect to word error rate and with the method of mutual information. The noise reduction methods are based on weighting rules. Least-squares weighting rules in the frequency domain have been developed to enable a continuous development based on the existing system and also to guarantee its low complexity and footprint for applications in embedded devices. The weighting rule parameters are optimized employing a multidimensional optimization task method of Monte Carlo followed by a compass search method. Root compression and cepstral smoothing methods have also been implemented to boost the recognition performance. The additional complexity and memory requirements of the proposed system are minimum. The performance of the proposed system was compared to the European Telecommunications Standards Institute (ETSI) standardized system. The proposed system outperforms the ETSI system by up to 8.6 % relative increase in word accuracy and achieves up to 35.1 % relative increase in word accuracy compared to the existing baseline system on the ETSI Aurora 3 German task. A relative increase of up to 18 % in word accuracy over the existing baseline system is also obtained from the proposed weighting rules on large vocabulary databases. An entropy-based feature vector analysis method has also been developed to assess the quality of feature vectors. The entropy estimation is based on the histogram approach. The method has the advantage to objectively asses the feature vector quality regardless of the acoustic modeling assumption used in the speech recognition system

    Data Optimization in Deep Learning: A Survey

    Full text link
    Large-scale, high-quality data are considered an essential factor for the successful application of many deep learning techniques. Meanwhile, numerous real-world deep learning tasks still have to contend with the lack of sufficient amounts of high-quality data. Additionally, issues such as model robustness, fairness, and trustworthiness are also closely related to training data. Consequently, a huge number of studies in the existing literature have focused on the data aspect in deep learning tasks. Some typical data optimization techniques include data augmentation, logit perturbation, sample weighting, and data condensation. These techniques usually come from different deep learning divisions and their theoretical inspirations or heuristic motivations may seem unrelated to each other. This study aims to organize a wide range of existing data optimization methodologies for deep learning from the previous literature, and makes the effort to construct a comprehensive taxonomy for them. The constructed taxonomy considers the diversity of split dimensions, and deep sub-taxonomies are constructed for each dimension. On the basis of the taxonomy, connections among the extensive data optimization methods for deep learning are built in terms of four aspects. We probe into rendering several promising and interesting future directions. The constructed taxonomy and the revealed connections will enlighten the better understanding of existing methods and the design of novel data optimization techniques. Furthermore, our aspiration for this survey is to promote data optimization as an independent subdivision of deep learning. A curated, up-to-date list of resources related to data optimization in deep learning is available at \url{https://github.com/YaoRujing/Data-Optimization}

    Neural network based image capture for 3D reconstruction

    Get PDF
    The aim of the thesis is to build a neural network, which is capable of choosing frames from a video, which have important information for building a 3D map of the depicted structure without losing the 3D map accuracy. Many times, consecutive frames have redundant information, which do not add to 3D map any significant information or some frames might be, for example, distorted, which do not add to 3D map at all. It all depends on how a camera is moved around when a video is filmed. If all the frames of the video are used in the reconstruction of the 3D map, it will take a long time and it will require a lot of resources, which is problematic especially in the embedded devices. In this thesis it has been considered that embedded device would choose the most informative frames for building the 3D map, but the 3D map itself would be built afterwards with the saved frames on a desktop computer. A database is built from video feeds for neural network training and testing. To build the data base for training a neural network a visual simultaneous localization and mapping algorithm is used to extract features, connecting points between frames and estimate the camera movement from each frame of the video feed. To get more training samples and make the training less time consuming, video feeds have been divided into short sequences of frames. A structure from motion algorithm is used to construct a 3D point cloud of image subsets. A 3D point cloud is then constructed after each frame. To determine whether a frame is a frame with important information for 3D point cloud construction, chamfer distance is used to calculate how close the 3D point cloud is after each added frame to the 3D point cloud constructed with all the video frames. Based on the chamfer distance change then class label is determined for each frame. For the neural network a long short-term memory recurrent neural network structure was chosen, because it can learn from the entire sequence of data. The data base construction, neural network training and validation all were done with Matlab. The result of this master’s thesis is a simple long short-term memory neural network that can choose the important frames from a short sequence of images, but the accuracy needs to be further improved to use the presented method in real embedded device. The custom loss function developed in the thesis did not perform well enough that any of the similar consecutive frames could be chosen, but not more than one of those.Diplomityön tarkoitus on rakentaa neuroverkko, joka pystyy valitsemaan tärkeät kuvat 3D-mallinnusta varten videosta ilman kuvauksen tarkkuuden heikentymistä verrattuna kuvaukseen, joka on tehty kaikilla videon kuvilla. Useasti peräkkäiset kuvat videossa sisältävät samanlaista tietoa, joka ei lisää 3D-mallinnukseen tarkkuutta. Kuinka paljon kuvissa on uutta tietoa verrattuna edelliseen kuvaan, riippuu kameran liikkeestä ja liikkeen nopeudesta. 3D-mallinnuksen rakentamiseen kuluu paljon aikaa ja laskentakapasiteettia, jos kaikkia videon kuvia käytetään 3D-mallinnuksen rakentamiseen, mikä on ongelmallista sulautetuissa järjestelmissä. Tässä työssä on käytetty oletusta, että sulautettu laite kuvaisi ympäristöä ja valitsisi kuvat, joissa on tärkeää informaatiota 3D kuvauksen tekemistä varten, jonka jälkeen valitut kuvat tallennettaisiin laitteen muistiin. Itse 3D-mallinnus tehtäisiin jälkikäteen pöytätietokoneella. Työssä on tehty tietokanta neuroverkkojen opetusta varten kokonaan pöytäkoneella. Tietokanta opetusta varten on tehty vSLAM-menetelmällä, jossa kuvista poimitaan piirteitä, joita voidaan yhdistää kuvien välillä ja niistä laskea kameran liike kuvien välillä. Jotta opetustietokantaa saadaan enemmän näytteitä, käytetyt videot on jaettu lyhyisiin kuvasarjoihin. Näin saadaan myös opetukseen käytettyä laskenta-aikaa lyhennettyä. SfM-menetelmällä on laskettu 3D-mallinnus kuvista, työssä on käytetty pistepilveä. Pistepilvet on laskettu jokaisen kuvan jälkeen. Kuva on määritelty tärkeäksi, jos sen lisääminen pistepilven laskentaan tekee pistepilvestä samanlaisemman viiste-etäisyydellä kuin pistepilvi, joka on laskettu kaikilla kuvasarjan kuvilla. Pistepilvien samanlaisuutta on mitattu viiste etäisyydellä jokaisen pistepilven laskentaan lisätyn kuvan jälkeen. Riippuen kuinka paljon viiste etäisyys pienenee kuvalle määritellään luokka. Neuroverkon rakenteena käytetään LSTM takaisinkytkeytyvää neuroverkkoa, koska se pystyy luokittelemaan jokaisen kuvan koko aikaisemman kuvajonon perusteella, eikä vain sen kuvan perusteella, jota parhaillaan luokitellaan. Matlab-ohjelmistoa on käytetty diplomityössä tietokannan ja neuroverkkojen rakentamiseen. Diplomityön tuloksena LTSM takaisinkytkeytyvä neuroverkko pystyy valitsemaan tärkeimpiä kuvia lyhyistä kuvasarjoista, mutta kuvien valintatarkkuutta pitää vielä tulevaisuudessa parantaa ennen kuin esitettyä järjestelmää voisi käyttää sulautetussa järjestelmässä. Neuroverkko ei oppinut valitsemaan yhtä ja vain yhtä kuvaa samanlaista tietoa sisältävien kuvien joukosta työssä käytetyillä riskifunktioilla
    corecore