4 research outputs found

    Towards End-to-End spoken intent recognition in smart home

    Get PDF
    International audienceVoice based interaction in a smart home has become a feature of many industrial products. These systems react to voice commands, whether it is for answering a question, providing music or turning on the lights. To be efficient, these systems must be able to extract the intent of the user from the voice command. Intent recognition from voice is typically performed through automatic speech recognition (ASR) and intent classification from the transcriptions in a pipeline. However, the errors accumulated at the ASR stage might severely impact the intent classifier. In this paper, we propose an End-to-End (E2E) model to perform intent classification directly from the raw speech input. The E2E approach is thus optimized for this specific task and avoids error propagation. Furthermore, prosodic aspects of the speech signal can be exploited by the E2E model for intent classification (e.g., question vs imperative voice). Experiments on a corpus of voice commands acquired in a real smart home reveal that the state-of-the art pipeline baseline is still superior to the E2E approach. However, using artificial data generation techniques we show that significant improvement to the E2E model can be brought to reach competitive performances. This opens the way to further research on E2E Spoken Language Understanding

    Learning natural language understanding systems from unaligned labels for voice command in smart homes

    No full text
    Voice command smart home systems have become a target for the industry to provide more natural human computer interaction. To interpret voice command, systems must be able to extract the meaning from natural language; this task is called Natural Language Understanding (NLU). Modern NLU is based on statistical models which are trained on data. However, a current limitation of most NLU statistical models is the dependence on large amount of textual data aligned with target semantic labels. This is highly time-consuming. Moreover, they require training several separate models for predicting intents, slot-labels and slot-values. In this paper, we propose to use a sequence-to-sequence neural architecture to train NLU models which do not need aligned data and can jointly learn the intent, slot-label and slot-value prediction tasks. This approach has been evaluated both on a voice command dataset we acquired for the purpose of the study as well as on a publicly available dataset. The experiments show that a single model learned on unaligned data is competitive with state-of-the-art models which depend on aligned data

    Learning Natural Language Understanding Systems from Unaligned Labels for Voice Command in Smart Homes

    No full text
    International audienceVoice command smart home systems have become a target for the industry to provide more natural human computer interaction. To interpret voice command, systems must be able to extract the meaning from natural language; this task is called Natural Language Understanding (NLU). Modern NLU is based on statistical models which are trained on data. However, a current limitation of most NLU statistical models is the dependence on large amount of textual data aligned with target semantic labels. This is highly time-consuming. Moreover, they require training several separate models for predicting intents, slot-labels and slot-values. In this paper, we propose to use a sequence-to-sequence neural architecture to train NLU models which do not need aligned data and can jointly learn the intent, slot-label and slot-value prediction tasks. This approach has been evaluated both on a voice command dataset we acquired for the purpose of the study as well as on a publicly available dataset. The experiments show that a single model learned on unaligned data is competitive with state-of-the-art models which depend on aligned data

    Learning Natural Language Understanding Systems from Unaligned Labels for Voice Command in Smart Homes

    No full text
    International audienceVoice command smart home systems have become a target for the industry to provide more natural human computer interaction. To interpret voice command, systems must be able to extract the meaning from natural language; this task is called Natural Language Understanding (NLU). Modern NLU is based on statistical models which are trained on data. However, a current limitation of most NLU statistical models is the dependence on large amount of textual data aligned with target semantic labels. This is highly time-consuming. Moreover, they require training several separate models for predicting intents, slot-labels and slot-values. In this paper, we propose to use a sequence-to-sequence neural architecture to train NLU models which do not need aligned data and can jointly learn the intent, slot-label and slot-value prediction tasks. This approach has been evaluated both on a voice command dataset we acquired for the purpose of the study as well as on a publicly available dataset. The experiments show that a single model learned on unaligned data is competitive with state-of-the-art models which depend on aligned data
    corecore