30 research outputs found

    Overview of the NTCIR-12 SpokenQuery&Doc-2 task

    Get PDF
    This paper presents an overview of the Spoken Query and Spoken Document retrieval (SpokenQuery&Doc-2) task at the NTCIR-12 Workshop. This task included spoken query driven spoken content retrieval (SQ-SCR) and a spoken query driven spoken term detection (SQ-STD) as the two subtasks. The paper describes details of each sub-task, the data used, the creation of the speech recognition systems used to create the transcripts, the design of the retrieval test collections, the metrics used to evaluate the sub-tasks and a summary of the results of submissions by the task participants

    Construction of a corpus of elderly Japanese speech for analysis and recognition

    Get PDF
    Tokushima UniversityAichi Prefectural UniversityUniversity of YamanashiLREC 2018 Special Speech Sessions "Speech Resources Collection in Real-World Situations"; Phoenix Seagaia Conference Center, Miyazaki; 2018-05-09We have constructed a new speech data corpus using the utterances of 100 elderly Japanese people, in order to improve the accuracy of automatic recognition of the speech of older people. Humanoid robots are being developed for use in elder care nursing facilities because interaction with such robots is expected to help clients maintain their cognitive abilities, as well as provide them with companionship. In order for these robots to interact with the elderly through spoken dialogue, a high performance speech recognition system for the speech of elderly people is needed. To develop such a system, we recorded speech uttered by 100 elderly Japanese who had an average age of 77.2, most of them living in nursing homes. Another corpus of elderly Japanese speech called S-JNAS (Seniors-Japanese Newspaper Article Sentences) has been developed previously, but the average age of the participants was 67.6. Since the target age for nursing home care is around 75, much higher than that of most of the S-JNAS samples, we felt a more representative corpus was needed. In this study we compare the performance of our new corpus with both the Japanese read speech corpus JNAS (Japanese Newspaper Article Speech), which consists of adult speech, and with the S-JNAS, the senior version of JNAS, by conducting speech recognition experiments. Data from the JNAS, S-JNAS and CSJ (Corpus of Spontaneous Japanese) was used as training data for the acoustic models, respectively. We then used our new corpus to adapt the acoustic models to elderly speech, but we were unable to achieve sufficient performance when attempting to recognize elderly speech. Based on our experimental results, we believe that development of a corpus of spontaneous elderly speech and/or special acoustic adaptation methods will likely be necessary to improve the recognition performance of dialog systems for the elderly

    A new speech corpus of super-elderly Japanese for acoustic modeling

    Get PDF
    The development of accessible speech recognition technology will allow the elderly to more easily access electronically stored information. However, the necessary level of recognition accuracy for elderly speech has not yet been achieved using conventional speech recognition systems, due to the unique features of the speech of elderly people. To address this problem, we have created a new speech corpus named EARS (Elderly Adults Read Speech), consisting of the recorded read speech of 123 super-elderly Japanese people (average age: 83.1), as a resource for training automated speech recognition models for the elderly. In this study, we investigated the acoustic features of super-elderly Japanese speech using our new speech corpus. In comparison to the speech of less elderly Japanese speakers, we observed a slower speech rate and extended vowel duration for both genders, a slight increase in fundamental frequency for males, and a slight decrease in fundamental frequency for females. To demonstrate the efficacy of our corpus, we also conducted speech recognition experiments using two different acoustic models (DNN-HMM and transformer-based), trained with a combination of data from our corpus and speech data from three conventional Japanese speech corpora. When using the DNN-HMM trained with EARS and speech data from existing corpora, the character error rate (CER) was reduced by 7.8% (to just over 9%), compared to a CER of 16.9% when using only the baseline training corpora. We also investigated the effect of training the models with various amounts of EARS data, using a simple data expansion method. The acoustic models were also trained for various numbers of epochs without any modifications. When using the Transformer-based end-to-end speech recognizer, the character error rate was reduced by 3.0% (to 11.4%) by using a doubled EARS corpus with the baseline data for training, compared to a CER of 13.4% when only data from the baseline training corpora were used

    Clinicopathological Features and Outcomes of Endoscopic Submucosal Dissection for Superficial Cancer of the Pharynx

    Get PDF
    The efficacy and safety of endoscopic submucosal dissection (ESD) for superficial cancer of the pharynx are still unclear. To identify clinicopathological features of superficial pharyngeal cancer, and the efficacy and safety of ESD, we retrospectively assessed 70 pharyngeal cancers in 59 patients who underwent ESD. Of these patients, 61.0% and 50.8% had a history of esophageal cancer and head and neck cancer, respectively. The median tumor size was 15 mm, and 75.7% of the lesions were located at the piriform sinus. The en bloc resection rate was 94.9%. Treatment-related adverse events occurred in 8 cases, but there was no treatment-related death. The lateral margin was positive for neoplasm in 3 lesions (4.3%) and inconclusive in 27 lesions (38.6%), but no local recurrence was observed. Cervical lymph node metastasis was observed in 6 patients, and was successfully treated by cervical lymph node dissection. The three-year overall survival rate was 91.5% (95%CI: 76.6-97.3%) and the cause-specific survival rate was 97.6% (95%CI: 84.9-99.7%). In conclusion, ESD for superficial pharyngeal cancer was safe and effective. “Resect and watch” is probably a feasible and rational strategy for treatment of patients with superficial pharyngeal cancer

    A hybrid approach of knowledge-driven and data-driven reasoning for activity recognition in smart homes

    Get PDF
    Accurate activity recognition plays a major role in smart homes to provide assistance and support for users, especially elderly and cognitively impaired people. To realize this task, knowledge-driven approaches are one of the emerging research areas that have shown interesting advantages and features. However, several limitations have been associated with these approaches. The produced models are usually incomplete to capture all types of human activities. This resulted in the limited ability to accurately infer users’ activities. This paper presents an alternative approach by combining knowledge-driven with data-driven reasoning to allow activity models to evolve and adapt automatically based on users’ particularities. Firstly, a knowledge-driven reasoning is presented for inferring an initial activity model. The model is then trained using data-driven techniques to produce a dynamic activity model that learns users’ varying action. This approach has been evaluated using a publicly available dataset and the experimental results show the learned activity model yields significantly higher recognition rates compared to the initial activity model

    Combination of Time-domain, Frequency-domain, and Cepstral-domain Acoustic Features for Speech Commands Classification

    Full text link
    In speech-related classification tasks, frequency-domain acoustic features such as logarithmic Mel-filter bank coefficients (FBANK) and cepstral-domain acoustic features such as Mel-frequency cepstral coefficients (MFCC) are often used. However, time-domain features perform more effectively in some sound classification tasks which contain non-vocal or weakly speech-related sounds. We previously proposed a feature called bit sequence representation (BSR), which is a time-domain binary acoustic feature based on the raw waveform. Compared with MFCC, BSR performed better in environmental sound detection and showed comparable accuracy performance in limited-vocabulary speech recognition tasks. In this paper, we propose a novel improvement BSR feature called BSR-float16 to represent floating-point values more precisely. We experimentally demonstrated the complementarity among time-domain, frequency-domain, and cepstral-domain features using a dataset called Speech Commands proposed by Google. Therefore, we used a simple back-end score fusion method to improve the final classification accuracy. The fusion results also showed better noise robustness.Comment: 5 pages, 4 figure

    Construction and Analysis of Corpus of Japanese Classroom Lecture Speech Contents

    Get PDF
    This paper explains our developing Corpus of Japanese classroom Lecture speech Contents (henceforth, denoted as CJLC). Increasing e-Learning contents demand a sophisticated interactive browsing system for themselves, however, existing tools do not satisfy such a requirement. Many researches including large vocabulary continuous speech recognition and extraction of important sentences against lecture contents are necessary in order to realize the above system. CJLC is designed as their fundamental basis, and consists of speech, transcriptions, and slides that were collected in real university classroom lectures.APSIPA ASC 2009: Asia-Pacific Signal and Information Processing Association, 2009 Annual Summit and Conference. 4-7 October 2009. Sapporo, Japan. Oral session: Initiatives in Spoken Document Processing (6 October 2009)

    Construction of a corpus of elderly Japanese speech for analysis and recognition

    No full text
    corecore