30 research outputs found

    Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition

    Full text link
    Achieving high accuracy with low latency has always been a challenge in streaming end-to-end automatic speech recognition (ASR) systems. By attending to more future contexts, a streaming ASR model achieves higher accuracy but results in larger latency, which hurts the streaming performance. In the Mask-CTC framework, an encoder network is trained to learn the feature representation that anticipates long-term contexts, which is desirable for streaming ASR. Mask-CTC-based encoder pre-training has been shown beneficial in achieving low latency and high accuracy for triggered attention-based ASR. However, the effectiveness of this method has not been demonstrated for various model architectures, nor has it been verified that the encoder has the expected look-ahead capability to reduce latency. This study, therefore, examines the effectiveness of Mask-CTCbased pre-training for models with different architectures, such as Transformer-Transducer and contextual block streaming ASR. We also discuss the effect of the proposed pre-training method on obtaining accurate output spike timing.Comment: Accepted to EUSIPCO 202

    Conversation-oriented ASR with multi-look-ahead CBS architecture

    Full text link
    During conversations, humans are capable of inferring the intention of the speaker at any point of the speech to prepare the following action promptly. Such ability is also the key for conversational systems to achieve rhythmic and natural conversation. To perform this, the automatic speech recognition (ASR) used for transcribing the speech in real-time must achieve high accuracy without delay. In streaming ASR, high accuracy is assured by attending to look-ahead frames, which leads to delay increments. To tackle this trade-off issue, we propose a multiple latency streaming ASR to achieve high accuracy with zero look-ahead. The proposed system contains two encoders that operate in parallel, where a primary encoder generates accurate outputs utilizing look-ahead frames, and the auxiliary encoder recognizes the look-ahead portion of the primary encoder without look-ahead. The proposed system is constructed based on contextual block streaming (CBS) architecture, which leverages block processing and has a high affinity for the multiple latency architecture. Various methods are also studied for architecting the system, including shifting the network to perform as different encoders; as well as generating both encoders' outputs in one encoding pass.Comment: Submitted to ICASSP202

    Crop Classification and LAI Estimation Using Original and Resolution-Reduced Images from Two Consumer-Grade Cameras

    Get PDF
    Consumer-grade cameras are being increasingly used for remote sensing applications in recent years. However, the performance of this type of cameras has not been systematically tested and well documented in the literature. The objective of this research was to evaluate the performance of original and resolution-reduced images taken from two consumer-grade cameras, a RGB camera and a modified near-infrared (NIR) camera, for crop identification and leaf area index (LAI) estimation. Airborne RGB and NIR images taken over a 6.5-square-km cropping area were mosaicked and aligned to create a four-band mosaic with a spatial resolution of 0.4 m. The spatial resolution of the mosaic was then reduced to 1, 2, 4, 10, 15 and 30 m for comparison. Six supervised classifiers were applied to the RGB images and the four-band images for crop identification, and 10 vegetation indices (VIs) derived from the images were related to ground-measured LAI. Accuracy assessment showed that maximum likelihood applied to the 0.4-m images achieved an overall accuracy of 83.3% for the RGB image and 90.4% for the four-band image. Regression analysis showed that the 10 VIs explained 58.7% to 83.1% of the variability in LAI. Moreover, spatial resolutions at 0.4, 1, 2 and 4 m achieved better classification results for both crop identification and LAI prediction than the coarser spatial resolutions at 10, 15 and 30 m. The results from this study indicate that imagery from consumer-grade cameras can be a useful data source for crop identification and canopy cover estimation

    Crop Classification and LAI Estimation Using Original and Resolution-Reduced Images from Two Consumer-Grade Cameras

    Get PDF
    Consumer-grade cameras are being increasingly used for remote sensing applications in recent years. However, the performance of this type of cameras has not been systematically tested and well documented in the literature. The objective of this research was to evaluate the performance of original and resolution-reduced images taken from two consumer-grade cameras, a RGB camera and a modified near-infrared (NIR) camera, for crop identification and leaf area index (LAI) estimation. Airborne RGB and NIR images taken over a 6.5-square-km cropping area were mosaicked and aligned to create a four-band mosaic with a spatial resolution of 0.4 m. The spatial resolution of the mosaic was then reduced to 1, 2, 4, 10, 15 and 30 m for comparison. Six supervised classifiers were applied to the RGB images and the four-band images for crop identification, and 10 vegetation indices (VIs) derived from the images were related to ground-measured LAI. Accuracy assessment showed that maximum likelihood applied to the 0.4-m images achieved an overall accuracy of 83.3% for the RGB image and 90.4% for the four-band image. Regression analysis showed that the 10 VIs explained 58.7% to 83.1% of the variability in LAI. Moreover, spatial resolutions at 0.4, 1, 2 and 4 m achieved better classification results for both crop identification and LAI prediction than the coarser spatial resolutions at 10, 15 and 30 m. The results from this study indicate that imagery from consumer-grade cameras can be a useful data source for crop identification and canopy cover estimation

    Crop Classification and LAI Estimation Using Original and Resolution-Reduced Images from Two Consumer-Grade Cameras

    Get PDF
    Consumer-grade cameras are being increasingly used for remote sensing applications in recent years. However, the performance of this type of cameras has not been systematically tested and well documented in the literature. The objective of this research was to evaluate the performance of original and resolution-reduced images taken from two consumer-grade cameras, a RGB camera and a modified near-infrared (NIR) camera, for crop identification and leaf area index (LAI) estimation. Airborne RGB and NIR images taken over a 6.5-square-km cropping area were mosaicked and aligned to create a four-band mosaic with a spatial resolution of 0.4 m. The spatial resolution of the mosaic was then reduced to 1, 2, 4, 10, 15 and 30 m for comparison. Six supervised classifiers were applied to the RGB images and the four-band images for crop identification, and 10 vegetation indices (VIs) derived from the images were related to ground-measured LAI. Accuracy assessment showed that maximum likelihood applied to the 0.4-m images achieved an overall accuracy of 83.3% for the RGB image and 90.4% for the four-band image. Regression analysis showed that the 10 VIs explained 58.7% to 83.1% of the variability in LAI. Moreover, spatial resolutions at 0.4, 1, 2 and 4 m achieved better classification results for both crop identification and LAI prediction than the coarser spatial resolutions at 10, 15 and 30 m. The results from this study indicate that imagery from consumer-grade cameras can be a useful data source for crop identification and canopy cover estimation

    Registration for Optical Multimodal Remote Sensing Images Based on FAST Detection,Window Selection, and Histogram Specification

    Get PDF
    In recent years, digital frame cameras have been increasingly used for remote sensing applications. However, it is always a challenge to align or register images captured with different cameras or different imaging sensor units. In this research, a novel registration method was proposed. Coarse registration was first applied to approximately align the sensed and reference images. Window selection was then used to reduce the search space and a histogram specification was applied to optimize the grayscale similarity between the images. After comparisons with other commonly-used detectors, the fast corner detector, FAST (Features from Accelerated Segment Test), was selected to extract the feature points. The matching point pairs were then detected between the images, the outliers were eliminated, and geometric transformation was performed. The appropriate window size was searched and set to one-tenth of the image width. The images that were acquired by a two-camera system, a camera with five imaging sensors, and a camera with replaceable filters mounted on a manned aircraft, an unmanned aerial vehicle, and a ground-based platform, respectively, were used to evaluate the performance of the proposed method. The image analysis results showed that, through the appropriate window selection and histogram specification, the number of correctly matched point pairs had increased by 11.30 times, and that the correct matching rate had increased by 36%, compared with the results based on FAST alone. The root mean square error (RMSE) in the x and y directions was generally within 0.5 pixels. In comparison with the binary robust invariant scalable keypoints (BRISK), curvature scale space (CSS), Harris, speed up robust features (SURF), and commercial software ERDAS and ENVI, this method resulted in larger numbers of correct matching pairs and smaller, more consistent RMSE. Furthermore, it was not necessary to choose any tie control points manually before registration. The results from this study indicate that the proposed method can be effective for registering optical multimodal remote sensing images that have been captured with different imaging sensors

    Crop Classification and LAI Estimation Using Original and Resolution-Reduced Images from Two Consumer-Grade Cameras

    Get PDF
    Consumer-grade cameras are being increasingly used for remote sensing applications in recent years. However, the performance of this type of cameras has not been systematically tested and well documented in the literature. The objective of this research was to evaluate the performance of original and resolution-reduced images taken from two consumer-grade cameras, a RGB camera and a modified near-infrared (NIR) camera, for crop identification and leaf area index (LAI) estimation. Airborne RGB and NIR images taken over a 6.5-square-km cropping area were mosaicked and aligned to create a four-band mosaic with a spatial resolution of 0.4 m. The spatial resolution of the mosaic was then reduced to 1, 2, 4, 10, 15 and 30 m for comparison. Six supervised classifiers were applied to the RGB images and the four-band images for crop identification, and 10 vegetation indices (VIs) derived from the images were related to ground-measured LAI. Accuracy assessment showed that maximum likelihood applied to the 0.4-m images achieved an overall accuracy of 83.3% for the RGB image and 90.4% for the four-band image. Regression analysis showed that the 10 VIs explained 58.7% to 83.1% of the variability in LAI. Moreover, spatial resolutions at 0.4, 1, 2 and 4 m achieved better classification results for both crop identification and LAI prediction than the coarser spatial resolutions at 10, 15 and 30 m. The results from this study indicate that imagery from consumer-grade cameras can be a useful data source for crop identification and canopy cover estimation

    Streaming Automatic Speech Recognition with Low Latency and High Accuracy

    Get PDF
    早稲田大学修士(工学)master thesi

    Registration for Optical Multimodal Remote Sensing Images Based on FAST Detection,Window Selection, and Histogram Specification

    Get PDF
    In recent years, digital frame cameras have been increasingly used for remote sensing applications. However, it is always a challenge to align or register images captured with different cameras or different imaging sensor units. In this research, a novel registration method was proposed. Coarse registration was first applied to approximately align the sensed and reference images. Window selection was then used to reduce the search space and a histogram specification was applied to optimize the grayscale similarity between the images. After comparisons with other commonly-used detectors, the fast corner detector, FAST (Features from Accelerated Segment Test), was selected to extract the feature points. The matching point pairs were then detected between the images, the outliers were eliminated, and geometric transformation was performed. The appropriate window size was searched and set to one-tenth of the image width. The images that were acquired by a two-camera system, a camera with five imaging sensors, and a camera with replaceable filters mounted on a manned aircraft, an unmanned aerial vehicle, and a ground-based platform, respectively, were used to evaluate the performance of the proposed method. The image analysis results showed that, through the appropriate window selection and histogram specification, the number of correctly matched point pairs had increased by 11.30 times, and that the correct matching rate had increased by 36%, compared with the results based on FAST alone. The root mean square error (RMSE) in the x and y directions was generally within 0.5 pixels. In comparison with the binary robust invariant scalable keypoints (BRISK), curvature scale space (CSS), Harris, speed up robust features (SURF), and commercial software ERDAS and ENVI, this method resulted in larger numbers of correct matching pairs and smaller, more consistent RMSE. Furthermore, it was not necessary to choose any tie control points manually before registration. The results from this study indicate that the proposed method can be effective for registering optical multimodal remote sensing images that have been captured with different imaging sensors
    corecore