737 research outputs found
Detection of major ASL sign types in continuous signing for ASL recognition
In American Sign Language (ASL) as well as other signed languages, different classes of signs (e.g., lexical signs, fingerspelled signs, and classifier constructions) have different internal structural properties. Continuous sign recognition accuracy can be improved through use of distinct recognition strategies, as well as different training datasets, for each class of signs. For these strategies to be applied, continuous signing video needs to be segmented into parts corresponding to particular classes of signs. In this paper we present a multiple instance learning-based segmentation system that accurately labels 91.27% of the video frames of 500 continuous utterances (including 7 different subjects) from the publicly accessible NCSLGR corpus (Neidle and Vogler, 2012). The system uses novel feature descriptors derived from both motion and shape statistics of the regions of high local motion. The system does not require a hand tracker
Scalable ASL sign recognition using model-based machine learning and linguistically annotated corpora
We report on the high success rates of our new, scalable, computational approach for sign recognition from monocular video, exploiting linguistically annotated ASL datasets with multiple signers. We recognize signs using a hybrid framework combining state-of-the-art learning methods with features based on what is known about the linguistic composition of lexical signs. We model and recognize the sub-components of sign production, with attention to hand shape, orientation, location, motion trajectories, plus non-manual features, and we combine these within a CRF framework. The effect is to make the sign recognition problem robust, scalable, and feasible with relatively smaller datasets than are required for purely data-driven methods. From a 350-sign vocabulary of isolated, citation-form lexical signs from the American Sign Language Lexicon Video Dataset (ASLLVD), including both 1- and 2-handed signs, we achieve a top-1 accuracy of 93.3% and a top-5 accuracy of 97.9%. The high probability with which we can produce 5 sign candidates that contain the correct result opens the door to potential applications, as it is reasonable to provide a sign lookup functionality that offers the user 5 possible signs, in decreasing order of likelihood, with the user then asked to select the desired sign
NEW shared & interconnected ASL resources: SignStream® 3 Software; DAI 2 for web access to linguistically annotated video corpora; and a sign bank
2017 marked the release of a new version of SignStream® software, designed to facilitate linguistic analysis of ASL video. SignStream® provides an intuitive interface for labeling and time-aligning manual and non-manual components of the signing. Version 3 has many new features. For example, it enables representation of morpho-phonological information, including display of handshapes. An expanding ASL video corpus, annotated through use of SignStream®, is shared publicly on the Web. This corpus (video plus annotations) is Web-accessible—browsable, searchable, and downloadable—thanks to a new, improved version of our Data Access Interface: DAI 2. DAI 2 also offers Web access to a brand new Sign Bank, containing about 10,000 examples of about 3,000 distinct signs, as produced by up to 9 different ASL signers. This Sign Bank is also directly accessible from within SignStream®, thereby boosting the efficiency and consistency of annotation; new items can also be added to the Sign Bank. Soon to be integrated into SignStream® 3 and DAI 2 are visualizations of computer-generated analyses of the video: graphical display of eyebrow height, eye aperture, an
Multispectral Deep Neural Networks for Pedestrian Detection
Multispectral pedestrian detection is essential for around-the-clock
applications, e.g., surveillance and autonomous driving. We deeply analyze
Faster R-CNN for multispectral pedestrian detection task and then model it into
a convolutional network (ConvNet) fusion problem. Further, we discover that
ConvNet-based pedestrian detectors trained by color or thermal images
separately provide complementary information in discriminating human instances.
Thus there is a large potential to improve pedestrian detection by using color
and thermal images in DNNs simultaneously. We carefully design four ConvNet
fusion architectures that integrate two-branch ConvNets on different DNNs
stages, all of which yield better performance compared with the baseline
detector. Our experimental results on KAIST pedestrian benchmark show that the
Halfway Fusion model that performs fusion on the middle-level convolutional
features outperforms the baseline method by 11% and yields a missing rate 3.5%
lower than the other proposed architectures.Comment: 13 pages, 8 figures, BMVC 2016 ora
Linguistically-driven framework for computationally efficient and scalable sign recognition
We introduce a new general framework for sign recognition from monocular video using limited quantities of annotated data. The novelty of the hybrid framework we describe here is that we exploit state-of-the art learning methods while also incorporating features based on what we know about the linguistic composition of lexical signs. In particular, we analyze hand shape, orientation, location, and motion trajectories, and then use CRFs to combine this linguistically significant information for purposes of sign recognition. Our robust modeling and recognition of these sub-components of sign production allow an efficient parameterization of the sign recognition problem as compared with purely data-driven methods. This parameterization enables a scalable and extendable time-series learning approach that advances the state of the art in sign recognition, as shown by the results reported here for recognition of isolated, citation-form, lexical signs from American Sign Language (ASL)
3D face tracking and multi-scale, spatio-temporal analysis of linguistically significant facial expressions and head positions in ASL
Essential grammatical information is conveyed in signed languages by clusters of events involving facial expressions and movements of the head and upper body. This poses a significant challenge for computer-based sign language recognition. Here, we present new methods for the recognition of nonmanual grammatical markers in American Sign Language (ASL) based on: (1) new 3D tracking methods for the estimation of 3D head pose and facial expressions to determine the relevant low-level features; (2) methods for higher-level analysis of component events (raised/lowered eyebrows, periodic head nods and head shakes) used in grammatical markings—with differentiation of temporal phases (onset, core, offset, where appropriate), analysis of their characteristic properties, and extraction of corresponding features; (3) a 2-level learning framework to combine lowand high-level features of differing spatio-temporal scales. This new approach achieves significantly better tracking and recognition results than our previous methods
Adaptive low rank and sparse decomposition of video using compressive sensing
We address the problem of reconstructing and analyzing surveillance videos
using compressive sensing. We develop a new method that performs video
reconstruction by low rank and sparse decomposition adaptively. Background
subtraction becomes part of the reconstruction. In our method, a background
model is used in which the background is learned adaptively as the compressive
measurements are processed. The adaptive method has low latency, and is more
robust than previous methods. We will present experimental results to
demonstrate the advantages of the proposed method.Comment: Accepted ICIP 201
A new framework for sign language recognition based on 3D handshape identification and linguistic modeling
Current approaches to sign recognition by computer generally have at least some of the following limitations: they rely on laboratory
conditions for sign production, are limited to a small vocabulary, rely on 2D modeling (and therefore cannot deal with occlusions
and off-plane rotations), and/or achieve limited success. Here we propose a new framework that (1) provides a new tracking method
less dependent than others on laboratory conditions and able to deal with variations in background and skin regions (such as the
face, forearms, or other hands); (2) allows for identification of 3D hand configurations that are linguistically important in American
Sign Language (ASL); and (3) incorporates statistical information reflecting linguistic constraints in sign production. For purposes of
large-scale computer-based sign language recognition from video, the ability to distinguish hand configurations accurately is critical.
Our current method estimates the 3D hand configuration to distinguish among 77 hand configurations linguistically relevant for
ASL. Constraining the problem in this way makes recognition of 3D hand configuration more tractable and provides the information
specifically needed for sign recognition. Further improvements are obtained by incorporation of statistical information about linguistic
dependencies among handshapes within a sign derived from an annotated corpus of almost 10,000 sign tokens
- …
