868 research outputs found
NEW shared & interconnected ASL resources: SignStream® 3 Software; DAI 2 for web access to linguistically annotated video corpora; and a sign bank
2017 marked the release of a new version of SignStream® software, designed to facilitate linguistic analysis of ASL video. SignStream® provides an intuitive interface for labeling and time-aligning manual and non-manual components of the signing. Version 3 has many new features. For example, it enables representation of morpho-phonological information, including display of handshapes. An expanding ASL video corpus, annotated through use of SignStream®, is shared publicly on the Web. This corpus (video plus annotations) is Web-accessible—browsable, searchable, and downloadable—thanks to a new, improved version of our Data Access Interface: DAI 2. DAI 2 also offers Web access to a brand new Sign Bank, containing about 10,000 examples of about 3,000 distinct signs, as produced by up to 9 different ASL signers. This Sign Bank is also directly accessible from within SignStream®, thereby boosting the efficiency and consistency of annotation; new items can also be added to the Sign Bank. Soon to be integrated into SignStream® 3 and DAI 2 are visualizations of computer-generated analyses of the video: graphical display of eyebrow height, eye aperture, an
Evaluation of Motion Velocity as a Feature for Sign Language Detection
Popular video sharing websites contain a large collection of videos in various sign languages. These websites have the potential of being a significant source of knowledge sharing and communication for the members of the deaf and hard-of-hearing community. However, prior studies have shown that traditional keyword-based search does not do a good job of discovering these videos.
Dr. Frank Shipman and others have been working towards building a distributed digital library by indexing the sign language videos available online. This system employs an automatic detector, based on visual features extracted from the video, for filtering out non-sign language content. Features such as the amount and location of hand movements, symmetry of motion etc. have been experimented with for this purpose. Caio Monteiro and his team designed a classifier which uses face detection to identify the region-of-interest (ROI) in a frame, and foreground segmentation to estimate amount of hand motion within the region. It was later improved upon by Karappa et al. by dividing the ROI using polar coordinates and estimating motion in each division to form a composite feature set.
This thesis work examines another visual feature associated with the signing activity i.e. speed of hand movements. Speed based features performed better compared to the foreground-based features for a complex dataset of SL and non-SL videos. The F1 score showed a jump from 0.73 to 0.78. However, for a second dataset consisting of videos with single signers and static backgrounds, the classification scores dipped. More consistent performance improvements were observed when features from the two feature sets were used in conjunction. F1 score of 0.76 was observed for the complex dataset. For the second dataset, the F1 score changed from 0.85 to 0.86.
Another associated problem is identifying the sign language in a video. The impact of speed of motion on the problem of classifying American Sign Language versus British Sign Language was found to be minimal. We concluded that it is the location of motion which influences this problem more than either the speed or the amount of motion.
Non-speed related analyses of sign language detection were also explored. Since the American Sign Language alphabet is one-handed, it was expected that videos with left-handed signing might be falsely identified as British Sign Language, which has a two-handed alphabet. We briefly studied this issue with respect to our corpus of ASL and BSL videos and discovered that our classifier design does not suffer from this issue. Apart from this, we explored speeding up the classification process by computing symmetry of motion in the ROI on selected keyframes as a single feature for classification. The resulting feature extraction was significantly faster but the precision and recall values depreciated to 59% and 62% respectively for a F1 score of .61
Data and methods for a visual understanding of sign languages
Signed languages are complete and natural languages used as the first or preferred mode of communication by millions of people worldwide. However, they, unfortunately, continue to be marginalized languages. Designing, building, and evaluating models that work on sign languages presents compelling research challenges and requires interdisciplinary and collaborative efforts. The recent advances in Machine Learning (ML) and Artificial Intelligence (AI) has the power to enable better accessibility to sign language users and narrow down the existing communication barrier between the Deaf community and non-sign language users. However, recent AI-powered technologies still do not account for sign language in their pipelines. This is mainly because sign languages are visual languages, that use manual and non-manual features to convey information, and do not have a standard written form. Thus, the goal of this thesis is to contribute to the development of new technologies that account for sign language by creating large-scale multimodal resources suitable for training modern data-hungry machine learning models and developing automatic systems that focus on computer vision tasks related to sign language that aims at learning better visual understanding of sign languages.
Thus, in Part I, we introduce the How2Sign dataset, which is a large-scale collection of multimodal and multiview sign language videos in American Sign Language. In Part II, we contribute to the development of technologies that account for sign languages by presenting in Chapter 4 a framework called Spot-Align, based on sign spotting methods, to automatically annotate sign instances in continuous sign language. We further present the benefits of this framework and establish a baseline for the sign language recognition task on the How2Sign dataset. In addition to that, in Chapter 5 we benefit from the different annotations and modalities of the How2Sign to explore sign language video retrieval by learning cross-modal embeddings. Later in Chapter 6, we explore sign language video generation by applying Generative Adversarial Networks to the sign language domain and assess if and how well sign language users can understand automatically generated sign language videos by proposing an evaluation protocol based on How2Sign topics and English translationLes llengües de signes són llengües completes i naturals que utilitzen milions de persones de tot el món com mode de comunicació primer o preferit. Tanmateix, malauradament, continuen essent llengües marginades. Dissenyar, construir i avaluar tecnologies que funcionin amb les llengües de signes presenta reptes de recerca que requereixen d’esforços interdisciplinaris i col·laboratius. Els avenços recents en l’aprenentatge automà tic i la intel·ligència artificial (IA) poden millorar l’accessibilitat tecnològica dels signants, i alhora reduir la barrera de comunicació existent entre la comunitat sorda i les persones no-signants. Tanmateix, les tecnologies més modernes en IA encara no consideren les llengües de signes en les seves interfÃcies amb l’usuari. Això es deu principalment a que les llengües de signes són llenguatges visuals, que utilitzen caracterÃstiques manuals i no manuals per transmetre informació, i no tenen una forma escrita està ndard. Els objectius principals d’aquesta tesi són la creació de recursos multimodals a gran escala adequats per entrenar models d’aprenentatge automà tic per a llengües de signes, i desenvolupar sistemes de visió per computador adreçats a una millor comprensió automà tica de les llengües de signes. AixÃ, a la Part I presentem la base de dades How2Sign, una gran col·lecció multimodal i multivista de vÃdeos de la llengua de signes nord-americana. A la Part II, contribuïm al desenvolupament de tecnologia per a llengües de signes, presentant al capÃtol 4 una solució per anotar signes automà ticament anomenada Spot-Align, basada en mètodes de localització de signes en seqüències contÃnues de signes. Després, presentem els avantatges d’aquesta solució i proporcionem uns primers resultats per la tasca de reconeixement de la llengua de signes a la base de dades How2Sign. A continuació, al capÃtol 5 aprofitem de les anotacions i diverses modalitats de How2Sign per explorar la cerca de vÃdeos en llengua de signes a partir de l’entrenament d’incrustacions multimodals. Finalment, al capÃtol 6, explorem la generació de vÃdeos en llengua de signes aplicant xarxes adversà ries generatives al domini de la llengua de signes. Avaluem fins a quin punt els signants poden entendre els vÃdeos generats automà ticament, proposant un nou protocol d’avaluació basat en les categories dins de How2Sign i la traducció dels vÃdeos a l’anglès escritLas lenguas de signos son lenguas completas y naturales que utilizan millones de personas
de todo el mundo como modo de comunicación primero o preferido. Sin embargo, desgraciadamente,
siguen siendo lenguas marginadas. Diseñar, construir y evaluar tecnologÃas
que funcionen con las lenguas de signos presenta retos de investigación que requieren
esfuerzos interdisciplinares y colaborativos. Los avances recientes en el aprendizaje automático
y la inteligencia artificial (IA) pueden mejorar la accesibilidad tecnológica de
los signantes, al tiempo que reducir la barrera de comunicación existente entre la comunidad
sorda y las personas no signantes. Sin embargo, las tecnologÃas más modernas en
IA todavÃa no consideran las lenguas de signos en sus interfaces con el usuario. Esto
se debe principalmente a que las lenguas de signos son lenguajes visuales, que utilizan
caracterÃsticas manuales y no manuales para transmitir información, y carecen de una
forma escrita estándar. Los principales objetivos de esta tesis son la creación de recursos
multimodales a gran escala adecuados para entrenar modelos de aprendizaje automático
para lenguas de signos, y desarrollar sistemas de visión por computador dirigidos a una
mejor comprensión automática de las lenguas de signos.
AsÃ, en la Parte I presentamos la base de datos How2Sign, una gran colección multimodal
y multivista de vÃdeos de lenguaje la lengua de signos estadounidense. En la Part II,
contribuimos al desarrollo de tecnologÃa para lenguas de signos, presentando en el capÃtulo
4 una solución para anotar signos automáticamente llamada Spot-Align, basada en
métodos de localización de signos en secuencias continuas de signos. Después, presentamos
las ventajas de esta solución y proporcionamos unos primeros resultados por la tarea
de reconocimiento de la lengua de signos en la base de datos How2Sign. A continuación,
en el capÃtulo 5 aprovechamos de las anotaciones y diversas modalidades de How2Sign
para explorar la búsqueda de vÃdeos en lengua de signos a partir del entrenamiento de
incrustaciones multimodales. Finalmente, en el capÃtulo 6, exploramos la generación
de vÃdeos en lengua de signos aplicando redes adversarias generativas al dominio de la
lengua de signos. Evaluamos hasta qué punto los signantes pueden entender los vÃdeos
generados automáticamente, proponiendo un nuevo protocolo de evaluación basado en
las categorÃas dentro de How2Sign y la traducción de los vÃdeos al inglés escrito.Teoria del Senyal i Comunicacion
Developing a Sign Language Video Collection via Metadata and Video Classifiers
Video sharing sites have become a central tool for the storage and dissemination of sign language content. Sign language videos have many purposes, including sharing experiences or opinions, teaching and practicing a sign language, etc. However, due to limitations of term-based search, these videos can be hard to locate. This results in a diminished value of these sites for the deaf or hard-of-hearing community. As a result, members of the community frequently engage in a push-style delivery of content, sharing direct links to sign language videos with other members of the sign language community. To address this problem, we propose the Sign Language Digital Library (SLaDL).
SLaDL is composed of two main sub-systems, a crawler that collects potential videos for inclusion into the digital library corpus, and an automatic classification system that detects and identifies sign language presence in the crawled videos. These components attempt to filter out videos that do not include sign language from the collection and to organize sign language videos based on different languages. This dissertation explores individual and combined components of the classification system. The components form a cascade of multimodal classifiers aimed at achieving high accuracy when classifying potential videos while minimizing the computational effort.
A web application coordinates the execution of these two subsystems and enables user interaction (browsing and searching) with the library corpus. Since the collection of the digital library is automatically curated by the cascading classifier, the number of irrelevant results is expected to be drastically lower when compared to general-purpose video sharing sites.
iii
Video sharing sites have become a central tool for the storage and dissemination of sign language content. Sign language videos have many purposes, including sharing experiences or opinions, teaching and practicing a sign language, etc. However, due to limitations of term-based search, these videos can be hard to locate. This results in a diminished value of these sites for the deaf or hard-of-hearing community. As a result, members of the community frequently engage in a push-style delivery of content, sharing direct links to sign language videos with other members of the sign language community. To address this problem, we propose the Sign Language Digital Library (SLaDL).
SLaDL is composed of two main sub-systems, a crawler that collects potential videos for inclusion into the digital library corpus, and an automatic classification system that detects and identifies sign language presence in the crawled videos. These components attempt to filter out videos that do not include sign language from the collection and to organize sign language videos based on different languages. This dissertation explores individual and combined components of the classification system. The components form a cascade of multimodal classifiers aimed at achieving high accuracy when classifying potential videos while minimizing the computational effort.
A web application coordinates the execution of these two subsystems and enables user interaction (browsing and searching) with the library corpus. Since the collection of the digital library is automatically curated by the cascading classifier, the number of irrelevant results is expected to be drastically lower when compared to general-purpose video sharing sites.
The evaluation involved a series of experiments focused on specific components of the system, and on analyzing how to best configure SLaDL. In the first set of experiments, we investigated three different crawling approaches, assessing how they compared in terms of both finding a large quantity of sign language videos and expanding the variety of videos in the collection. Secondly, we evaluated the performance of different approaches to multimodal classification in terms of precision, recall, F1 score, and computational costs. Lastly, we incorporated the best multimodal approach into cascading classifiers to reduce computation while preserving accuracy. We experimented with four different cascading configurations and analyzed their performance for the detection and identification of signed content. Given our findings of each experiment, we proposed the set up for an instantiation of SLaDL
ATLAS: A flexible and extensible architecture for linguistic annotation
We describe a formal model for annotating linguistic artifacts, from which we
derive an application programming interface (API) to a suite of tools for
manipulating these annotations. The abstract logical model provides for a range
of storage formats and promotes the reuse of tools that interact through this
API. We focus first on ``Annotation Graphs,'' a graph model for annotations on
linear signals (such as text and speech) indexed by intervals, for which
efficient database storage and querying techniques are applicable. We note how
a wide range of existing annotated corpora can be mapped to this annotation
graph model. This model is then generalized to encompass a wider variety of
linguistic ``signals,'' including both naturally occuring phenomena (as
recorded in images, video, multi-modal interactions, etc.), as well as the
derived resources that are increasingly important to the engineering of natural
language processing systems (such as word lists, dictionaries, aligned
bilingual corpora, etc.). We conclude with a review of the current efforts
towards implementing key pieces of this architecture.Comment: 8 pages, 9 figure
Linguistically Motivated Sign Language Segmentation
Sign language segmentation is a crucial task in sign language processing
systems. It enables downstream tasks such as sign recognition, transcription,
and machine translation. In this work, we consider two kinds of segmentation:
segmentation into individual signs and segmentation into phrases, larger units
comprising several signs. We propose a novel approach to jointly model these
two tasks.
Our method is motivated by linguistic cues observed in sign language corpora.
We replace the predominant IO tagging scheme with BIO tagging to account for
continuous signing. Given that prosody plays a significant role in phrase
boundaries, we explore the use of optical flow features. We also provide an
extensive analysis of hand shapes and 3D hand normalization.
We find that introducing BIO tagging is necessary to model sign boundaries.
Explicitly encoding prosody by optical flow improves segmentation in shallow
models, but its contribution is negligible in deeper models. Careful tuning of
the decoding algorithm atop the models further improves the segmentation
quality.
We demonstrate that our final models generalize to out-of-domain video
content in a different signed language, even under a zero-shot setting. We
observe that including optical flow and 3D hand normalization enhances the
robustness of the model in this context.Comment: Accepted at EMNLP 2023 (Findings
- …