12 research outputs found

    Hey, vitrivr! - A Multimodal UI for Video Retrieval

    Get PDF
    In this paper, we present a multimodal web-based user interface for the vitrivr system. vitrivr is a modern, open-source video retrieval system for searching in large collections of video using a great variety of query modes, including query-by-sketch, query-by-example and query-by-motion. With the multimodal user interface, prospective users benefit from being able to naturally interact with the vitrivr system by using spoken commands and also by applying multimodal commands which combine spoken instructions with manual pointing. While the main strength of the UI is the seamless combination of speech-based and sketch-based interaction for multimedia similarity search, the speech modality has shown to be very effective for retrieval on its own. In particular, it helps overcoming accessibility boundaries and offering retrieval functionality for users with disabilities. Finally, for a holistic natural experience with the vitrivr system, we have integrated a speech synthesis engine that returns spoken answers to the user

    Novice-Friendly Text-based Video Search with vitrivr

    Get PDF
    Video retrieval still offers many challenges which can so far only be effectively mediated through interactive, human-in-the-loop retrieval approaches. The vitrivr multimedia retrieval stack offers a broad range of query mechanisms to enable users to perform such interactive retrieval. While these multiple mechanisms offer various options to experienced users, they can be difficult to use for novices. In this paper, we present a minimal user interface geared towards novice users that only exposes a subset of vitrivr’s functionality but simplifies user interaction

    Temporal multimodal video and lifelog retrieval

    Get PDF
    The past decades have seen exponential growth of both consumption and production of data, with multimedia such as images and videos contributing significantly to said growth. The widespread proliferation of smartphones has provided everyday users with the ability to consume and produce such content easily. As the complexity and diversity of multimedia data has grown, so has the need for more complex retrieval models which address the information needs of users. Finding relevant multimedia content is central in many scenarios, from internet search engines and medical retrieval to querying one's personal multimedia archive, also called lifelog. Traditional retrieval models have often focused on queries targeting small units of retrieval, yet users usually remember temporal context and expect results to include this. However, there is little research into enabling these information needs in interactive multimedia retrieval. In this thesis, we aim to close this research gap by making several contributions to multimedia retrieval with a focus on two scenarios, namely video and lifelog retrieval. We provide a retrieval model for complex information needs with temporal components, including a data model for multimedia retrieval, a query model for complex information needs, and a modular and adaptable query execution model which includes novel algorithms for result fusion. The concepts and models are implemented in vitrivr, an open-source multimodal multimedia retrieval system, which covers all aspects from extraction to query formulation and browsing. vitrivr has proven its usefulness in evaluation campaigns and is now used in two large-scale interdisciplinary research projects. We show the feasibility and effectiveness of our contributions in two ways: firstly, through results from user-centric evaluations which pit different user-system combinations against one another. Secondly, we perform a system-centric evaluation by creating a new dataset for temporal information needs in video and lifelog retrieval with which we quantitatively evaluate our models. The results show significant benefits for systems that enable users to specify more complex information needs with temporal components. Participation in interactive retrieval evaluation campaigns over multiple years provides insight into possible future developments and challenges of such campaigns

    ShARc: Shape and Appearance Recognition for Person Identification In-the-wild

    Full text link
    Identifying individuals in unconstrained video settings is a valuable yet challenging task in biometric analysis due to variations in appearances, environments, degradations, and occlusions. In this paper, we present ShARc, a multimodal approach for video-based person identification in uncontrolled environments that emphasizes 3-D body shape, pose, and appearance. We introduce two encoders: a Pose and Shape Encoder (PSE) and an Aggregated Appearance Encoder (AAE). PSE encodes the body shape via binarized silhouettes, skeleton motions, and 3-D body shape, while AAE provides two levels of temporal appearance feature aggregation: attention-based feature aggregation and averaging aggregation. For attention-based feature aggregation, we employ spatial and temporal attention to focus on key areas for person distinction. For averaging aggregation, we introduce a novel flattening layer after averaging to extract more distinguishable information and reduce overfitting of attention. We utilize centroid feature averaging for gallery registration. We demonstrate significant improvements over existing state-of-the-art methods on public datasets, including CCVID, MEVID, and BRIAR.Comment: WACV 202

    The development of a video retrieval system using a clinician-led approach

    Get PDF
    Patient video taken at home can provide valuable insights into the recovery progress during a programme of physical therapy, but is very time consuming for clinician review. Our work focussed on (i) enabling any patient to share information about progress at home, simply by sharing video and (ii) building intelligent systems to support Physical Therapists (PTs) in reviewing this video data and extracting the necessary detail. This paper reports the development of the system, appropriate for future clinical use without reliance on a technical team, and the clinician involvement in that development. We contribute an interactive content-based video retrieval system that significantly reduces the time taken for clinicians to review videos, using human head movement as an example. The system supports query-by-movement (clinicians move their own body to define search queries) and retrieves the essential fine-grained movements needed for clinical interpretation. This is done by comparing sequences of image-based pose estimates (here head rotations) through a distance metric (here Fréchet distance) and presenting a ranked list of similar movements to clinicians for review. In contrast to existing intelligent systems for retrospective review of human movement, the system supports a flexible analysis where clinicians can look for any movement that interests them. Evaluation by a group of PTs with expertise in training movement control showed that 96% of all relevant movements were identified with time savings of as much as 99.1% compared to reviewing target videos in full. The novelty of this contribution includes retrospective progress monitoring that preserves context through video, and content-based video retrieval that supports both fine-grained human actions and query-by-movement. Future research, including large clinician-led studies, will refine the technical aspects and explore the benefits in terms of patient outcomes, PT time, and financial savings over the course of a programme of therapy. It is anticipated that this clinician-led approach will mitigate the reported slow clinical uptake of technology with resulting patient benefit

    Reconhecimento de expressões faciais na língua de sinais brasileira por meio do sistema de códigos de ação facial

    Get PDF
    Orientadores: Paula Dornhofer Paro Costa, Kate Mamhy Oliveira KumadaTese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: Surdos ao redor do mundo usam a língua de sinais para se comunicarem, porém, apesar da ampla disseminação dessas línguas, os surdos ou indivíduos com deficiência auditiva ainda enfrentam dificuldades na comunicação com ouvintes, na ausência de um intérprete. Tais dificuldades impactam negativamente o acesso dos surdos à educação, ao mercado de trabalho e aos serviços públicos em geral. As tecnologias assistivas, como o Reconhecimento Automático de Língua de Sinais, do inglês Automatic Sign Language Recognition (ASLR), visam superar esses obstáculos de comunicação. No entanto, o desenvolvimento de sistemas ASLR confiáveis apresenta vários desafios devido à complexidade linguística das línguas de sinais. As línguas de sinais (LSs) são sistemas linguísticos visuoespaciais que, como qualquer outra língua humana, apresentam variações linguísticas globais e regionais, além de um sistema gramatical. Além disso, as línguas de sinais não se baseiam apenas em gestos manuais, mas também em marcadores não-manuais, como expressões faciais. Nas línguas de sinais, as expressões faciais podem diferenciar itens lexicais, participar da construção sintática e contribuir para processos de intensificação, entre outras funções gramaticais e afetivas. Associado aos modelos de reconhecimento de gestos, o reconhecimento da expressões faciais é um componente essencial da tecnologia ASLR. Neste trabalho, propomos um sistema automático de reconhecimento de expressões faciais para Libras, a língua brasileira de sinais. A partir de uma pesquisa bibliográfica, apresentamos um estudo da linguagem e uma taxonomia diferente para expressões faciais de Libras associadas ao sistema de codificação de ações faciais. Além disso, um conjunto de dados de expressões faciais em Libras foi criado. Com base em experimentos, a decisão sobre a construção do nosso sistema foi através de pré-processamento e modelos de reconhecimento. Os recursos obtidos para a classificação das ações faciais são resultado da aplicação combinada de uma região de interesse, e informações geométricas da face dado embasamento teórico e a obtenção de desempenho melhor do que outras etapas testadas. Quanto aos classificadores, o SqueezeNet apresentou melhores taxas de precisão. Com isso, o potencial do modelo proposto vem da análise de 77% da acurácia média de reconhecimento das expressões faciais de Libras. Este trabalho contribui para o crescimento dos estudos que envolvem a visão computacional e os aspectos de reconhecimento da estrutura das expressões faciais da língua de sinais, e tem como foco principal a importância da anotação da ação facial de forma automatizadaAbstract: Deaf people around the world use sign languages to communicate but, despite the wide dissemination of such languages, deaf or hard of hearing individuals still face difficulties in communicating with hearing individuals, in the absence of an interpreter. Such difficulties negatively impact the access of deaf individuals to education, to the job market, and to public services in general. Assistive technologies, such as Automatic Sign Language Recognition (ASLR), aim at overcoming such communication obstacles. However, the development of reliable ASLR systems imposes numerous challenges due the linguistic complexity of sign languages. Sign languages (SLs) are visuospatial linguistic systems that, like any other human language, present global and regional linguistic variations, and a grammatical system. Also, sign languages do not rely only on manual gestures but also non-manual markers, such as facial expressions. In SL, facial expressions may differentiate lexical items, participate in syntactic construction, and contribute to change the intensity of a sentence, among other grammatical and affective functions. Associated with the gesture recognition models, facial expression recognition (FER) is an essential component of ASLR technology. In this work, we propose an automatic facial expression recognition (FER) system for Brazilian Sign Language (Libras). Derived from a literature survey, we present a language study and a different taxonomy for facial expressions of Libras associated with the Facial Action Coding System (FACS). Also, a dataset of facial expressions in Libras was created. An experimental setting was done for the construction of our framework for a preprocessing stage and recognizer model. The features for the classification of the facial actions resulted from the application of a combined region of interest and geometric information given a theoretical basis and better performance than other tested steps. As for classifiers, SqueezeNet returned better accuracy rates. With this, the potential of the proposed model comes from the analysis of 77% of the average accuracy of recognition of Libras' facial expressions. This work contributes to the growth of studies that involve the computational vision and recognition aspects of the structure of sign language facial expressions, and its main focus is the importance of facial action annotation in an automated wayDoutoradoEngenharia de ComputaçãoDoutora em Engenharia Elétrica001CAPE

    Inferring Complex Activities for Context-aware Systems within Smart Environments

    Get PDF
    The rising ageing population worldwide and the prevalence of age-related conditions such as physical fragility, mental impairments and chronic diseases have significantly impacted the quality of life and caused a shortage of health and care services. Over-stretched healthcare providers are leading to a paradigm shift in public healthcare provisioning. Thus, Ambient Assisted Living (AAL) using Smart Homes (SH) technologies has been rigorously investigated to help address the aforementioned problems. Human Activity Recognition (HAR) is a critical component in AAL systems which enables applications such as just-in-time assistance, behaviour analysis, anomalies detection and emergency notifications. This thesis is aimed at investigating challenges faced in accurately recognising Activities of Daily Living (ADLs) performed by single or multiple inhabitants within smart environments. Specifically, this thesis explores five complementary research challenges in HAR. The first study contributes to knowledge by developing a semantic-enabled data segmentation approach with user-preferences. The second study takes the segmented set of sensor data to investigate and recognise human ADLs at multi-granular action level; coarse- and fine-grained action level. At the coarse-grained actions level, semantic relationships between the sensor, object and ADLs are deduced, whereas, at fine-grained action level, object usage at the satisfactory threshold with the evidence fused from multimodal sensor data is leveraged to verify the intended actions. Moreover, due to imprecise/vague interpretations of multimodal sensors and data fusion challenges, fuzzy set theory and fuzzy web ontology language (fuzzy-OWL) are leveraged. The third study focuses on incorporating uncertainties caused in HAR due to factors such as technological failure, object malfunction, and human errors. Hence, existing studies uncertainty theories and approaches are analysed and based on the findings, probabilistic ontology (PR-OWL) based HAR approach is proposed. The fourth study extends the first three studies to distinguish activities conducted by more than one inhabitant in a shared smart environment with the use of discriminative sensor-based techniques and time-series pattern analysis. The final study investigates in a suitable system architecture with a real-time smart environment tailored to AAL system and proposes microservices architecture with sensor-based off-the-shelf and bespoke sensing methods. The initial semantic-enabled data segmentation study was evaluated with 100% and 97.8% accuracy to segment sensor events under single and mixed activities scenarios. However, the average classification time taken to segment each sensor events have suffered from 3971ms and 62183ms for single and mixed activities scenarios, respectively. The second study to detect fine-grained-level user actions was evaluated with 30 and 153 fuzzy rules to detect two fine-grained movements with a pre-collected dataset from the real-time smart environment. The result of the second study indicate good average accuracy of 83.33% and 100% but with the high average duration of 24648ms and 105318ms, and posing further challenges for the scalability of fusion rule creations. The third study was evaluated by incorporating PR-OWL ontology with ADL ontologies and Semantic-Sensor-Network (SSN) ontology to define four types of uncertainties presented in the kitchen-based activity. The fourth study illustrated a case study to extended single-user AR to multi-user AR by combining RFID tags and fingerprint sensors discriminative sensors to identify and associate user actions with the aid of time-series analysis. The last study responds to the computations and performance requirements for the four studies by analysing and proposing microservices-based system architecture for AAL system. A future research investigation towards adopting fog/edge computing paradigms from cloud computing is discussed for higher availability, reduced network traffic/energy, cost, and creating a decentralised system. As a result of the five studies, this thesis develops a knowledge-driven framework to estimate and recognise multi-user activities at fine-grained level user actions. This framework integrates three complementary ontologies to conceptualise factual, fuzzy and uncertainties in the environment/ADLs, time-series analysis and discriminative sensing environment. Moreover, a distributed software architecture, multimodal sensor-based hardware prototypes, and other supportive utility tools such as simulator and synthetic ADL data generator for the experimentation were developed to support the evaluation of the proposed approaches. The distributed system is platform-independent and currently supported by an Android mobile application and web-browser based client interfaces for retrieving information such as live sensor events and HAR results

    Visual processing in a primate temporal association cortex: insensitivity to self-induced motion

    Get PDF
    An animal's own behaviour can give rise to sensory stimulation that is very similar to stimulation of completely external origin. Much of this self-induced stimulation has little informative value to the animal and may even interfere with the processing of externally-induced stimulation. A high-level association area in the temporal cortex of macaque (superior temporal polysensory area, STP) which has been shown to participate in the analysis of visual motion was targeted in a series of experiments in order to investigate whether this brain area discriminates externally- and self-induced stimulation in its visual motion processing. Earlier results in somatosensory processing within this same brain area provided grounds for this presumption The cells studied in here were sensitive to the presence of motion but showed no selectivity for the form of the stimulus. 25% of all visually responsive cells in area STP were classified as belonging to this class of cells. This group of cells was further categorized into unidirectional (39%), bidirectional (4%) and pandirectional (57%) cells. Tuning to direction varied in sharpness. For most cells the angular change in direction required to reduce response to half maximal was between 45 and 70 degrees. The optimal directions of cells appeared clustered around cartesian axes, (up/down, left/right and towards/away). The response latency varied between 35.0-126.4 ms (mean 90.9 ms). On average cell responses showed a transient burst of activity followed by a tonic discharge maintained for the duration of stimulation. 83% of the motion sensitive cells lacking form selectivity responded to any stimuli moved by the experimenter, but gave no response to the sight of the animal's own limb movements. The cells remained, however, responsive to external stimulation while the monkey's own hand was moving in view. Responses to self-induced movements were recovered if the monkey introduced a novel object in its hand into view. That the response discrimination between externally- and self-induced stimulation was not caused by differences in the visual appearance of the stimuli was confirmed in the second experiment where the monkey was trained to rotate a handle connected to a patterned cylinder in order to generate visual motion stimulation over a fixation point. 61% of the tested cells discriminated between pattern motion generated by the monkey and by the experimenter. It was shown that the monkey's motor activity as such (turning a handle without visible cylinder rotation) did not affect the cells' spontaneous activity. Some indication was received to suggest that the discriminative mechanism is using not only (motor) corollary discharges but also proprioceptive input. These results also gave evidence of the plasticity of discriminative processing in STP for the animal's life-time experiences. Finally, the cells were studied for their responsiveness for image motion resulting from movements of external objects and movements of the animal's body (self-motion). 84% of the cells responded only to visual object-motion and failed to respond to visual motion resulting from animal's self-motion. The experiments also revealed that area STP processes visual motion mostly in observer- relative terms, i.e. in reference to the perceiver itself. The results provide one explanation for the functional significance of the convergence of several modalities of sensory (and motor) input in the STP. It is suggested that area STP works as a "neural filter" to separate expected sensory consequences resulting from one's own actions from those that originate from the actions of other animals or environmental events
    corecore