Search CORE

20 research outputs found

Video-Mined Task Graphs for Keystep Recognition in Instructional Videos

Author: Afouras Triantafyllos
Ashutosh Kumar
Grauman Kristen
Ramakrishnan Santhosh Kumar
Publication venue
Publication date: 17/07/2023
Field of study

Procedural activity understanding requires perceiving human actions in terms of a broader task, where multiple keysteps are performed in sequence across a long video to reach a final goal state -- such as the steps of a recipe or a DIY fix-it task. Prior work largely treats keystep recognition in isolation of this broader structure, or else rigidly confines keysteps to align with a predefined sequential script. We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps, and then leverage this graph to regularize keystep recognition in novel videos. On multiple datasets of real-world instructional videos, we show the impact: more reliable zero-shot keystep localization and improved video representation learning, exceeding the state of the art.Comment: Technical Repor

arXiv.org e-Print Archive

NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory

Author: Al-Halah Ziad
Grauman Kristen
Ramakrishnan Santhosh Kumar
Publication venue
Publication date: 25/03/2023
Field of study

Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand. However, the structured nature of the learning problem (free-form text query inputs, localized video temporal window outputs) and its needle-in-a-haystack nature makes it both technically challenging and expensive to supervise. We introduce Narrations-as-Queries (NaQ), a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model. Validating our idea on the Ego4D benchmark, we find it has tremendous impact in practice. NaQ improves multiple top models by substantial margins (even doubling their accuracy), and yields the very best results to date on the Ego4D NLQ challenge, soundly outperforming all challenge winners in the CVPR and ECCV 2022 competitions and topping the current public leaderboard. Beyond achieving the state-of-the-art for NLQ, we also demonstrate unique properties of our approach such as the ability to perform zero-shot and few-shot NLQ, and improved performance on queries about long-tail object categories. Code and models: {\small\url{http://vision.cs.utexas.edu/projects/naq}}.Comment: 13 pages, 7 figures, appearing in CVPR 202

arXiv.org e-Print Archive

SpotEM: Efficient Video Search for Episodic Memory

Author: Al-Halah Ziad
Grauman Kristen
Ramakrishnan Santhosh Kumar
Publication venue
Publication date: 27/06/2023
Field of study

The goal in episodic memory (EM) is to search a long egocentric video to answer a natural language query (e.g., "where did I leave my purse?"). Existing EM methods exhaustively extract expensive fixed-length clip features to look everywhere in the video for the answer, which is infeasible for long wearable-camera videos that span hours or even days. We propose SpotEM, an approach to achieve efficiency for a given EM method while maintaining good accuracy. SpotEM consists of three key ideas: 1) a novel clip selector that learns to identify promising video regions to search conditioned on the language query; 2) a set of low-cost semantic indexing features that capture the context of rooms, objects, and interactions that suggest where to look; and 3) distillation losses that address the optimization issues arising from end-to-end joint training of the clip selector and EM model. Our experiments on 200+ hours of video from the Ego4D EM Natural Language Queries benchmark and three different EM models demonstrate the effectiveness of our approach: computing only 10% - 25% of the clip features, we preserve 84% - 97% of the original EM model's accuracy. Project page: https://vision.cs.utexas.edu/projects/spotemComment: Published in ICML 202

arXiv.org e-Print Archive

EgoEnv: Human-centric environment representations from egocentric video

Author: Desai Ruta
Grauman Kristen
Hillis James
Nagarajan Tushar
Ramakrishnan Santhosh Kumar
Publication venue
Publication date: 09/11/2023
Field of study

First-person video highlights a camera-wearer's activities in the context of their persistent environment. However, current video understanding approaches reason over visual features from short video clips that are detached from the underlying physical space and capture only what is immediately visible. To facilitate human-centric environment understanding, we present an approach that links egocentric video and the environment by learning representations that are predictive of the camera-wearer's (potentially unseen) local surroundings. We train such models using videos from agents in simulated 3D environments where the environment is fully observable, and test them on human-captured real-world videos from unseen environments. On two human-centric video tasks, we show that models equipped with our environment-aware features consistently outperform their counterparts with traditional clip features. Moreover, despite being trained exclusively on simulated videos, our approach successfully handles real-world videos from HouseTours and Ego4D, and achieves state-of-the-art results on the Ego4D NLQ challenge. Project page: https://vision.cs.utexas.edu/projects/ego-env/Comment: Published in NeurIPS 2023 (Oral

arXiv.org e-Print Archive

Habitat-Matterport 3D Semantics Dataset

Author: Batra Dhruv
Chang Angel Xuan
Chaplot Devendra Singh
Clegg Alexander William
Gervet Theo
Gokaslan Aaron
Maestre Noah
Ramakrishnan Santhosh Kumar
Ramrakhya Ram
Savva Manolis
Turner John
Yadav Karmesh
Publication venue
Publication date: 12/12/2022
Field of study

We present the Habitat-Matterport 3D Semantics (HM3DSEM) dataset. HM3DSEM is the largest dataset of 3D real-world spaces with densely annotated semantics that is currently available to the academic community. It consists of 142,646 object instance annotations across 216 3D spaces and 3,100 rooms within those spaces. The scale, quality, and diversity of object annotations far exceed those of prior datasets. A key difference setting apart HM3DSEM from other datasets is the use of texture information to annotate pixel-accurate object boundaries. We demonstrate the effectiveness of HM3DSEM dataset for the Object Goal Navigation task using different methods. Policies trained using HM3DSEM perform outperform those trained on prior datasets. Introduction of HM3DSEM in the Habitat ObjectNav Challenge lead to an increase in participation from 400 submissions in 2021 to 1022 submissions in 2022.Comment: 14 Pages, 10 Figures, 5 Table

arXiv.org e-Print Archive

A Domain-Agnostic Approach for Characterization of Lifelong Learning Systems

Despite the advancement of machine learning techniques in recent years, state-of-the-art systems lack robustness to "real world" events, where the input distributions and tasks encountered by the deployed systems will not be limited to the original training context, and systems will instead need to adapt to novel distributions and tasks while deployed. This critical gap may be addressed through the development of "Lifelong Learning" systems that are capable of 1) Continuous Learning, 2) Transfer and Adaptation, and 3) Scalability. Unfortunately, efforts to improve these capabilities are typically treated as distinct areas of research that are assessed independently, without regard to the impact of each separate capability on other aspects of the system. We instead propose a holistic approach, using a suite of metrics and an evaluation framework to assess Lifelong Learning in a principled way that is agnostic to specific domains or system techniques. Through five case studies, we show that this suite of metrics can inform the development of varied and complex Lifelong Learning systems. We highlight how the proposed suite of metrics quantifies performance trade-offs present during Lifelong Learning system development - both the widely discussed Stability-Plasticity dilemma and the newly proposed relationship between Sample Efficient and Robust Learning. Further, we make recommendations for the formulation and use of metrics to guide the continuing development of Lifelong Learning systems and assess their progress in the future.Comment: To appear in Neural Network

arXiv.org e-Print Archive

Loughborough University Institutional Repository

Recommended from our members

Predictive scene representations for embodied visual search

Author: Ramakrishnan Santhosh Kumar
Publication venue
Publication date: 12/01/2024
Field of study

The goal in embodied perception is to understand egocentric images and videos captured by intelligent agents (humans and robots). Intelligent robots have to perceive the world using sensory inputs, build rich representations of their surrounding environment and take actions to perform their tasks. Augmented reality (AR) assistants must perceive activities performed by humans to provide assistance. My dissertation tackles embodied visual search, where the goal is to enable intelligent search for robots and AR assistants. My research aims to build predictive scene representations that can enable robot/AR agents to efficiently and accurately search for human-specified goals in complex scenes and videos. Embodied visual search manifests as the visual navigation problem in robotics, where a mobile agent must efficiently navigate in the environment using visual sensors to search for one or more goals (e.g., where is the refrigerator?). Research on visual navigation aims to fuel a future generation of intelligent robots that can deploy in various environments to aid and enhance our daily lives. A key component of visual navigation is to build useful representations of the agent's surrounding environment. Unfortunately, existing navigation methods are limited to only encoding parts of the environment that the agent directly observes. For example, when a robot sees a dining table, it is unaware of the chair and floor space hidden behind the table. A robot that has navigated only to the kitchen and living room in a house is unaware of where to find a bed or a bathtub. Failing to encode the unseen parts of an environment hinders the agent's ability to make good decisions. My dissertation builds predictive representations of real-world environments for visual navigation. Predictive representations enable an agent to perceive unseen parts of the environment conditioned on its limited history of sensory observations. By leveraging experience from previously seen environments, an agent can use semantic and geometric regularities shared across real-world environments to build predictive representations. First, I propose to learn agents that perform pixel-wise reconstructions of novel scenes and object models by anticipating unseen viewpoints. Next, I develop agents that efficiently build geometric maps of 3D environments by anticipating occupancy for unseen map regions, and efficiently search for objects in 3D environments by anticipating the presence of unseen objects. Furthermore, I propose a self-supervised strategy for learning general-purpose environment representations by anticipating unseen visual features and demonstrate their transferability to multiple downstream navigation tasks. Embodied visual search manifests as the episodic memory problem in egocentric videos, where an AI assistant must efficiently scan a long visual history in search of a specific goal. Such an episodic memory system could index human experiences in AR spanning several weeks and respond to the human user's queries (did I leave the refrigerator open?) or organize a robot's experience during long-term operation and recollect critical details to make navigational decisions (what room should I go to find sheets of paper?) and respond to humans (was the lab locked when you last went there?). Research on episodic memory (EM) aims to build AI assistants that can reason about long visual histories and respond to natural language queries. It is challenging to enable such a personal episodic memory due to the long duration of egocentric videos that can span several minutes to weeks, the open-ended nature of text queries, and the short nature of response windows that only span a few seconds. Standard EM methods suffer from two key shortcomings: the limited availability of annotated data results in poor generalization to new videos and queries, and the exorbitant compute requirements during inference limits their applicability to practical use cases. I propose to address the former limitation by developing a novel data augmentation algorithm that uses timestamped text descriptions to significantly expand the EM supervision. I propose to address the latter limitation by anticipating the relevance of video clips to the query. Specifically, I propose a novel clip-selection policy that previews the video cheaply to obtain the context of rooms, objects and interactions, and leverages semantic priors to identify query-relevant clips. It then searches efficiently by only expending computation on a relevant subset of clips. Overall, my dissertation represents an important step toward developing intelligent search agents for embodied AI. The proposed methods have repeatedly established state-of-the-art results across major benchmarks in the field. Importantly, I develop robotic navigation policies that can be trained in simulation and successfully deployed on real robots and video search methods that can effectively understand real-world human-captured videos of day-to-day activities to respond to human queries. Finally, I outline my future directions to learn foundational models of 3D scenes, build episodic memory systems for long-horizon videos, and robot learning from in-the-wild videos.Computer Science

Texas ScholarWorks

Hybrid EMD-RF Model for Predicting Annual Rainfall in Kerala, India

Author: Asha Jayasree
Jayan A. Ramakrishnan
Rishidas Sivadas
Santhosh Kumar Sasidharan
Publication venue: 'MDPI AG'
Publication date: 01/04/2023
Field of study

Rainfall forecasting is critical for the economy, but it has proven difficult due to the uncertainties, complexities, and interdependencies that exist in climatic systems. An efficient rainfall forecasting model will be beneficial in implementing suitable measures against natural disasters such as floods and landslides. In this paper, a novel hybrid model of empirical mode decomposition (EMD) and random forest (RF) was developed to enhance the accuracy of annual rainfall prediction. The EMD technique was utilized to decompose the rainfall signal into six intrinsic mode functions (IMFs) to extract underlying patterns, while the RF algorithm was employed to make predictions based on the IMFs. The hybrid RF–IMF model was trained and tested using a dataset of annual rainfall in Kerala from 1871 to 2020, and its performance was compared to traditional models such as RF regression and the autoregressive moving average (ARMA) model. Mean absolute error (MAE), mean absolute percentage error (MAPE), mean squared error (MSE), root mean squared error (RMSE), and coefficient of determination or R-squared (R2) were used to compare the performances of these three models. Model evaluation metrics show that the RF–IMF model outperformed both the RF model and ARMA model

Directory of Open Access Journals

Comparison of patient and graft survival in tacrolimus versus cyclosporine-based immunosuppressive regimes in renal transplant recipients – Single-center experience from South India

Author: J Roopa
Kiran Chandra Patro
R Dilip
S Ramakrishnan
Santhosh Kumar
Publication venue: 'Medknow'
Publication date: 01/01/2018
Field of study

Studies have shown better graft function and reduced acute rejection rates among renal transplant recipients who were on Tacrolimus (Tac)-based immunosuppression regimens as compared to cyclosporine (CsA)-based regimens in the first year. However, the long-term follow-up data did not reveal better outcomes in the Tac-based regimens. In view of the short term benefits, the trend has been to change to Tac-based regimens off late. Data from the Indian subcontinent are, however, sparse. We, therefore, looked at our data to ascertain if Tac-based regimen does have better outcomes in our population. We studied a total of 108 individuals who underwent renal transplantation between January 2007 and June 2013, with a mean follow-up of 38.22 months (comparable to both groups). In our group, males constituted 77.8%,; and among the 108 individuals, 16.7% were diabetics. New-onset diabetes after renal transplantation was more common in the Tac group (21 vs. 12 and was statistically significant [P = 0.03]). At the last follow-up, serum creatinine was higher in the CsA group (1.77 mg/dl vs. 1.35 mg/dl) and was statistically significant (P = 0.03). Individuals requiring hemodialysis were also significantly higher in the CsA group (9 vs. 2; P = 0.05). The patient survival was similar in both groups (1-year and 5-year follow-up); however, graft survival was better in Tac group as compared to CsA group (0.94 vs. 0.88 at 1 year and 0.85 vs. 0.72 at 5 years)

Directory of Open Access Journals