20 research outputs found
Video-Mined Task Graphs for Keystep Recognition in Instructional Videos
Procedural activity understanding requires perceiving human actions in terms
of a broader task, where multiple keysteps are performed in sequence across a
long video to reach a final goal state -- such as the steps of a recipe or a
DIY fix-it task. Prior work largely treats keystep recognition in isolation of
this broader structure, or else rigidly confines keysteps to align with a
predefined sequential script. We propose discovering a task graph automatically
from how-to videos to represent probabilistically how people tend to execute
keysteps, and then leverage this graph to regularize keystep recognition in
novel videos. On multiple datasets of real-world instructional videos, we show
the impact: more reliable zero-shot keystep localization and improved video
representation learning, exceeding the state of the art.Comment: Technical Repor
NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory
Searching long egocentric videos with natural language queries (NLQ) has
compelling applications in augmented reality and robotics, where a fluid index
into everything that a person (agent) has seen before could augment human
memory and surface relevant information on demand. However, the structured
nature of the learning problem (free-form text query inputs, localized video
temporal window outputs) and its needle-in-a-haystack nature makes it both
technically challenging and expensive to supervise. We introduce
Narrations-as-Queries (NaQ), a data augmentation strategy that transforms
standard video-text narrations into training data for a video query
localization model. Validating our idea on the Ego4D benchmark, we find it has
tremendous impact in practice. NaQ improves multiple top models by substantial
margins (even doubling their accuracy), and yields the very best results to
date on the Ego4D NLQ challenge, soundly outperforming all challenge winners in
the CVPR and ECCV 2022 competitions and topping the current public leaderboard.
Beyond achieving the state-of-the-art for NLQ, we also demonstrate unique
properties of our approach such as the ability to perform zero-shot and
few-shot NLQ, and improved performance on queries about long-tail object
categories. Code and models:
{\small\url{http://vision.cs.utexas.edu/projects/naq}}.Comment: 13 pages, 7 figures, appearing in CVPR 202
SpotEM: Efficient Video Search for Episodic Memory
The goal in episodic memory (EM) is to search a long egocentric video to
answer a natural language query (e.g., "where did I leave my purse?"). Existing
EM methods exhaustively extract expensive fixed-length clip features to look
everywhere in the video for the answer, which is infeasible for long
wearable-camera videos that span hours or even days. We propose SpotEM, an
approach to achieve efficiency for a given EM method while maintaining good
accuracy. SpotEM consists of three key ideas: 1) a novel clip selector that
learns to identify promising video regions to search conditioned on the
language query; 2) a set of low-cost semantic indexing features that capture
the context of rooms, objects, and interactions that suggest where to look; and
3) distillation losses that address the optimization issues arising from
end-to-end joint training of the clip selector and EM model. Our experiments on
200+ hours of video from the Ego4D EM Natural Language Queries benchmark and
three different EM models demonstrate the effectiveness of our approach:
computing only 10% - 25% of the clip features, we preserve 84% - 97% of the
original EM model's accuracy. Project page:
https://vision.cs.utexas.edu/projects/spotemComment: Published in ICML 202
EgoEnv: Human-centric environment representations from egocentric video
First-person video highlights a camera-wearer's activities in the context of
their persistent environment. However, current video understanding approaches
reason over visual features from short video clips that are detached from the
underlying physical space and capture only what is immediately visible. To
facilitate human-centric environment understanding, we present an approach that
links egocentric video and the environment by learning representations that are
predictive of the camera-wearer's (potentially unseen) local surroundings. We
train such models using videos from agents in simulated 3D environments where
the environment is fully observable, and test them on human-captured real-world
videos from unseen environments. On two human-centric video tasks, we show that
models equipped with our environment-aware features consistently outperform
their counterparts with traditional clip features. Moreover, despite being
trained exclusively on simulated videos, our approach successfully handles
real-world videos from HouseTours and Ego4D, and achieves state-of-the-art
results on the Ego4D NLQ challenge. Project page:
https://vision.cs.utexas.edu/projects/ego-env/Comment: Published in NeurIPS 2023 (Oral
Habitat-Matterport 3D Semantics Dataset
We present the Habitat-Matterport 3D Semantics (HM3DSEM) dataset. HM3DSEM is
the largest dataset of 3D real-world spaces with densely annotated semantics
that is currently available to the academic community. It consists of 142,646
object instance annotations across 216 3D spaces and 3,100 rooms within those
spaces. The scale, quality, and diversity of object annotations far exceed
those of prior datasets. A key difference setting apart HM3DSEM from other
datasets is the use of texture information to annotate pixel-accurate object
boundaries. We demonstrate the effectiveness of HM3DSEM dataset for the Object
Goal Navigation task using different methods. Policies trained using HM3DSEM
perform outperform those trained on prior datasets. Introduction of HM3DSEM in
the Habitat ObjectNav Challenge lead to an increase in participation from 400
submissions in 2021 to 1022 submissions in 2022.Comment: 14 Pages, 10 Figures, 5 Table
A Domain-Agnostic Approach for Characterization of Lifelong Learning Systems
Despite the advancement of machine learning techniques in recent years,
state-of-the-art systems lack robustness to "real world" events, where the
input distributions and tasks encountered by the deployed systems will not be
limited to the original training context, and systems will instead need to
adapt to novel distributions and tasks while deployed. This critical gap may be
addressed through the development of "Lifelong Learning" systems that are
capable of 1) Continuous Learning, 2) Transfer and Adaptation, and 3)
Scalability. Unfortunately, efforts to improve these capabilities are typically
treated as distinct areas of research that are assessed independently, without
regard to the impact of each separate capability on other aspects of the
system. We instead propose a holistic approach, using a suite of metrics and an
evaluation framework to assess Lifelong Learning in a principled way that is
agnostic to specific domains or system techniques. Through five case studies,
we show that this suite of metrics can inform the development of varied and
complex Lifelong Learning systems. We highlight how the proposed suite of
metrics quantifies performance trade-offs present during Lifelong Learning
system development - both the widely discussed Stability-Plasticity dilemma and
the newly proposed relationship between Sample Efficient and Robust Learning.
Further, we make recommendations for the formulation and use of metrics to
guide the continuing development of Lifelong Learning systems and assess their
progress in the future.Comment: To appear in Neural Network
Recommended from our members
Predictive scene representations for embodied visual search
The goal in embodied perception is to understand egocentric images and videos captured by intelligent agents (humans and robots). Intelligent robots have to perceive the world using sensory inputs, build rich representations of their surrounding environment and take actions to perform their tasks. Augmented reality (AR) assistants must perceive activities performed by humans to provide assistance. My dissertation tackles embodied visual search, where the goal is to enable intelligent search for robots and AR assistants. My research aims to build predictive scene representations that can enable robot/AR agents to efficiently and accurately search for human-specified goals in complex scenes and videos. Embodied visual search manifests as the visual navigation problem in robotics, where a mobile agent must efficiently navigate in the environment using visual sensors to search for one or more goals (e.g., where is the refrigerator?). Research on visual navigation aims to fuel a future generation of intelligent robots that can deploy in various environments to aid and enhance our daily lives. A key component of visual navigation is to build useful representations of the agent's surrounding environment. Unfortunately, existing navigation methods are limited to only encoding parts of the environment that the agent directly observes. For example, when a robot sees a dining table, it is unaware of the chair and floor space hidden behind the table. A robot that has navigated only to the kitchen and living room in a house is unaware of where to find a bed or a bathtub. Failing to encode the unseen parts of an environment hinders the agent's ability to make good decisions. My dissertation builds predictive representations of real-world environments for visual navigation. Predictive representations enable an agent to perceive unseen parts of the environment conditioned on its limited history of sensory observations. By leveraging experience from previously seen environments, an agent can use semantic and geometric regularities shared across real-world environments to build predictive representations. First, I propose to learn agents that perform pixel-wise reconstructions of novel scenes and object models by anticipating unseen viewpoints. Next, I develop agents that efficiently build geometric maps of 3D environments by anticipating occupancy for unseen map regions, and efficiently search for objects in 3D environments by anticipating the presence of unseen objects. Furthermore, I propose a self-supervised strategy for learning general-purpose environment representations by anticipating unseen visual features and demonstrate their transferability to multiple downstream navigation tasks. Embodied visual search manifests as the episodic memory problem in egocentric videos, where an AI assistant must efficiently scan a long visual history in search of a specific goal. Such an episodic memory system could index human experiences in AR spanning several weeks and respond to the human user's queries (did I leave the refrigerator open?) or organize a robot's experience during long-term operation and recollect critical details to make navigational decisions (what room should I go to find sheets of paper?) and respond to humans (was the lab locked when you last went there?). Research on episodic memory (EM) aims to build AI assistants that can reason about long visual histories and respond to natural language queries. It is challenging to enable such a personal episodic memory due to the long duration of egocentric videos that can span several minutes to weeks, the open-ended nature of text queries, and the short nature of response windows that only span a few seconds. Standard EM methods suffer from two key shortcomings: the limited availability of annotated data results in poor generalization to new videos and queries, and the exorbitant compute requirements during inference limits their applicability to practical use cases. I propose to address the former limitation by developing a novel data augmentation algorithm that uses timestamped text descriptions to significantly expand the EM supervision. I propose to address the latter limitation by anticipating the relevance of video clips to the query. Specifically, I propose a novel clip-selection policy that previews the video cheaply to obtain the context of rooms, objects and interactions, and leverages semantic priors to identify query-relevant clips. It then searches efficiently by only expending computation on a relevant subset of clips. Overall, my dissertation represents an important step toward developing intelligent search agents for embodied AI. The proposed methods have repeatedly established state-of-the-art results across major benchmarks in the field. Importantly, I develop robotic navigation policies that can be trained in simulation and successfully deployed on real robots and video search methods that can effectively understand real-world human-captured videos of day-to-day activities to respond to human queries. Finally, I outline my future directions to learn foundational models of 3D scenes, build episodic memory systems for long-horizon videos, and robot learning from in-the-wild videos.Computer Science
Hybrid EMD-RF Model for Predicting Annual Rainfall in Kerala, India
Rainfall forecasting is critical for the economy, but it has proven difficult due to the uncertainties, complexities, and interdependencies that exist in climatic systems. An efficient rainfall forecasting model will be beneficial in implementing suitable measures against natural disasters such as floods and landslides. In this paper, a novel hybrid model of empirical mode decomposition (EMD) and random forest (RF) was developed to enhance the accuracy of annual rainfall prediction. The EMD technique was utilized to decompose the rainfall signal into six intrinsic mode functions (IMFs) to extract underlying patterns, while the RF algorithm was employed to make predictions based on the IMFs. The hybrid RF–IMF model was trained and tested using a dataset of annual rainfall in Kerala from 1871 to 2020, and its performance was compared to traditional models such as RF regression and the autoregressive moving average (ARMA) model. Mean absolute error (MAE), mean absolute percentage error (MAPE), mean squared error (MSE), root mean squared error (RMSE), and coefficient of determination or R-squared (R2) were used to compare the performances of these three models. Model evaluation metrics show that the RF–IMF model outperformed both the RF model and ARMA model
Comparison of patient and graft survival in tacrolimus versus cyclosporine-based immunosuppressive regimes in renal transplant recipients – Single-center experience from South India
Studies have shown better graft function and reduced acute rejection rates among renal transplant recipients who were on Tacrolimus (Tac)-based immunosuppression regimens as compared to cyclosporine (CsA)-based regimens in the first year. However, the long-term follow-up data did not reveal better outcomes in the Tac-based regimens. In view of the short term benefits, the trend has been to change to Tac-based regimens off late. Data from the Indian subcontinent are, however, sparse. We, therefore, looked at our data to ascertain if Tac-based regimen does have better outcomes in our population. We studied a total of 108 individuals who underwent renal transplantation between January 2007 and June 2013, with a mean follow-up of 38.22 months (comparable to both groups). In our group, males constituted 77.8%,; and among the 108 individuals, 16.7% were diabetics. New-onset diabetes after renal transplantation was more common in the Tac group (21 vs. 12 and was statistically significant [P = 0.03]). At the last follow-up, serum creatinine was higher in the CsA group (1.77 mg/dl vs. 1.35 mg/dl) and was statistically significant (P = 0.03). Individuals requiring hemodialysis were also significantly higher in the CsA group (9 vs. 2; P = 0.05). The patient survival was similar in both groups (1-year and 5-year follow-up); however, graft survival was better in Tac group as compared to CsA group (0.94 vs. 0.88 at 1 year and 0.85 vs. 0.72 at 5 years)