11,809 research outputs found
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
This paper introduces InternVid, a large-scale video-centric multimodal
dataset that enables learning powerful and transferable video-text
representations for multimodal understanding and generation. The InternVid
dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M
video clips accompanied by detailed descriptions of total 4.1B words. Our core
contribution is to develop a scalable approach to autonomously build a
high-quality video-text dataset with large language models (LLM), thereby
showcasing its efficacy in learning video-language representation at scale.
Specifically, we utilize a multi-scale approach to generate video-related
descriptions. Furthermore, we introduce ViCLIP, a video-text representation
learning model based on ViT-L. Learned on InternVid via contrastive learning,
this model demonstrates leading zero-shot action recognition and competitive
video retrieval performance. Beyond basic video understanding tasks like
recognition and retrieval, our dataset and model have broad applications. They
are particularly beneficial for generating interleaved video-text data for
learning a video-centric dialogue system, advancing video-to-text and
text-to-video generation research. These proposed resources provide a tool for
researchers and practitioners interested in multimodal video understanding and
generation.Comment: Data and Code:
https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVi
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
Pre-trained large language models (LLMs) have recently achieved better
generalization and sample efficiency in autonomous web navigation. However, the
performance on real-world websites has still suffered from (1) open domainness,
(2) limited context length, and (3) lack of inductive bias on HTML. We
introduce WebAgent, an LLM-driven agent that can complete the tasks on real
websites following natural language instructions. WebAgent plans ahead by
decomposing instructions into canonical sub-instructions, summarizes long HTML
documents into task-relevant snippets, and acts on websites via generated
Python programs from those. We design WebAgent with Flan-U-PaLM, for grounded
code generation, and HTML-T5, new pre-trained LLMs for long HTML documents
using local and global attention mechanisms and a mixture of long-span
denoising objectives, for planning and summarization. We empirically
demonstrate that our recipe improves the success on a real website by over 50%,
and that HTML-T5 is the best model to solve HTML-based tasks; achieving 14.9%
higher success rate than prior SoTA on the MiniWoB web navigation benchmark and
better accuracy on offline task planning evaluation
CAMEL: Communicative Agents for "Mind" Exploration of Large Scale Language Model Society
The rapid advancement of conversational and chat-based language models has
led to remarkable progress in complex task-solving. However, their success
heavily relies on human input to guide the conversation, which can be
challenging and time-consuming. This paper explores the potential of building
scalable techniques to facilitate autonomous cooperation among communicative
agents and provide insight into their "cognitive" processes. To address the
challenges of achieving autonomous cooperation, we propose a novel
communicative agent framework named role-playing. Our approach involves using
inception prompting to guide chat agents toward task completion while
maintaining consistency with human intentions. We showcase how role-playing can
be used to generate conversational data for studying the behaviors and
capabilities of chat agents, providing a valuable resource for investigating
conversational language models. Our contributions include introducing a novel
communicative agent framework, offering a scalable approach for studying the
cooperative behaviors and capabilities of multi-agent systems, and
open-sourcing our library to support research on communicative agents and
beyond. The GitHub repository of this project is made publicly available on:
https://github.com/lightaime/camel
Endogenous measures for contextualising large-scale social phenomena: a corpus-based method for mediated public discourse
This work presents an interdisciplinary methodology for developing endogenous measures of group membership through analysis of pervasive linguistic patterns in public discourse. Focusing on political discourse, this work critiques the conventional approach to the study of political participation, which is premised on decontextualised, exogenous measures to characterise groups. Considering the theoretical and empirical weaknesses of decontextualised approaches to large-scale social phenomena, this work suggests that contextualisation using endogenous measures might provide a complementary perspective to mitigate such weaknesses.
This work develops a sociomaterial perspective on political participation in mediated discourse as affiliatory action performed through language. While the affiliatory function of language is often performed consciously (such as statements of identity), this work is concerned with unconscious features (such as patterns in lexis and grammar). This work argues that pervasive patterns in such features that emerge through socialisation are resistant to change and manipulation, and thus might serve as endogenous measures of sociopolitical contexts, and thus of groups.
In terms of method, the work takes a corpus-based approach to the analysis of data from the Twitter messaging service whereby patterns in usersâ speech are examined statistically in order to trace potential community membership. The method is applied in the US state of Michigan during the second half of 2018â6 November having been the date of midterm (i.e. non-Presidential) elections in the United States. The corpus is assembled from the original posts of 5,889 users, who are nominally geolocalised to 417 municipalities. These users are clustered according to pervasive language features. Comparing the linguistic clusters according to the municipalities they represent finds that there are regular sociodemographic differentials across clusters. This is understood as an indication of social structure, suggesting that endogenous measures derived from pervasive patterns in language may indeed offer a complementary, contextualised perspective on large-scale social phenomena
One Small Step for Generative AI, One Giant Leap for AGI: A Complete Survey on ChatGPT in AIGC Era
OpenAI has recently released GPT-4 (a.k.a. ChatGPT plus), which is
demonstrated to be one small step for generative AI (GAI), but one giant leap
for artificial general intelligence (AGI). Since its official release in
November 2022, ChatGPT has quickly attracted numerous users with extensive
media coverage. Such unprecedented attention has also motivated numerous
researchers to investigate ChatGPT from various aspects. According to Google
scholar, there are more than 500 articles with ChatGPT in their titles or
mentioning it in their abstracts. Considering this, a review is urgently
needed, and our work fills this gap. Overall, this work is the first to survey
ChatGPT with a comprehensive review of its underlying technology, applications,
and challenges. Moreover, we present an outlook on how ChatGPT might evolve to
realize general-purpose AIGC (a.k.a. AI-generated content), which will be a
significant milestone for the development of AGI.Comment: A Survey on ChatGPT and GPT-4, 29 pages. Feedback is appreciated
([email protected]
CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes
Training models to apply linguistic knowledge and visual concepts from 2D
images to 3D world understanding is a promising direction that researchers have
only recently started to explore. In this work, we design a novel 3D
pre-training Vision-Language method that helps a model learn semantically
meaningful and transferable 3D scene point cloud representations. We inject the
representational power of the popular CLIP model into our 3D encoder by
aligning the encoded 3D scene features with the corresponding 2D image and text
embeddings produced by CLIP. To assess our model's 3D world reasoning
capability, we evaluate it on the downstream task of 3D Visual Question
Answering. Experimental quantitative and qualitative results show that our
pre-training method outperforms state-of-the-art works in this task and leads
to an interpretable representation of 3D scene features.Comment: CVPRW 2023. Code will be made publicly available:
https://github.com/AlexDelitzas/3D-VQ
Larger Probes Tell a Different Story: Extending Psycholinguistic Datasets Via In-Context Learning
Language model probing is often used to test specific capabilities of these
models. However, conclusions from such studies may be limited when the probing
benchmarks are small and lack statistical power. In this work, we introduce
new, larger datasets for negation (NEG-1500-SIMP) and role reversal (ROLE-1500)
inspired by psycholinguistic studies. We dramatically extend existing NEG-136
and ROLE-88 benchmarks using GPT3, increasing their size from 18 and 44
sentence pairs to 750 each. We also create another version of extended negation
dataset (NEG-1500-SIMP-TEMP), created using template-based generation. It
consists of 770 sentence pairs. We evaluate 22 models on the extended datasets,
seeing model performance dip 20-57% compared to the original smaller
benchmarks. We observe high levels of negation sensitivity in models like BERT
and ALBERT demonstrating that previous findings might have been skewed due to
smaller test sets. Finally, we observe that while GPT3 has generated all the
examples in ROLE-1500 is only able to solve 24.6% of them during probing
Entities and their genera: Slicing up the world the medieval way--and does it matter to formal ontology?
Genera, typically hand-in-hand with their branching species, are essential elements of vocabulary-based information constructs, in particular scientific taxonomies. Should they also feature in formal ontologies, the highest of such constructs? I argue in this article that the answer is âYesâ and that the question posed in its title also has a Yes-answer: The way medieval ontologists sliced up the world into genera does matter to formal ontology. More specifically, the way Dietrich of Freiberg, a Latin scholastic, conceived and applied strictly generic criteria to slice up the world into its entities can provide some guidelines to the field of formal ontology with respect to not only its contents, but also its scope. In particular, Dietrich's information criterion plays here a central role
Recommended from our members
The Child in Games: From the Meek, to the Mighty, to the Monstrous
Drawing across game studies, childhood studies, and childrenâs literature studies, this thesis catalogues and critiques the representation of children in contemporary video games.
It poses two questions:
1) How are children represented in contemporary video games?
2) In what ways do the representations of children in video games affirm or
challenge dominant Western beliefs about the figure of the child?
To answer these questions, I combine a large-scale content analysis of over 500 games published between 2009 and 2019 with a series of autoethnographic close readings. My content analysis is designed to provide a quantitative snapshot of the representation of children in games. I use statistical analysis to assemble data points as meaningful constellations. I use the axes of race, gender, and age, as well as genre, age-rating, and publication year, to identify patterns in representation. I distil my findings as a set of seven archetypes: The Blithe Child, The Heroic Child, The Human Becoming, The Child Sacrifice, The Side Kid, The Waif, and The Little Monster. This typology is not intended to work against the granular detail of the information recorded in the dataset, but to draw attention to patterns of coherence and divergence that occur between particular examples, as well as to intersections with representational tropes about children identified in other media.
I select four of these seven archetypes to structure my autoethnographic close readings. While content analysis is a useful tool for documenting the presence, absence, and dominant function of child-characters in games, close reading allows for a more intersectional approach that can attend to the nuances of representation across identity markers, creating opportunities to examine internal contradictions, ironies, and the polysemy generated through interpretive gaps. I develop my own close reading method building on the autoethnographic approaches of Carr (2019), Vossen (2020), McArthur (2018), and Jennings (2021), which I call critical ekphrasis. Chapter one argues that the Blithe Child triangulates âchildrenâ, âtoysâ, and âpaidiaâ. It suggests that both childhood and play can be conceptualised as a âmagic circleâ, and that the immateriality of the Blithe Child implies childhood can be a mode of being unconnected to anatomical markers or chronological age. Chapter two explores how the Heroic Child challenges the apparent affinity between video games and traditional hero
2
narratives. It argues that the dependence of the childly protagonist undermines dualistic thinking and instead celebrates cooperation, compromise, and connection. Chapter three compares the Child Sacrifice to the woman-in-the-refrigerator trope, arguing that it functions to justify aggressive, hypermasculine, militarised violence. The final chapter compares the Little Monster and the Waif to examine how the uncanny child raises metareferential questions about autonomy in interactive media and agency in intergenerational relationships.
My research project concludes by suggesting that virtual children in simulated worlds point to the active construction and delimitation of âthe childâ in society and can reveal that much of what is assumed to be natural, obvious, and universal about the figure of âthe childâ is in fact ideological. It hints at the possibility that just as virtual children are used as rhetorical figures to explain and justify the rules, mechanics, and moral systems of a digital game, so too is the figure of âthe childâ used to routinise and vindicate the rules, workings, and moral systems of Euro-American culture.AHR
- âŠ