753 research outputs found

    Comprehensive Information Integration Modeling Framework for Video Titling

    Full text link
    In e-commerce, consumer-generated videos, which in general deliver consumers' individual preferences for the different aspects of certain products, are massive in volume. To recommend these videos to potential consumers more effectively, diverse and catchy video titles are critical. However, consumer-generated videos seldom accompany appropriate titles. To bridge this gap, we integrate comprehensive sources of information, including the content of consumer-generated videos, the narrative comment sentences supplied by consumers, and the product attributes, in an end-to-end modeling framework. Although automatic video titling is very useful and demanding, it is much less addressed than video captioning. The latter focuses on generating sentences that describe videos as a whole while our task requires the product-aware multi-grained video analysis. To tackle this issue, the proposed method consists of two processes, i.e., granular-level interaction modeling and abstraction-level story-line summarization. Specifically, the granular-level interaction modeling first utilizes temporal-spatial landmark cues, descriptive words, and abstractive attributes to builds three individual graphs and recognizes the intra-actions in each graph through Graph Neural Networks (GNN). Then the global-local aggregation module is proposed to model inter-actions across graphs and aggregate heterogeneous graphs into a holistic graph representation. The abstraction-level story-line summarization further considers both frame-level video features and the holistic graph to utilize the interactions between products and backgrounds, and generate the story-line topic of the video. We collect a large-scale dataset accordingly from real-world data in Taobao, a world-leading e-commerce platform, and will make the desensitized version publicly available to nourish further development of the research community...Comment: 11 pages, 6 figures, to appear in KDD 2020 proceeding

    A Neural Attention Model for Abstractive Sentence Summarization

    Full text link
    Summarization based on text extraction is inherently limited, but generation-style abstractive methods have proven challenging to build. In this work, we propose a fully data-driven approach to abstractive sentence summarization. Our method utilizes a local attention-based model that generates each word of the summary conditioned on the input sentence. While the model is structurally simple, it can easily be trained end-to-end and scales to a large amount of training data. The model shows significant performance gains on the DUC-2004 shared task compared with several strong baselines.Comment: Proceedings of EMNLP 201

    L-Eval: Instituting Standardized Evaluation for Long Context Language Models

    Full text link
    Recently, there has been growing interest in extending the context length of instruction-following models in order to effectively process single-turn long input (e.g. summarizing a paper) and conversations with more extensive histories. While proprietary models such as GPT-4 and Claude have demonstrated considerable advancements in handling tens of thousands of tokens of context, open-sourced models are still in the early stages of experimentation. It also remains unclear whether developing these long context models can offer substantial gains on practical downstream tasks over retrieval-based methods or models simply trained on chunked contexts. To address this challenge, we propose to institute standardized evaluation for long context language models. Concretely, we develop L-Eval which contains 411 long documents and over 2,000 query-response pairs manually annotated and checked by the authors encompassing areas such as law, finance, school lectures, lengthy conversations, news, long-form novels, and meetings. L-Eval also adopts diverse evaluation methods and instruction styles, enabling a more reliable assessment of Long Context Language Models (LCLMs). Our findings indicate that while open-source models typically lag behind their commercial counterparts, they still exhibit impressive performance. LLaMA2 achieves the best results (win 45\% vs turbo-16k) on open-ended tasks with only 4k context length and ChatGLM2 achieves the best results on closed-ended tasks with 8k input tokens. We release our new evaluation suite, code, and all generation results including predictions from all open-sourced LCLMs, GPT4-32k, Cluade-100k at {\url{https://github.com/OpenLMLab/LEval}}

    Videoscapes: Exploring Unstructured Video Collections

    No full text

    Extending Context Window of Large Language Models via Positional Interpolation

    Full text link
    We present Position Interpolation (PI) that extends the context window sizes of RoPE-based pretrained LLMs such as LLaMA models to up to 32768 with minimal fine-tuning (within 1000 steps), while demonstrating strong empirical results on various tasks that require long context, including passkey retrieval, language modeling, and long document summarization from LLaMA 7B to 65B. Meanwhile, the extended model by Position Interpolation preserve quality relatively well on tasks within its original context window. To achieve this goal, Position Interpolation linearly down-scales the input position indices to match the original context window size, rather than extrapolating beyond the trained context length which may lead to catastrophically high attention scores that completely ruin the self-attention mechanism. Our theoretical study shows that the upper bound of interpolation is at least 600×\sim 600 \times smaller than that of extrapolation, further demonstrating its stability. Models extended via Position Interpolation retain its original architecture and can reuse most pre-existing optimization and infrastructure.Comment: Fix template issue

    Diversificação de Imagens baseada em Agrupamento Adaptativo por Consulta

    Get PDF
    À medida que a tecnologia avança, grandes transformações acontecem, sejam em âmbito puramente social ou tecnológico. Nesse contexto, observa-se que as imagens têm impactado de forma direta em muitas dessas transformações. As imagens são utilizadas em contextos diversos, como em sistemas de medicina, de biodiversidade e bibliotecas digitais (Torres & Falcao, 2006). Desta forma, ao longo dos anos, muito tem sido feito para melhorar a eficácia com a qual essas imagens são recuperadas e analisadas. Uma dessas técnicas é a chamada recuperação de imagens por conteúdo (Veltkamp & Tanase, 2002). Essas técnicas, basicamente, tentam recuperar imagens semelhantes à uma especificação ou padrão definido pelo usuário (por exemplo, esboço de forma, uma imagem modelo) (Torres & Falcao, 2006).O processo de recuperação de informação exige que determinados aspectos sejam observados com cautela, como ambiguidade, redundância, relevância e diversidade. Além disso, as imagens que devem ser exibidas para um determinado usuário são as imagens consideradas relevantes, aquelas que oferecem informações úteis. Contudo, embora a utilização da relevância como critério seja eficaz, tem-se observado que em certas situações o seu uso não satisfaz por completo as necessidades de consulta que demandam diversidade visual, por exemplo, Chang et al (2016); Chang & Wang (2016); Fan et al (2008). Uma solução comumente explorada para amenizar esse problema é a utilização de técnicas de agrupamento de dados, que visa encontrar grupos de objetos que tenham certa semelhança, sem levar em consideração informações prévias sobre os dados existentes. Todavia, muitos algoritmos demandam um valor de referência para determinar o número de grupos a serem gerados.A determinação do número de grupos é uma tarefa que exige esforço, pois envolve um conjunto de propriedades e características das imagens. Trabalhos anteriores tentaram encontrar um número fixo de grupos independente da consulta a ser executada (Ferreira et al, 2016); (Tollari, 2016), ao invés de utilizar métodos adaptativos. Em outros trabalhos, mesmo simulando-se o número exato de clusters para cada consulta (com base no ground-truth), os resultados não foram satisfatórios (Araujo, 2016). Diante disso, este estudo buscou formular uma abordagem que auxiliasse na detecção automática do número de grupos, adaptável a cada consulta

    Exploiting the Bipartite Structure of Entity Grids for Document Coherence and Retrieval

    Get PDF
    International audienceDocument coherence describes how much sense text makes in terms of its logical organisation and discourse flow. Even though coherence is a relatively difficult notion to quantify precisely, it can be approximated automatically. This type of coherence modelling is not only interesting in itself, but also useful for a number of other text processing tasks, including Information Retrieval (IR), where adjusting the ranking of documents according to both their relevance and their coherence has been shown to increase retrieval effectiveness.The state of the art in unsupervised coherence modelling represents documents as bipartite graphs of sentences and discourse entities, and then projects these bipartite graphs into one–mode undirected graphs. However, one–mode projections may incur significant loss of the information present in the original bipartite structure. To address this we present three novel graph metrics that compute document coherence on the original bipartite graph of sentences and entities. Evaluation on standard settings shows that: (i) one of our coherence metrics beats the state of the art in terms of coherence accuracy; and (ii) all three of our coherence metrics improve retrieval effectiveness because, as closer analysis reveals, they capture aspects of document quality that go undetected by both keyword-based standard ranking and by spam filtering. This work contributes document coherence metrics that are theoretically principled, parameter-free, and useful to IR
    corecore