753 research outputs found
Comprehensive Information Integration Modeling Framework for Video Titling
In e-commerce, consumer-generated videos, which in general deliver consumers'
individual preferences for the different aspects of certain products, are
massive in volume. To recommend these videos to potential consumers more
effectively, diverse and catchy video titles are critical. However,
consumer-generated videos seldom accompany appropriate titles. To bridge this
gap, we integrate comprehensive sources of information, including the content
of consumer-generated videos, the narrative comment sentences supplied by
consumers, and the product attributes, in an end-to-end modeling framework.
Although automatic video titling is very useful and demanding, it is much less
addressed than video captioning. The latter focuses on generating sentences
that describe videos as a whole while our task requires the product-aware
multi-grained video analysis. To tackle this issue, the proposed method
consists of two processes, i.e., granular-level interaction modeling and
abstraction-level story-line summarization. Specifically, the granular-level
interaction modeling first utilizes temporal-spatial landmark cues, descriptive
words, and abstractive attributes to builds three individual graphs and
recognizes the intra-actions in each graph through Graph Neural Networks (GNN).
Then the global-local aggregation module is proposed to model inter-actions
across graphs and aggregate heterogeneous graphs into a holistic graph
representation. The abstraction-level story-line summarization further
considers both frame-level video features and the holistic graph to utilize the
interactions between products and backgrounds, and generate the story-line
topic of the video. We collect a large-scale dataset accordingly from
real-world data in Taobao, a world-leading e-commerce platform, and will make
the desensitized version publicly available to nourish further development of
the research community...Comment: 11 pages, 6 figures, to appear in KDD 2020 proceeding
A Neural Attention Model for Abstractive Sentence Summarization
Summarization based on text extraction is inherently limited, but
generation-style abstractive methods have proven challenging to build. In this
work, we propose a fully data-driven approach to abstractive sentence
summarization. Our method utilizes a local attention-based model that generates
each word of the summary conditioned on the input sentence. While the model is
structurally simple, it can easily be trained end-to-end and scales to a large
amount of training data. The model shows significant performance gains on the
DUC-2004 shared task compared with several strong baselines.Comment: Proceedings of EMNLP 201
L-Eval: Instituting Standardized Evaluation for Long Context Language Models
Recently, there has been growing interest in extending the context length of
instruction-following models in order to effectively process single-turn long
input (e.g. summarizing a paper) and conversations with more extensive
histories. While proprietary models such as GPT-4 and Claude have demonstrated
considerable advancements in handling tens of thousands of tokens of context,
open-sourced models are still in the early stages of experimentation. It also
remains unclear whether developing these long context models can offer
substantial gains on practical downstream tasks over retrieval-based methods or
models simply trained on chunked contexts. To address this challenge, we
propose to institute standardized evaluation for long context language models.
Concretely, we develop L-Eval which contains 411 long documents and over 2,000
query-response pairs manually annotated and checked by the authors encompassing
areas such as law, finance, school lectures, lengthy conversations, news,
long-form novels, and meetings. L-Eval also adopts diverse evaluation methods
and instruction styles, enabling a more reliable assessment of Long Context
Language Models (LCLMs). Our findings indicate that while open-source models
typically lag behind their commercial counterparts, they still exhibit
impressive performance. LLaMA2 achieves the best results (win 45\% vs
turbo-16k) on open-ended tasks with only 4k context length and ChatGLM2
achieves the best results on closed-ended tasks with 8k input tokens. We
release our new evaluation suite, code, and all generation results including
predictions from all open-sourced LCLMs, GPT4-32k, Cluade-100k at
{\url{https://github.com/OpenLMLab/LEval}}
Extending Context Window of Large Language Models via Positional Interpolation
We present Position Interpolation (PI) that extends the context window sizes
of RoPE-based pretrained LLMs such as LLaMA models to up to 32768 with minimal
fine-tuning (within 1000 steps), while demonstrating strong empirical results
on various tasks that require long context, including passkey retrieval,
language modeling, and long document summarization from LLaMA 7B to 65B.
Meanwhile, the extended model by Position Interpolation preserve quality
relatively well on tasks within its original context window. To achieve this
goal, Position Interpolation linearly down-scales the input position indices to
match the original context window size, rather than extrapolating beyond the
trained context length which may lead to catastrophically high attention scores
that completely ruin the self-attention mechanism. Our theoretical study shows
that the upper bound of interpolation is at least smaller
than that of extrapolation, further demonstrating its stability. Models
extended via Position Interpolation retain its original architecture and can
reuse most pre-existing optimization and infrastructure.Comment: Fix template issue
Diversificação de Imagens baseada em Agrupamento Adaptativo por Consulta
À medida que a tecnologia avança, grandes transformações acontecem, sejam em âmbito puramente social ou tecnológico. Nesse contexto, observa-se que as imagens têm impactado de forma direta em muitas dessas transformações. As imagens são utilizadas em contextos diversos, como em sistemas de medicina, de biodiversidade e bibliotecas digitais (Torres & Falcao, 2006). Desta forma, ao longo dos anos, muito tem sido feito para melhorar a eficácia com a qual essas imagens são recuperadas e analisadas. Uma dessas técnicas é a chamada recuperação de imagens por conteúdo (Veltkamp & Tanase, 2002). Essas técnicas, basicamente, tentam recuperar imagens semelhantes à uma especificação ou padrão definido pelo usuário (por exemplo, esboço de forma, uma imagem modelo) (Torres & Falcao, 2006).O processo de recuperação de informação exige que determinados aspectos sejam observados com cautela, como ambiguidade, redundância, relevância e diversidade. Além disso, as imagens que devem ser exibidas para um determinado usuário são as imagens consideradas relevantes, aquelas que oferecem informações úteis. Contudo, embora a utilização da relevância como critério seja eficaz, tem-se observado que em certas situações o seu uso não satisfaz por completo as necessidades de consulta que demandam diversidade visual, por exemplo, Chang et al (2016); Chang & Wang (2016); Fan et al (2008). Uma solução comumente explorada para amenizar esse problema é a utilização de técnicas de agrupamento de dados, que visa encontrar grupos de objetos que tenham certa semelhança, sem levar em consideração informações prévias sobre os dados existentes. Todavia, muitos algoritmos demandam um valor de referência para determinar o número de grupos a serem gerados.A determinação do número de grupos é uma tarefa que exige esforço, pois envolve um conjunto de propriedades e características das imagens. Trabalhos anteriores tentaram encontrar um número fixo de grupos independente da consulta a ser executada (Ferreira et al, 2016); (Tollari, 2016), ao invés de utilizar métodos adaptativos. Em outros trabalhos, mesmo simulando-se o número exato de clusters para cada consulta (com base no ground-truth), os resultados não foram satisfatórios (Araujo, 2016). Diante disso, este estudo buscou formular uma abordagem que auxiliasse na detecção automática do número de grupos, adaptável a cada consulta
Exploiting the Bipartite Structure of Entity Grids for Document Coherence and Retrieval
International audienceDocument coherence describes how much sense text makes in terms of its logical organisation and discourse flow. Even though coherence is a relatively difficult notion to quantify precisely, it can be approximated automatically. This type of coherence modelling is not only interesting in itself, but also useful for a number of other text processing tasks, including Information Retrieval (IR), where adjusting the ranking of documents according to both their relevance and their coherence has been shown to increase retrieval effectiveness.The state of the art in unsupervised coherence modelling represents documents as bipartite graphs of sentences and discourse entities, and then projects these bipartite graphs into one–mode undirected graphs. However, one–mode projections may incur significant loss of the information present in the original bipartite structure. To address this we present three novel graph metrics that compute document coherence on the original bipartite graph of sentences and entities. Evaluation on standard settings shows that: (i) one of our coherence metrics beats the state of the art in terms of coherence accuracy; and (ii) all three of our coherence metrics improve retrieval effectiveness because, as closer analysis reveals, they capture aspects of document quality that go undetected by both keyword-based standard ranking and by spam filtering. This work contributes document coherence metrics that are theoretically principled, parameter-free, and useful to IR
- …