176 research outputs found

    Attentive Aspect Modeling for Review-aware Recommendation

    Full text link
    In recent years, many studies extract aspects from user reviews and integrate them with ratings for improving the recommendation performance. The common aspects mentioned in a user's reviews and a product's reviews indicate indirect connections between the user and product. However, these aspect-based methods suffer from two problems. First, the common aspects are usually very sparse, which is caused by the sparsity of user-product interactions and the diversity of individual users' vocabularies. Second, a user's interests on aspects could be different with respect to different products, which are usually assumed to be static in existing methods. In this paper, we propose an Attentive Aspect-based Recommendation Model (AARM) to tackle these challenges. For the first problem, to enrich the aspect connections between user and product, besides common aspects, AARM also models the interactions between synonymous and similar aspects. For the second problem, a neural attention network which simultaneously considers user, product and aspect information is constructed to capture a user's attention towards aspects when examining different products. Extensive quantitative and qualitative experiments show that AARM can effectively alleviate the two aforementioned problems and significantly outperforms several state-of-the-art recommendation methods on top-N recommendation task.Comment: Camera-ready manuscript for TOI

    Speaker Representation Learning using Global Context Guided Channel and Time-Frequency Transformations

    Full text link
    In this study, we propose the global context guided channel and time-frequency transformations to model the long-range, non-local time-frequency dependencies and channel variances in speaker representations. We use the global context information to enhance important channels and recalibrate salient time-frequency locations by computing the similarity between the global context and local features. The proposed modules, together with a popular ResNet based model, are evaluated on the VoxCeleb1 dataset, which is a large scale speaker verification corpus collected in the wild. This lightweight block can be easily incorporated into a CNN model with little additional computational costs and effectively improves the speaker verification performance compared to the baseline ResNet-LDE model and the Squeeze&Excitation block by a large margin. Detailed ablation studies are also performed to analyze various factors that may impact the performance of the proposed modules. We find that by employing the proposed L2-tf-GTFC transformation block, the Equal Error Rate decreases from 4.56% to 3.07%, a relative 32.68% reduction, and a relative 27.28% improvement in terms of the DCF score. The results indicate that our proposed global context guided transformation modules can efficiently improve the learned speaker representations by achieving time-frequency and channel-wise feature recalibration.Comment: Accepted to Interspeech 202

    SCAN: A Spatial Context Attentive Network for Joint Multi-Agent Intent Prediction

    Full text link
    Safe navigation of autonomous agents in human centric environments requires the ability to understand and predict motion of neighboring pedestrians. However, predicting pedestrian intent is a complex problem. Pedestrian motion is governed by complex social navigation norms, is dependent on neighbors' trajectories, and is multimodal in nature. In this work, we propose SCAN, a Spatial Context Attentive Network that can jointly predict socially-acceptable multiple future trajectories for all pedestrians in a scene. SCAN encodes the influence of spatially close neighbors using a novel spatial attention mechanism in a manner that relies on fewer assumptions, is parameter efficient, and is more interpretable compared to state-of-the-art spatial attention approaches. Through experiments on several datasets we demonstrate that our approach can also quantitatively outperform state of the art trajectory prediction methods in terms of accuracy of predicted intent

    On Conditional and Compositional Language Model Differentiable Prompting

    Full text link
    Prompts have been shown to be an effective method to adapt a frozen Pretrained Language Model (PLM) to perform well on downstream tasks. Prompts can be represented by a human-engineered word sequence or by a learned continuous embedding. In this work, we investigate conditional and compositional differentiable prompting. We propose a new model, Prompt Production System (PRopS), which learns to transform task instructions or input metadata, into continuous prompts that elicit task-specific outputs from the PLM. Our model uses a modular network structure based on our neural formulation of Production Systems, which allows the model to learn discrete rules -- neural functions that learn to specialize in transforming particular prompt input patterns, making it suitable for compositional transfer learning and few-shot learning. We present extensive empirical and theoretical analysis and show that PRopS consistently surpasses other PLM adaptation techniques, and often improves upon fully fine-tuned models, on compositional generalization tasks, controllable summarization and multilingual translation, while needing fewer trainable parameters.Comment: Accepted at International Joint Conference on Artificial Intelligence (IJCAI) 202

    Image Retrieval via CNNs in TensorFlow2

    Get PDF
    Tato práce se věnuje vyhledávání největší množiny obrázků příslušící vyhledávanému objektu v rozsáhlých datových kolekcích. Konvoluční neuronové sítě (CNNs) prokázaly svoji schopnost poskytnout efektivní deskriptory pro vyhledávání obrázků. Zabýváme se tedy použitím vyladěných CNN k extrakcí globálních deskriptorů pro použití v problému vyhledávání obrázků (CBIR). V práci jsme studovali současný stav poznání metod vyhledávání obrázků jako například GeM a DELF. Klíčovým přínosem této práce je TensorFlow 2 implementace rozšiřitelného a vysoce přizpůsobitelného frameworku pro CBIR, založená na práci Radenoviće et al. Tento přístup poskytuje výsledky srovnatelné s nejlepšími současnými metodami, přičemž ale používá relativně krátké deskriptory. Pro ověření výsledků jsme natrénovali sítě na SfM120k datasetu a provedli experimenty na dvou standardních datasetech (revisited Oxford5k a Paris6k). Během experimentů byly využity rozlišné trénovací strategie, architektury neuronových sítí a ztrátové funkce pro komplexní zhodnocení implementovaného přístupu. Finální zdrojový kód byl přidán do oficiálního repozitáře TensorFlow 2, jakožto součást výzkumné knihovny DELF.This thesis addresses the problem of instance-level image retrieval in large-scale picture collections, intending to find the greatest number of images corresponding to a query. Convolutional neural networks (CNNs) have demonstrated their ability to provide effective descriptors for content-based image retrieval (CBIR). Given the current knowledge, we focused our efforts on utilizing fine-tuned CNNs for global feature extraction with the goal of using those for image retrieval problems. Firstly, we examined several methods proposed to improve image retrieval, such as GeM and DELF. As the main result of this thesis, an extendable and highly-customizable image retrieval framework based on the work of Radenović et al. was re-implemented in TensorFlow 2. This approach produces state-of-the-art retrieval results while using relatively short descriptors. As a validation, we trained the networks on the SfM120k landmark images dataset and performed experiments on two image retrieval benchmarks (revisited Oxford5k and Paris6k). Different training strategies, network architectures and loss functions were used in the experiments. The final project code was successfully merged into the official Tensorflow repository managed by Google, as a part of the DELF research library

    자연어 처리를 위한 문맥 정보 및 메모리 어텐션을 활용하는 계층적 문맥 인코더

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 공과대학 전기·정보공학부, 2022. 8. 정교민.최근 자연어 처리(NLP)를 위한 표준 아키텍처가 순환 신경망에서 트랜스포머 아키텍처로 발전했다. 트랜스포머 아키텍처는 토큰 간의 상관 관계를 추출하는 데 강점을 보여주고 추출한 정보를 통합하여 적절한 출력을 생성하는 attention layer들로 구성된다. 이러한 발전은 최근 딥 러닝 사회에 주어진 입력 데이터 밖의 추가 컨텍스트 정보를 활용하는 새로운 도전을 제시했다. 본 학위 논문에서는 다양한 자연어 처리 작업에서 주어진 입력 외에 추가적인 컨텍스트 정보를 효과적으로 활용하는 새로운 방법과 분석을 attention layer에 초점을 맞추어 제안한다. 먼저, 이전 문장에 대한 컨텍스트 정보를 효율적으로 내장하고, 메모리 어텐션 메커니즘을 통해 내장된 문맥 표현을 입력 표현에 융합하는 계층적 메모리 컨텍스트 인코더(HMCE)를 제안한다. 제안된 HMCE는 다양한 문맥 인지 기계 번역 작업에서 추가 문맥 정보를 활용하지 않는 트랜스포머와 비교하였을 때 더 뛰어난 성능을 보인다. 그런 다음 문맥 표현과 입력 표현 사이의 어텐션 메커니즘을 개선하기 위해 문맥 표현과 입력 표현 사이의 표현 유사성을 Centered Kernel Alignment(CKA)를 이용하여 심층 분석하며, CKA를 최적화하는 방법을 제안한다. 마지막으로, 문맥 정보가 시각 양식으로 주어지는 다중 모달 시나리오에 대해 CKA 최적화 방법을 모달리티 정렬 방법으로 확장한다. 이 Modality Alignment 방법은 멀티 모달간 표현 유사성을 극대화하여 비디오 질문 응답 작업에서 큰 성능 향상을 가져온다.Recently, the standard architecture for Natural Language Processing (NLP) has evolved from Recurrent Neural Network to Transformer architecture. Transformer architecture consists of attention layers which show its strength at finding the correlation between tokens and incorporate the correlation information to generate proper output. While many researches leveraging Transformer architecture report the new state-of-the-arts performances on various NLP tasks, These recent improvements propose a new challenge to deep learning society: exploiting additional context information. Because human intelligence perceives signals in everyday life with much rich contextual information (e.g. additional memory, visual information, and common sense), exploiting the context information is a step forward to the ultimate goal for Artificial Intelligence. In this dissertation, I propose novel methodologies and analyses to improve context-awareness of Transformer architecture focusing on the attention mechanism for various natural language processing tasks. The proposed methods utilize the additionally given context information, which is not limited to the modality of natural language, aside the given input information. First, I propose Hierarchical Memory Context Encoder (HMCE) which efficiently embeds the contextual information over preceding sentences via a hierarchical architecture of Transformer and fuses the embedded context representation into the input representation via memory attention mechanism. The proposed HMCE outperforms the original Transformer which does not leverage the additional context information on various context-aware machine translation tasks. It also shows the best performance evaluated in BLEU among the baselines using the additional context. Then, to improve the attention mechanism between context representation and input representation, I deeply analyze the representational similarity between the context representation and the input representation. Based on my analyses on representational similarity inside Transformer architecture, I propose a method for optimizing Centered Kernel Alignment (CKA) between internal representations of Transformer. The proposed CKA optimization method increases the performance of Transformer in various machine translation tasks and language modelling tasks. Lastly, I extend the CKA optimization method to Modality Alignment method for multi-modal scenarios where the context information takes the modality of visual information. My Modality Alignment method enhances the cross-modality attention mechanism by maximizing the representational similarity between visual representation and natural language representation, resulting in performance improvements larger than 3.5% accuracy on video question answering tasks.1 Introduction 1 2 Backgrounds 8 3 Context-aware Hierarchical Transformer Architecture 12 3.1 Related Works 15 3.1.1 Using Multiple Sentences for Context-awareness in Machine Translation 15 3.1.2 Structured Neural Machine Translation Models for Contextawareness 16 3.1.3 Evaluating Context-awareness with Generated Translation 16 3.2 Proposed Approach: Context-aware Hierarchical Text Encoder with Memory Networks 16 3.2.1 Context-aware NMT Encoders 17 3.2.2 Hierarchical Memory Context Encoder 21 3.3 Experiments 25 3.3.1 Data 26 3.3.2 Hyperparameters and Training Details 28 3.3.3 Overall BLEU Evaluation 28 3.3.4 Model Complexity Analysis 30 3.3.5 BLEU Evaluation on Helpful/Unhelpful Context 31 3.3.6 Qualitative Analysis 32 3.3.7 Limitations and Future Directions 34 3.4 Conclusion 35 4 Optimizing Representational Diversity of Transformer Architecture 36 4.1 Related Works 38 4.1.1 Analyses of Diversity in Multi-Head Attention 38 4.1.2 Similarities between Deep Neural Representations 39 4.2 Similarity Measures for Multi-Head Attention 40 4.2.1 Multi-Head Attention 40 4.2.2 Singular Vector Canonical Correlation Analysis (SVCCA) 41 4.2.3 Centered Kernel Alignment (CKA) 43 4.3 Proposed Approach: Controlling Inter-Head Diversity 43 4.3.1 HSIC Regularizer 44 4.3.2 Orthogonality Regularizer 44 4.3.3 Drophead 45 4.4 Inter-Head Similarity Analyses 46 4.4.1 Experimental Details for Similarity Analysis 46 4.4.2 Applying SVCCA and CKA 47 4.4.3 Analyses on Inter-Model Similarity 47 4.4.4 Does Multi-Head Strategy Diversify a Model's Representation Subspaces 49 4.5 Experiments on Controlling Inter-Head Similarity Methods 52 4.5.1 Experimental Details 52 4.5.2 Analysis on Controlling Inter-Head Diversity 54 4.5.3 Quantitative Evaluation 55 4.5.4 Limitations and Future Directions 57 4.6 Conclusions 60 5 Modality Alignment for Cross-modal Attention 61 5.1 Related Works 63 5.1.1 Representation Similarity between Modalities 63 5.1.2 Video Question Answering 64 5.2 Proposed Approach: Modality Align between Multi-modal Representations 65 5.2.1 Centered Kernel Alignment Review 65 5.2.2 Why CKA is Proper to Modality Alignment 66 5.2.3 Proposed Method 69 5.3 Experiments 71 5.3.1 Cosine Similarity Learning with CKA 72 5.3.2 Modality Align on Video Question Answering Task 75 5.4 Conclusion 82 6 Conclusion 83 Abstract (In Korean) 97박

    Generative Models as Distributions of Functions

    Full text link
    Generative models are typically trained on grid-like data such as images. As a result, the size of these models usually scales directly with the underlying grid resolution. In this paper, we abandon discretized grids and instead parameterize individual data points by continuous functions. We then build generative models by learning distributions over such functions. By treating data points as functions, we can abstract away from the specific type of data we train on and construct models that are agnostic to discretization. To train our model, we use an adversarial approach with a discriminator that acts on continuous signals. Through experiments on a wide variety of data modalities including images, 3D shapes and climate data, we demonstrate that our model can learn rich distributions of functions independently of data type and resolution.Comment: Added experiments for learning distributions of functions on manifolds. Added more 3D experiments and comparisons to baseline
    corecore