173 research outputs found
A Comprehensive Trainable Error Model for Sung Music Queries
We propose a model for errors in sung queries, a variant of the hidden Markov
model (HMM). This is a solution to the problem of identifying the degree of
similarity between a (typically error-laden) sung query and a potential target
in a database of musical works, an important problem in the field of music
information retrieval. Similarity metrics are a critical component of
query-by-humming (QBH) applications which search audio and multimedia databases
for strong matches to oral queries. Our model comprehensively expresses the
types of error or variation between target and query: cumulative and
non-cumulative local errors, transposition, tempo and tempo changes,
insertions, deletions and modulation. The model is not only expressive, but
automatically trainable, or able to learn and generalize from query examples.
We present results of simulations, designed to assess the discriminatory
potential of the model, and tests with real sung queries, to demonstrate
relevance to real-world applications
Modeling Time-Series and Spatial Data for Recommendations and Other Applications
With the research directions described in this thesis, we seek to address the
critical challenges in designing recommender systems that can understand the
dynamics of continuous-time event sequences. We follow a ground-up approach,
i.e., first, we address the problems that may arise due to the poor quality of
CTES data being fed into a recommender system. Later, we handle the task of
designing accurate recommender systems. To improve the quality of the CTES
data, we address a fundamental problem of overcoming missing events in temporal
sequences. Moreover, to provide accurate sequence modeling frameworks, we
design solutions for points-of-interest recommendation, i.e., models that can
handle spatial mobility data of users to various POI check-ins and recommend
candidate locations for the next check-in. Lastly, we highlight that the
capabilities of the proposed models can have applications beyond recommender
systems, and we extend their abilities to design solutions for large-scale CTES
retrieval and human activity prediction. A significant part of this thesis uses
the idea of modeling the underlying distribution of CTES via neural marked
temporal point processes (MTPP). Traditional MTPP models are stochastic
processes that utilize a fixed formulation to capture the generative mechanism
of a sequence of discrete events localized in continuous time. In contrast,
neural MTPP combine the underlying ideas from the point process literature with
modern deep learning architectures. The ability of deep-learning models as
accurate function approximators has led to a significant gain in the predictive
prowess of neural MTPP models. In this thesis, we utilize and present several
neural network-based enhancements for the current MTPP frameworks for the
aforementioned real-world applications.Comment: Ph.D. Thesis (2022
A Review of Deep Learning Techniques for Speech Processing
The field of speech processing has undergone a transformative shift with the
advent of deep learning. The use of multiple processing layers has enabled the
creation of models capable of extracting intricate features from speech data.
This development has paved the way for unparalleled advancements in speech
recognition, text-to-speech synthesis, automatic speech recognition, and
emotion recognition, propelling the performance of these tasks to unprecedented
heights. The power of deep learning techniques has opened up new avenues for
research and innovation in the field of speech processing, with far-reaching
implications for a range of industries and applications. This review paper
provides a comprehensive overview of the key deep learning models and their
applications in speech-processing tasks. We begin by tracing the evolution of
speech processing research, from early approaches, such as MFCC and HMM, to
more recent advances in deep learning architectures, such as CNNs, RNNs,
transformers, conformers, and diffusion models. We categorize the approaches
and compare their strengths and weaknesses for solving speech-processing tasks.
Furthermore, we extensively cover various speech-processing tasks, datasets,
and benchmarks used in the literature and describe how different deep-learning
networks have been utilized to tackle these tasks. Additionally, we discuss the
challenges and future directions of deep learning in speech processing,
including the need for more parameter-efficient, interpretable models and the
potential of deep learning for multimodal speech processing. By examining the
field's evolution, comparing and contrasting different approaches, and
highlighting future directions and challenges, we hope to inspire further
research in this exciting and rapidly advancing field
Unifying Token and Span Level Supervisions for Few-Shot Sequence Labeling
Few-shot sequence labeling aims to identify novel classes based on only a few
labeled samples. Existing methods solve the data scarcity problem mainly by
designing token-level or span-level labeling models based on metric learning.
However, these methods are only trained at a single granularity (i.e., either
token level or span level) and have some weaknesses of the corresponding
granularity. In this paper, we first unify token and span level supervisions
and propose a Consistent Dual Adaptive Prototypical (CDAP) network for few-shot
sequence labeling. CDAP contains the token-level and span-level networks,
jointly trained at different granularities. To align the outputs of two
networks, we further propose a consistent loss to enable them to learn from
each other. During the inference phase, we propose a consistent greedy
inference algorithm that first adjusts the predicted probability and then
greedily selects non-overlapping spans with maximum probability. Extensive
experiments show that our model achieves new state-of-the-art results on three
benchmark datasets.Comment: Accepted by ACM Transactions on Information System
MetaRec: Meta-Learning Meets Recommendation Systems
Artificial neural networks (ANNs) have recently received increasing attention as powerful modeling tools to improve the performance of recommendation systems. Meta-learning, on the other hand, is a paradigm that has re-surged in popularity within the broader machine learning community over the past several years. In this thesis, we will explore the intersection of these two domains and work on developing methods for integrating meta-learning to design more accurate and flexible recommendation systems.
In the present work, we propose a meta-learning framework for the design of collaborative filtering methods in recommendation systems, drawing from ideas, models, and solutions from modern approaches in both the meta-learning and recommendation system literature, applying them to recommendation tasks to obtain improved generalization performance.
Our proposed framework, MetaRec, includes and unifies the main state-of-the-art models in recommendation systems, extending them to be flexibly configured and efficiently operate with limited data. We empirically test the architectures created under our MetaRec framework on several recommendation benchmark datasets using a plethora of evaluation metrics and find that by taking a meta-learning approach to the collaborative filtering problem, we observe notable gains in predictive performance
Review : Deep learning in electron microscopy
Deep learning is transforming most areas of science and technology, including electron microscopy. This review paper offers a practical perspective aimed at developers with limited familiarity. For context, we review popular applications of deep learning in electron microscopy. Following, we discuss hardware and software needed to get started with deep learning and interface with electron microscopes. We then review neural network components, popular architectures, and their optimization. Finally, we discuss future directions of deep learning in electron microscopy
A Survey of Natural Language Generation
This paper offers a comprehensive review of the research on Natural Language
Generation (NLG) over the past two decades, especially in relation to
data-to-text generation and text-to-text generation deep learning methods, as
well as new applications of NLG technology. This survey aims to (a) give the
latest synthesis of deep learning research on the NLG core tasks, as well as
the architectures adopted in the field; (b) detail meticulously and
comprehensively various NLG tasks and datasets, and draw attention to the
challenges in NLG evaluation, focusing on different evaluation methods and
their relationships; (c) highlight some future emphasis and relatively recent
research issues that arise due to the increasing synergy between NLG and other
artificial intelligence areas, such as computer vision, text and computational
creativity.Comment: Accepted by ACM Computing Survey (CSUR) 202
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning
Vision-language pre-training and instruction tuning have demonstrated
general-purpose capabilities in 2D visual reasoning tasks by aligning visual
encoders with state-of-the-art large language models (LLMs). In this paper, we
introduce a simple, yet effective, cross-modality framework built atop frozen
LLMs that allows the integration of various modalities without extensive
modality-specific customization. To facilitate instruction-modality
fine-tuning, we collect high-quality instruction tuning data in an automatic
and scalable manner, composed of 24K QA samples for audio and 250K QA samples
for 3D. Leveraging instruction-aware representations, our model performs
comparably with leading-edge counterparts without the need of extensive
modality-specific pre-training or customization. Furthermore, our approach
demonstrates cross-modal reasoning abilities across two or more input
modalities, despite each modality projection being trained individually. To
study the model's cross-modal abilities, we contribute a novel Discriminative
Cross-modal Reasoning (DisCRn) evaluation task, comprising 9K audio-video QA
samples and 28K image-3D QA samples that require the model to reason
discriminatively across disparate input modalities
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity
This survey addresses the crucial issue of factuality in Large Language
Models (LLMs). As LLMs find applications across diverse domains, the
reliability and accuracy of their outputs become vital. We define the
Factuality Issue as the probability of LLMs to produce content inconsistent
with established facts. We first delve into the implications of these
inaccuracies, highlighting the potential consequences and challenges posed by
factual errors in LLM outputs. Subsequently, we analyze the mechanisms through
which LLMs store and process facts, seeking the primary causes of factual
errors. Our discussion then transitions to methodologies for evaluating LLM
factuality, emphasizing key metrics, benchmarks, and studies. We further
explore strategies for enhancing LLM factuality, including approaches tailored
for specific domains. We focus two primary LLM configurations standalone LLMs
and Retrieval-Augmented LLMs that utilizes external data, we detail their
unique challenges and potential enhancements. Our survey offers a structured
guide for researchers aiming to fortify the factual reliability of LLMs.Comment: 62 pages; 300+ reference
- …