74 research outputs found
Maximum Likelihood Pitch Estimation Using Sinusoidal Modeling
The aim of the work presented in this thesis is to automatically extract the fundamental frequency of a periodic signal from noisy observations, a task commonly referred to as pitch estimation. An algorithm for optimal pitch estimation using a maximum likelihood formulation is presented. The speech waveform is modeled using sinusoidal basis functions that are harmonically tied together to explicitly capture the periodic structure of voiced speech. The problem of pitch estimation is casted as a model selection problem and the Akaike Information Criterion is used to estimate the pitch. The algorithm is compared with several existing pitch detection algorithms (PDAs) on a reference pitch database. The results indicate the superior performance of the algorithm in comparison with most of the PDAs. The application of parametric modeling in single channel speech segregation and the use of mel-frequency cepstral coefficients for sequential grouping are analyzed in the speech separation challenge database
High Resolution Numerical Methods for Coupled Non-linear Multi-physics Simulations with Applications in Reactor Analysis
The modeling of nuclear reactors involves the solution of a multi-physics problem with widely varying time and length scales. This translates mathematically to solving a system of coupled, non-linear, and stiff partial differential equations (PDEs). Multi-physics applications possess the added complexity that most of the solution fields participate in various physics components, potentially yielding spatial and/or temporal coupling errors. This dissertation deals with the verification aspects associated with such a multi-physics code, i.e., the substantiation that the mathematical description of the multi-physics equations are solved correctly (both in time and space). Conventional paradigms used in reactor analysis problems employed to couple various physics components are often non-iterative and can be inconsistent in their treatment of the non-linear terms. This leads to the usage of smaller time steps to maintain stability and accuracy requirements, thereby increasing the overall computational time for simulation. The inconsistencies of these weakly coupled solution methods can be overcome using tighter coupling strategies and yield a better approximation to the coupled non-linear operator, by resolving the dominant spatial and temporal scales involved in the multi-physics simulation. A multi-physics framework, KARMA (K(c)ode for Analysis of Reactor and other Multi-physics Applications), is presented. KARMA uses tight coupling strategies for various physical models based on a Matrix-free Nonlinear-Krylov (MFNK) framework in order to attain high-order spatio-temporal accuracy for all solution fields in amenable wall clock times, for various test problems. The framework also utilizes traditional loosely coupled methods as lower-order solvers, which serve as efficient preconditioners for the tightly coupled solution. Since the software platform employs both lower and higher-order coupling strategies, it can easily be used to test and evaluate different coupling strategies and numerical methods and to compare their efficiency for problems of interest. Multi-physics code verification efforts pertaining to reactor applications are described and associated numerical results obtained using the developed multi-physics framework are provided. The versatility of numerical methods used here for coupled problems and feasibility of general non-linear solvers with appropriate physics-based preconditioners in the KARMA framework offer significantly efficient techniques to solve multi-physics problems in reactor analysis
Multiple-Question Multiple-Answer Text-VQA
We present Multiple-Question Multiple-Answer (MQMA), a novel approach to do
text-VQA in encoder-decoder transformer models. The text-VQA task requires a
model to answer a question by understanding multi-modal content: text
(typically from OCR) and an associated image. To the best of our knowledge,
almost all previous approaches for text-VQA process a single question and its
associated content to predict a single answer. In order to answer multiple
questions from the same image, each question and content are fed into the model
multiple times. In contrast, our proposed MQMA approach takes multiple
questions and content as input at the encoder and predicts multiple answers at
the decoder in an auto-regressive manner at the same time. We make several
novel architectural modifications to standard encoder-decoder transformers to
support MQMA. We also propose a novel MQMA denoising pre-training task which is
designed to teach the model to align and delineate multiple questions and
content with associated answers. MQMA pre-trained model achieves
state-of-the-art results on multiple text-VQA datasets, each with strong
baselines. Specifically, on OCR-VQA (+2.5%), TextVQA (+1.4%), ST-VQA (+0.6%),
DocVQA (+1.1%) absolute improvements over the previous state-of-the-art
approaches
Learning Optimal Seeds for Diffusion-Based Salient Object Detection
In diffusion-based saliency detection, an image is parti-tioned into superpixels and mapped to a graph, with super-pixels as nodes and edge strengths proportional to super-pixel similarity. Saliency information is then propagated over the graph using a diffusion process, whose equilibrium state yields the object saliency map. The optimal solution is the product of a propagation matrix and a saliency seed vector that contains a prior saliency assessment. This is obtained from either a bottom-up saliency detector or some heuristics. In this work, we propose a method to learn op-timal seeds for object saliency. Two types of features are computed per superpixel: the bottom-up saliency of the su-perpixel region and a set of mid-level vision features infor-mative of how likely the superpixel is to belong to an object. The combination of features that best discriminates between object and background saliency is then learned, using a large-margin formulation of the discriminant saliency prin-ciple. The propagation of the resulting saliency seeds, using a diffusion process, is finally shown to outperform the state of the art on a number of salient object detection datasets. 1
DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder Transformer Models
Encoder-decoder transformer models have achieved great success on various
vision-language (VL) tasks, but they suffer from high inference latency.
Typically, the decoder takes up most of the latency because of the
auto-regressive decoding. To accelerate the inference, we propose an approach
of performing Dynamic Early Exit on Decoder (DEED). We build a multi-exit
encoder-decoder transformer model which is trained with deep supervision so
that each of its decoder layers is capable of generating plausible predictions.
In addition, we leverage simple yet practical techniques, including shared
generation head and adaptation modules, to keep accuracy when exiting at
shallow decoder layers. Based on the multi-exit model, we perform step-level
dynamic early exit during inference, where the model may decide to use fewer
decoder layers based on its confidence of the current layer at each individual
decoding step. Considering different number of decoder layers may be used at
different decoding steps, we compute deeper-layer decoder features of previous
decoding steps just-in-time, which ensures the features from different decoding
steps are semantically aligned. We evaluate our approach with two
state-of-the-art encoder-decoder transformer models on various VL tasks. We
show our approach can reduce overall inference latency by 30%-60% with
comparable or even higher accuracy compared to baselines
- …