190 research outputs found
Improving Efficiency in Deep Learning for Large Scale Visual Recognition
The emerging recent large scale visual recognition methods, and in particular the deep Convolutional Neural Networks (CNN), are promising to revolutionize many computer vision based artificial intelligent applications, such as autonomous driving and online image retrieval systems. One of the main challenges in large scale visual recognition is the complexity of the corresponding algorithms. This is further exacerbated by the fact that in most real-world scenarios they need to run in real time and on platforms that have limited computational resources. This dissertation focuses on improving the efficiency of such large scale visual recognition algorithms from several perspectives. First, to reduce the complexity of large scale classification to sub-linear with the number of classes, a probabilistic label tree framework is proposed. A test sample is classified by traversing the label tree from the root node. Each node in the tree is associated with a probabilistic estimation of all the labels. The tree is learned recursively with iterative maximum likelihood optimization. Comparing to the hard label partition proposed previously, the probabilistic framework performs classification more accurately with similar efficiency. Second, we explore the redundancy of parameters in Convolutional Neural Networks (CNN) and employ sparse decomposition to significantly reduce both the amount of parameters and computational complexity. Both inter-channel and inner-channel redundancy is exploit to achieve more than 90\% sparsity with approximately 1\% drop of classification accuracy. We also propose a CPU based efficient sparse matrix multiplication algorithm to reduce the actual running time of CNN models with sparse convolutional kernels. Third, we propose a multi-stage framework based on CNN to achieve better efficiency than a single traditional CNN model. With a combination of cascade model and the label tree framework, the proposed method divides the input images in both the image space and the label space, and processes each image with CNN models that are most suitable and efficient. The average complexity of the framework is significantly reduced, while the overall accuracy remains the same as in the single complex model
Visual Instruction Tuning with Polite Flamingo
Recent research has demonstrated that the multi-task fine-tuning of
multi-modal Large Language Models (LLMs) using an assortment of annotated
downstream vision-language datasets significantly enhances their performance.
Yet, during this process, a side effect, which we termed as the "multi-modal
alignment tax", surfaces. This side effect negatively impacts the model's
ability to format responses appropriately -- for instance, its "politeness" --
due to the overly succinct and unformatted nature of raw annotations, resulting
in reduced human preference. In this paper, we introduce Polite Flamingo, a
multi-modal response rewriter that transforms raw annotations into a more
appealing, "polite" format. Polite Flamingo is trained to reconstruct
high-quality responses from their automatically distorted counterparts and is
subsequently applied to a vast array of vision-language datasets for response
rewriting. After rigorous filtering, we generate the PF-1M dataset and further
validate its value by fine-tuning a multi-modal LLM with it. Combined with
novel methodologies including U-shaped multi-stage tuning and multi-turn
augmentation, the resulting model, Clever Flamingo, demonstrates its advantages
in both multi-modal understanding and response politeness according to
automated and human evaluations.Comment: In AAAI-2
An experimental study of the natural characteristics of the wheeled and the tracked self-propelled guns
With wheeled and tracked self-propelled guns as research object, the study carries out the experimental modal analysis by using traditional method of hammering and operational modal analysis method, and obtains low-order natural frequency of the guns, and thus lays the foundation for further research on the vibration characteristics of wheeled and tracked self-propelled guns. By contrasting the low-order natural characteristics of wheeled and tracked self-propelled guns, conclusions can draw as the following: the modal shapes (from low to high) of wheeled and tracked self-propelled guns are pitch, translation and roll; when the modal shapes are identical, the natural frequency of tracked self-propelled guns is greater than that of wheeled self-propelled guns, which accords with the test results of the gun’s suspension equivalent stiffness; for wheeled self-propelled guns, an accurate measure of the gun’s natural characteristics is feasible by either the traditional or the operational modal analysis method. When it comes to tracked self-propelled guns, the operational modal analysis method is more accurate
Towards Objectively Benchmarking Social Intelligence for Language Agents at Action Level
Prominent large language models have exhibited human-level performance in
many domains, even enabling the derived agents to simulate human and social
interactions. While practical works have substantiated the practicability of
grounding language agents in sandbox simulation or embodied simulators, current
social intelligence benchmarks either stay at the language level or use
subjective metrics. In pursuit of a more realistic and objective evaluation, we
introduce the Social Tasks in Sandbox Simulation (STSS) benchmark, which
assesses language agents \textbf{objectively} at the \textbf{action level} by
scrutinizing the goal achievements within the multi-agent simulation.
Additionally, we sample conversation scenarios to build a language-level
benchmark to provide an economically prudent preliminary evaluation and align
with prevailing benchmarks. To gauge the significance of agent architecture, we
implement a target-driven planning (TDP) module as an adjunct to the existing
agent. Our evaluative findings highlight that the STSS benchmark is challenging
for state-of-the-art language agents. Furthermore, it effectively discriminates
between distinct language agents, suggesting its usefulness as a benchmark for
evaluating both language models and agent architectures
- …