37 research outputs found
Accurate Human Motion Capture and Modeling using Low-cost Sensors
Motion capture technologies, especially those combined with multiple kinds of sensory technologies to capture both kinematic and dynamic information, are widely used in a variety of fields such as biomechanics, robotics, and health. However, many existing systems suffer from limitations of being intrusive, restrictive, and expensive.
This dissertation explores two aspects of motion capture systems that are low-cost, non-intrusive, high-accuracy, and easy to use for common users, including both full-body kinematics and dynamics capture, and user-specific hand modeling.
More specifically, we present a new method for full-body motion capture that uses input data captured by three depth cameras and a pair of pressure-sensing shoes. Our system is appealing because it is fully automatic and can accurately reconstruct both full-body kinematic and dynamic data. We introduce a highly accurate tracking process that automatically reconstructs 3D skeletal poses using depth data, foot pressure data, and detailed full-body geometry. We also develop an efficient physics-based motion reconstruction algorithm for solving internal joint torques and contact forces based on contact pressure information and 3D poses from the kinematic tracking process.
In addition, we present a novel low-dimensional parametric model for 3D hand modeling and synthesis. We construct a low-dimensional parametric model to compactly represent hand shape variations across individuals and enhance it by adding Linear Blend Skinning (LBS) for pose deformation. We also introduce an efficient iterative approach to learn the parametric model from a large unaligned scan database. Our model is compact, expressive, and produces a natural-looking LBS model for pose deformation, which allows for a variety of applications ranging from user-specific hand modeling to skinning weights transfer and model-based hand tracking
DIME-FM: DIstilling Multimodal and Efficient Foundation Models
Large Vision-Language Foundation Models (VLFM), such as CLIP, ALIGN and
Florence, are trained on large-scale datasets of image-caption pairs and
achieve superior transferability and robustness on downstream tasks, but they
are difficult to use in many practical applications due to their large size,
high latency and fixed architectures. Unfortunately, recent work shows training
a small custom VLFM for resource-limited applications is currently very
difficult using public and smaller-scale data. In this paper, we introduce a
new distillation mechanism (DIME-FM) that allows us to transfer the knowledge
contained in large VLFMs to smaller, customized foundation models using a
relatively small amount of inexpensive, unpaired images and sentences. We
transfer the knowledge from the pre-trained CLIP-ViTL/14 model to a ViT-B/32
model, with only 40M public images and 28.4M unpaired public sentences. The
resulting model "Distill-ViT-B/32" rivals the CLIP-ViT-B/32 model pre-trained
on its private WiT dataset (400M image-text pairs): Distill-ViT-B/32 achieves
similar results in terms of zero-shot and linear-probing performance on both
ImageNet and the ELEVATER (20 image classification tasks) benchmarks. It also
displays comparable robustness when evaluated on five datasets with natural
distribution shifts from ImageNet
Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search
Recent work in network quantization has substantially reduced the time and
space complexity of neural network inference, enabling their deployment on
embedded and mobile devices with limited computational and memory resources.
However, existing quantization methods often represent all weights and
activations with the same precision (bit-width). In this paper, we explore a
new dimension of the design space: quantizing different layers with different
bit-widths. We formulate this problem as a neural architecture search problem
and propose a novel differentiable neural architecture search (DNAS) framework
to efficiently explore its exponential search space with gradient-based
optimization. Experiments show we surpass the state-of-the-art compression of
ResNet on CIFAR-10 and ImageNet. Our quantized models with 21.1x smaller model
size or 103.9x lower computational cost can still outperform baseline quantized
or even full precision models
Data Efficient Language-supervised Zero-shot Recognition with Optimal Transport Distillation
Traditional computer vision models are trained to predict a fixed set of
predefined categories. Recently, natural language has been shown to be a
broader and richer source of supervision that provides finer descriptions to
visual concepts than supervised "gold" labels. Previous works, such as CLIP,
use InfoNCE loss to train a model to predict the pairing between images and
text captions. CLIP, however, is data hungry and requires more than 400M
image-text pairs for training. The inefficiency can be partially attributed to
the fact that the image-text pairs are noisy. To address this, we propose OTTER
(Optimal TransporT distillation for Efficient zero-shot Recognition), which
uses online entropic optimal transport to find a soft image-text match as
labels for contrastive learning. Based on pretrained image and text encoders,
models trained with OTTER achieve strong performance with only 3M image text
pairs. Compared with InfoNCE loss, label smoothing, and knowledge distillation,
OTTER consistently outperforms these baselines in zero shot evaluation on
Google Open Images (19,958 classes) and multi-labeled ImageNet 10K (10032
classes) from Tencent ML-Images. Over 42 evaluations on 7 different
dataset/architecture settings x 6 metrics, OTTER outperforms (32) or ties (2)
all baselines in 34 of them.Comment: 19 pages, 6 figure
Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention During Vision Transformer Inference
Vision Transformers (ViTs) have shown impressive performance but still
require a high computation cost as compared to convolutional neural networks
(CNNs), one reason is that ViTs' attention measures global similarities and
thus has a quadratic complexity with the number of input tokens. Existing
efficient ViTs adopt local attention (e.g., Swin) or linear attention (e.g.,
Performer), which sacrifice ViTs' capabilities of capturing either global or
local context. In this work, we ask an important research question: Can ViTs
learn both global and local context while being more efficient during
inference? To this end, we propose a framework called Castling-ViT, which
trains ViTs using both linear-angular attention and masked softmax-based
quadratic attention, but then switches to having only linear angular attention
during ViT inference. Our Castling-ViT leverages angular kernels to measure the
similarities between queries and keys via spectral angles. And we further
simplify it with two techniques: (1) a novel linear-angular attention
mechanism: we decompose the angular kernels into linear terms and high-order
residuals, and only keep the linear terms; and (2) we adopt two parameterized
modules to approximate high-order residuals: a depthwise convolution and an
auxiliary masked softmax attention to help learn both global and local
information, where the masks for softmax attention are regularized to gradually
become zeros and thus incur no overhead during ViT inference. Extensive
experiments and ablation studies on three tasks consistently validate the
effectiveness of the proposed Castling-ViT, e.g., achieving up to a 1.8% higher
accuracy or 40% MACs reduction on ImageNet classification and 1.2 higher mAP on
COCO detection under comparable FLOPs, as compared to ViTs with vanilla
softmax-based attentions.Comment: CVPR 202
Effect of Soybean Protein Isolate and Tea Polyphenol Stabilized High Interior Phase Pickering Emulsion Replacing Fat on Meatball Quality
In order to reduce the harm caused by high intake of saturated fat on human health. This study aimed to investigate the effect of high interphase Pickering emulsions (HIPEs) stabilized by soybean protein isolated and soybean oil, and tea polyphenols were used as functional ingredients to evaluate two different HIPEs as pork back fat (PBF) replacers in the meatballs. Six different formulations were prepared by the replacement of PBF with water, HIPEs and HIPEs loaded with tea polyphenols. Physical, chemical and sensory indexes of meatballs were assessed. Compared with the control group, there were significant differences in all indexes expect pH (P<0.05). Reduce-fat meatballs with HIPEs showed higher cooking rate (93.59%), content of moisture (66.91%) and protein (14.48%) and lower content of fat (8.42%). The hardness, elasticity, cohesiveness and chewability of meatballs increased with the addition of emulsion. After addition of HIPEs, the L* of meatballs was increased, and the a*, b* were increased due to the addition of tea polyphenols. The HIPEs loaded with tea polyphenols had the best improvement on the infiltrate on the moisture (1.57%) and fat (0.019%) of meatballs, the lowest TBA value was 4.12 mg/kg. The sensory evaluation was higher than the control group. Using HIPEs as a fat substitute could effectively reduce the fat content in meatballs and improve the yield and quality of meatballs. This study can provide some reference for the development of fat-reducing meat products