42 research outputs found
Med-BERT: pre-trained contextualized embeddings on large-scale structured electronic health records for disease prediction
Deep learning (DL) based predictive models from electronic health records
(EHR) deliver impressive performance in many clinical tasks. Large training
cohorts, however, are often required to achieve high accuracy, hindering the
adoption of DL-based models in scenarios with limited training data size.
Recently, bidirectional encoder representations from transformers (BERT) and
related models have achieved tremendous successes in the natural language
processing domain. The pre-training of BERT on a very large training corpus
generates contextualized embeddings that can boost the performance of models
trained on smaller datasets. We propose Med-BERT, which adapts the BERT
framework for pre-training contextualized embedding models on structured
diagnosis data from 28,490,650 patients EHR dataset. Fine-tuning experiments
are conducted on two disease-prediction tasks: (1) prediction of heart failure
in patients with diabetes and (2) prediction of pancreatic cancer from two
clinical databases. Med-BERT substantially improves prediction accuracy,
boosting the area under receiver operating characteristics curve (AUC) by
2.02-7.12%. In particular, pre-trained Med-BERT substantially improves the
performance of tasks with very small fine-tuning training sets (300-500
samples) boosting the AUC by more than 20% or equivalent to the AUC of 10 times
larger training set. We believe that Med-BERT will benefit disease-prediction
studies with small local training datasets, reduce data collection expenses,
and accelerate the pace of artificial intelligence aided healthcare.Comment: L.R., X.Y., and Z.X. share first authorship of this wor
Pseudo-Siamese Network based Timbre-reserved Black-box Adversarial Attack in Speaker Identification
In this study, we propose a timbre-reserved adversarial attack approach for
speaker identification (SID) to not only exploit the weakness of the SID model
but also preserve the timbre of the target speaker in a black-box attack
setting. Particularly, we generate timbre-reserved fake audio by adding an
adversarial constraint during the training of the voice conversion model. Then,
we leverage a pseudo-Siamese network architecture to learn from the black-box
SID model constraining both intrinsic similarity and structural similarity
simultaneously. The intrinsic similarity loss is to learn an intrinsic
invariance, while the structural similarity loss is to ensure that the
substitute SID model shares a similar decision boundary to the fixed black-box
SID model. The substitute model can be used as a proxy to generate
timbre-reserved fake audio for attacking. Experimental results on the Audio
Deepfake Detection (ADD) challenge dataset indicate that the attack success
rate of our proposed approach yields up to 60.58% and 55.38% in the white-box
and black-box scenarios, respectively, and can deceive both human beings and
machines.Comment: 5 page
Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features
Voice conversion for highly expressive speech is challenging. Current
approaches struggle with the balancing between speaker similarity,
intelligibility and expressiveness. To address this problem, we propose
Expressive-VC, a novel end-to-end voice conversion framework that leverages
advantages from both neural bottleneck feature (BNF) approach and information
perturbation approach. Specifically, we use a BNF encoder and a Perturbed-Wav
encoder to form a content extractor to learn linguistic and para-linguistic
features respectively, where BNFs come from a robust pre-trained ASR model and
the perturbed wave becomes speaker-irrelevant after signal perturbation. We
further fuse the linguistic and para-linguistic features through an attention
mechanism, where speaker-dependent prosody features are adopted as the
attention query, which result from a prosody encoder with target speaker
embedding and normalized pitch and energy of source speech as input. Finally
the decoder consumes the integrated features and the speaker-dependent prosody
feature to generate the converted speech. Experiments demonstrate that
Expressive-VC is superior to several state-of-the-art systems, achieving both
high expressiveness captured from the source speech and high speaker similarity
with the target speaker; meanwhile intelligibility is well maintained
Preserving background sound in noise-robust voice conversion via multi-task learning
Background sound is an informative form of art that is helpful in providing a
more immersive experience in real-application voice conversion (VC) scenarios.
However, prior research about VC, mainly focusing on clean voices, pay rare
attention to VC with background sound. The critical problem for preserving
background sound in VC is inevitable speech distortion by the neural separation
model and the cascade mismatch between the source separation model and the VC
model. In this paper, we propose an end-to-end framework via multi-task
learning which sequentially cascades a source separation (SS) module, a
bottleneck feature extraction module and a VC module. Specifically, the source
separation task explicitly considers critical phase information and confines
the distortion caused by the imperfect separation process. The source
separation task, the typical VC task and the unified task shares a uniform
reconstruction loss constrained by joint training to reduce the mismatch
between the SS and VC modules. Experimental results demonstrate that our
proposed framework significantly outperforms the baseline systems while
achieving comparable quality and speaker similarity to the VC models trained
with clean data.Comment: Submitted to ICASSP 202
A novel image fusion algorithm based on bandelet transform
A novel image fusion algorithm based on bandelet transform is proposed. Bandelet transform can take advantage of the geometrical regularity of image structure and represent sharp image transitions such as edges efficiently in image fusion. For reconstructing the fused image, the maximum rule is used to
select source images’ geometric flow and bandelet coefficients. Experimental results indicate that the bandelet-based fusion algorithm represents the edge and detailed information well and outperforms the wavelet-based and Laplacian pyramid-based fusion algorithms, especially when the abundant texture and
edges are contained in the source images.Navigation Science Foundation (No. 05F07001) and the National Natural Science Foundation of China (No. 60472081)
PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts
Style voice conversion aims to transform the style of source speech to a
desired style according to real-world application demands. However, the current
style voice conversion approach relies on pre-defined labels or reference
speech to control the conversion process, which leads to limitations in style
diversity or falls short in terms of the intuitive and interpretability of
style representation. In this study, we propose PromptVC, a novel style voice
conversion approach that employs a latent diffusion model to generate a style
vector driven by natural language prompts. Specifically, the style vector is
extracted by a style encoder during training, and then the latent diffusion
model is trained independently to sample the style vector from noise, with this
process being conditioned on natural language prompts. To improve style
expressiveness, we leverage HuBERT to extract discrete tokens and replace them
with the K-Means center embedding to serve as the linguistic content, which
minimizes residual style information. Additionally, we deduplicate the same
discrete token and employ a differentiable duration predictor to re-predict the
duration of each token, which can adapt the duration of the same linguistic
content to different styles. The subjective and objective evaluation results
demonstrate the effectiveness of our proposed system.Comment: Submitted to ICASSP 202
Unsupervised Deep Representation Learning Enables Phenotype Discovery For Genetic association Studies of Brain Imaging
Understanding the genetic architecture of brain structure is challenging, partly due to difficulties in designing robust, non-biased descriptors of brain morphology. Until recently, brain measures for genome-wide association studies (GWAS) consisted of traditionally expert-defined or software-derived image-derived phenotypes (IDPs) that are often based on theoretical preconceptions or computed from limited amounts of data. Here, we present an approach to derive brain imaging phenotypes using unsupervised deep representation learning. We train a 3-D convolutional autoencoder model with reconstruction loss on 6130 UK Biobank (UKBB) participants\u27 T1 or T2-FLAIR (T2) brain MRIs to create a 128-dimensional representation known as Unsupervised Deep learning derived Imaging Phenotypes (UDIPs). GWAS of these UDIPs in held-out UKBB subjects (n = 22,880 discovery and n = 12,359/11,265 replication cohorts for T1/T2) identified 9457 significant SNPs organized into 97 independent genetic loci of which 60 loci were replicated. Twenty-six loci were not reported in earlier T1 and T2 IDP-based UK Biobank GWAS. We developed a perturbation-based decoder interpretation approach to show that these loci are associated with UDIPs mapped to multiple relevant brain regions. Our results established unsupervised deep learning can derive robust, unbiased, heritable, and interpretable brain imaging phenotypes
Disentangling the effects of vapor pressure deficit on northern terrestrial vegetation productivity
The impact of atmospheric vapor pressure deficit (VPD) on plant photosynthesis has long been acknowledged, but large interactions with air temperature (T) and soil moisture (SM) still hinder a complete understanding of the influence of VPD on vegetation production across various climate zones. Here, we found a diverging response of productivity to VPD in the Northern Hemisphere by excluding interactive effects of VPD with T and SM. The interactions between VPD and T/SM not only offset the potential positive impact of warming on vegetation productivity but also amplifies the negative effect of soil drying. Notably, for high-latitude ecosystems, there occurs a pronounced shift in vegetation productivity\u27s response to VPD during the growing season when VPD surpasses a threshold of 3.5 to 4.0 hectopascals. These results yield previously unknown insights into the role of VPD in terrestrial ecosystems and enhance our comprehension of the terrestrial carbon cycle\u27s response to global warming
Recommended from our members
Deep Learning Approach for Brain Machine Interface
Objective: Brain machine interface (BMI) or Brain Computer Interface (BCI) provides a direct pathway between the brain and an external device to help people suffering from severely impaired motor function by decoding brain activities and translating human intentions into control signals. Conventionally, the decoding pipeline for BMIs consists of chained different stages of feature extraction, time-frequency analysis and statistical learning models. Each of these stages uses a different algorithm trained in a sequential manner, which makes the whole system difficult to be adaptive. Our goal is to create differentiable signal processing modules and plug them together to build an adaptive online system. The system could be trained with a single objective function and a single learning algorithm so that each component can be updated in parallel to increase the performance in a robust manner. We use deep neural networks to address these needs. Main Results: We predicted the finger trajectory using Electrocorticography (ECoG) signals and compared results for the Least Angle Regression (LARS), Convolutional Long Short Term Memory Network (Conv-LSTM), Random Forest (RF), and a pipeline consisting of band-pass filtering, energy extraction, feature selection and linear regression. The results showed that the deep learning models performed better than the commonly used linear model. The deep learning models not only gave smoother and more realistic trajectories but also learned the transition between movement and rest state. We also estimated the source connectivity of the brain signals using a Recurrent Neural Network (RNN) and it correctly estimated the order and sparsity level of the underlying Multivariate Auto-regressive process (MVAR). The time course of the source connectivity was also recovered. Significance: We replace the conventional signal processing pipeline with differentiable modules so that the whole BMI system is adaptive. The study of the decoding system demonstrated a model for BMI that involved a convolutional and recurrent neural network. It integrated the feature extraction pipeline into the convolution and pooling layer and used Long Short Term Memory (LSTM) layer to capture the state transitions. The decoding network eliminated the need to separately train the model at each step in the decoding pipeline. The whole system can be jointly optimized using stochastic gradient descent and is capable of online learning. The study of the source connectivity estimation demonstrated a generative RNN model that can estimate the un-mixing matrix and the MVAR coefficients of the source activity at the same time. Our method addressed the issue of estimation and inference of the non-stationary MVAR coefficients and the un-mixing matrix in the presence of non-gaussian noise. More importantly, this model can be easily plugged into the BMI decoding system as a differentiable feature extraction module
Decoding of finger trajectory from ECoG using deep learning
Objective. Conventional decoding pipeline for brain-machine interfaces (BMIs) consists of chained different stages of feature extraction, time-frequency analysis and statistical learning models. Each of these stages uses a different algorithm trained in a sequential manner, which makes it difficult to make the whole system adaptive. The goal was to create an adaptive online system with a single objective function and a single learning algorithm so that the whole system can be trained in parallel to increase the decoding performance. Here, we used deep neural networks consisting of convolutional neural networks (CNN) and a special kind of recurrent neural network (RNN) called long short term memory (LSTM) to address these needs. Approach. We used electrocorticography (ECoG) data collected by Kubanek et al. The task consisted of individual finger flexions upon a visual cue. Our model combined a hierarchical feature extractor CNN and a RNN that was able to process sequential data and recognize temporal dynamics in the neural data. CNN was used as the feature extractor and LSTM was used as the regression algorithm to capture the temporal dynamics of the signal. Main results. We predicted the finger trajectory using ECoG signals and compared results for the least angle regression (LARS), CNN-LSTM, random forest, LSTM model (LSTM_HC, for using hard-coded features) and a decoding pipeline consisting of band-pass filtering, energy extraction, feature selection and linear regression. The results showed that the deep learning models performed better than the commonly used linear model. The deep learning models not only gave smoother and more realistic trajectories but also learned the transition between movement and rest state. Significance. This study demonstrated a decoding network for BMI that involved a convolutional and recurrent neural network model. It integrated the feature extraction pipeline into the convolution and pooling layer and used LSTM layer to capture the state transitions. The discussed network eliminated the need to separately train the model at each step in the decoding pipeline. The whole system can be jointly optimized using stochastic gradient descent and is capable of online learning