306 research outputs found
Methods for Comparing a DNA Sequence with a Protein Sequence
We describe two methods for constructing an optimal global alignment of, and an optimal local alignment between, a DNA sequence and a protein sequence. The alignment model of the methods addresses the problems of frameshifts and introns in the DNA sequence. The methods require computer memory proportional to the sequence lengths, so they can rigorously process very huge sequences. The simplified versions of the methods were implemented as computer programs named NAP and LAP. The experimental results demonstrate that the programs are sensitive and powerful tools for finding genes by DNA-protein sequence homology
Optimal Clustering with Noisy Queries via Multi-Armed Bandit
Motivated by many applications, we study clustering with a faulty oracle. In
this problem, there are items belonging to unknown clusters, and the
algorithm is allowed to ask the oracle whether two items belong to the same
cluster or not. However, the answer from the oracle is correct only with
probability . The goal is to recover the hidden
clusters with minimum number of noisy queries. Previous works have shown that
the problem can be solved with queries, while
queries is known to be necessary. So, for any
values of and , there is still a non-trivial gap between upper and
lower bounds. In this work, we obtain the first matching upper and lower bounds
for a wide range of parameters. In particular, a new polynomial time algorithm
with queries is proposed. Moreover, we prove a new lower bound of
, which, combined with the existing
bound, matches our upper bound up to an additive
term. To obtain the new results, our
main ingredient is an interesting connection between our problem and
multi-armed bandit, which might provide useful insights for other similar
problems.Comment: ICML 202
Semantically Enhanced Software Traceability Using Deep Learning Techniques
In most safety-critical domains the need for traceability is prescribed by
certifying bodies. Trace links are generally created among requirements,
design, source code, test cases and other artifacts, however, creating such
links manually is time consuming and error prone. Automated solutions use
information retrieval and machine learning techniques to generate trace links,
however, current techniques fail to understand semantics of the software
artifacts or to integrate domain knowledge into the tracing process and
therefore tend to deliver imprecise and inaccurate results. In this paper, we
present a solution that uses deep learning to incorporate requirements artifact
semantics and domain knowledge into the tracing solution. We propose a tracing
network architecture that utilizes Word Embedding and Recurrent Neural Network
(RNN) models to generate trace links. Word embedding learns word vectors that
represent knowledge of the domain corpus and RNN uses these word vectors to
learn the sentence semantics of requirements artifacts. We trained 360
different configurations of the tracing network using existing trace links in
the Positive Train Control domain and identified the Bidirectional Gated
Recurrent Unit (BI-GRU) as the best model for the tracing task. BI-GRU
significantly out-performed state-of-the-art tracing methods including the
Vector Space Model and Latent Semantic Indexing.Comment: 2017 IEEE/ACM 39th International Conference on Software Engineering
(ICSE
The Long-Term Effects of Blood Urea Nitrogen Levels on Cardiovascular Disease and All-Cause Mortality in Diabetes: A Prospective Cohort Study
BACKGROUND: The long-term effects of blood urea nitrogen(BUN) in patients with diabetes remain unknown. Current studies reporting the target BUN level in patients with diabetes are also limited. Hence, this prospective study aimed to explore the relationship of BUN with all-cause and cardiovascular mortalities in patients with diabetes.
METHODS: In total, 10,507 participants with diabetes from the National Health and Nutrition Examination Survey (1999-2018) were enrolled. The causes and numbers of deaths were determined based on the National Death Index mortality data from the date of NHANES interview until follow-up (December 31, 2019). Multivariate Cox proportional hazard regression models were used to calculate the hazard ratios (HRs) and 95% confidence interval (CIs) of mortality.
RESULTS: Of the adult participants with diabetes, 4963 (47.2%) were female. The median (interquartile range) BUN level of participants was 5 (3.93-6.43) mmol/L. After 86,601 person-years of follow-up, 2,441 deaths were documented. After adjusting for variables, the HRs of cardiovascular disease (CVD) and all-cause mortality in the highest BUN level group were 1.52 and 1.35, respectively, compared with those in the lowest BUN level group. With a one-unit increment in BUN levels, the HRs of all-cause and CVD mortality rates were 1.07 and 1.08, respectively. The results remained robust when several sensitivity and stratified analyses were performed. Moreover, BUN showed a nonlinear association with all-cause and CVD mortality. Their curves all showed that the inflection points were close to the BUN level of 5 mmol/L.
CONCLUSION: BUN had a nonlinear association with all-cause and CVD mortality in patients with diabetes. The inflection point was at 5 mmol/L
ASR: Attention-alike Structural Re-parameterization
The structural re-parameterization (SRP) technique is a novel deep learning
technique that achieves interconversion between different network architectures
through equivalent parameter transformations. This technique enables the
mitigation of the extra costs for performance improvement during training, such
as parameter size and inference time, through these transformations during
inference, and therefore SRP has great potential for industrial and practical
applications. The existing SRP methods have successfully considered many
commonly used architectures, such as normalizations, pooling methods,
multi-branch convolution. However, the widely used self-attention modules
cannot be directly implemented by SRP due to these modules usually act on the
backbone network in a multiplicative manner and the modules' output is
input-dependent during inference, which limits the application scenarios of
SRP. In this paper, we conduct extensive experiments from a statistical
perspective and discover an interesting phenomenon Stripe Observation, which
reveals that channel attention values quickly approach some constant vectors
during training. This observation inspires us to propose a simple-yet-effective
attention-alike structural re-parameterization (ASR) that allows us to achieve
SRP for a given network while enjoying the effectiveness of the self-attention
mechanism. Extensive experiments conducted on several standard benchmarks
demonstrate the effectiveness of ASR in generally improving the performance of
existing backbone networks, self-attention modules, and SRP methods without any
elaborated model crafting. We also analyze the limitations and provide
experimental or theoretical evidence for the strong robustness of the proposed
ASR.Comment: Technical repor
SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models
Diffusion models, which have emerged to become popular text-to-image
generation models, can produce high-quality and content-rich images guided by
textual prompts. However, there are limitations to semantic understanding and
commonsense reasoning in existing models when the input prompts are concise
narrative, resulting in low-quality image generation. To improve the capacities
for narrative prompts, we propose a simple-yet-effective parameter-efficient
fine-tuning approach called the Semantic Understanding and Reasoning adapter
(SUR-adapter) for pre-trained diffusion models. To reach this goal, we first
collect and annotate a new dataset SURD which consists of more than 57,000
semantically corrected multi-modal samples. Each sample contains a simple
narrative prompt, a complex keyword-based prompt, and a high-quality image.
Then, we align the semantic representation of narrative prompts to the complex
prompts and transfer knowledge of large language models (LLMs) to our
SUR-adapter via knowledge distillation so that it can acquire the powerful
semantic understanding and reasoning capabilities to build a high-quality
textual semantic representation for text-to-image generation. We conduct
experiments by integrating multiple LLMs and popular pre-trained diffusion
models to show the effectiveness of our approach in enabling diffusion models
to understand and reason concise natural language without image quality
degradation. Our approach can make text-to-image diffusion models easier to use
with better user experience, which demonstrates our approach has the potential
for further advancing the development of user-friendly text-to-image generation
models by bridging the semantic gap between simple narrative prompts and
complex keyword-based prompts. The code is released at
https://github.com/Qrange-group/SUR-adapter.Comment: accepted by ACM MM 202
PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity Recognition
In this study, we aim to reduce generation latency for Named Entity
Recognition (NER) with Large Language Models (LLMs). The main cause of high
latency in LLMs is the sequential decoding process, which autoregressively
generates all labels and mentions for NER, significantly increase the sequence
length. To this end, we introduce Parallel Decoding in LLM for NE}
(PaDeLLM-NER), a approach that integrates seamlessly into existing generative
model frameworks without necessitating additional modules or architectural
modifications. PaDeLLM-NER allows for the simultaneous decoding of all
mentions, thereby reducing generation latency. Experiments reveal that
PaDeLLM-NER significantly increases inference speed that is 1.76 to 10.22 times
faster than the autoregressive approach for both English and Chinese.
Simultaneously it maintains the quality of predictions as evidenced by the
performance that is on par with the state-of-the-art across various datasets
Endophytic bacterium Pseudomonas protegens suppresses mycelial growth of Botryosphaeria dothidea and decreases its pathogenicity to postharvest fruits
Apple (Malus domestica Borkh.), one of the most economically important fruits widely consumed worldwide, has been suffering from apple ring rot caused by Botryosphaeria dothidea, which dramatically affects its quality and yield. In the present study, we demonstrated that Pseudomonas protegens, isolated from Chinese leek (Allium tuberosum), significantly suppressed the mycelial growth and propagation of B. dothidea, respectively, further displayed a considerably inhibitory effect on the apple ring rot of postharvest fruits. In addition, P. protegens significantly improved the total soluble solid/titrable acidity (TSS/TA) ratio and soluble sugar/titrable acidity (SS/TA) ratio and drastically maintained the fruit firmness. Further analysis manifested that P. protegens substantially induced the defense-related genes such as MdGLU, MdPAL, MdPOD, MdCAL, and transcription factors related to the resistance to B. dothidea, including MdWRKY15, MdPUB29, MdMyb73, and MdERF11 in apple fruits. Meanwhile, P. protegens considerably restrained the expressions of the pathogenicity-related genes in B. dothidea, including the BdCYP450, BdADH, BdGHY, BdATS, Bdα/β-HY, and BdSTR. By inference, P. protegens inhibited the apple ring rot on postharvest fruits by activating the defense system of apple fruit and repressing the pathogenic factor of B. dothidea. The study provided a theoretical basis and a potential alternative to manage the apple ring rot on postharvest fruits
UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding
In the era of Large Language Models (LLMs), tremendous strides have been made
in the field of multimodal understanding. However, existing advanced algorithms
are limited to effectively utilizing the immense representation capabilities
and rich world knowledge inherent to these large pre-trained models, and the
beneficial connections among tasks within the context of text-rich scenarios
have not been sufficiently explored. In this work, we introduce UniDoc, a novel
multimodal model equipped with text detection and recognition capabilities,
which are deficient in existing approaches. Moreover, UniDoc capitalizes on the
beneficial interactions among tasks to enhance the performance of each
individual task. To implement UniDoc, we perform unified multimodal instruct
tuning on the contributed large-scale instruction following datasets.
Quantitative and qualitative experimental results show that UniDoc sets
state-of-the-art scores across multiple challenging benchmarks. To the best of
our knowledge, this is the first large multimodal model capable of simultaneous
text detection, recognition, spotting, and understanding
- …