103 research outputs found
Event-driven Real-time Retrieval in Web Search
Information retrieval in real-time search presents unique challenges distinct
from those encountered in classical web search. These challenges are
particularly pronounced due to the rapid change of user search intent, which is
influenced by the occurrence and evolution of breaking news events, such as
earthquakes, elections, and wars. Previous dense retrieval methods, which
primarily focused on static semantic representation, lack the capacity to
capture immediate search intent, leading to inferior performance in retrieving
the most recent event-related documents in time-sensitive scenarios. To address
this issue, this paper expands the query with event information that represents
real-time search intent. The Event information is then integrated with the
query through a cross-attention mechanism, resulting in a time-context query
representation. We further enhance the model's capacity for event
representation through multi-task training. Since publicly available datasets
such as MS-MARCO do not contain any event information on the query side and
have few time-sensitive queries, we design an automatic data collection and
annotation pipeline to address this issue, which includes ModelZoo-based Coarse
Annotation and LLM-driven Fine Annotation processes. In addition, we share the
training tricks such as two-stage training and hard negative sampling. Finally,
we conduct a set of offline experiments on a million-scale production dataset
to evaluate our approach and deploy an A/B testing in a real online system to
verify the performance. Extensive experimental results demonstrate that our
proposed approach significantly outperforms existing state-of-the-art baseline
methods
Event-Centric Query Expansion in Web Search
In search engines, query expansion (QE) is a crucial technique to improve
search experience. Previous studies often rely on long-term search log mining,
which leads to slow updates and is sub-optimal for time-sensitive news
searches. In this work, we present Event-Centric Query Expansion (EQE), a novel
QE system that addresses these issues by mining the best expansion from a
significant amount of potential events rapidly and accurately. This system
consists of four stages, i.e., event collection, event reformulation, semantic
retrieval and online ranking. Specifically, we first collect and filter news
headlines from websites. Then we propose a generation model that incorporates
contrastive learning and prompt-tuning techniques to reformulate these
headlines to concise candidates. Additionally, we fine-tune a dual-tower
semantic model to function as an encoder for event retrieval and explore a
two-stage contrastive training approach to enhance the accuracy of event
retrieval. Finally, we rank the retrieved events and select the optimal one as
QE, which is then used to improve the retrieval of event-related documents.
Through offline analysis and online A/B testing, we observe that the EQE system
significantly improves many metrics compared to the baseline. The system has
been deployed in Tencent QQ Browser Search and served hundreds of millions of
users. The dataset and baseline codes are available at
https://open-event-hub.github.io/eqe .Comment: ACL 2023 Industry Trac
SlimPajama-DC: Understanding Data Combinations for LLM Training
This paper aims to understand the impacts of various data combinations (e.g.,
web text, wikipedia, github, books) on the training of large language models
using SlimPajama. SlimPajama is a rigorously deduplicated, multi-source
dataset, which has been refined and further deduplicated to 627B tokens from
the extensive 1.2T tokens RedPajama dataset contributed by Together. We've
termed our research as SlimPajama-DC, an empirical analysis designed to uncover
fundamental characteristics and best practices associated with employing
SlimPajama in the training of large language models. During our research with
SlimPajama, two pivotal observations emerged: (1) Global deduplication vs.
local deduplication. We analyze and discuss how global (across different
sources of datasets) and local (within the single source of dataset)
deduplications affect the performance of trained models. (2) Proportions of
high-quality/highly-deduplicated multi-source datasets in the combination. To
study this, we construct six configurations of SlimPajama dataset and train
individual ones using 1.3B Cerebras-GPT model with Alibi and SwiGLU. Our best
configuration outperforms the 1.3B model trained on RedPajama using the same
number of training tokens by a significant margin. All our 1.3B models are
trained on Cerebras 16 CS-2 cluster with a total of 80 PFLOP/s in bf16
mixed precision. We further extend our discoveries (such as increasing data
diversity is crucial after global deduplication) on a 7B model with large
batch-size training. Our models and the separate SlimPajama-DC datasets are
available at: https://huggingface.co/MBZUAI-LLM and
https://huggingface.co/datasets/cerebras/SlimPajama-627B.Comment: Technical report. Huggingface: https://huggingface.co/MBZUAI-LLM and
https://huggingface.co/datasets/cerebras/SlimPajama-627
Benchmarking reconstructive spectrometer with multi-resonant cavities
Recent years have seen the rapid development of miniaturized reconstructive
spectrometers (RSs), yet they still confront a range of technical challenges,
such as bandwidth/resolution ratio, sensing speed, and/or power efficiency.
Reported RS designs often suffer from insufficient decorrelation between
sampling channels, which results in limited compressive sampling efficiency, in
essence, due to inadequate engineering of sampling responses. This in turn
leads to poor spectral-pixel-to-channel ratios (SPCRs), typically restricted at
single digits. So far, there lacks a general guideline for manipulating RS
sampling responses for the effectiveness of spectral information acquisition.
In this study, we shed light on a fundamental parameter from the compressive
sensing theory - the average mutual correlation coefficient v - and provide
insight into how it serves as a critical benchmark in RS design with regards to
the SPCR and reconstruction accuracy. To this end, we propose a novel RS design
with multi-resonant cavities, consisting of a series of partial reflective
interfaces. Such multi-cavity configuration offers an expansive parameter
space, facilitating the superlative optimization of sampling matrices with
minimized v. As a proof-of-concept demonstration, a single-shot, dual-band RS
is implemented on a SiN platform, tailored for capturing signature spectral
shapes across different wavelength regions, with customized photonic crystal
nanobeam mirrors. Experimentally, the device demonstrates an overall operation
bandwidth of 270 nm and a <0.5 nm resolution with only 15 sampling channels per
band, leading to a record high SPCR of 18.0. Moreover, the proposed
multi-cavity design can be readily adapted to various photonic platforms. For
instance, we showcase that by employing multi-layer coatings, an
ultra-broadband RS can be optimized to exhibit a 700 nm bandwidth with an SPCR
of over 100
Title2Event: Benchmarking Open Event Extraction with a Large-scale Chinese Title Dataset
Event extraction (EE) is crucial to downstream tasks such as new aggregation
and event knowledge graph construction. Most existing EE datasets manually
define fixed event types and design specific schema for each of them, failing
to cover diverse events emerging from the online text. Moreover, news titles,
an important source of event mentions, have not gained enough attention in
current EE research. In this paper, We present Title2Event, a large-scale
sentence-level dataset benchmarking Open Event Extraction without restricting
event types. Title2Event contains more than 42,000 news titles in 34 topics
collected from Chinese web pages. To the best of our knowledge, it is currently
the largest manually-annotated Chinese dataset for open event extraction. We
further conduct experiments on Title2Event with different models and show that
the characteristics of titles make it challenging for event extraction,
addressing the significance of advanced study on this problem. The dataset and
baseline codes are available at https://open-event-hub.github.io/title2event.Comment: EMNLP 202
An improved positioning algorithm in a long-range asymmetric perimeter security system
In this paper, an improved positioning algorithm is proposed for a long-range asymmetric perimeter security system. This algorithm employs zero-crossing rate to detect the disturbance starting point, and then utilizes an improved empirical mode decomposition to obtain the effective time-frequency distribution of the extracted signal. In the end, a cross-correlation is used to estimate the time delay of the effective extracted signal. The scheme is also verified and analyzed experimentally. The field test results demonstrate that the proposed scheme can achieve a detection of 96.60% of positioning errors distributed within the range of 0-±20 m at the sensing length of 75 km, which significantly improves the positioning accuracy for the long-range asymmetric fence perimeter application
Nutrient availability contributes to structural and functional diversity of microbiome in Xinjiang oilfield
Indigenous microbial enhanced oil recovery (IMEOR) is a promising alternative way to promote oil recovery. It activates oil recovery microorganisms in the reservoir by adding nutrients to the injected water, utilizing microbial growth and metabolism to enhance recovery. However, few studies have focused on the impact of injected nutrients on reservoir microbial community composition and potential functions. This limits the further strategic development of IMEOR. In this study, we investigated the effects of nutrition on the composition of the reservoir bacterial community and functions in the Qizhong block of Xinjiang Oilfield, China, by constructing a long core microbial flooding simulation device. The results showed that the microbial community structure of the reservoir changed from aerobic state to anaerobic state after nutrient injection. Reducing the nutrient concentration increased the diversity and network stability of the reservoir bacterial community. At the same time, the nitrogen metabolism function also showed the same change response. Overall, these results indicated that nutrition significantly affected the community structure and function of reservoir microorganisms. Injecting low concentrations of nutrients may be more beneficial to improve oil recovery. This study is of great significance for guiding IMEOR technology and saving costs at the field site
Pioglitazone Improves Mitochondrial Function in the Remnant Kidney and Protects against Renal Fibrosis in 5/6 Nephrectomized Rats
Pioglitazone is a type of peroxisome proliferator-activated receptor γ (PPARγ) agonist and has been demonstrated to be effective in chronic kidney diseases (CKD) treatment. However, the underlying mechanism involved in the renoprotection of pioglitazone has not been fully revealed. In the present study, the renoprotective mechanism of pioglitazone was investigated in 5/6 nephrectomized (Nx) rats and TGF-β1-exposed HK-2 cells. Pioglitazone attenuated renal injury and improved renal function, as examined by 24 h urinary protein, blood urea nitrogen and plasma creatinine in Nx rats. Renal fibrosis and enhanced expressions of profibrotic proteins TGF-β1, fibronectin and collagen I caused by Nx were significantly alleviated by pioglitazone. In addition, pioglitazone protected mitochondrial functions by stabilizing the mitochondrial membrane potential, inhibiting ROS generation, maintaining ATP production and the activities of complexes I and III, and preventing cytochrome C leakage from mitochondria. Pioglitazone also upregulated the expression levels of ATP synthase β, COX I and NDUFB8, which were downregulated in the kidney of Nx rats and TGF-β1-exposed HK-2 cells. Furthermore, pioglitazone increased fusion proteins Opa-1 and Mfn2 expressions and decreased fission protein Drp1 expression. The results imply that pioglitazone may exert the renoprotective effects through modulating mitochondrial electron transport chain and mitochondrial dynamics in CKD. Finally, these recoveries were completely or partly inhibited by GW9662, which suggests that these effects at least partly PPARγ dependent. This study provides evidence for the pharmacological mechanism of pioglitazone in the treatment of CKD
Simultaneous bond-selective deuterium-based isotopic labeling sensing with disposable ultra-miniature CARS fiber probe
Deuterium-based isotopic labeling is an important technique for tracking cellular metabolism with the Raman signals analysis of low-wavenumber (LW) C–D bonds and high-wavenumber (HW) C–H bonds. We propose and demonstrate a disposable ultra-miniature fiber probe to detect LW and HW coherent anti-Stokes Raman scattering (CARS) spectra for deuterated compounds simultaneously and bond-selectively sensing. The 10.78 µm diameter disposable fiber probe, comprised of focusing taper as fiber probe head and time-domain walk-off eliminating fiber section with designed length, realizes wide-frequency-interval dual Stokes pulse delivering and focusing. The fiber probe enables quantitative concentration determination with resolution down to 11 mM. The chemical vibration modes of LW region C–D bonds and HW region C–H bonds of the mixture samples of organic compounds and their deuterated counterparts in a simulated cell are simultaneously excited and characterized. The CARS disposable fiber probe introduces a promising handle for in vivo biochemical detection based on isotopic labeling sensing
Improving detection and notification of tuberculosis cases in students in Shaanxi province, China: an intervention study
<p>Abstract</p> <p>Background</p> <p>Cooperation between different public and private health institutes involved in tuberculosis (TB) control has proven to enhance TB control in different settings. In China, such a mechanism has not been set up yet between Centers for Disease Control (CDCs) and university hospitals despite an increased TB incidence among students. This study aims to improve arrival of TB suspects identified by universities at the CDCs in order to manage them under standardized, directly observed treatment-short course (DOTS) conditions according to the National Tuberculosis Programme (NTP) guidelines.</p> <p>Methods</p> <p>Five matched pairs of universities were randomly assigned to the control and intervention group. After a baseline survey, a cooperation mechanism between local CDCs and university hospitals was set up in the intervention group. The effects on referral of TB suspects to the local CDC, tracing by the local CDC, and arrival at the local CDCs were assessed. Differences were tested by means of the chi-square test.</p> <p>Results</p> <p>During the baseline survey, the referral, tracing and arrival rates were between 37% and 46%. After implementation of the cooperation mechanism, these rates had not changed in the control group but increased significantly in the intervention group: the referral, tracing and arrival rates were 97%, 95%, and 93%, respectively.</p> <p>Conclusions</p> <p>It is feasible and effective to set up cooperation between CDCs and university hospitals to increase the number of TB suspects examined by CDCs and increase the number of TB patients treated under DOTS conditions. These public-public mix (PPM) activities should be expanded to cover all other university hospitals in China.</p
- …