103 research outputs found

    Event-driven Real-time Retrieval in Web Search

    Full text link
    Information retrieval in real-time search presents unique challenges distinct from those encountered in classical web search. These challenges are particularly pronounced due to the rapid change of user search intent, which is influenced by the occurrence and evolution of breaking news events, such as earthquakes, elections, and wars. Previous dense retrieval methods, which primarily focused on static semantic representation, lack the capacity to capture immediate search intent, leading to inferior performance in retrieving the most recent event-related documents in time-sensitive scenarios. To address this issue, this paper expands the query with event information that represents real-time search intent. The Event information is then integrated with the query through a cross-attention mechanism, resulting in a time-context query representation. We further enhance the model's capacity for event representation through multi-task training. Since publicly available datasets such as MS-MARCO do not contain any event information on the query side and have few time-sensitive queries, we design an automatic data collection and annotation pipeline to address this issue, which includes ModelZoo-based Coarse Annotation and LLM-driven Fine Annotation processes. In addition, we share the training tricks such as two-stage training and hard negative sampling. Finally, we conduct a set of offline experiments on a million-scale production dataset to evaluate our approach and deploy an A/B testing in a real online system to verify the performance. Extensive experimental results demonstrate that our proposed approach significantly outperforms existing state-of-the-art baseline methods

    Event-Centric Query Expansion in Web Search

    Full text link
    In search engines, query expansion (QE) is a crucial technique to improve search experience. Previous studies often rely on long-term search log mining, which leads to slow updates and is sub-optimal for time-sensitive news searches. In this work, we present Event-Centric Query Expansion (EQE), a novel QE system that addresses these issues by mining the best expansion from a significant amount of potential events rapidly and accurately. This system consists of four stages, i.e., event collection, event reformulation, semantic retrieval and online ranking. Specifically, we first collect and filter news headlines from websites. Then we propose a generation model that incorporates contrastive learning and prompt-tuning techniques to reformulate these headlines to concise candidates. Additionally, we fine-tune a dual-tower semantic model to function as an encoder for event retrieval and explore a two-stage contrastive training approach to enhance the accuracy of event retrieval. Finally, we rank the retrieved events and select the optimal one as QE, which is then used to improve the retrieval of event-related documents. Through offline analysis and online A/B testing, we observe that the EQE system significantly improves many metrics compared to the baseline. The system has been deployed in Tencent QQ Browser Search and served hundreds of millions of users. The dataset and baseline codes are available at https://open-event-hub.github.io/eqe .Comment: ACL 2023 Industry Trac

    SlimPajama-DC: Understanding Data Combinations for LLM Training

    Full text link
    This paper aims to understand the impacts of various data combinations (e.g., web text, wikipedia, github, books) on the training of large language models using SlimPajama. SlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T tokens RedPajama dataset contributed by Together. We've termed our research as SlimPajama-DC, an empirical analysis designed to uncover fundamental characteristics and best practices associated with employing SlimPajama in the training of large language models. During our research with SlimPajama, two pivotal observations emerged: (1) Global deduplication vs. local deduplication. We analyze and discuss how global (across different sources of datasets) and local (within the single source of dataset) deduplications affect the performance of trained models. (2) Proportions of high-quality/highly-deduplicated multi-source datasets in the combination. To study this, we construct six configurations of SlimPajama dataset and train individual ones using 1.3B Cerebras-GPT model with Alibi and SwiGLU. Our best configuration outperforms the 1.3B model trained on RedPajama using the same number of training tokens by a significant margin. All our 1.3B models are trained on Cerebras 16×\times CS-2 cluster with a total of 80 PFLOP/s in bf16 mixed precision. We further extend our discoveries (such as increasing data diversity is crucial after global deduplication) on a 7B model with large batch-size training. Our models and the separate SlimPajama-DC datasets are available at: https://huggingface.co/MBZUAI-LLM and https://huggingface.co/datasets/cerebras/SlimPajama-627B.Comment: Technical report. Huggingface: https://huggingface.co/MBZUAI-LLM and https://huggingface.co/datasets/cerebras/SlimPajama-627

    Benchmarking reconstructive spectrometer with multi-resonant cavities

    Full text link
    Recent years have seen the rapid development of miniaturized reconstructive spectrometers (RSs), yet they still confront a range of technical challenges, such as bandwidth/resolution ratio, sensing speed, and/or power efficiency. Reported RS designs often suffer from insufficient decorrelation between sampling channels, which results in limited compressive sampling efficiency, in essence, due to inadequate engineering of sampling responses. This in turn leads to poor spectral-pixel-to-channel ratios (SPCRs), typically restricted at single digits. So far, there lacks a general guideline for manipulating RS sampling responses for the effectiveness of spectral information acquisition. In this study, we shed light on a fundamental parameter from the compressive sensing theory - the average mutual correlation coefficient v - and provide insight into how it serves as a critical benchmark in RS design with regards to the SPCR and reconstruction accuracy. To this end, we propose a novel RS design with multi-resonant cavities, consisting of a series of partial reflective interfaces. Such multi-cavity configuration offers an expansive parameter space, facilitating the superlative optimization of sampling matrices with minimized v. As a proof-of-concept demonstration, a single-shot, dual-band RS is implemented on a SiN platform, tailored for capturing signature spectral shapes across different wavelength regions, with customized photonic crystal nanobeam mirrors. Experimentally, the device demonstrates an overall operation bandwidth of 270 nm and a <0.5 nm resolution with only 15 sampling channels per band, leading to a record high SPCR of 18.0. Moreover, the proposed multi-cavity design can be readily adapted to various photonic platforms. For instance, we showcase that by employing multi-layer coatings, an ultra-broadband RS can be optimized to exhibit a 700 nm bandwidth with an SPCR of over 100

    Title2Event: Benchmarking Open Event Extraction with a Large-scale Chinese Title Dataset

    Full text link
    Event extraction (EE) is crucial to downstream tasks such as new aggregation and event knowledge graph construction. Most existing EE datasets manually define fixed event types and design specific schema for each of them, failing to cover diverse events emerging from the online text. Moreover, news titles, an important source of event mentions, have not gained enough attention in current EE research. In this paper, We present Title2Event, a large-scale sentence-level dataset benchmarking Open Event Extraction without restricting event types. Title2Event contains more than 42,000 news titles in 34 topics collected from Chinese web pages. To the best of our knowledge, it is currently the largest manually-annotated Chinese dataset for open event extraction. We further conduct experiments on Title2Event with different models and show that the characteristics of titles make it challenging for event extraction, addressing the significance of advanced study on this problem. The dataset and baseline codes are available at https://open-event-hub.github.io/title2event.Comment: EMNLP 202

    An improved positioning algorithm in a long-range asymmetric perimeter security system

    Get PDF
    In this paper, an improved positioning algorithm is proposed for a long-range asymmetric perimeter security system. This algorithm employs zero-crossing rate to detect the disturbance starting point, and then utilizes an improved empirical mode decomposition to obtain the effective time-frequency distribution of the extracted signal. In the end, a cross-correlation is used to estimate the time delay of the effective extracted signal. The scheme is also verified and analyzed experimentally. The field test results demonstrate that the proposed scheme can achieve a detection of 96.60% of positioning errors distributed within the range of 0-±20 m at the sensing length of 75 km, which significantly improves the positioning accuracy for the long-range asymmetric fence perimeter application

    Nutrient availability contributes to structural and functional diversity of microbiome in Xinjiang oilfield

    Get PDF
    Indigenous microbial enhanced oil recovery (IMEOR) is a promising alternative way to promote oil recovery. It activates oil recovery microorganisms in the reservoir by adding nutrients to the injected water, utilizing microbial growth and metabolism to enhance recovery. However, few studies have focused on the impact of injected nutrients on reservoir microbial community composition and potential functions. This limits the further strategic development of IMEOR. In this study, we investigated the effects of nutrition on the composition of the reservoir bacterial community and functions in the Qizhong block of Xinjiang Oilfield, China, by constructing a long core microbial flooding simulation device. The results showed that the microbial community structure of the reservoir changed from aerobic state to anaerobic state after nutrient injection. Reducing the nutrient concentration increased the diversity and network stability of the reservoir bacterial community. At the same time, the nitrogen metabolism function also showed the same change response. Overall, these results indicated that nutrition significantly affected the community structure and function of reservoir microorganisms. Injecting low concentrations of nutrients may be more beneficial to improve oil recovery. This study is of great significance for guiding IMEOR technology and saving costs at the field site

    Pioglitazone Improves Mitochondrial Function in the Remnant Kidney and Protects against Renal Fibrosis in 5/6 Nephrectomized Rats

    Get PDF
    Pioglitazone is a type of peroxisome proliferator-activated receptor γ (PPARγ) agonist and has been demonstrated to be effective in chronic kidney diseases (CKD) treatment. However, the underlying mechanism involved in the renoprotection of pioglitazone has not been fully revealed. In the present study, the renoprotective mechanism of pioglitazone was investigated in 5/6 nephrectomized (Nx) rats and TGF-β1-exposed HK-2 cells. Pioglitazone attenuated renal injury and improved renal function, as examined by 24 h urinary protein, blood urea nitrogen and plasma creatinine in Nx rats. Renal fibrosis and enhanced expressions of profibrotic proteins TGF-β1, fibronectin and collagen I caused by Nx were significantly alleviated by pioglitazone. In addition, pioglitazone protected mitochondrial functions by stabilizing the mitochondrial membrane potential, inhibiting ROS generation, maintaining ATP production and the activities of complexes I and III, and preventing cytochrome C leakage from mitochondria. Pioglitazone also upregulated the expression levels of ATP synthase β, COX I and NDUFB8, which were downregulated in the kidney of Nx rats and TGF-β1-exposed HK-2 cells. Furthermore, pioglitazone increased fusion proteins Opa-1 and Mfn2 expressions and decreased fission protein Drp1 expression. The results imply that pioglitazone may exert the renoprotective effects through modulating mitochondrial electron transport chain and mitochondrial dynamics in CKD. Finally, these recoveries were completely or partly inhibited by GW9662, which suggests that these effects at least partly PPARγ dependent. This study provides evidence for the pharmacological mechanism of pioglitazone in the treatment of CKD

    Simultaneous bond-selective deuterium-based isotopic labeling sensing with disposable ultra-miniature CARS fiber probe

    Get PDF
    Deuterium-based isotopic labeling is an important technique for tracking cellular metabolism with the Raman signals analysis of low-wavenumber (LW) C–D bonds and high-wavenumber (HW) C–H bonds. We propose and demonstrate a disposable ultra-miniature fiber probe to detect LW and HW coherent anti-Stokes Raman scattering (CARS) spectra for deuterated compounds simultaneously and bond-selectively sensing. The 10.78 µm diameter disposable fiber probe, comprised of focusing taper as fiber probe head and time-domain walk-off eliminating fiber section with designed length, realizes wide-frequency-interval dual Stokes pulse delivering and focusing. The fiber probe enables quantitative concentration determination with resolution down to 11 mM. The chemical vibration modes of LW region C–D bonds and HW region C–H bonds of the mixture samples of organic compounds and their deuterated counterparts in a simulated cell are simultaneously excited and characterized. The CARS disposable fiber probe introduces a promising handle for in vivo biochemical detection based on isotopic labeling sensing

    Improving detection and notification of tuberculosis cases in students in Shaanxi province, China: an intervention study

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Cooperation between different public and private health institutes involved in tuberculosis (TB) control has proven to enhance TB control in different settings. In China, such a mechanism has not been set up yet between Centers for Disease Control (CDCs) and university hospitals despite an increased TB incidence among students. This study aims to improve arrival of TB suspects identified by universities at the CDCs in order to manage them under standardized, directly observed treatment-short course (DOTS) conditions according to the National Tuberculosis Programme (NTP) guidelines.</p> <p>Methods</p> <p>Five matched pairs of universities were randomly assigned to the control and intervention group. After a baseline survey, a cooperation mechanism between local CDCs and university hospitals was set up in the intervention group. The effects on referral of TB suspects to the local CDC, tracing by the local CDC, and arrival at the local CDCs were assessed. Differences were tested by means of the chi-square test.</p> <p>Results</p> <p>During the baseline survey, the referral, tracing and arrival rates were between 37% and 46%. After implementation of the cooperation mechanism, these rates had not changed in the control group but increased significantly in the intervention group: the referral, tracing and arrival rates were 97%, 95%, and 93%, respectively.</p> <p>Conclusions</p> <p>It is feasible and effective to set up cooperation between CDCs and university hospitals to increase the number of TB suspects examined by CDCs and increase the number of TB patients treated under DOTS conditions. These public-public mix (PPM) activities should be expanded to cover all other university hospitals in China.</p
    • …
    corecore