162 research outputs found
Text Summarization Technique for Punjabi Language Using Neural Networks
In the contemporary world, utilization of digital content has risen exponentially. For example, newspaper and web
articles, status updates, advertisements etc. have become an integral part of our daily routine. Thus, there is a need to build
an automated system to summarize such large documents of text in order to save time and effort. Although, there are
summarizers for languages such as English since the work has started in the 1950s and at present has led it up to a matured
stage but there are several languages that still need special attention such as Punjabi language. The Punjabi language is
highly rich in morphological structure as compared to English and other foreign languages. In this work, we provide three
phase extractive summarization methodology using neural networks. It induces compendious summary of Punjabi single text
document. The methodology incorporates pre-processing phase that cleans the text; processing phase that extracts statistical
and linguistic features; and classification phase. The classification based neural network applies an activation function-
sigmoid and weighted error reduction-gradient descent optimization to generate the resultant output summary. The proposed
summarization system is applied over monolingual Punjabi text corpus from Indian languages corpora initiative phase-II.
The precision, recall and F-measure are achieved as 90.0%, 89.28% an 89.65% respectively which is reasonably good in
comparison to the performance of other existing Indian languages" summarizers.This research is partially funded by the Ministry of Economy, Industry and Competitiveness, Spain (CSO2017-86747-R)
Focused Transformer: Contrastive Training for Context Scaling
Large language models have an exceptional capability to incorporate new
information in a contextual manner. However, the full potential of such an
approach is often restrained due to a limitation in the effective context
length. One solution to this issue is to endow an attention layer with access
to an external memory, which comprises of (key, value) pairs. Yet, as the
number of documents increases, the proportion of relevant keys to irrelevant
ones decreases, leading the model to focus more on the irrelevant keys. We
identify a significant challenge, dubbed the distraction issue, where keys
linked to different semantic values might overlap, making them hard to
distinguish. To tackle this problem, we introduce the Focused Transformer
(FoT), a technique that employs a training process inspired by contrastive
learning. This novel approach enhances the structure of the (key, value) space,
enabling an extension of the context length. Our method allows for fine-tuning
pre-existing, large-scale models to lengthen their effective context. This is
demonstrated by our fine-tuning of and OpenLLaMA checkpoints. The
resulting models, which we name LongLLaMA, exhibit advancements in tasks
requiring a long context. We further illustrate that our LongLLaMA models
adeptly manage a context length for passkey retrieval
Heterodyne Receiver Development at the Caltech Submillimeter Observatory
The Caltech Submillimeter Observatory (CSO) operates at the summit of Mauna Kea, Hawaii, at an elevation of 4200 m. The site was chosen for its very dry climate and stable atmosphere, enabling submillimeter observations in the astrophysically important 1.3 mm to 300 μm atmospheric windows. Ever since its inception, the CSO has proven itself to be a productive test-bed for new detector technologies. In this paper we review the heterodyne (coherent) receiver development at the CSO, and highlight some of the ways it has helped to shape the field of submillimeter and terahertz high spectral resolution far-infrared astronomy
SCRIPDB: a portal for easy access to syntheses, chemicals and reactions in patents
The patent literature is a rich catalog of biologically relevant chemicals; many public and commercial molecular databases contain the structures disclosed in patent claims. However, patents are an equally rich source of metadata about bioactive molecules, including mechanism of action, disease class, homologous experimental series, structural alternatives, or the synthetic pathways used to produce molecules of interest. Unfortunately, this metadata is discarded when chemical structures are deposited separately in databases. SCRIPDB is a chemical structure database designed to make this metadata accessible. SCRIPDB provides the full original patent text, reactions and relationships described within any individual patent, in addition to the molecular files common to structural databases. We discuss how such information is valuable in medical text mining, chemical image analysis, reaction extraction and in silico pharmaceutical lead optimization. SCRIPDB may be searched by exact chemical structure, substructure or molecular similarity and the results may be restricted to patents describing synthetic routes. SCRIPDB is available at http://dcv.uhnres.utoronto.ca/SCRIPDB
Trial2Vec: Zero-Shot Clinical Trial Document Similarity Search using Self-Supervision
Clinical trials are essential for drug development but are extremely
expensive and time-consuming to conduct. It is beneficial to study similar
historical trials when designing a clinical trial. However, lengthy trial
documents and lack of labeled data make trial similarity search difficult. We
propose a zero-shot clinical trial retrieval method, Trial2Vec, which learns
through self-supervision without annotating similar clinical trials.
Specifically, the meta-structure of trial documents (e.g., title, eligibility
criteria, target disease) along with clinical knowledge (e.g., UMLS knowledge
base https://www.nlm.nih.gov/research/umls/index.html) are leveraged to
automatically generate contrastive samples. Besides, Trial2Vec encodes trial
documents considering meta-structure thus producing compact embeddings
aggregating multi-aspect information from the whole document. We show that our
method yields medically interpretable embeddings by visualization and it gets a
15% average improvement over the best baselines on precision/recall for trial
retrieval, which is evaluated on our labeled 1600 trial pairs. In addition, we
prove the pre-trained embeddings benefit the downstream trial outcome
prediction task over 240k trials. Software ias available at
https://github.com/RyanWangZf/Trial2Vec.Comment: Findings of EMNLP 202
- …