Search CORE

162 research outputs found

A Study on the Performances of Representation Strategies Handled For Text Categorization

Author: Dr. K. Meenakshi Sundaram, K. Ramya
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 30/09/2014
Field of study

No Abstrac

International Journal on Recent and Innovation Trends in Computing and Communication

Text Summarization Technique for Punjabi Language Using Neural Networks

Author: Arora Anuja
Jain Arti
Kaur Amanpreet
Morato Lara Jorge Luis
Yadav Divakar
Publication venue: IAJIT
Publication date: 01/11/2021
Field of study

In the contemporary world, utilization of digital content has risen exponentially. For example, newspaper and web articles, status updates, advertisements etc. have become an integral part of our daily routine. Thus, there is a need to build an automated system to summarize such large documents of text in order to save time and effort. Although, there are summarizers for languages such as English since the work has started in the 1950s and at present has led it up to a matured stage but there are several languages that still need special attention such as Punjabi language. The Punjabi language is highly rich in morphological structure as compared to English and other foreign languages. In this work, we provide three phase extractive summarization methodology using neural networks. It induces compendious summary of Punjabi single text document. The methodology incorporates pre-processing phase that cleans the text; processing phase that extracts statistical and linguistic features; and classification phase. The classification based neural network applies an activation function- sigmoid and weighted error reduction-gradient descent optimization to generate the resultant output summary. The proposed summarization system is applied over monolingual Punjabi text corpus from Indian languages corpora initiative phase-II. The precision, recall and F-measure are achieved as 90.0%, 89.28% an 89.65% respectively which is reasonably good in comparison to the performance of other existing Indian languages" summarizers.This research is partially funded by the Ministry of Economy, Industry and Competitiveness, Spain (CSO2017-86747-R)

Universidad Carlos III de Madrid e-Archivo

Focused Transformer: Contrastive Training for Context Scaling

Author: Michalewski Henryk
Miłoś Piotr
Pacek Mikołaj
Staniszewski Konrad
Tworkowski Szymon
Wu Yuhuai
Publication venue
Publication date: 06/07/2023
Field of study

Large language models have an exceptional capability to incorporate new information in a contextual manner. However, the full potential of such an approach is often restrained due to a limitation in the effective context length. One solution to this issue is to endow an attention layer with access to an external memory, which comprises of (key, value) pairs. Yet, as the number of documents increases, the proportion of relevant keys to irrelevant ones decreases, leading the model to focus more on the irrelevant keys. We identify a significant challenge, dubbed the distraction issue, where keys linked to different semantic values might overlap, making them hard to distinguish. To tackle this problem, we introduce the Focused Transformer (FoT), a technique that employs a training process inspired by contrastive learning. This novel approach enhances the structure of the (key, value) space, enabling an extension of the context length. Our method allows for fine-tuning pre-existing, large-scale models to lengthen their effective context. This is demonstrated by our fine-tuning of

3B

and

7B

OpenLLaMA checkpoints. The resulting models, which we name LongLLaMA, exhibit advancements in tasks requiring a long context. We further illustrate that our LongLLaMA models adeptly manage a

256 k

context length for passkey retrieval

arXiv.org e-Print Archive

Heterodyne Receiver Development at the Caltech Submillimeter Observatory

Author: Kooi J. W.
Publication venue: 'Astronomical Society of the Pacific Conference Series'
Publication date: 01/01/2009
Field of study

The Caltech Submillimeter Observatory (CSO) operates at the summit of Mauna Kea, Hawaii, at an elevation of 4200 m. The site was chosen for its very dry climate and stable atmosphere, enabling submillimeter observations in the astrophysically important 1.3 mm to 300 μm atmospheric windows. Ever since its inception, the CSO has proven itself to be a productive test-bed for new detector technologies. In this paper we review the heterodyne (coherent) receiver development at the CSO, and highlight some of the ways it has helped to shape the field of submillimeter and terahertz high spectral resolution far-infrared astronomy

Caltech Authors

SCRIPDB: a portal for easy access to syntheses, chemicals and reactions in patents

Author: A. Heifets
Corey
Guha
Hattori
I. Jurisica
Law
Pirok
Podolyan
Sheridan
Southall
Stewart
Tanaka
Thangaraj
Verspoor
Wang
Publication venue: Oxford University Press
Publication date
Field of study

The patent literature is a rich catalog of biologically relevant chemicals; many public and commercial molecular databases contain the structures disclosed in patent claims. However, patents are an equally rich source of metadata about bioactive molecules, including mechanism of action, disease class, homologous experimental series, structural alternatives, or the synthetic pathways used to produce molecules of interest. Unfortunately, this metadata is discarded when chemical structures are deposited separately in databases. SCRIPDB is a chemical structure database designed to make this metadata accessible. SCRIPDB provides the full original patent text, reactions and relationships described within any individual patent, in addition to the molecular files common to structural databases. We discuss how such information is valuable in medical text mining, chemical image analysis, reaction extraction and in silico pharmaceutical lead optimization. SCRIPDB may be searched by exact chemical structure, substructure or molecular similarity and the results may be restricted to patents describing synthetic routes. SCRIPDB is available at http://dcv.uhnres.utoronto.ca/SCRIPDB

Crossref

PubMed Central

Trial2Vec: Zero-Shot Clinical Trial Document Similarity Search using Self-Supervision

Author: Sun Jimeng
Wang Zifeng
Publication venue
Publication date: 09/10/2022
Field of study

Clinical trials are essential for drug development but are extremely expensive and time-consuming to conduct. It is beneficial to study similar historical trials when designing a clinical trial. However, lengthy trial documents and lack of labeled data make trial similarity search difficult. We propose a zero-shot clinical trial retrieval method, Trial2Vec, which learns through self-supervision without annotating similar clinical trials. Specifically, the meta-structure of trial documents (e.g., title, eligibility criteria, target disease) along with clinical knowledge (e.g., UMLS knowledge base https://www.nlm.nih.gov/research/umls/index.html) are leveraged to automatically generate contrastive samples. Besides, Trial2Vec encodes trial documents considering meta-structure thus producing compact embeddings aggregating multi-aspect information from the whole document. We show that our method yields medically interpretable embeddings by visualization and it gets a 15% average improvement over the best baselines on precision/recall for trial retrieval, which is evaluated on our labeled 1600 trial pairs. In addition, we prove the pre-trained embeddings benefit the downstream trial outcome prediction task over 240k trials. Software ias available at https://github.com/RyanWangZf/Trial2Vec.Comment: Findings of EMNLP 202

arXiv.org e-Print Archive