802 research outputs found
Hindi language text search: a literature review
The literature review focuses on the major problems of Hindi text searching over the web. The review reveals the availability of a number of techniques and search engines that have been developed to facilitate Hindi text searching. Among many problems, a dominant one is when a text formed by combinatorial characters or words is searched
Development of Multilingual Resource Management Mechanisms for Libraries
Multilingual is one of the important concept in any library. This study is create on the basis of global recommendations and local requirement for each and every libraries. Select the multilingual components for setting up the multilingual cluster in different libraries to each user. Development of multilingual environment for accessing and retrieving the library resources among the users as well as library professionals. Now, the methodology of integration of Google Indic Transliteration for libraries have follow the five steps such as (i) selection of transliteration tools for libraries (ii) comparison of tools for libraries (iii) integration Methods in Koha for libraries (iv) Development of Google indic transliteration in Koha for users (v) testing for libraries (vi) results for libraries. Development of multilingual framework for libraries is also an important task in integrated library system and in this section have follow the some important steps such as (i) Bengali Language Installation in Koha for libraries (ii) Settings Multilingual System Preferences in Koha for libraries (iii) Translate the Modules for libraries (iv) Bengali Interface in Koha for libraries. Apart from these it has also shows the Bengali data entry process in Koha for libraries such as Data Entry through Ibus Avro Phonetics for libraries and Data Entry through Virtual Keyboard for libraries. Development of Multilingual Digital Resource Management for libraries by using the DSpace and Greenstone. Management of multilingual for libraries in different areas such as federated searching (VuFind Multilingual Discovery tool ; Multilingual Retrieval in OAI-PMH tool ; Multilingual Data Import through Z39.50 Server ). Multilingual bibliographic data edit through MarcEditor for the better management of integrated library management system. It has also create and editing the content by using the content management system tool for efficient and effective retrieval of multilingual digital content resources among the users
Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval
Dense retrieval models have predominantly been studied for English, where
models have shown great success, due to the availability of human-labeled
training pairs. However, there has been limited success for multilingual
retrieval so far, as training data is uneven or scarcely available across
multiple languages. Synthetic training data generation is promising (e.g.,
InPars or Promptagator), but has been investigated only for English. Therefore,
to study model capabilities across both cross-lingual and monolingual retrieval
tasks, we develop SWIM-IR, a synthetic retrieval training dataset containing 33
(high to very-low resource) languages for training multilingual dense retrieval
models without requiring any human supervision. To construct SWIM-IR, we
propose SAP (summarize-then-ask prompting), where the large language model
(LLM) generates a textual summary prior to the query generation step. SAP
assists the LLM in generating informative queries in the target language. Using
SWIM-IR, we explore synthetic fine-tuning of multilingual dense retrieval
models and evaluate them robustly on three retrieval benchmarks: XOR-Retrieve
(cross-lingual), XTREME-UP (cross-lingual) and MIRACL (monolingual). Our
models, called SWIM-X, are competitive with human-supervised dense retrieval
models, e.g., mContriever, finding that SWIM-IR can cheaply substitute for
expensive human-labeled retrieval training data.Comment: Data released at https://github.com/google-research-datasets/swim-i
Text Summarization Technique for Punjabi Language Using Neural Networks
In the contemporary world, utilization of digital content has risen exponentially. For example, newspaper and web
articles, status updates, advertisements etc. have become an integral part of our daily routine. Thus, there is a need to build
an automated system to summarize such large documents of text in order to save time and effort. Although, there are
summarizers for languages such as English since the work has started in the 1950s and at present has led it up to a matured
stage but there are several languages that still need special attention such as Punjabi language. The Punjabi language is
highly rich in morphological structure as compared to English and other foreign languages. In this work, we provide three
phase extractive summarization methodology using neural networks. It induces compendious summary of Punjabi single text
document. The methodology incorporates pre-processing phase that cleans the text; processing phase that extracts statistical
and linguistic features; and classification phase. The classification based neural network applies an activation function-
sigmoid and weighted error reduction-gradient descent optimization to generate the resultant output summary. The proposed
summarization system is applied over monolingual Punjabi text corpus from Indian languages corpora initiative phase-II.
The precision, recall and F-measure are achieved as 90.0%, 89.28% an 89.65% respectively which is reasonably good in
comparison to the performance of other existing Indian languages" summarizers.This research is partially funded by the Ministry of Economy, Industry and Competitiveness, Spain (CSO2017-86747-R)
- …