Search CORE

91,967 research outputs found

Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

Author: AK Raghavan
Bheemaraj Aravinth
Deepak Kumar
Diddee Harshita
Doddapaneni Sumanth
J Mahalakshmi
Jobanputra Mayank
Kakwani Divyanshu
Khapra Mitesh Shantadevi
Kumar Navneet
Kumar Pratyush
Kunchukuttan Anoop
Nagaraj Srihari
Pradeep Aswin
Raghavan Vivek
Ramesh Gowtham
Sahoo Sujit
Sharma Ajitesh
Publication venue
Publication date: 18/11/2021
Field of study

We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages (from two language families). Specifically, we compile 12.4 million sentence pairs from existing, publicly-available parallel corpora, and additionally mine 37.4 million sentence pairs from the web, resulting in a 4x increase. We mine the parallel sentences from the web by combining many corpora, tools, and methods: (a) web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents, (c) multilingual representation models for aligning sentences, and (d) approximate nearest neighbor search for searching in a large collection of sentences. Human evaluation of samples from the newly mined corpora validate the high quality of the parallel sentences across 11 languages. Further, we extract 83.4 million sentence pairs between all 55 Indic language pairs from the English-centric parallel corpus using English as the pivot language. We trained multilingual NMT models spanning all these languages on Samanantar, which outperform existing models and baselines on publicly available benchmarks, such as FLORES, establishing the utility of Samanantar. Our data and models are available publicly at https://indicnlp.ai4bharat.org/samanantar/ and we hope they will help advance research in NMT and multilingual NLP for Indic languages.Comment: Accepted to the Transactions of the Association for Computational Linguistics (TACL

arXiv.org e-Print Archive

Iterchanging Discrete Event Simulationprocess Interaction Modelsusing The Web Ontology Language - Owl

Author: Lacy Lee
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/01/2006
Field of study

Discrete event simulation development requires significant investments in time and resources. Descriptions of discrete event simulation models are associated with world views, including the process interaction orientation. Historically, these models have been encoded using high-level programming languages or special purpose, typically vendor-specific, simulation languages. These approaches complicate simulation model reuse and interchange. The current document-centric World Wide Web is evolving into a Semantic Web that communicates information using ontologies. The Web Ontology Language OWL, was used to encode a Process Interaction Modeling Ontology for Discrete Event Simulations (PIMODES). The PIMODES ontology was developed using ontology engineering processes. Software was developed to demonstrate the feasibility of interchanging models from commercial simulation packages using PIMODES as an intermediate representation. The purpose of PIMODES is to provide a vendor-neutral open representation to support model interchange. Model interchange enables reuse and provides an opportunity to improve simulation quality, reduce development costs, and reduce development times

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Ontology-based patterns for the integration of business processes and enterprise application architectures

Author: Gacitua-Decar Veronica
Pahl Claus
Publication venue: 'IGI Global'
Publication date: 01/01/2010
Field of study

Increasingly, enterprises are using Service-Oriented Architecture (SOA) as an approach to Enterprise Application Integration (EAI). SOA has the potential to bridge the gap between business and technology and to improve the reuse of existing applications and the interoperability with new ones. In addition to service architecture descriptions, architecture abstractions like patterns and styles capture design knowledge and allow the reuse of successfully applied designs, thus improving the quality of software. Knowledge gained from integration projects can be captured to build a repository of semantically enriched, experience-based solutions. Business patterns identify the interaction and structure between users, business processes, and data. Specific integration and composition patterns at a more technical level address enterprise application integration and capture reliable architecture solutions. We use an ontology-based approach to capture architecture and process patterns. Ontology techniques for pattern definition, extension and composition are developed and their applicability in business process-driven application integration is demonstrated

DCU Online Research Access Service

Semantic model-driven development of service-centric software architectures

Author: Barrett Ronan
Pahl Claus
Publication venue
Publication date: 01/01/2007
Field of study

Service-oriented architecture (SOA) is a recent architectural paradigm that has received much attention. The prevalent focus on platforms such as Web services, however, needs to be complemented by appropriate software engineering methods. We propose the model-driven development of service-centric software systems. We present in particular an investigation into the role of enriched semantic modelling for a modeldriven development framework for service-centric software systems. Ontologies as the foundations of semantic modelling and its enhancement through architectural pattern modelling are at the core of the proposed approach. We introduce foundations and discuss the benefits and also the challenges in this context

Irish Universities

DCU Online Research Access Service

BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling

Author: de la Rosa Javier
Grandury Marıa
Ponferrada Eduardo G.
Romero Manu
Salas Pablo Gonzalez de Prado
Villegas Paulo
Publication venue
Publication date: 14/07/2022
Field of study

The pre-training of large language models usually requires massive amounts of resources, both in terms of computation and data. Frequently used web sources such as Common Crawl might contain enough noise to make this pre-training sub-optimal. In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a novel data-centric technique which we name

\textit{perplexity sampling}

that enables the pre-training of language models in roughly half the amount of steps and using one fifth of the data. The resulting models are comparable to the current state-of-the-art, and even achieve better results for certain tasks. Our work is proof of the versatility of Transformers, and paves the way for small teams to train their models on a limited budget. Our models are available at this

\href{https://huggingface.co/bertin-project}{URL}

.Comment: Published at Procesamiento del Lenguaje Natura

arXiv.org e-Print Archive

A Survey of Location Prediction on Twitter

Author: Han Jialong
Sun Aixin
Zheng Xin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Locations, e.g., countries, states, cities, and point-of-interests, are central to news, emergency events, and people's daily lives. Automatic identification of locations associated with or mentioned in documents has been explored for decades. As one of the most popular online social network platforms, Twitter has attracted a large number of users who send millions of tweets on daily basis. Due to the world-wide coverage of its users and real-time freshness of tweets, location prediction on Twitter has gained significant attention in recent years. Research efforts are spent on dealing with new challenges and opportunities brought by the noisy, short, and context-rich nature of tweets. In this survey, we aim at offering an overall picture of location prediction on Twitter. Specifically, we concentrate on the prediction of user home locations, tweet locations, and mentioned locations. We first define the three tasks and review the evaluation metrics. By summarizing Twitter network, tweet content, and tweet context as potential inputs, we then structurally highlight how the problems depend on these inputs. Each dependency is illustrated by a comprehensive review of the corresponding strategies adopted in state-of-the-art approaches. In addition, we also briefly review two related problems, i.e., semantic location prediction and point-of-interest recommendation. Finally, we list future research directions.Comment: Accepted to TKDE. 30 pages, 1 figur

arXiv.org e-Print Archive

DR-NTU (Digital Repository of NTU)

WISER: A Semantic Approach for Expert Finding in Academia based on Entity Linking

Author: Cifariello Paolo
Ferragina Paolo
Ponza Marco
Publication venue: 'Elsevier BV'
Publication date: 01/01/2019
Field of study

We present WISER, a new semantic search engine for expert finding in academia. Our system is unsupervised and it jointly combines classical language modeling techniques, based on text evidences, with the Wikipedia Knowledge Graph, via entity linking. WISER indexes each academic author through a novel profiling technique which models her expertise with a small, labeled and weighted graph drawn from Wikipedia. Nodes in this graph are the Wikipedia entities mentioned in the author's publications, whereas the weighted edges express the semantic relatedness among these entities computed via textual and graph-based relatedness functions. Every node is also labeled with a relevance score which models the pertinence of the corresponding entity to author's expertise, and is computed by means of a proper random-walk calculation over that graph; and with a latent vector representation which is learned via entity and other kinds of structural embeddings derived from Wikipedia. At query time, experts are retrieved by combining classic document-centric approaches, which exploit the occurrences of query terms in the author's documents, with a novel set of profile-centric scoring strategies, which compute the semantic relatedness between the author's expertise and the query topic via the above graph-based profiles. The effectiveness of our system is established over a large-scale experimental test on a standard dataset for this task. We show that WISER achieves better performance than all the other competitors, thus proving the effectiveness of modelling author's profile via our "semantic" graph of entities. Finally, we comment on the use of WISER for indexing and profiling the whole research community within the University of Pisa, and its application to technology transfer in our University

arXiv.org e-Print Archive

Archivio della Ricerca - Università di Pisa

Archivio della ricerca della Scuola Superiore Sant'Anna

Modeling Documents as Mixtures of Persons for Expert Finding

Author: Hiemstra D.
Serdyukov P.
Publication venue: Springer Verlag
Publication date: 01/01/2008
Field of study

In this paper we address the problem of searching for knowledgeable persons within the enterprise, known as the expert finding (or expert search) task. We present a probabilistic algorithm using the assumption that terms in documents are produced by people who are mentioned in them.We represent documents retrieved to a query as mixtures of candidate experts language models. Two methods of personal language models extraction are proposed, as well as the way of combining them with other evidences of expertise. Experiments conducted with the TREC Enterprise collection demonstrate the superiority of our approach in comparison with the best one among existing solutions

CiteSeerX

Radboud Repository

University of Twente Research Information