91,967 research outputs found

    Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

    Full text link
    We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages (from two language families). Specifically, we compile 12.4 million sentence pairs from existing, publicly-available parallel corpora, and additionally mine 37.4 million sentence pairs from the web, resulting in a 4x increase. We mine the parallel sentences from the web by combining many corpora, tools, and methods: (a) web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents, (c) multilingual representation models for aligning sentences, and (d) approximate nearest neighbor search for searching in a large collection of sentences. Human evaluation of samples from the newly mined corpora validate the high quality of the parallel sentences across 11 languages. Further, we extract 83.4 million sentence pairs between all 55 Indic language pairs from the English-centric parallel corpus using English as the pivot language. We trained multilingual NMT models spanning all these languages on Samanantar, which outperform existing models and baselines on publicly available benchmarks, such as FLORES, establishing the utility of Samanantar. Our data and models are available publicly at https://indicnlp.ai4bharat.org/samanantar/ and we hope they will help advance research in NMT and multilingual NLP for Indic languages.Comment: Accepted to the Transactions of the Association for Computational Linguistics (TACL

    Iterchanging Discrete Event Simulationprocess Interaction Modelsusing The Web Ontology Language - Owl

    Get PDF
    Discrete event simulation development requires significant investments in time and resources. Descriptions of discrete event simulation models are associated with world views, including the process interaction orientation. Historically, these models have been encoded using high-level programming languages or special purpose, typically vendor-specific, simulation languages. These approaches complicate simulation model reuse and interchange. The current document-centric World Wide Web is evolving into a Semantic Web that communicates information using ontologies. The Web Ontology Language OWL, was used to encode a Process Interaction Modeling Ontology for Discrete Event Simulations (PIMODES). The PIMODES ontology was developed using ontology engineering processes. Software was developed to demonstrate the feasibility of interchanging models from commercial simulation packages using PIMODES as an intermediate representation. The purpose of PIMODES is to provide a vendor-neutral open representation to support model interchange. Model interchange enables reuse and provides an opportunity to improve simulation quality, reduce development costs, and reduce development times

    Ontology-based patterns for the integration of business processes and enterprise application architectures

    Get PDF
    Increasingly, enterprises are using Service-Oriented Architecture (SOA) as an approach to Enterprise Application Integration (EAI). SOA has the potential to bridge the gap between business and technology and to improve the reuse of existing applications and the interoperability with new ones. In addition to service architecture descriptions, architecture abstractions like patterns and styles capture design knowledge and allow the reuse of successfully applied designs, thus improving the quality of software. Knowledge gained from integration projects can be captured to build a repository of semantically enriched, experience-based solutions. Business patterns identify the interaction and structure between users, business processes, and data. Specific integration and composition patterns at a more technical level address enterprise application integration and capture reliable architecture solutions. We use an ontology-based approach to capture architecture and process patterns. Ontology techniques for pattern definition, extension and composition are developed and their applicability in business process-driven application integration is demonstrated

    Semantic model-driven development of service-centric software architectures

    Get PDF
    Service-oriented architecture (SOA) is a recent architectural paradigm that has received much attention. The prevalent focus on platforms such as Web services, however, needs to be complemented by appropriate software engineering methods. We propose the model-driven development of service-centric software systems. We present in particular an investigation into the role of enriched semantic modelling for a modeldriven development framework for service-centric software systems. Ontologies as the foundations of semantic modelling and its enhancement through architectural pattern modelling are at the core of the proposed approach. We introduce foundations and discuss the benefits and also the challenges in this context

    BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling

    Full text link
    The pre-training of large language models usually requires massive amounts of resources, both in terms of computation and data. Frequently used web sources such as Common Crawl might contain enough noise to make this pre-training sub-optimal. In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a novel data-centric technique which we name perplexity sampling\textit{perplexity sampling} that enables the pre-training of language models in roughly half the amount of steps and using one fifth of the data. The resulting models are comparable to the current state-of-the-art, and even achieve better results for certain tasks. Our work is proof of the versatility of Transformers, and paves the way for small teams to train their models on a limited budget. Our models are available at this \href\href{https://huggingface.co/bertin-project}{URL}.Comment: Published at Procesamiento del Lenguaje Natura

    A Survey of Location Prediction on Twitter

    Full text link
    Locations, e.g., countries, states, cities, and point-of-interests, are central to news, emergency events, and people's daily lives. Automatic identification of locations associated with or mentioned in documents has been explored for decades. As one of the most popular online social network platforms, Twitter has attracted a large number of users who send millions of tweets on daily basis. Due to the world-wide coverage of its users and real-time freshness of tweets, location prediction on Twitter has gained significant attention in recent years. Research efforts are spent on dealing with new challenges and opportunities brought by the noisy, short, and context-rich nature of tweets. In this survey, we aim at offering an overall picture of location prediction on Twitter. Specifically, we concentrate on the prediction of user home locations, tweet locations, and mentioned locations. We first define the three tasks and review the evaluation metrics. By summarizing Twitter network, tweet content, and tweet context as potential inputs, we then structurally highlight how the problems depend on these inputs. Each dependency is illustrated by a comprehensive review of the corresponding strategies adopted in state-of-the-art approaches. In addition, we also briefly review two related problems, i.e., semantic location prediction and point-of-interest recommendation. Finally, we list future research directions.Comment: Accepted to TKDE. 30 pages, 1 figur

    WISER: A Semantic Approach for Expert Finding in Academia based on Entity Linking

    Full text link
    We present WISER, a new semantic search engine for expert finding in academia. Our system is unsupervised and it jointly combines classical language modeling techniques, based on text evidences, with the Wikipedia Knowledge Graph, via entity linking. WISER indexes each academic author through a novel profiling technique which models her expertise with a small, labeled and weighted graph drawn from Wikipedia. Nodes in this graph are the Wikipedia entities mentioned in the author's publications, whereas the weighted edges express the semantic relatedness among these entities computed via textual and graph-based relatedness functions. Every node is also labeled with a relevance score which models the pertinence of the corresponding entity to author's expertise, and is computed by means of a proper random-walk calculation over that graph; and with a latent vector representation which is learned via entity and other kinds of structural embeddings derived from Wikipedia. At query time, experts are retrieved by combining classic document-centric approaches, which exploit the occurrences of query terms in the author's documents, with a novel set of profile-centric scoring strategies, which compute the semantic relatedness between the author's expertise and the query topic via the above graph-based profiles. The effectiveness of our system is established over a large-scale experimental test on a standard dataset for this task. We show that WISER achieves better performance than all the other competitors, thus proving the effectiveness of modelling author's profile via our "semantic" graph of entities. Finally, we comment on the use of WISER for indexing and profiling the whole research community within the University of Pisa, and its application to technology transfer in our University

    Modeling Documents as Mixtures of Persons for Expert Finding

    Get PDF
    In this paper we address the problem of searching for knowledgeable persons within the enterprise, known as the expert finding (or expert search) task. We present a probabilistic algorithm using the assumption that terms in documents are produced by people who are mentioned in them.We represent documents retrieved to a query as mixtures of candidate experts language models. Two methods of personal language models extraction are proposed, as well as the way of combining them with other evidences of expertise. Experiments conducted with the TREC Enterprise collection demonstrate the superiority of our approach in comparison with the best one among existing solutions
    corecore