Search CORE

582 research outputs found

Creating a Live, Public Short Message Service Corpus: The NUS SMS Corpus

Author: A. B. Bodomo
A. Deumert
C. Dürscheid
C. Thurlow
D. Crystal
D. Pietrini
F. Liu
I. Hutchby
K. -l. Zhou
M. D. Back
M. Žic Fuchs
Min-Yen Kan
P. G. Ipeirotis
R. Ling
R. Rettie
S. Herring
S. Sotillo
Tao Chen
W. Liu
Y. Sun
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Short Message Service (SMS) messages are largely sent directly from one person to another from their mobile phones. They represent a means of personal communication that is an important communicative artifact in our current digital era. As most existing studies have used private access to SMS corpora, comparative studies using the same raw SMS data has not been possible up to now. We describe our efforts to collect a public SMS corpus to address this problem. We use a battery of methodologies to collect the corpus, paying particular attention to privacy issues to address contributors' concerns. Our live project collects new SMS message submissions, checks their quality and adds the valid messages, releasing the resultant corpus as XML and as SQL dumps, along with corpus statistics, every month. We opportunistically collect as much metadata about the messages and their sender as possible, so as to enable different types of analyses. To date, we have collected about 60,000 messages, focusing on English and Mandarin Chinese.Comment: It contains 31 pages, 6 figures, and 10 tables. It has been submitted to Language Resource and Evaluation Journa

arXiv.org e-Print Archive

CiteSeerX

Crossref

ScholarBank@NUS

NLSC: Unrestricted Natural Language-based Service Composition through Sentence Embeddings

Author: Akoju Sushma A.
Dangi Ankit
Romero Oscar J.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 04/06/2019
Field of study

Current approaches for service composition (assemblies of atomic services) require developers to use: (a) domain-specific semantics to formalize services that restrict the vocabulary for their descriptions, and (b) translation mechanisms for service retrieval to convert unstructured user requests to strongly-typed semantic representations. In our work, we argue that effort to developing service descriptions, request translations, and matching mechanisms could be reduced using unrestricted natural language; allowing both: (1) end-users to intuitively express their needs using natural language, and (2) service developers to develop services without relying on syntactic/semantic description languages. Although there are some natural language-based service composition approaches, they restrict service retrieval to syntactic/semantic matching. With recent developments in Machine learning and Natural Language Processing, we motivate the use of Sentence Embeddings by leveraging richer semantic representations of sentences for service description, matching and retrieval. Experimental results show that service composition development effort may be reduced by more than 44\% while keeping a high precision/recall when matching high-level user requests with low-level service method invocations.Comment: This paper will appear on SCC'19 (IEEE International Conference on Services Computing) on July 1

arXiv.org e-Print Archive

Crossref

VOCE Corpus: Ecologically Collected Speech Annotated with Physiological and Psychological Stress Assessments.

Author: Abrudan T
Aguiar A
Almeida PR
Cunha M
Kaiseler M
Meinedo H
Silva J
Publication venue
Publication date: 01/01/2014
Field of study

Public speaking is a widely requested professional skill, and at the same time an activity that causes one of the most common adult phobias (Miller and Stone, 2009). It is also known that the study of stress under laboratory conditions, as it is most commonly done, may provide only limited ecological validity (Wilhelm and Grossman, 2010). Previously, we introduced an inter-disciplinary methodology to enable collecting a large amount of recordings under consistent conditions (Aguiar et al., 2013). This paper introduces the VOCE corpus of speech annotated with stress indicators under naturalistic public speaking (PS) settings. The novelty of this corpus is that the recordings are carried out in objectively stressful PS situations, as recommended in (Zanstra and Johnston, 2011). The current database contains a total of 38 recordings, 13 of which contain full psychological and physiologic annotation. We show that the collected recordings validate the assumptions of the methodology, namely that participants experience stress during the PS events. We describe the various metrics that can be used for physiologic and psychological annotation, and we characterise the sample collected so far, providing evidence that demographics do not affect the relevant psychological or physiologic annotation. The collection activities are on-going, and we expect to increase the number of complete recordings in the corpus to 30 by June 2014

Repositório Aberto da Universidade do Porto

Leeds Beckett Repository

Smartphone picture organization: a hierarchical approach

Author: Dimiccoli Mariella
Lonn Stefan
Radeva Petia
Publication venue: 'Elsevier BV'
Publication date: 01/01/2019
Field of study

We live in a society where the large majority of the population has a camera-equipped smartphone. In addition, hard drives and cloud storage are getting cheaper and cheaper, leading to a tremendous growth in stored personal photos. Unlike photo collections captured by a digital camera, which typically are pre-processed by the user who organizes them into event-related folders, smartphone pictures are automatically stored in the cloud. As a consequence, photo collections captured by a smartphone are highly unstructured and because smartphones are ubiquitous, they present a larger variability compared to pictures captured by a digital camera. To solve the need of organizing large smartphone photo collections automatically, we propose here a new methodology for hierarchical photo organization into topics and topic-related categories. Our approach successfully estimates latent topics in the pictures by applying probabilistic Latent Semantic Analysis, and automatically assigns a name to each topic by relying on a lexical database. Topic-related categories are then estimated by using a set of topic-specific Convolutional Neuronal Networks. To validate our approach, we ensemble and make public a large dataset of more than 8,000 smartphone pictures from 40 persons. Experimental results demonstrate major user satisfaction with respect to state of the art solutions in terms of organization.Peer ReviewedPreprin

arXiv.org e-Print Archive

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Digital.CSIC

Gaming variables in linguistic research. Italian scale validation and a Minecraft pilot study

Author: Ardolino Fabio
Piccardi Duccio
Publication venue: place:Milano
Publication date: 01/01/2021
Field of study

This paper deals with the concept of gamified science and its recent applications to the linguistic field. We argue that, albeit promising, this paradigm still lacks analytical tools to model the effects of the peculiar experimental setting on the results obtained. After a theoretical introduction to the User Engagement and Gaming Literacy constructs, we present two validated Italian translations of scales representing them. Lastly, we test these two gaming variables in a pilot study on the postvocalic realizations of /k t/ in the Florentine variety. Results show that both variables positively condition the production of non-continuants (i.e., emphasized words) but through different underlying mechanisms

Archivio della Ricerca - Università degli Studi di Siena

DARIAH and the Benelux

Author: Backes Marianne
Chambers Sally
Hoogerwerf Maarten
Van der West Jan
Publication venue: Department of Applied Linguistics, Translators and Interpreters, University of Antwerp
Publication date: 01/01/2015
Field of study

Ghent University Academic Bibliography

Machine-assisted translation by Human-in-the-loop Crowdsourcing for Bambara

Author: Tapo Allahsera Auguste
Publication venue: RIT Scholar Works
Publication date: 01/08/2020
Field of study

Language is more than a tool of conveying information; it is utilized in all aspects of our lives. Yet only a small number of languages in the 7,000 languages worldwide are highly resourced by human language technologies (HLT). Despite African languages representing over 2,000 languages, only a few African languages are highly resourced, for which there exists a considerable amount of parallel digital data. We present a novel approach to machine translation (MT) for under-resourced languages by improving the quality of the model using a paradigm called ``humans in the Loop.\u27\u27 This thesis describes the work carried out to create a Bambara-French MT system including data discovery, data preparation, model hyper-parameter tuning, the development of a crowdsourcing platform for humans in the loop, vocabulary sizing, and segmentation. We present a novel approach to machine translation (MT) for under-resourced languages by improving the quality of the model using a paradigm called ``humans in the Loop.\u27\u27 We achieved a BLEU (bilingual evaluation understudy) score of 17.5. The results confirm that MT for Bambara, despite our small data set, is viable. This work has the potential to contribute to the reduction of language barriers between the people of Sub-Saharan Africa and the rest of the world

RIT Scholar Works

A Topic Modeling Guided Approach for Semantic Knowledge Discovery in e-Commerce

Author: Anoop V S
Asharaf S
Publication venue: 'Universidad Internacional de La Rioja'
Publication date: 10/09/2021
Field of study

The task of mining large unstructured text archives, extracting useful patterns and then organizing them into a knowledgebase has attained a great attention due to its vast array of immediate applications in business. Businesses thus demand new and efficient algorithms for leveraging potentially useful patterns from heterogeneous data sources that produce huge volumes of unstructured data. Due to the ability to bring out hidden themes from large text repositories, topic modeling algorithms attained significant attention in the recent past. This paper proposes an efficient and scalable method which is guided by topic modeling for extracting concepts and relationships from e-commerce product descriptions and organizing them into knowledgebase. Semantic graphs can be generated from such a knowledgebase on which meaning aware product discovery experience can be built for potential buyers. Extensive experiments using proposed unsupervised algorithms with e-commerce product descriptions collected from open web shows that our proposed method outperforms some of the existing methods of leveraging concepts and relationships so that efficient knowledgebase construction is possible

Re-UNIR

(Re)presenting Science in Research Articles and Press Releases

Author: Laura Di Ferrante
Publication venue: Department of Foreign Languages and Literatures at the University of Verona
Publication date: 01/12/2023
Field of study

Science communication is a powerful supplier of scientific knowledge for the public (see Harmatiy 2021; Kueffer and Larson 2014). While popularization discourse has been explored in depth (see for example, Calsamiglia and Van Dijk 2004; Garzone 2014, 2020; Gotti 2014; Luzón 2013; Myers 2003), the ways specific linguistic strategies impact content and the communication of science still need to be fully explored. The general purpose of this study is to explore how titles of scientific articles are transformed to be turned into headlines of press releases. Specifically, it aims first to identify recurring discursive patterns in the adaptation of titles in scientific discourse to headlines in science communication. Second, it investigates whether these patterns have an impact on the way scientific knowledge is presented. Two matching corpora were used: one of titles of research articles and one of headlines of research-based university press releases. The unique feature of these two corpora is that they have a bijective relation, so that each of the 210 titles of scientific papers matches one of the 210 university press release headlines. Results show that many linguistic strategies in science journalism are the mirror image of scientific discourse: three strategies were identified that contribute to two different representations of science, as an ongoing process in academic titles and a conclusive fact in press releases’ headlines: 1) the validity-endorsement strategy; 2) the V-ing construction; 3) the opposition between unspecified association vs. explicit relation

Directory of Open Access Journals