Search CORE

104 research outputs found

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Author: Kudo Taku
Richardson John
Publication venue
Publication date: 01/01/2018
Field of study

This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system. We perform a validation experiment of NMT on English-Japanese machine translation, and find that it is possible to achieve comparable accuracy to direct subword training from raw sentences. We also compare the performance of subword training and segmentation with various configurations. SentencePiece is available under the Apache 2 license at https://github.com/google/sentencepiece.Comment: Accepted as a demo paper at EMNLP201

arXiv.org e-Print Archive

Crossref

Conversion Prediction Using Multi-task Conditional Attention Networks to Support the Creation of Effective Ad Creative

Author: Bahdanau Dzmitry
Kingma Diederik P
Kudo Taku
Lin Zhouhan
Luong Thang
Thomaidou Stamatina
Xu Kelvin
Yang Hongxia
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 17/05/2019
Field of study

Accurately predicting conversions in advertisements is generally a challenging task, because such conversions do not occur frequently. In this paper, we propose a new framework to support creating high-performing ad creatives, including the accurate prediction of ad creative text conversions before delivering to the consumer. The proposed framework includes three key ideas: multi-task learning, conditional attention, and attention highlighting. Multi-task learning is an idea for improving the prediction accuracy of conversion, which predicts clicks and conversions simultaneously, to solve the difficulty of data imbalance. Furthermore, conditional attention focuses attention of each ad creative with the consideration of its genre and target gender, thus improving conversion prediction accuracy. Attention highlighting visualizes important words and/or phrases based on conditional attention. We evaluated the proposed framework with actual delivery history data (14,000 creatives displayed more than a certain number of times from Gunosy Inc.), and confirmed that these ideas improve the prediction performance of conversions, and visualize noteworthy words according to the creatives' attributes.Comment: 9 pages, 6 figures. Accepted at The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2019) as an applied data science pape

arXiv.org e-Print Archive

Crossref

Causal relationship between eWOM topics and profit of rural tourism at Japanese Roadside Stations "MICHINOEKI"

Author: Alemán Carreón Elisa Claire
Kudo Taku
Nonaka Hirofumi
Nonaka Hirofumi
O'Connor Brendan
Ohe Yasuo
Ohe Yasuo
Shimizu Shohei
Yokota Toshiyuki
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/05/2019
Field of study

Affected by urbanization, centralization and the decrease of overall population, Japan has been making efforts to revitalize the rural areas across the country. One particular effort is to increase tourism to these rural areas via regional branding, using local farm products as tourist attractions across Japan. Particularly, a program subsidized by the government called Michinoeki, which stands for 'roadside station', was created 20 years ago and it strives to provide a safe and comfortable space for cultural interaction between road travelers and the local community, as well as offering refreshment, and relevant information to travelers. However, despite its importance in the revitalization of the Japanese economy, studies with newer technologies and methodologies are lacking. Using sales data from establishments in the Kyushu area of Japan, we used Support Vector to classify content from Twitter into relevant topics and studied their causal relationship to the sales for each establishment using LiNGAM, a linear non-gaussian acyclic model built for causal structure analysis, to perform an improved market analysis considering more than just correlation. Under the hypotheses stated by the LiNGAM model, we discovered a positive causal relationship between the number of tweets mentioning those establishments, specially mentioning deserts, a need for better access and traf^ic options, and a potentially untapped customer base in motorcycle biker groups

arXiv.org e-Print Archive

Crossref

Product gas analysis of laminar premixed ammonia-methane flames in stagnation flows

Author: Hayakawa Akihiro
Hideaki Kobayashi
Kovaleva Marina
Taku Kudo
Publication venue
Publication date
Field of study

Ammonia is a promising hydrogen energy vector and a carbon-free fuel; hence the use of ammonia-hydrocarbon fuel blends can be viewed as an intermediate step towards a hydrogen economy. The characterization of methane-ammonia emissions is essential for designing combustors for a broader range of fuels while fulfilling strict NOx emission requirements and global warming targets. The product gas trends of laminar premixed ammonia-methane flames at atmospheric pressure were studied for 0.1 to 0.6 ammonia heat ratios at the operable range of equivalence ratios. Gases including NO, N2O, NO2, HCN, CO and NH3 were measured using the dual dilution gas method and compared against numerical predictions. Experimental results showed the highest NO emissions at approximately 8,000 ppm for the 0.3 and 0.4 ammonia heat ratios, reducing twofold at the extreme heat ratio conditions. The optimal condition for reducing NOx emissions while maintaining low unburnt NH3 was found to occur at a 1.20 equivalence ratio for higher ammonia ratios, moving incrementally closer towards 1.35 as the methane ratio was increased. These results can aid a further reaction model analysis due to the availability of stain stabilised stagnation flame models in numerical software

Online Research @ Cardiff

Numerical and experimental study of product gas characteristics in premixed ammonia/methane/air laminar flames stabilised in a stagnation flow

Author: Colson Sophie
Hayakawa Akihiro
Kobayashi Hideaki
Kovaleva Marina
Kudo Taku
Okafor Ekenechukwu C.
Valera Medina Agustin
Publication venue: 'Elsevier BV'
Publication date: 31/03/2022
Field of study

The adoption of ammonia/hydrocarbon fuel blends can be viewed as an intermediate step towards a hydrogen economy, hence the characterization of methane/ammonia flame product gas trends is essential for designing combustors for a broader range of low-carbon fuel blends while fulfilling strict NOx requirements. This paper describes the product gas content of laminar premixed ammonia/methane flames for a range of equivalence ratios and ammonia heat ratios ranging from 10% to 60%, using a strain stabilized burner at atmospheric pressure and room temperature. The optimal condition to reduce NOx emissions while maintaining below 100 ppm of unburnt NH3 emissions was found to be at equivalence ratio of 1.20 for higher ammonia ratios, moving incrementally closer over 1.35 as the methane fuel content was increased. Meanwhile, the highest measured NO values were ∼6,950 ppm at an equivalence ratio of 0.9, peaking at heat ratios of 30% to 40% at this equivalence ratio. Detailed reaction mechanisms were evaluated against the experimental data and rate constants of NO production/consumption steps featuring both NH and HNO intermediates and thermal NOx reactions were updated for Okafor's mechanism. Changes in reaction rate constants improved the mechanism accuracy for NO emissions in lean to stoichiometric flames. Meanwhile, in the rich region, modelled NO values were less responsive to changes in reaction constants, suggesting the need for an alternative approach to improve NO predictions for rich, high methane content flames. However, N2O performance in the rich region could be improved, highlighting the significance of the HNO+CONH+CO2 reaction

Online Research @ Cardiff

GREEK-BERT: The Greeks visiting Sesame Street

Author: Chalkidis Ilias
Devlin Jacob
Gage Philip
Koehn Philipp
Kudo Taku
Lafferty D.
Lample Guillaume
Lan Zhenzhong
Mikolov Tomas
Ortiz Suárez Pedro Javier
Outsios Stamatis
P.
Prokopidis Prokopis
Prokopidis Prokopis
Sebastian Ruder
Vaswani Ashish
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 03/09/2020
Field of study

Transformer-based language models, such as BERT and its variants, have achieved state-of-the-art performance in several downstream natural language processing (NLP) tasks on generic benchmark datasets (e.g., GLUE, SQUAD, RACE). However, these models have mostly been applied to the resource-rich English language. In this paper, we present GREEK-BERT, a monolingual BERT-based language model for modern Greek. We evaluate its performance in three NLP tasks, i.e., part-of-speech tagging, named entity recognition, and natural language inference, obtaining state-of-the-art performance. Interestingly, in two of the benchmarks GREEK-BERT outperforms two multilingual Transformer-based models (M-BERT, XLM-R), as well as shallower neural baselines operating on pre-trained word embeddings, by a large margin (5%-10%). Most importantly, we make both GREEK-BERT and our training code publicly available, along with code illustrating how GREEK-BERT can be fine-tuned for downstream NLP tasks. We expect these resources to boost NLP research and applications for modern Greek.Comment: 8 pages, 1 figure, 11th Hellenic Conference on Artificial Intelligence (SETN 2020

arXiv.org e-Print Archive

Crossref

Revisiting Low Resource Status of Indian Languages in Machine Translation

Author: Arora Sanjeev
Barrault Loïc
Bañón Marta
Dabre Raj
Goyal Vikrant
Jha Girish Nath
Koehn Philipp
Kudo Taku
Kunchukuttan Anoop
Nakazawa Toshiaki
Nakazawa Toshiaki
Nakazawa Toshiaki
Papineni Kishore
Parida Shantipriya
Post Matt
Ramasamy Loganathan
Rudrabha Mukhopadhyay Prajwal KR
Schwenk Holger
Sennrich Rico
Sennrich Rico
Siripragada Shashank
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 04/11/2020
Field of study

Indian language machine translation performance is hampered due to the lack of large scale multi-lingual sentence aligned corpora and robust benchmarks. Through this paper, we provide and analyse an automated framework to obtain such a corpus for Indian language neural machine translation (NMT) systems. Our pipeline consists of a baseline NMT system, a retrieval module, and an alignment module that is used to work with publicly available websites such as press releases by the government. The main contribution towards this effort is to obtain an incremental method that uses the above pipeline to iteratively improve the size of the corpus as well as improve each of the components of our system. Through our work, we also evaluate the design choices such as the choice of pivoting language and the effect of iterative incremental increase in corpus size. Our work in addition to providing an automated framework also results in generating a relatively larger corpus as compared to existing corpora that are available for Indian languages. This corpus helps us obtain substantially improved results on the publicly available WAT evaluation benchmark and other standard evaluation benchmarks.Comment: 10 pages, few figures, Preprint under revie

arXiv.org e-Print Archive

Crossref

Study on N2O production mechanisms of lean ammonia/hydrogen/air premixed laminar flames

Author: Colson Sophie Valerie Anne
Gotama Gabriel Jeremy
Hayakawa Akihiro
Hayashi Masao
Kobayashi Hideaki
Kovaleva Marina
Kudo Taku
Mashruk Syed
Okafor Ekenechukwu Chijioke
Valera Medina Agustin
Publication venue
Publication date
Field of study

Application of ammonia as fuel is a potential candidate to achieve carbon neutrality. As laminar burning velocity of ammonia is slow, hydrogen addition is also considered to improve combustion characteristics with no carbon emission. In this study, we experimentally investigated product gas characteristics of strain stabilized ammonia/hydrogen/air premixed laminar flames under atmospheric pressure for various equivalence ratios. In a lean condition, large amount of N2O production was observed. To clarify N2O production mechanisms, numerical simulations were conducted using a reaction mechanism developed by Gotama et al. In the Gotama reaction mechanism, major N2O production path was NH+NO=N2O+H and major N2O consumption paths were N2O+H=N2+OH and N2O(+M)=N2+O(+M). It was clarified that a decrease in N2O consumption via N2O(+M)=N2+O(+M) increases N2O emission for lean and strained conditions

Online Research @ Cardiff