6 research outputs found
RIGOTRIO at SemEval-2017 Task 9: Combining Machine Learning and Grammar Engineering for AMR Parsing and Generation
By addressing both text-to-AMR parsing and AMR-to-text generation, SemEval-2017 Task 9 established AMR as a powerful semantic interlingua. We strengthen the interlingual aspect of AMR by applying the multilingual Grammatical Framework (GF) for AMR-to-text generation. Our current rule-based GF approach completely covered only 12.3% of the test AMRs, therefore we combined it with state-of-the-art JAMR Generator to see if the combination increases or decreases the overall performance. The combined system achieved the automatic BLEU score of 18.82 and the human Trueskill score of 107.2, to be compared to the plain JAMR Generator results. As for AMR parsing, we added NER extensions to our SemEval-2016 general-domain AMR parser to handle the biomedical genre, rich in organic compound names, achieving Smatch F1=54.0%
Character-level neural translation for multilingual media monitoring in the SUMMA project
The paper steps outside the comfort-zone of the traditional NLP tasks like
automatic speech recognition (ASR) and machine translation (MT) to addresses
two novel problems arising in the automated multilingual news monitoring:
segmentation of the TV and radio program ASR transcripts into individual
stories, and clustering of the individual stories coming from various sources
and languages into storylines. Storyline clustering of stories covering the
same events is an essential task for inquisitorial media monitoring. We address
these two problems jointly by engaging the low-dimensional semantic
representation capabilities of the sequence to sequence neural translation
models. To enable joint multi-task learning for multilingual neural translation
of morphologically rich languages we replace the attention mechanism with the
sliding-window mechanism and operate the sequence to sequence neural
translation model on the character-level rather than on the word-level. The
story segmentation and storyline clustering problem is tackled by examining the
low-dimensional vectors produced as a side-product of the neural translation
process. The results of this paper describe a novel approach to the automatic
story segmentation and storyline clustering problem.Comment: LREC-2016 submissio
dBaby: Grounded Language Teaching through Games and Efficient Reinforcement Learning
This paper outlines a project proposal to be submitted to EC H2020 call ICT-29-2018. The purpose of the project is to create a digital Baby (dBaby) - an agent perceiving and interacting with the 3D world and communicating with its Teacher via natural language phrases to achieve the goals set by the Teacher. The novelty of the approach is that neither language nor visual capabilities are hard-coded in dBaby - instead, the Teacher defines a language learning Game grounded in the 3D world, and dBaby learns the language as a byproduct of the reinforcement learning from the raw pixels and character strings while maximizing the rewards in the Game. So far such approach successfully has been demonstrated only in the virtual 3D world with pre-programmed Games where it requires millions of episodes to learn a dozen words. Moving to human Teacher and real 3D environment requires an order-of-magnitude improvement to data-efficiency of the reinforcement learning. A novel Episodic Control based pre-training is demonstrated as a promising approach for bootstrapping the data-efficient reinforcement learning
SUMMA: Integrating Multiple NLP Technologies into an Open-source Platform for Multilingual Media Monitoring
The open-source SUMMA Platform is a highly scalable distributed architecture for monitoring a large number of media broadcasts in parallel, with a lag behind actual broadcast time of at most a few minutes.
It assembles numerous state-of-the-art NLP technologies into a fully automated media ingestion pipeline that can record live broadcasts, detect and transcribe spoken content, translate from several languages (original text or transcribed speech) into English,1 recognize Named Entities, detect topics, cluster and summarize documents across language barriers, and extract and store factual claims in these news items.
This paper describes the intended use cases and discusses the system design decisions that allowed us to integrate state-of-theart NLP modules into an effective workflow with comparatively little effort
The SUMMA Platform:A Scalable Infrastructure for Multi-lingual Multi-media Monitoring
The open-source SUMMA Platform is a highly scalable distributed architecture for monitoring a large number of media broadcasts in parallel, with a lag behind actual broadcast time of at most a few minutes.
The Platform offers a fully automated media ingestion pipeline capable of recording live broadcasts, detection and transcription of spoken content, translation of all text (original or transcribed) into English, recognition and linking of Named Entities, topic detection, clustering and crosslingual multi-document summarization of related media items, and last but not least, extraction and storage of factual claims in these news items. Browser-based graphical user interfaces provide humans with aggregated information as well as structured access to individual news items stored in the Platform’s database.
This paper describes the intended use cases and provides an overview over the system’s implementation