17 research outputs found
FrameNet CNL: a Knowledge Representation and Information Extraction Language
The paper presents a FrameNet-based information extraction and knowledge
representation framework, called FrameNet-CNL. The framework is used on natural
language documents and represents the extracted knowledge in a tailor-made
Frame-ontology from which unambiguous FrameNet-CNL paraphrase text can be
generated automatically in multiple languages. This approach brings together
the fields of information extraction and CNL, because a source text can be
considered belonging to FrameNet-CNL, if information extraction parser produces
the correct knowledge representation as a result. We describe a
state-of-the-art information extraction parser used by a national news agency
and speculate that FrameNet-CNL eventually could shape the natural language
subset used for writing the newswire articles.Comment: CNL-2014 camera-ready version. The final publication is available at
link.springer.co
dBaby: Grounded Language Teaching through Games and Efficient Reinforcement Learning
This paper outlines a project proposal to be submitted to EC H2020 call ICT-29-2018. The purpose of the project is to create a digital Baby (dBaby) - an agent perceiving and interacting with the 3D world and communicating with its Teacher via natural language phrases to achieve the goals set by the Teacher. The novelty of the approach is that neither language nor visual capabilities are hard-coded in dBaby - instead, the Teacher defines a language learning Game grounded in the 3D world, and dBaby learns the language as a byproduct of the reinforcement learning from the raw pixels and character strings while maximizing the rewards in the Game. So far such approach successfully has been demonstrated only in the virtual 3D world with pre-programmed Games where it requires millions of episodes to learn a dozen words. Moving to human Teacher and real 3D environment requires an order-of-magnitude improvement to data-efficiency of the reinforcement learning. A novel Episodic Control based pre-training is demonstrated as a promising approach for bootstrapping the data-efficient reinforcement learning
Discrete Denoising Diffusion Approach to Integer Factorization
Integer factorization is a famous computational problem unknown whether being
solvable in the polynomial time. With the rise of deep neural networks, it is
interesting whether they can facilitate faster factorization. We present an
approach to factorization utilizing deep neural networks and discrete denoising
diffusion that works by iteratively correcting errors in a partially-correct
solution. To this end, we develop a new seq2seq neural network architecture,
employ relaxed categorical distribution and adapt the reverse diffusion process
to cope better with inaccuracies in the denoising step. The approach is able to
find factors for integers of up to 56 bits long. Our analysis indicates that
investment in training leads to an exponential decrease of sampling steps
required at inference to achieve a given success rate, thus counteracting an
exponential run-time increase depending on the bit-length.Comment: International Conference on Artificial Neural Networks ICANN 202
RIGOTRIO at SemEval-2017 Task 9: Combining Machine Learning and Grammar Engineering for AMR Parsing and Generation
By addressing both text-to-AMR parsing and AMR-to-text generation, SemEval-2017 Task 9 established AMR as a powerful semantic interlingua. We strengthen the interlingual aspect of AMR by applying the multilingual Grammatical Framework (GF) for AMR-to-text generation. Our current rule-based GF approach completely covered only 12.3% of the test AMRs, therefore we combined it with state-of-the-art JAMR Generator to see if the combination increases or decreases the overall performance. The combined system achieved the automatic BLEU score of 18.82 and the human Trueskill score of 107.2, to be compared to the plain JAMR Generator results. As for AMR parsing, we added NER extensions to our SemEval-2016 general-domain AMR parser to handle the biomedical genre, rich in organic compound names, achieving Smatch F1=54.0%
Character-level neural translation for multilingual media monitoring in the SUMMA project
The paper steps outside the comfort-zone of the traditional NLP tasks like
automatic speech recognition (ASR) and machine translation (MT) to addresses
two novel problems arising in the automated multilingual news monitoring:
segmentation of the TV and radio program ASR transcripts into individual
stories, and clustering of the individual stories coming from various sources
and languages into storylines. Storyline clustering of stories covering the
same events is an essential task for inquisitorial media monitoring. We address
these two problems jointly by engaging the low-dimensional semantic
representation capabilities of the sequence to sequence neural translation
models. To enable joint multi-task learning for multilingual neural translation
of morphologically rich languages we replace the attention mechanism with the
sliding-window mechanism and operate the sequence to sequence neural
translation model on the character-level rather than on the word-level. The
story segmentation and storyline clustering problem is tackled by examining the
low-dimensional vectors produced as a side-product of the neural translation
process. The results of this paper describe a novel approach to the automatic
story segmentation and storyline clustering problem.Comment: LREC-2016 submissio
The SUMMA Platform:Scalable Understanding of Multilingual Media
We present the latest version of the SUMMA platform, an open-source software platform for monitoring and interpreting multi-lingual media, from written news published on the internet to live media broadcasts via satellite or internet streaming.This work was conducted within the scope of the Research and Innovation Action SUMMA, which has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688139
SUMMA: Integrating Multiple NLP Technologies into an Open-source Platform for Multilingual Media Monitoring
The open-source SUMMA Platform is a highly scalable distributed architecture for monitoring a large number of media broadcasts in parallel, with a lag behind actual broadcast time of at most a few minutes.
It assembles numerous state-of-the-art NLP technologies into a fully automated media ingestion pipeline that can record live broadcasts, detect and transcribe spoken content, translate from several languages (original text or transcribed speech) into English,1 recognize Named Entities, detect topics, cluster and summarize documents across language barriers, and extract and store factual claims in these news items.
This paper describes the intended use cases and discusses the system design decisions that allowed us to integrate state-of-theart NLP modules into an effective workflow with comparatively little effort