Search CORE

535 research outputs found

NOWJ1@ALQAC 2023: Enhancing Legal Task Performance with Classic Statistical Models and Pre-trained Language Models

Author: Hoang Minh-Quan
Mai Ngoc-Duy
Nguyen Ha-Thanh
Nguyen Hoang-Viet
Nguyen Tan-Minh
Nguyen Van-Huan
Nguyen Xuan-Hoa
Vuong Thi-Hai-Yen
Publication venue
Publication date: 16/09/2023
Field of study

This paper describes the NOWJ1 Team's approach for the Automated Legal Question Answering Competition (ALQAC) 2023, which focuses on enhancing legal task performance by integrating classical statistical models and Pre-trained Language Models (PLMs). For the document retrieval task, we implement a pre-processing step to overcome input limitations and apply learning-to-rank methods to consolidate features from various models. The question-answering task is split into two sub-tasks: sentence classification and answer extraction. We incorporate state-of-the-art models to develop distinct systems for each sub-task, utilizing both classic statistical models and pre-trained Language Models. Experimental results demonstrate the promising potential of our proposed methodology in the competition.Comment: ISAILD@KSE 202

arXiv.org e-Print Archive

A Novel Neural Network Model for Joint POS Tagging and Graph-based Dependency Parsing

Author: Dras Mark
Johnson Mark
Nguyen Dat Quoc
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2017
Field of study

We present a novel neural network model that learns POS tagging and graph-based dependency parsing jointly. Our model uses bidirectional LSTMs to learn feature representations shared for both POS tagging and dependency parsing tasks, thus handling the feature-engineering problem. Our extensive experiments, on 19 languages from the Universal Dependencies project, show that our model outperforms the state-of-the-art neural network-based Stack-propagation model for joint POS tagging and transition-based dependency parsing, resulting in a new state of the art. Our code is open-source and available together with pre-trained models at: https://github.com/datquocnguyen/jPTDPComment: v2: also include universal POS tagging, UAS and LAS accuracies w.r.t gold-standard segmentation on Universal Dependencies 2.0 - CoNLL 2017 shared task test data; in CoNLL 201

arXiv.org e-Print Archive

Crossref

An improved neural network model for joint POS tagging and dependency parsing

Author: Nguyen Dat Quoc
Verspoor Karin
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2018
Field of study

We propose a novel neural network model for joint part-of-speech (POS) tagging and dependency parsing. Our model extends the well-known BIST graph-based dependency parser (Kiperwasser and Goldberg, 2016) by incorporating a BiLSTM-based tagging component to produce automatically predicted POS tags for the parser. On the benchmark English Penn treebank, our model obtains strong UAS and LAS scores at 94.51% and 92.87%, respectively, producing 1.5+% absolute improvements to the BIST graph-based parser, and also obtaining a state-of-the-art POS tagging accuracy at 97.97%. Furthermore, experimental results on parsing 61 "big" Universal Dependencies treebanks from raw texts show that our model outperforms the baseline UDPipe (Straka and Strakov\'a, 2017) with 0.8% higher average POS tagging score and 3.6% higher average LAS score. In addition, with our model, we also obtain state-of-the-art downstream task scores for biomedical event extraction and opinion analysis applications. Our code is available together with all pre-trained models at: https://github.com/datquocnguyen/jPTDPComment: 11 pages; In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, to appea

arXiv.org e-Print Archive

Crossref

Miko Team: Deep Learning Approach for Legal Question Answering in ALQAC 2022

Author: Nguyen Dat
Nguyen Minh Le
Nguyen Phuong Minh
Van Hieu Nguyen
Publication venue
Publication date: 03/11/2022
Field of study

We introduce efficient deep learning-based methods for legal document processing including Legal Document Retrieval and Legal Question Answering tasks in the Automated Legal Question Answering Competition (ALQAC 2022). In this competition, we achieve 1\textsuperscript{st} place in the first task and 3\textsuperscript{rd} place in the second task. Our method is based on the XLM-RoBERTa model that is pre-trained from a large amount of unlabeled corpus before fine-tuning to the specific tasks. The experimental results showed that our method works well in legal retrieval information tasks with limited labeled data. Besides, this method can be applied to other information retrieval tasks in low-resource languages

arXiv.org e-Print Archive

Effective combination of pretrained models - KIT@IWSLT2022

Author: Liu Danni
Mullov Carlos
Nguyen Thai-Binh
Nguyen Tuan Nam
Niehues Jan
Pham Ngoc-Quan
Waibel Alexander
Publication venue: Association for Computational Linguistics
Publication date: 21/06/2022
Field of study

KITopen

情報検索における意味的ギャップの解消 : トピックモデルを用いた先進的画像探索

Author: Nguyen Cam Tu
Publication venue
Publication date: 15/09/2011
Field of study

Tohoku University徳山豪課

Tohoku University Repository (TOUR) / 東北大学機関リポジトリ

Institutional Repositories DataBase (IRDB)

Handling cross and out-of-domain samples in Thai word segmentation

Author: Chuangsuwanich Ekapol
Limkonchotiwat Peerat
Nutanong Sarana
Phatthiyaphaibun Wannaphong
Sarwar Raheem
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 06/05/2021
Field of study

© 2021 The Authors. Published by ACL. This is an open access article available under a Creative Commons licence. The published version can be accessed at the following link on the publisher’s website: https://aclanthology.org/2021.findings-acl.86While word segmentation is a solved problem in many languages, it is still a challenge in continuous-script or low-resource languages. Like other NLP tasks, word segmentation is domain-dependent, which can be a challenge in low-resource languages like Thai and Urdu since there can be domains with insufficient data. This investigation proposes a new solution to adapt an existing domaingeneric model to a target domain, as well as a data augmentation technique to combat the low-resource problems. In addition to domain adaptation, we also propose a framework to handle out-of-domain inputs using an ensemble of domain-specific models called MultiDomain Ensemble (MDE). To assess the effectiveness of the proposed solutions, we conducted extensive experiments on domain adaptation and out-of-domain scenarios. Moreover, we also proposed a multiple task dataset for Thai text processing, including word segmentation. For domain adaptation, we compared our solution to the state-of-the-art Thai word segmentation (TWS) method and obtained improvements from 93.47% to 98.48% at the character level and 84.03% to 96.75% at the word level. For out-of-domain scenarios, our MDE method significantly outperformed the state-of-the-art TWS and multi-criteria methods. Furthermore, to demonstrate our method’s generalizability, we also applied our MDE framework to other languages, namely Chinese, Japanese, and Urdu, and obtained improvements similar to Thai’s

Wolverhampton Intellectual Repository and E-theses