Search CORE

2,089 research outputs found

Ways of not reading Gertrude Stein

Author: Cecire Natalia
Publication venue: 'Project Muse'
Publication date: 01/01/2015
Field of study

I situate the controversial critical strategies of “distant reading” and “surface reading” in the reception history of Gertrude Stein, an author whose work was frequently declared “unreadable.” I argue that an early twentieth-century history of compromised forms of reading, including women’s reading and information work, subtends both the technology with which distant reading may be carried out and the ways in which an author’s work comes to be understood as a “corpus.

Sussex Research Online

LawBench: Benchmarking Legal Knowledge of Large Language Models

Author: Chen Kai
Fei Zhiwei
Ge Jidong
Han Zhuo
Shen Xiaoyu
Shen Zongwen
Zhang Songyang
Zhou Fengzhe
Zhu Dawei
Publication venue
Publication date: 28/09/2023
Field of study

Large language models (LLMs) have demonstrated strong capabilities in various aspects. However, when applying them to the highly specialized, safe-critical legal domain, it is unclear how much legal knowledge they possess and whether they can reliably perform legal-related tasks. To address this gap, we propose a comprehensive evaluation benchmark LawBench. LawBench has been meticulously crafted to have precise assessment of the LLMs' legal capabilities from three cognitive levels: (1) Legal knowledge memorization: whether LLMs can memorize needed legal concepts, articles and facts; (2) Legal knowledge understanding: whether LLMs can comprehend entities, events and relationships within legal text; (3) Legal knowledge applying: whether LLMs can properly utilize their legal knowledge and make necessary reasoning steps to solve realistic legal tasks. LawBench contains 20 diverse tasks covering 5 task types: single-label classification (SLC), multi-label classification (MLC), regression, extraction and generation. We perform extensive evaluations of 51 LLMs on LawBench, including 20 multilingual LLMs, 22 Chinese-oriented LLMs and 9 legal specific LLMs. The results show that GPT-4 remains the best-performing LLM in the legal domain, surpassing the others by a significant margin. While fine-tuning LLMs on legal specific text brings certain improvements, we are still a long way from obtaining usable and reliable LLMs in legal tasks. All data, model predictions and evaluation code are released in https://github.com/open-compass/LawBench/. We hope this benchmark provides in-depth understanding of the LLMs' domain-specified capabilities and speed up the development of LLMs in the legal domain

arXiv.org e-Print Archive

Quality in subtitling: theory and professional reality

Author: Kuo Szu-Yu
Publication venue: Centre for Co-Curricular Studies, Imperial College London
Publication date: 01/03/2014
Field of study

The issue of quality is of great importance in translation studies and, although some studies have been conducted in the field of subtitling, most discussions have been limited to aspects such as how to become a good subtitler and how to produce quality subtitles. Little research has been carried out to investigate other potential factors that may influence the quality of subtitling output in practice. In recent years, some subtitling courses at postgraduate level have attempted to bridge the gap between academia and industry, not only by incorporating the teaching of linguistic and technical skills into the curriculum but also by informing students about ethics, working conditions, market competition, and other relevant professional issues. This instruction is intended to prepare them for promising careers in the subtitling industry, where a progressively deteriorating trend has been observed by some professional subtitlers. The main aim and objective of this study is to explore both theoretical and practical aspects of subtitling quality. The study aspires to call attention to the factors influencing the quality of subtitles and also to provide suggestions to improve the state of affairs within the subtitling industry in terms of quality. In order to examine the potential factors that influence the perception of subtitling quality, particularly in the professional context, two rounds of online surveys were conducted to establish the working conditions of subtitlers. Despite the fact that the participants in the first survey were based in thirty-nine different countries, the data collected is more representative of the situation in Europe, where subtitling is a relatively mature industry compared to other parts of the world. The second survey targeted subtitlers working with the Chinese language in an attempt to study the burgeoning Chinese audiovisual market. This thesis provides a systematic analysis of the numerous parameters that have an impact on the quality of subtitling, both in theory and in professional reality, and offers a detailed insight into the working environment of subtitlers. At the same time, it endeavours to draw attention to the need to ensure decent working conditions in the industry. The general findings are discussed in terms of their implications for the development of the profession as well as for subtitler training and education.Open Acces

Spiral - Imperial College Digital Repository

Advanced document data extraction techniques to improve supply chain performance

Author: Sharma Vikash
Publication venue
Publication date: 01/07/2021
Field of study

In this thesis, a novel machine learning technique to extract text-based information from scanned images has been developed. This information extraction is performed in the context of scanned invoices and bills used in financial transactions. These financial transactions contain a considerable amount of data that must be extracted, refined, and stored digitally before it can be used for analysis. Converting this data into a digital format is often a time-consuming process. Automation and data optimisation show promise as methods for reducing the time required and the cost of Supply Chain Management (SCM) processes, especially Supplier Invoice Management (SIM), Financial Supply Chain Management (FSCM) and Supply Chain procurement processes. This thesis uses a cross-disciplinary approach involving Computer Science and Operational Management to explore the benefit of automated invoice data extraction in business and its impact on SCM. The study adopts a multimethod approach based on empirical research, surveys, and interviews performed on selected companies.The expert system developed in this thesis focuses on two distinct areas of research: Text/Object Detection and Text Extraction. For Text/Object Detection, the Faster R-CNN model was analysed. While this model yields outstanding results in terms of object detection, it is limited by poor performance when image quality is low. The Generative Adversarial Network (GAN) model is proposed in response to this limitation. The GAN model is a generator network that is implemented with the help of the Faster R-CNN model and a discriminator that relies on PatchGAN. The output of the GAN model is text data with bonding boxes. For text extraction from the bounding box, a novel data extraction framework consisting of various processes including XML processing in case of existing OCR engine, bounding box pre-processing, text clean up, OCR error correction, spell check, type check, pattern-based matching, and finally, a learning mechanism for automatizing future data extraction was designed. Whichever fields the system can extract successfully are provided in key-value format.The efficiency of the proposed system was validated using existing datasets such as SROIE and VATI. Real-time data was validated using invoices that were collected by two companies that provide invoice automation services in various countries. Currently, these scanned invoices are sent to an OCR system such as OmniPage, Tesseract, or ABBYY FRE to extract text blocks and later, a rule-based engine is used to extract relevant data. While the system’s methodology is robust, the companies surveyed were not satisfied with its accuracy. Thus, they sought out new, optimized solutions. To confirm the results, the engines were used to return XML-based files with text and metadata identified. The output XML data was then fed into this new system for information extraction. This system uses the existing OCR engine and a novel, self-adaptive, learning-based OCR engine. This new engine is based on the GAN model for better text identification. Experiments were conducted on various invoice formats to further test and refine its extraction capabilities. For cost optimisation and the analysis of spend classification, additional data were provided by another company in London that holds expertise in reducing their clients' procurement costs. This data was fed into our system to get a deeper level of spend classification and categorisation. This helped the company to reduce its reliance on human effort and allowed for greater efficiency in comparison with the process of performing similar tasks manually using excel sheets and Business Intelligence (BI) tools.The intention behind the development of this novel methodology was twofold. First, to test and develop a novel solution that does not depend on any specific OCR technology. Second, to increase the information extraction accuracy factor over that of existing methodologies. Finally, it evaluates the real-world need for the system and the impact it would have on SCM. This newly developed method is generic and can extract text from any given invoice, making it a valuable tool for optimizing SCM. In addition, the system uses a template-matching approach to ensure the quality of the extracted information

Repository@Hull - Worktribe

Working Styles of Student Translators in Revision and Post-editing: an Empirical-Experimental Study with Eye-tracking, Keylogging and Cue-based Retrospection

Author: HUANG JIN
Publication venue
Publication date: 01/01/2016
Field of study

In today’s translation profession, being skilful at revision (including self-revision and other-revision) and post-editing tasks is becoming essential for translators. The exploration of the working styles of student translators in the revision and post-editing processes is vital in helping us to understand the nature of these tasks, and may help in improving pedagogy. Drawing on theories from translation-related studies, cognitive psychology, and text comprehension and production, the aims of this research were to: (1) identify the basic types of reading and typing activity (physical activities) of student translators in the processes of revision and post-editing, and to measure statistically and compare the duration of these activities within and across tasks; (2) identify the underlying purposes (mental activities) behind each type of reading and typing activity; (3) categorise the basic types of working style of student translators and compare the frequency of use of each working style both within and across tasks; (4) identify the personal working styles of student translators in carrying out different tasks, and (5) identify the most efficient working style in each task. Eighteen student translators from Durham University, with Chinese as L1 and English as L2, were invited to participate in the experiment. They were asked to translate, self-revise, other-revise and post-edit three comparable texts in Translog-II with the eye-tracking plugin activated. A cue-based retrospective interview was carried out after each session to collect the student translators’ subjective and conscious data for qualitative analysis. The raw logging data were transformed into User Activity Data and were analysed both quantitatively and qualitatively. This study identified seven types of reading and typing activity in the processes of self-revision, other-revision and post-editing. Three revision phases were defined and four types of working style were recognised. The student translators’ personal working styles were compared in all three tasks. In addition, a tentative model of their cognitive processes in self-revision, other-revision and post-editing was developed, and the efficiency of the four working styles in each task was tested

Durham e-Theses

Application of pre-training and fine-tuning AI models to machine translation: a case study of multilingual text classification in Baidu

Author: Guangxin Wei
Publication venue
Publication date: 13/09/2022
Field of study

With the development of international information technology, we are producing a huge amount of information all the time. The processing ability of information in various languages is gradually replacing information and becoming a rarer resource. How to obtain the most effective information in such a large and complex amount of multilingual textual information is a major goal of multilingual information processing. Multilingual text classification helps users to break the language barrier and accurately locate the required information and triage information. At the same time, the rapid development of the Internet has accelerated the communication among users of various languages, giving rise to a large number of multilingual texts, such as book and movie reviews, online chats, product introductions and other forms, which contain a large amount of valuable implicit information and urgently need automated tools to categorize and process those multilingual texts. This work describes the Natural Language Process (NLP) sub-task known as Multilingual Text Classification (MTC) performed within the context of Baidu, a Chinese leading AI company with a strong Internet base, whose NLP division led the industry in deep learning technology to go online in Machine Translation (MT) and search. Multilingual text classification is an important module in NLP machine translation and a basic module in NLP tasks. It can be applied to many fields, such as Fake Reviews Detection, News Headlines Categories Classification, Analysis of positive and negative reviews and so on. In the following work, we will first define the AI model paradigm of 'pre-training and fine-tuning' in deep learning in the Baidu NLP department. Then investigated the application scenarios of multilingual text classification. Most of the text classification systems currently available in the Chinese market are designed for a single language, such as Alibaba's text classification system. If users need to classify texts of the same category in multiple languages, they need to train multiple single text classification systems and then classify them one by one. However, many internationalized products do not have a single text language, such as AliExpress cross-border e-commerce business, Airbnb B&B business, etc. Industry needs to understand and classify users’ reviews in various languages, and have conducted in-depth statistics and marketing strategy development, and multilingual text classification is particularly important in this scenario. Therefore, we focus on interpreting the methodology of multilingual text classification model of machine translation in Baidu NLP department, and capture sets of multilingual data of reviews, news headlines and other data for manual classification and labeling, use the labeling results for fine-tuning of multilingual text classification model, and output the quality evaluation data of Baidu multilingual text classification model after fine-tuning. We will discuss if the pre-training and fine-tuning of the large model can substantially improve the quality and performance of multilingual text classification. Finally, based on the machine translation-multilingual text classification model, we derive the application method of pre-training and fine-tuning paradigm in the current cutting-edge deep learning AI model under the NLP system and verify the generality and cutting-edge of the pre-training and fine-tuning paradigm in the deep learning-intelligent search field.Com o desenvolvimento da tecnologia de informação internacional, estamos sempre a produzir uma enorme quantidade de informação e o recurso mais escasso já não é a informação, mas a capacidade de processar informação em cada língua. A maior parte da informação multilingue é expressa sob a forma de texto. Como obter a informação mais eficaz numa quantidade tão considerável e complexa de informação textual multilingue é um dos principais objetivos do processamento de informação multilingue. A classificação de texto multilingue ajuda os utilizadores a quebrar a barreira linguística e a localizar com precisão a informação necessária e a classificá-la. Ao mesmo tempo, o rápido desenvolvimento da Internet acelerou a comunicação entre utilizadores de várias línguas, dando origem a um grande número de textos multilingues, tais como críticas de livros e filmes, chats, introduções de produtos e outros distintos textos, que contêm uma grande quantidade de informação implícita valiosa e necessitam urgentemente de ferramentas automatizadas para categorizar e processar esses textos multilingues. Este trabalho descreve a subtarefa do Processamento de Linguagem Natural (PNL) conhecida como Classificação de Texto Multilingue (MTC), realizada no contexto da Baidu, uma empresa chinesa líder em IA, cuja equipa de PNL levou a indústria em tecnologia baseada em aprendizagem neuronal a destacar-se em Tradução Automática (MT) e pesquisa científica. A classificação multilingue de textos é um módulo importante na tradução automática de PNL e um módulo básico em tarefas de PNL. A MTC pode ser aplicada a muitos campos, tais como análise de sentimentos multilingues, categorização de notícias, filtragem de conteúdos indesejados (do inglês spam), entre outros. Neste trabalho, iremos primeiro definir o paradigma do modelo AI de 'pré-treino e afinação' em aprendizagem profunda no departamento de PNL da Baidu. Em seguida, realizaremos a pesquisa sobre outros produtos no mercado com capacidade de classificação de texto — a classificação de texto levada a cabo pela Alibaba. Após a pesquisa, verificamos que a maioria dos sistemas de classificação de texto atualmente disponíveis no mercado chinês são concebidos para uma única língua, tal como o sistema de classificação de texto Alibaba. Se os utilizadores precisarem de classificar textos da mesma categoria em várias línguas, precisam de aplicar vários sistemas de classificação de texto para cada língua e depois classificá-los um a um. No entanto, muitos produtos internacionalizados não têm uma única língua de texto, tais como AliExpress comércio eletrónico transfronteiriço, Airbnb B&B business, etc. A indústria precisa compreender e classificar as revisões dos utilizadores em várias línguas. Esta necessidade conduziu a um desenvolvimento aprofundado de estatísticas e estratégias de marketing, e a classificação de textos multilingues é particularmente importante neste cenário. Desta forma, concentrar-nos-emos na interpretação da metodologia do modelo de classificação de texto multilingue da tradução automática no departamento de PNL Baidu. Colhemos para o efeito conjuntos de dados multilingues de comentários e críticas, manchetes de notícias e outros dados para classificação manual, utilizamos os resultados dessa classificação para o aperfeiçoamento do modelo de classificação de texto multilingue e produzimos os dados de avaliação da qualidade do modelo de classificação de texto multilingue da Baidu. Discutiremos se o pré-treino e o aperfeiçoamento do modelo podem melhorar substancialmente a qualidade e o desempenho da classificação de texto multilingue. Finalmente, com base no modelo de classificação de texto multilingue de tradução automática, derivamos o método de aplicação do paradigma de pré-formação e afinação no atual modelo de IA de aprendizagem profunda de ponta sob o sistema de PNL, e verificamos a robustez e os resultados positivos do paradigma de pré-treino e afinação no campo de pesquisa de aprendizagem profunda

Universidade de Lisboa: Repositório.UL

Design principles of integrated information platform for emergency responses: The case of 2008 Beijing Olympic Games

Author: Guofeng Su (7198892)
Hongyong Yuan (7198895)
Lili Yang (1252140)
Publication venue
Publication date: 04/11/2011
Field of study

This paper investigates the challenges faced in designing an integrated information platform for emergency response management and uses the Beijing Olympic Games as a case study. The research methods are grounded in action research, participatory design, and situation-awareness oriented design. The completion of a more than two-year industrial secondment and six-month field studies ensured that a full understanding of user requirements had been obtained. A service-centered architecture was proposed to satisfy these user requirements. The proposed architecture consists mainly of information gathering, database management, and decision support services. The decision support services include situational overview, instant risk assessment, emergency response preplan, and disaster development prediction. Abstracting from the experience obtained while building this system, we outline a set of design principles in the general domain of information systems (IS) development for emergency management. These design principles form a contribution to the information systems literature because they provide guidance to developers who are aiming to support emergency response and the development of such systems that have not yet been adequately met by any existing types of IS. We are proud that the information platform developed was deployed in the real world and used in the 2008 Beijing Olympic Games. © 2012 INFORMS

Loughborough University Institutional Repository

A corpus-based study of Chinese and English translation of international economic law: an interdisciplinary study

Author: Chen Binghua
Publication venue: University of Stirling
Publication date: 01/10/2017
Field of study

International Economic Law (IEL), a sub-discipline of International Law, is concerned with the regulation of international economic relations and the behaviours of States, international organisations, and firms operating in the international arena. Due to the increase in commercial intercourse, translation of International Economic Law has become an important factor in promoting cross-cultural communication. The translation of IEL is not purely a technical exercise that simply involves the linguistic translations from one language to another but rather a social and cultural act. This research sets out to examine the translation of terminology used in International Economic Law (IEL) – drawing on data from a bespoke self-built Parallel Corpus of International Economic Law (PCIEL) using a corpus-based, systematic micro-level framework – to analyse the subject matter and to discuss the feasibility of translating these legal terms at the word level, and the sentence and discourse level, with a particular focus on the impact of cultural influences. The study presents the findings from the Chinese translator’s perspective regarding International Economic Law from English/Chinese into Chinese/English with a focus on the areas of law, economics, and culture. The contribution made by a corpus-based approach applied to the interdisciplinary subject of IEL is explored. In particular, this establishes a link between linguistic and non-linguistic study in translating legal texts, especially IEL. The corpus data are organized in different semantic fields and the translation analysis covers lexical, sentential and cultural perspectives. This research demonstrates that not only linguistic factors, but, also, cultural factors make clear contributions to the translation of terminology in PCIEL

Stirling Online Research Repository

Translation, interpreting, cognition

Author
Publication venue: Language Science Press
Publication date: 15/07/2021
Field of study

Cognitive aspects of the translation process have become central in Translation and Interpreting Studies in recent years, further establishing the field of Cognitive Translatology. Empirical and interdisciplinary studies investigating translation and interpreting processes promise a hitherto unprecedented predictive and explanatory power. This collection contains such studies which observe behaviour during translation and interpreting. The contributions cover a vast area and investigate behaviour during translation and interpreting – with a focus on training of future professionals, on language processing more generally, on the role of technology in the practice of translation and interpreting, on translation of multimodal media texts, on aspects of ergonomics and usability, on emotions, self-concept and psychological factors, and finally also on revision and post-editing. For the present publication, we selected a number of contributions presented at the Second International Congress on Translation, Interpreting and Cognition hosted by the Tra&Co Lab at the Johannes Gutenberg University of Mainz. Most of the papers in this volume are formulated in a particular constraint-based grammar framework, Head-driven Phrase Structure Grammar. The contributions investigate how the lexical and constructional aspects of this theory can be combined to provide an answer to this question across different linguistic sub-theories

Directory of Open Access Books (DOAB)

The way out of the box

Author
Publication venue
Publication date: 01/01/2021
Field of study

Synopsis: Cognitive aspects of the translation process have become central in Translation and Interpreting Studies in recent years, further establishing the field of Cognitive Translatology. Empirical and interdisciplinary studies investigating translation and interpreting processes promise a hitherto unprecedented predictive and explanatory power. This collection contains such studies which observe behaviour during translation and interpreting. The contributions cover a vast area and investigate behaviour during translation and interpreting – with a focus on training of future professionals, on language processing more generally, on the role of technology in the practice of translation and interpreting, on translation of multimodal media texts, on aspects of ergonomics and usability, on emotions, self-concept and psychological factors, and finally also on revision and post-editing. For the present publication, we selected a number of contributions presented at the Second International Congress on Translation, Interpreting and Cognition hosted by the Tra&Co Lab at the Johannes Gutenberg University of Mainz

Institutional Repository of the Freie Universität Berlin