2,089 research outputs found

    Ways of not reading Gertrude Stein

    Get PDF
    I situate the controversial critical strategies of “distant reading” and “surface reading” in the reception history of Gertrude Stein, an author whose work was frequently declared “unreadable.” I argue that an early twentieth-century history of compromised forms of reading, including women’s reading and information work, subtends both the technology with which distant reading may be carried out and the ways in which an author’s work comes to be understood as a “corpus.

    LawBench: Benchmarking Legal Knowledge of Large Language Models

    Full text link
    Large language models (LLMs) have demonstrated strong capabilities in various aspects. However, when applying them to the highly specialized, safe-critical legal domain, it is unclear how much legal knowledge they possess and whether they can reliably perform legal-related tasks. To address this gap, we propose a comprehensive evaluation benchmark LawBench. LawBench has been meticulously crafted to have precise assessment of the LLMs' legal capabilities from three cognitive levels: (1) Legal knowledge memorization: whether LLMs can memorize needed legal concepts, articles and facts; (2) Legal knowledge understanding: whether LLMs can comprehend entities, events and relationships within legal text; (3) Legal knowledge applying: whether LLMs can properly utilize their legal knowledge and make necessary reasoning steps to solve realistic legal tasks. LawBench contains 20 diverse tasks covering 5 task types: single-label classification (SLC), multi-label classification (MLC), regression, extraction and generation. We perform extensive evaluations of 51 LLMs on LawBench, including 20 multilingual LLMs, 22 Chinese-oriented LLMs and 9 legal specific LLMs. The results show that GPT-4 remains the best-performing LLM in the legal domain, surpassing the others by a significant margin. While fine-tuning LLMs on legal specific text brings certain improvements, we are still a long way from obtaining usable and reliable LLMs in legal tasks. All data, model predictions and evaluation code are released in https://github.com/open-compass/LawBench/. We hope this benchmark provides in-depth understanding of the LLMs' domain-specified capabilities and speed up the development of LLMs in the legal domain

    Quality in subtitling: theory and professional reality

    Get PDF
    The issue of quality is of great importance in translation studies and, although some studies have been conducted in the field of subtitling, most discussions have been limited to aspects such as how to become a good subtitler and how to produce quality subtitles. Little research has been carried out to investigate other potential factors that may influence the quality of subtitling output in practice. In recent years, some subtitling courses at postgraduate level have attempted to bridge the gap between academia and industry, not only by incorporating the teaching of linguistic and technical skills into the curriculum but also by informing students about ethics, working conditions, market competition, and other relevant professional issues. This instruction is intended to prepare them for promising careers in the subtitling industry, where a progressively deteriorating trend has been observed by some professional subtitlers. The main aim and objective of this study is to explore both theoretical and practical aspects of subtitling quality. The study aspires to call attention to the factors influencing the quality of subtitles and also to provide suggestions to improve the state of affairs within the subtitling industry in terms of quality. In order to examine the potential factors that influence the perception of subtitling quality, particularly in the professional context, two rounds of online surveys were conducted to establish the working conditions of subtitlers. Despite the fact that the participants in the first survey were based in thirty-nine different countries, the data collected is more representative of the situation in Europe, where subtitling is a relatively mature industry compared to other parts of the world. The second survey targeted subtitlers working with the Chinese language in an attempt to study the burgeoning Chinese audiovisual market. This thesis provides a systematic analysis of the numerous parameters that have an impact on the quality of subtitling, both in theory and in professional reality, and offers a detailed insight into the working environment of subtitlers. At the same time, it endeavours to draw attention to the need to ensure decent working conditions in the industry. The general findings are discussed in terms of their implications for the development of the profession as well as for subtitler training and education.Open Acces

    Advanced document data extraction techniques to improve supply chain performance

    Get PDF
    In this thesis, a novel machine learning technique to extract text-based information from scanned images has been developed. This information extraction is performed in the context of scanned invoices and bills used in financial transactions. These financial transactions contain a considerable amount of data that must be extracted, refined, and stored digitally before it can be used for analysis. Converting this data into a digital format is often a time-consuming process. Automation and data optimisation show promise as methods for reducing the time required and the cost of Supply Chain Management (SCM) processes, especially Supplier Invoice Management (SIM), Financial Supply Chain Management (FSCM) and Supply Chain procurement processes. This thesis uses a cross-disciplinary approach involving Computer Science and Operational Management to explore the benefit of automated invoice data extraction in business and its impact on SCM. The study adopts a multimethod approach based on empirical research, surveys, and interviews performed on selected companies.The expert system developed in this thesis focuses on two distinct areas of research: Text/Object Detection and Text Extraction. For Text/Object Detection, the Faster R-CNN model was analysed. While this model yields outstanding results in terms of object detection, it is limited by poor performance when image quality is low. The Generative Adversarial Network (GAN) model is proposed in response to this limitation. The GAN model is a generator network that is implemented with the help of the Faster R-CNN model and a discriminator that relies on PatchGAN. The output of the GAN model is text data with bonding boxes. For text extraction from the bounding box, a novel data extraction framework consisting of various processes including XML processing in case of existing OCR engine, bounding box pre-processing, text clean up, OCR error correction, spell check, type check, pattern-based matching, and finally, a learning mechanism for automatizing future data extraction was designed. Whichever fields the system can extract successfully are provided in key-value format.The efficiency of the proposed system was validated using existing datasets such as SROIE and VATI. Real-time data was validated using invoices that were collected by two companies that provide invoice automation services in various countries. Currently, these scanned invoices are sent to an OCR system such as OmniPage, Tesseract, or ABBYY FRE to extract text blocks and later, a rule-based engine is used to extract relevant data. While the system’s methodology is robust, the companies surveyed were not satisfied with its accuracy. Thus, they sought out new, optimized solutions. To confirm the results, the engines were used to return XML-based files with text and metadata identified. The output XML data was then fed into this new system for information extraction. This system uses the existing OCR engine and a novel, self-adaptive, learning-based OCR engine. This new engine is based on the GAN model for better text identification. Experiments were conducted on various invoice formats to further test and refine its extraction capabilities. For cost optimisation and the analysis of spend classification, additional data were provided by another company in London that holds expertise in reducing their clients' procurement costs. This data was fed into our system to get a deeper level of spend classification and categorisation. This helped the company to reduce its reliance on human effort and allowed for greater efficiency in comparison with the process of performing similar tasks manually using excel sheets and Business Intelligence (BI) tools.The intention behind the development of this novel methodology was twofold. First, to test and develop a novel solution that does not depend on any specific OCR technology. Second, to increase the information extraction accuracy factor over that of existing methodologies. Finally, it evaluates the real-world need for the system and the impact it would have on SCM. This newly developed method is generic and can extract text from any given invoice, making it a valuable tool for optimizing SCM. In addition, the system uses a template-matching approach to ensure the quality of the extracted information

    Working Styles of Student Translators in Revision and Post-editing: an Empirical-Experimental Study with Eye-tracking, Keylogging and Cue-based Retrospection

    Get PDF
    In today’s translation profession, being skilful at revision (including self-revision and other-revision) and post-editing tasks is becoming essential for translators. The exploration of the working styles of student translators in the revision and post-editing processes is vital in helping us to understand the nature of these tasks, and may help in improving pedagogy. Drawing on theories from translation-related studies, cognitive psychology, and text comprehension and production, the aims of this research were to: (1) identify the basic types of reading and typing activity (physical activities) of student translators in the processes of revision and post-editing, and to measure statistically and compare the duration of these activities within and across tasks; (2) identify the underlying purposes (mental activities) behind each type of reading and typing activity; (3) categorise the basic types of working style of student translators and compare the frequency of use of each working style both within and across tasks; (4) identify the personal working styles of student translators in carrying out different tasks, and (5) identify the most efficient working style in each task. Eighteen student translators from Durham University, with Chinese as L1 and English as L2, were invited to participate in the experiment. They were asked to translate, self-revise, other-revise and post-edit three comparable texts in Translog-II with the eye-tracking plugin activated. A cue-based retrospective interview was carried out after each session to collect the student translators’ subjective and conscious data for qualitative analysis. The raw logging data were transformed into User Activity Data and were analysed both quantitatively and qualitatively. This study identified seven types of reading and typing activity in the processes of self-revision, other-revision and post-editing. Three revision phases were defined and four types of working style were recognised. The student translators’ personal working styles were compared in all three tasks. In addition, a tentative model of their cognitive processes in self-revision, other-revision and post-editing was developed, and the efficiency of the four working styles in each task was tested

    Application of pre-training and fine-tuning AI models to machine translation: a case study of multilingual text classification in Baidu

    Get PDF
    With the development of international information technology, we are producing a huge amount of information all the time. The processing ability of information in various languages is gradually replacing information and becoming a rarer resource. How to obtain the most effective information in such a large and complex amount of multilingual textual information is a major goal of multilingual information processing. Multilingual text classification helps users to break the language barrier and accurately locate the required information and triage information. At the same time, the rapid development of the Internet has accelerated the communication among users of various languages, giving rise to a large number of multilingual texts, such as book and movie reviews, online chats, product introductions and other forms, which contain a large amount of valuable implicit information and urgently need automated tools to categorize and process those multilingual texts. This work describes the Natural Language Process (NLP) sub-task known as Multilingual Text Classification (MTC) performed within the context of Baidu, a Chinese leading AI company with a strong Internet base, whose NLP division led the industry in deep learning technology to go online in Machine Translation (MT) and search. Multilingual text classification is an important module in NLP machine translation and a basic module in NLP tasks. It can be applied to many fields, such as Fake Reviews Detection, News Headlines Categories Classification, Analysis of positive and negative reviews and so on. In the following work, we will first define the AI model paradigm of 'pre-training and fine-tuning' in deep learning in the Baidu NLP department. Then investigated the application scenarios of multilingual text classification. Most of the text classification systems currently available in the Chinese market are designed for a single language, such as Alibaba's text classification system. If users need to classify texts of the same category in multiple languages, they need to train multiple single text classification systems and then classify them one by one. However, many internationalized products do not have a single text language, such as AliExpress cross-border e-commerce business, Airbnb B&B business, etc. Industry needs to understand and classify users’ reviews in various languages, and have conducted in-depth statistics and marketing strategy development, and multilingual text classification is particularly important in this scenario. Therefore, we focus on interpreting the methodology of multilingual text classification model of machine translation in Baidu NLP department, and capture sets of multilingual data of reviews, news headlines and other data for manual classification and labeling, use the labeling results for fine-tuning of multilingual text classification model, and output the quality evaluation data of Baidu multilingual text classification model after fine-tuning. We will discuss if the pre-training and fine-tuning of the large model can substantially improve the quality and performance of multilingual text classification. Finally, based on the machine translation-multilingual text classification model, we derive the application method of pre-training and fine-tuning paradigm in the current cutting-edge deep learning AI model under the NLP system and verify the generality and cutting-edge of the pre-training and fine-tuning paradigm in the deep learning-intelligent search field.Com o desenvolvimento da tecnologia de informação internacional, estamos sempre a produzir uma enorme quantidade de informação e o recurso mais escasso jĂĄ nĂŁo Ă© a informação, mas a capacidade de processar informação em cada lĂ­ngua. A maior parte da informação multilingue Ă© expressa sob a forma de texto. Como obter a informação mais eficaz numa quantidade tĂŁo considerĂĄvel e complexa de informação textual multilingue Ă© um dos principais objetivos do processamento de informação multilingue. A classificação de texto multilingue ajuda os utilizadores a quebrar a barreira linguĂ­stica e a localizar com precisĂŁo a informação necessĂĄria e a classificĂĄ-la. Ao mesmo tempo, o rĂĄpido desenvolvimento da Internet acelerou a comunicação entre utilizadores de vĂĄrias lĂ­nguas, dando origem a um grande nĂșmero de textos multilingues, tais como crĂ­ticas de livros e filmes, chats, introduçÔes de produtos e outros distintos textos, que contĂȘm uma grande quantidade de informação implĂ­cita valiosa e necessitam urgentemente de ferramentas automatizadas para categorizar e processar esses textos multilingues. Este trabalho descreve a subtarefa do Processamento de Linguagem Natural (PNL) conhecida como Classificação de Texto Multilingue (MTC), realizada no contexto da Baidu, uma empresa chinesa lĂ­der em IA, cuja equipa de PNL levou a indĂșstria em tecnologia baseada em aprendizagem neuronal a destacar-se em Tradução AutomĂĄtica (MT) e pesquisa cientĂ­fica. A classificação multilingue de textos Ă© um mĂłdulo importante na tradução automĂĄtica de PNL e um mĂłdulo bĂĄsico em tarefas de PNL. A MTC pode ser aplicada a muitos campos, tais como anĂĄlise de sentimentos multilingues, categorização de notĂ­cias, filtragem de conteĂșdos indesejados (do inglĂȘs spam), entre outros. Neste trabalho, iremos primeiro definir o paradigma do modelo AI de 'prĂ©-treino e afinação' em aprendizagem profunda no departamento de PNL da Baidu. Em seguida, realizaremos a pesquisa sobre outros produtos no mercado com capacidade de classificação de texto — a classificação de texto levada a cabo pela Alibaba. ApĂłs a pesquisa, verificamos que a maioria dos sistemas de classificação de texto atualmente disponĂ­veis no mercado chinĂȘs sĂŁo concebidos para uma Ășnica lĂ­ngua, tal como o sistema de classificação de texto Alibaba. Se os utilizadores precisarem de classificar textos da mesma categoria em vĂĄrias lĂ­nguas, precisam de aplicar vĂĄrios sistemas de classificação de texto para cada lĂ­ngua e depois classificĂĄ-los um a um. No entanto, muitos produtos internacionalizados nĂŁo tĂȘm uma Ășnica lĂ­ngua de texto, tais como AliExpress comĂ©rcio eletrĂłnico transfronteiriço, Airbnb B&B business, etc. A indĂșstria precisa compreender e classificar as revisĂ”es dos utilizadores em vĂĄrias lĂ­nguas. Esta necessidade conduziu a um desenvolvimento aprofundado de estatĂ­sticas e estratĂ©gias de marketing, e a classificação de textos multilingues Ă© particularmente importante neste cenĂĄrio. Desta forma, concentrar-nos-emos na interpretação da metodologia do modelo de classificação de texto multilingue da tradução automĂĄtica no departamento de PNL Baidu. Colhemos para o efeito conjuntos de dados multilingues de comentĂĄrios e crĂ­ticas, manchetes de notĂ­cias e outros dados para classificação manual, utilizamos os resultados dessa classificação para o aperfeiçoamento do modelo de classificação de texto multilingue e produzimos os dados de avaliação da qualidade do modelo de classificação de texto multilingue da Baidu. Discutiremos se o prĂ©-treino e o aperfeiçoamento do modelo podem melhorar substancialmente a qualidade e o desempenho da classificação de texto multilingue. Finalmente, com base no modelo de classificação de texto multilingue de tradução automĂĄtica, derivamos o mĂ©todo de aplicação do paradigma de prĂ©-formação e afinação no atual modelo de IA de aprendizagem profunda de ponta sob o sistema de PNL, e verificamos a robustez e os resultados positivos do paradigma de prĂ©-treino e afinação no campo de pesquisa de aprendizagem profunda

    Design principles of integrated information platform for emergency responses: The case of 2008 Beijing Olympic Games

    Get PDF
    This paper investigates the challenges faced in designing an integrated information platform for emergency response management and uses the Beijing Olympic Games as a case study. The research methods are grounded in action research, participatory design, and situation-awareness oriented design. The completion of a more than two-year industrial secondment and six-month field studies ensured that a full understanding of user requirements had been obtained. A service-centered architecture was proposed to satisfy these user requirements. The proposed architecture consists mainly of information gathering, database management, and decision support services. The decision support services include situational overview, instant risk assessment, emergency response preplan, and disaster development prediction. Abstracting from the experience obtained while building this system, we outline a set of design principles in the general domain of information systems (IS) development for emergency management. These design principles form a contribution to the information systems literature because they provide guidance to developers who are aiming to support emergency response and the development of such systems that have not yet been adequately met by any existing types of IS. We are proud that the information platform developed was deployed in the real world and used in the 2008 Beijing Olympic Games. © 2012 INFORMS

    A corpus-based study of Chinese and English translation of international economic law: an interdisciplinary study

    Get PDF
    International Economic Law (IEL), a sub-discipline of International Law, is concerned with the regulation of international economic relations and the behaviours of States, international organisations, and firms operating in the international arena. Due to the increase in commercial intercourse, translation of International Economic Law has become an important factor in promoting cross-cultural communication. The translation of IEL is not purely a technical exercise that simply involves the linguistic translations from one language to another but rather a social and cultural act. This research sets out to examine the translation of terminology used in International Economic Law (IEL) – drawing on data from a bespoke self-built Parallel Corpus of International Economic Law (PCIEL) using a corpus-based, systematic micro-level framework – to analyse the subject matter and to discuss the feasibility of translating these legal terms at the word level, and the sentence and discourse level, with a particular focus on the impact of cultural influences. The study presents the findings from the Chinese translator’s perspective regarding International Economic Law from English/Chinese into Chinese/English with a focus on the areas of law, economics, and culture. The contribution made by a corpus-based approach applied to the interdisciplinary subject of IEL is explored. In particular, this establishes a link between linguistic and non-linguistic study in translating legal texts, especially IEL. The corpus data are organized in different semantic fields and the translation analysis covers lexical, sentential and cultural perspectives. This research demonstrates that not only linguistic factors, but, also, cultural factors make clear contributions to the translation of terminology in PCIEL

    Translation, interpreting, cognition

    Get PDF
    Cognitive aspects of the translation process have become central in Translation and Interpreting Studies in recent years, further establishing the field of Cognitive Translatology. Empirical and interdisciplinary studies investigating translation and interpreting processes promise a hitherto unprecedented predictive and explanatory power. This collection contains such studies which observe behaviour during translation and interpreting. The contributions cover a vast area and investigate behaviour during translation and interpreting – with a focus on training of future professionals, on language processing more generally, on the role of technology in the practice of translation and interpreting, on translation of multimodal media texts, on aspects of ergonomics and usability, on emotions, self-concept and psychological factors, and finally also on revision and post-editing. For the present publication, we selected a number of contributions presented at the Second International Congress on Translation, Interpreting and Cognition hosted by the Tra&Co Lab at the Johannes Gutenberg University of Mainz. Most of the papers in this volume are formulated in a particular constraint-based grammar framework, Head-driven Phrase Structure Grammar. The contributions investigate how the lexical and constructional aspects of this theory can be combined to provide an answer to this question across different linguistic sub-theories

    The way out of the box

    Get PDF
    Synopsis: Cognitive aspects of the translation process have become central in Translation and Interpreting Studies in recent years, further establishing the field of Cognitive Translatology. Empirical and interdisciplinary studies investigating translation and interpreting processes promise a hitherto unprecedented predictive and explanatory power. This collection contains such studies which observe behaviour during translation and interpreting. The contributions cover a vast area and investigate behaviour during translation and interpreting – with a focus on training of future professionals, on language processing more generally, on the role of technology in the practice of translation and interpreting, on translation of multimodal media texts, on aspects of ergonomics and usability, on emotions, self-concept and psychological factors, and finally also on revision and post-editing. For the present publication, we selected a number of contributions presented at the Second International Congress on Translation, Interpreting and Cognition hosted by the Tra&Co Lab at the Johannes Gutenberg University of Mainz
    • 

    corecore