2,628 research outputs found

    Application of pre-training and fine-tuning AI models to machine translation: a case study of multilingual text classification in Baidu

    Get PDF
    With the development of international information technology, we are producing a huge amount of information all the time. The processing ability of information in various languages is gradually replacing information and becoming a rarer resource. How to obtain the most effective information in such a large and complex amount of multilingual textual information is a major goal of multilingual information processing. Multilingual text classification helps users to break the language barrier and accurately locate the required information and triage information. At the same time, the rapid development of the Internet has accelerated the communication among users of various languages, giving rise to a large number of multilingual texts, such as book and movie reviews, online chats, product introductions and other forms, which contain a large amount of valuable implicit information and urgently need automated tools to categorize and process those multilingual texts. This work describes the Natural Language Process (NLP) sub-task known as Multilingual Text Classification (MTC) performed within the context of Baidu, a Chinese leading AI company with a strong Internet base, whose NLP division led the industry in deep learning technology to go online in Machine Translation (MT) and search. Multilingual text classification is an important module in NLP machine translation and a basic module in NLP tasks. It can be applied to many fields, such as Fake Reviews Detection, News Headlines Categories Classification, Analysis of positive and negative reviews and so on. In the following work, we will first define the AI model paradigm of 'pre-training and fine-tuning' in deep learning in the Baidu NLP department. Then investigated the application scenarios of multilingual text classification. Most of the text classification systems currently available in the Chinese market are designed for a single language, such as Alibaba's text classification system. If users need to classify texts of the same category in multiple languages, they need to train multiple single text classification systems and then classify them one by one. However, many internationalized products do not have a single text language, such as AliExpress cross-border e-commerce business, Airbnb B&B business, etc. Industry needs to understand and classify users’ reviews in various languages, and have conducted in-depth statistics and marketing strategy development, and multilingual text classification is particularly important in this scenario. Therefore, we focus on interpreting the methodology of multilingual text classification model of machine translation in Baidu NLP department, and capture sets of multilingual data of reviews, news headlines and other data for manual classification and labeling, use the labeling results for fine-tuning of multilingual text classification model, and output the quality evaluation data of Baidu multilingual text classification model after fine-tuning. We will discuss if the pre-training and fine-tuning of the large model can substantially improve the quality and performance of multilingual text classification. Finally, based on the machine translation-multilingual text classification model, we derive the application method of pre-training and fine-tuning paradigm in the current cutting-edge deep learning AI model under the NLP system and verify the generality and cutting-edge of the pre-training and fine-tuning paradigm in the deep learning-intelligent search field.Com o desenvolvimento da tecnologia de informação internacional, estamos sempre a produzir uma enorme quantidade de informação e o recurso mais escasso já não é a informação, mas a capacidade de processar informação em cada língua. A maior parte da informação multilingue é expressa sob a forma de texto. Como obter a informação mais eficaz numa quantidade tão considerável e complexa de informação textual multilingue é um dos principais objetivos do processamento de informação multilingue. A classificação de texto multilingue ajuda os utilizadores a quebrar a barreira linguística e a localizar com precisão a informação necessária e a classificá-la. Ao mesmo tempo, o rápido desenvolvimento da Internet acelerou a comunicação entre utilizadores de várias línguas, dando origem a um grande número de textos multilingues, tais como críticas de livros e filmes, chats, introduções de produtos e outros distintos textos, que contêm uma grande quantidade de informação implícita valiosa e necessitam urgentemente de ferramentas automatizadas para categorizar e processar esses textos multilingues. Este trabalho descreve a subtarefa do Processamento de Linguagem Natural (PNL) conhecida como Classificação de Texto Multilingue (MTC), realizada no contexto da Baidu, uma empresa chinesa líder em IA, cuja equipa de PNL levou a indústria em tecnologia baseada em aprendizagem neuronal a destacar-se em Tradução Automática (MT) e pesquisa científica. A classificação multilingue de textos é um módulo importante na tradução automática de PNL e um módulo básico em tarefas de PNL. A MTC pode ser aplicada a muitos campos, tais como análise de sentimentos multilingues, categorização de notícias, filtragem de conteúdos indesejados (do inglês spam), entre outros. Neste trabalho, iremos primeiro definir o paradigma do modelo AI de 'pré-treino e afinação' em aprendizagem profunda no departamento de PNL da Baidu. Em seguida, realizaremos a pesquisa sobre outros produtos no mercado com capacidade de classificação de texto — a classificação de texto levada a cabo pela Alibaba. Após a pesquisa, verificamos que a maioria dos sistemas de classificação de texto atualmente disponíveis no mercado chinês são concebidos para uma única língua, tal como o sistema de classificação de texto Alibaba. Se os utilizadores precisarem de classificar textos da mesma categoria em várias línguas, precisam de aplicar vários sistemas de classificação de texto para cada língua e depois classificá-los um a um. No entanto, muitos produtos internacionalizados não têm uma única língua de texto, tais como AliExpress comércio eletrónico transfronteiriço, Airbnb B&B business, etc. A indústria precisa compreender e classificar as revisões dos utilizadores em várias línguas. Esta necessidade conduziu a um desenvolvimento aprofundado de estatísticas e estratégias de marketing, e a classificação de textos multilingues é particularmente importante neste cenário. Desta forma, concentrar-nos-emos na interpretação da metodologia do modelo de classificação de texto multilingue da tradução automática no departamento de PNL Baidu. Colhemos para o efeito conjuntos de dados multilingues de comentários e críticas, manchetes de notícias e outros dados para classificação manual, utilizamos os resultados dessa classificação para o aperfeiçoamento do modelo de classificação de texto multilingue e produzimos os dados de avaliação da qualidade do modelo de classificação de texto multilingue da Baidu. Discutiremos se o pré-treino e o aperfeiçoamento do modelo podem melhorar substancialmente a qualidade e o desempenho da classificação de texto multilingue. Finalmente, com base no modelo de classificação de texto multilingue de tradução automática, derivamos o método de aplicação do paradigma de pré-formação e afinação no atual modelo de IA de aprendizagem profunda de ponta sob o sistema de PNL, e verificamos a robustez e os resultados positivos do paradigma de pré-treino e afinação no campo de pesquisa de aprendizagem profunda

    Cultural institutions and Web 2.0

    Get PDF
    This report gives the results of an exploratory survey of the approaches that Australian cultural institutions are implementing to meet Web 2.0 challenges. For the purpose of this study cultural institutions are those organizations open to the general public that house information artefacts representative of national culture, namely galleries, museums, libraries and archives. The aim was to undertake a brief survey of the strategies being implemented by Australian cultural institutions to come to terms with Web 2.0 development, and meet challenges. This has been complemented by some consideration of management and technical issues that have been reported in the literature. The work leads to some findings that should inform both the institutions and the Australian research and development community of issues and opportunities relating to enhanced provision of access to Australian cultural heritage

    Emerging risks identification on food and feed - EFSA

    Get PDF
    The European Food Safety Authority's has established procedures for the identification of emerging risk in food and feed. The main objectives are to: (i) to carry out activities aiming at identifying, assessing and disseminating information on emerging issues and ensure coordination with relevant networks and international organisations; (ii) promote the identification of data sources and data collection and /or data generation in prioritised emerging issues; and the (iii) evaluate of the collected information and identify of emerging risks. The objective(s) of the Standing Working Group on Emerging Risks (SWG‐ER) is to collaborate with EFSA on the emerging risks identification (ERI) procedure and provide strategic direction for EFSA work building on past and ongoing projects related to EFSA ERI procedure. The SWG‐ER considered the ERI methodologies in place and results obtained by EFSA. It was concluded that a systematic approach to the identification of emerging issues based on experts’ networks is the major strength of the procedure but at present, it is mainly focused on single issues, over short to medium time horizons, no consistent weighting or ranking is applied and clear governance of emerging risks with follow‐up actions is missing. The analysis highlighted weaknesses with respect to data collection, analysis and integration. No methodology is in place to estimate the value of the procedure outputs in terms of avoided risk and there is urgent need for a communication strategy that addresses the lack of data and knowledge uncertainty and addresses risk perception issues. Recommendations were given in three areas: (i) Further develop a food system‐based approach including the integration of social sciences to improve understanding of interactions and dynamics between actors and drivers and the development of horizon scanning protocols; (ii) Improve data processing pipelines to prepare big data analytics, implement a data validation system and develop data sharing agreements to explore mutual benefits; and (iii) Revise the EFSA procedure for emerging risk identification to increase transparency and improve communication

    A corpus-based study of ethically sensitive issues in EU directives, national transposition measures and the press

    Get PDF
    This paper is set in the framework of the Eurolect Observatory Project, which is studying the differences between the EU varieties of legislative language (Eurolects) and their corresponding national legal varieties in 11 languages (Mori 2018). In this paper, our focus is on ethics and legislation: more specifically, the research question is whether any differences can be detected in the discursive construction of ethically sensitive issues in the English version of EU directives, their related national transposition measures adopted in the UK, and press articles reporting on the introduction, revision or implementation of such laws. In this sense, news reports and comments are seen as sitting at the end of a genre chain covering the whole spectre of knowledge dissemination, from the expert (legislation) to the popularising level (newspaper article). The ethically sensitive issues in question concern human health and animal welfare, and the corpora used for the study were selected from the English section of the EOMC (Eurolect Observatory Multilingual Corpus) and from the Lexis-Nexis database of press articles

    A corpus-based study of ethically sensitive issues in EU directives, national transposition measures and the press

    Get PDF
    This paper is set in the framework of the Eurolect Observatory Project, which is studying the differences between the EU varieties of legislative language (Eurolects) and their corresponding national legal varieties in 11 languages (Mori 2018). In this paper, our focus is on ethics and legislation: more specifically, the research question is whether any differences can be detected in the discursive construction of ethically sensitive issues in the English version of EU directives, their related national transposition measures adopted in the UK, and press articles reporting on the introduction, revision or implementation of such laws. In this sense, news reports and comments are seen as sitting at the end of a genre chain covering the whole spectre of knowledge dissemination, from the expert (legislation) to the popularising level (newspaper article). The ethically sensitive issues in question concern human health and animal welfare, and the corpora used for the study were selected from the English section of the EOMC (Eurolect Observatory Multilingual Corpus) and from the Lexis-Nexis database of press articles
    corecore