156 research outputs found

    Persian Semantic Role Labeling Using Transfer Learning and BERT-Based Models

    Full text link
    Semantic role labeling (SRL) is the process of detecting the predicate-argument structure of each predicate in a sentence. SRL plays a crucial role as a pre-processing step in many NLP applications such as topic and concept extraction, question answering, summarization, machine translation, sentiment analysis, and text mining. Recently, in many languages, unified SRL dragged lots of attention due to its outstanding performance, which is the result of overcoming the error propagation problem. However, regarding the Persian language, all previous works have focused on traditional methods of SRL leading to a drop in accuracy and imposing expensive feature extraction steps in terms of financial resources, time and energy consumption. In this work, we present an end-to-end SRL method that not only eliminates the need for feature extraction but also outperforms existing methods in facing new samples in practical situations. The proposed method does not employ any auxiliary features and shows more than 16 (83.16) percent improvement in accuracy against previous methods in similar circumstances.Comment: 17 pages, 4 figures, 10 tables, to appear in digital scholarship in the humanities journa

    Automatic Scaling of Text for Training Second Language Reading Comprehension

    Get PDF
    For children learning their first language, reading is one of the most effective ways to acquire new vocabulary. Studies link students who read more with larger and more complex vocabularies. For second language learners, there is a substantial barrier to reading. Even the books written for early first language readers assume a base vocabulary of nearly 7000 word families and a nuanced understanding of grammar. This project will look at ways that technology can help second language learners overcome this high barrier to entry, and the effectiveness of learning through reading for adults acquiring a foreign language. Through the implementation of Dokusha, an automatic graded reader generator for Japanese, this project will explore how advancements in natural language processing can be used to automatically simplify text for extensive reading in Japanese as a foreign language

    Sentiment classification with case-base approach

    Get PDF
    L'augmentation de la croissance des rĂ©seaux, des blogs et des utilisateurs des sites d'examen sociaux font d'Internet une Ă©norme source de donnĂ©es, en particulier sur la façon dont les gens pensent, sentent et agissent envers diffĂ©rentes questions. Ces jours-ci, les opinions des gens jouent un rĂŽle important dans la politique, l'industrie, l'Ă©ducation, etc. Alors, les gouvernements, les grandes et petites industries, les instituts universitaires, les entreprises et les individus cherchent Ă  Ă©tudier des techniques automatiques fin d’extraire les informations dont ils ont besoin dans les larges volumes de donnĂ©es. L’analyse des sentiments est une vĂ©ritable rĂ©ponse Ă  ce besoin. Elle est une application de traitement du langage naturel et linguistique informatique qui se compose de techniques de pointe telles que l'apprentissage machine et les modĂšles de langue pour capturer les Ă©valuations positives, nĂ©gatives ou neutre, avec ou sans leur force, dans des texte brut. Dans ce mĂ©moire, nous Ă©tudions une approche basĂ©e sur les cas pour l'analyse des sentiments au niveau des documents. Notre approche basĂ©e sur les cas gĂ©nĂšre un classificateur binaire qui utilise un ensemble de documents classifies, et cinq lexiques de sentiments diffĂ©rents pour extraire la polaritĂ© sur les scores correspondants aux commentaires. Puisque l'analyse des sentiments est en soi une tĂąche dĂ©pendante du domaine qui rend le travail difficile et coĂ»teux, nous appliquons une approche «cross domain» en basant notre classificateur sur les six diffĂ©rents domaines au lieu de le limiter Ă  un seul domaine. Pour amĂ©liorer la prĂ©cision de la classification, nous ajoutons la dĂ©tection de la nĂ©gation comme une partie de notre algorithme. En outre, pour amĂ©liorer la performance de notre approche, quelques modifications innovantes sont appliquĂ©es. Il est intĂ©ressant de mentionner que notre approche ouvre la voie Ă  nouveaux dĂ©veloppements en ajoutant plus de lexiques de sentiment et ensembles de donnĂ©es Ă  l'avenir.Increasing growth of the social networks, blogs, and user review sites make Internet a huge source of data especially about how people think, feel, and act toward different issues. These days, people opinions play an important role in the politic, industry, education, etc. Thus governments, large and small industries, academic institutes, companies, and individuals are looking for investigating automatic techniques to extract their desire information from large amount of data. Sentiment analysis is one true answer to this need. Sentiment analysis is an application of natural language processing and computational linguistic that consists of advanced techniques such as machine learning and language model approaches to capture the evaluative factors such as positive, negative, or neutral, with or without their strength, from plain texts. In this thesis we study a case-based approach on cross-domain for sentiment analysis on the document level. Our case-based algorithm generates a binary classifier that uses a set of the processed cases, and five different sentiment lexicons to extract the polarity along the corresponding scores from the reviews. Since sentiment analysis inherently is a domain dependent task that makes it problematic and expensive work, we use a cross-domain approach by training our classifier on the six different domains instead of limiting it to one domain. To improve the accuracy of the classifier, we add negation detection as a part of our algorithm. Moreover, to improve the performance of our approach, some innovative modifications are applied. It is worth to mention that our approach allows for further developments by adding more sentiment lexicons and data sets in the future

    External Reasoning: Towards Multi-Large-Language-Models Interchangeable Assistance with Human Feedback

    Full text link
    Memory is identified as a crucial human faculty that allows for the retention of visual and linguistic information within the hippocampus and neurons in the brain, which can subsequently be retrieved to address real-world challenges that arise through a lifetime of learning. The resolution of complex AI tasks through the application of acquired knowledge represents a stride toward the realization of artificial general intelligence. However, despite the prevalence of Large Language Models (LLMs) like GPT-3.5 and GPT-4 , which have displayed remarkable capabilities in language comprehension, generation, interaction, and reasoning, they are inhibited by constraints on context length that preclude the processing of extensive, continually evolving knowledge bases. This paper proposes that LLMs could be augmented through the selective integration of knowledge from external repositories, and in doing so, introduces a novel methodology for External Reasoning, exemplified by ChatPDF. Central to this approach is the establishment of a tiered policy for \textbf{External Reasoning based on Multiple LLM Interchange Assistance}, where the level of support rendered is modulated across entry, intermediate, and advanced tiers based on the complexity of the query, with adjustments made in response to human feedback. A comprehensive evaluation of this methodology is conducted using multiple LLMs and the results indicate state-of-the-art performance, surpassing existing solutions including ChatPDF.com. Moreover, the paper emphasizes that this approach is more efficient compared to the direct processing of full text by LLMs.Comment: technical repor

    Linking genes to literature: text mining, information extraction, and retrieval applications for biology

    Get PDF
    Efficient access to information contained in online scientific literature collections is essential for life science research, playing a crucial role from the initial stage of experiment planning to the final interpretation and communication of the results. The biological literature also constitutes the main information source for manual literature curation used by expert-curated databases. Following the increasing popularity of web-based applications for analyzing biological data, new text-mining and information extraction strategies are being implemented. These systems exploit existing regularities in natural language to extract biologically relevant information from electronic texts automatically. The aim of the BioCreative challenge is to promote the development of such tools and to provide insight into their performance. This review presents a general introduction to the main characteristics and applications of currently available text-mining systems for life sciences in terms of the following: the type of biological information demands being addressed; the level of information granularity of both user queries and results; and the features and methods commonly exploited by these applications. The current trend in biomedical text mining points toward an increasing diversification in terms of application types and techniques, together with integration of domain-specific resources such as ontologies. Additional descriptions of some of the systems discussed here are available on the internet

    Grounding event references in news

    Get PDF
    Events are frequently discussed in natural language, and their accurate identification is central to language understanding. Yet they are diverse and complex in ontology and reference; computational processing hence proves challenging. News provides a shared basis for communication by reporting events. We perform several studies into news event reference. One annotation study characterises each news report in terms of its update and topic events, but finds that topic is better consider through explicit references to background events. In this context, we propose the event linking task which—analogous to named entity linking or disambiguation—models the grounding of references to notable events. It defines the disambiguation of an event reference as a link to the archival article that first reports it. When two references are linked to the same article, they need not be references to the same event. Event linking hopes to provide an intuitive approximation to coreference, erring on the side of over-generation in contrast with the literature. The task is also distinguished in considering event references from multiple perspectives over time. We diagnostically evaluate the task by first linking references to past, newsworthy events in news and opinion pieces to an archive of the Sydney Morning Herald. The intensive annotation results in only a small corpus of 229 distinct links. However, we observe that a number of hyperlinks targeting online news correspond to event links. We thus acquire two large corpora of hyperlinks at very low cost. From these we learn weights for temporal and term overlap features in a retrieval system. These noisy data lead to significant performance gains over a bag-of-words baseline. While our initial system can accurately predict many event links, most will require deep linguistic processing for their disambiguation

    Designing coherent and engaging open-domain conversational AI systems

    Get PDF
    Designing conversational AI systems able to engage in open-domain ‘social’ conversation is extremely challenging and a frontier of current research. Such systems are required to have extensive awareness of the dialogue context and world knowledge, the user intents and interests, requiring more complicated language understanding, dialogue management, and state and topic tracking mechanisms compared to traditional task-oriented dialogue systems. Given the wide coverage of topics in open-domain dialogue, the conversation can span multiple turns where a number of complex linguistic phenomena (e.g. ellipsis and anaphora) are present and should be resolved for the system to be contextually aware. Such systems also need to be engaging, keeping the users’ interest over long conversations. These are only some of the challenges that open-domain dialogue systems face. Therefore this thesis focuses on designing dialogue systems able to hold extensive open-domain conversations in a coherent, engaging, and appropriate manner over multiple turns. First, different types of dialogue systems architecture and design decisions are discussed for social open-domain conversations, along with relevant evaluation metrics. A modular architecture for ensemble-based conversational systems is presented, called Alana, a finalist in the Amazon Alexa Prize Challenge in 2017 and 2018, able to tackle many of the challenges for open-domain social conversation. The system combines different features such as topic tracking, contextual Natural Language understanding, entity linking, user modelling, information retrieval, and response ranking, using a rich representation of dialogue state. The thesis next analyses the performance of the 2017 system and describes the upgrades developed for the 2018 system. This leads to an analysis and comparison of the real-user data collected in both years with different system configurations, allowing assessment of the impact of different design decisions and modules. Finally, Alana was integrated into an embodied robotic platform and enhanced with the ability to also perform tasks. This system was deployed and evaluated in a shopping mall in Finland. Further analysis of the added embodiment is presented and discussed, as well as the challenges of translating open-domain dialogue systems into other languages. Data analysis of the collected real-user data shows the importance of a variety of features developed and decisions made in the design of the Alana system

    CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

    Get PDF
    Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research

    A rules based system for named entity recognition in modern standard Arabic

    Get PDF
    The amount of textual information available electronically has made it difficult for many users to find and access the right information within acceptable time. Research communities in the natural language processing (NLP) field are developing tools and techniques to alleviate these problems and help users in exploiting these vast resources. These techniques include Information Retrieval (IR) and Information Extraction (IE). The work described in this thesis concerns IE and more specifically, named entity extraction in Arabic. The Arabic language is of significant interest to the NLP community mainly due to its political and economic significance, but also due to its interesting characteristics. Text usually contains all kinds of names such as person names, company names, city and country names, sports teams, chemicals and lots of other names from specific domains. These names are called Named Entities (NE) and Named Entity Recognition (NER), one of the main tasks of IE systems, seeks to locate and classify automatically these names into predefined categories. NER systems are developed for different applications and can be beneficial to other information management technologies as it can be built over an IR system or can be used as the base module of a Data Mining application. In this thesis we propose an efficient and effective framework for extracting Arabic NEs from text using a rule based approach. Our approach makes use of Arabic contextual and morphological information to extract named entities. The context is represented by means of words that are used as clues for each named entity type. Morphological information is used to detect the part of speech of each word given to the morphological analyzer. Subsequently we developed and implemented our rules in order to recognise each position of the named entity. Finally, our system implementation, evaluation metrics and experimental results are presented.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
    • 

    corecore