124 research outputs found

    Datasets for Large Language Models: A Comprehensive Survey

    Full text link
    This paper embarks on an exploration into the Large Language Model (LLM) datasets, which play a crucial role in the remarkable advancements of LLMs. The datasets serve as the foundational infrastructure analogous to a root system that sustains and nurtures the development of LLMs. Consequently, examination of these datasets emerges as a critical topic in research. In order to address the current lack of a comprehensive overview and thorough analysis of LLM datasets, and to gain insights into their current status and future trends, this survey consolidates and categorizes the fundamental aspects of LLM datasets from five perspectives: (1) Pre-training Corpora; (2) Instruction Fine-tuning Datasets; (3) Preference Datasets; (4) Evaluation Datasets; (5) Traditional Natural Language Processing (NLP) Datasets. The survey sheds light on the prevailing challenges and points out potential avenues for future investigation. Additionally, a comprehensive review of the existing available dataset resources is also provided, including statistics from 444 datasets, covering 8 language categories and spanning 32 domains. Information from 20 dimensions is incorporated into the dataset statistics. The total data size surveyed surpasses 774.5 TB for pre-training corpora and 700M instances for other datasets. We aim to present the entire landscape of LLM text datasets, serving as a comprehensive reference for researchers in this field and contributing to future studies. Related resources are available at: https://github.com/lmmlzn/Awesome-LLMs-Datasets.Comment: 181 pages, 21 figure

    Automatic Extraction and Assessment of Entities from the Web

    Get PDF
    The search for information about entities, such as people or movies, plays an increasingly important role on the Web. This information is still scattered across many Web pages, making it more time consuming for a user to ïŹnd all relevant information about an entity. This thesis describes techniques to extract entities and information about these entities from the Web, such as facts, opinions, questions and answers, interactive multimedia objects, and events. The ïŹndings of this thesis are that it is possible to create a large knowledge base automatically using a manually-crafted ontology. The precision of the extracted information was found to be between 75–90 % (facts and entities respectively) after using assessment algorithms. The algorithms from this thesis can be used to create such a knowledge base, which can be used in various research ïŹelds, such as question answering, named entity recognition, and information retrieval

    Study on open science: The general state of the play in Open Science principles and practices at European life sciences institutes

    Get PDF
    Nowadays, open science is a hot topic on all levels and also is one of the priorities of the European Research Area. Components that are commonly associated with open science are open access, open data, open methodology, open source, open peer review, open science policies and citizen science. Open science may a great potential to connect and influence the practices of researchers, funding institutions and the public. In this paper, we evaluate the level of openness based on public surveys at four European life sciences institute

    Understanding interaction mechanics in touchless target selection

    Get PDF
    Indiana University-Purdue University Indianapolis (IUPUI)We use gestures frequently in daily life—to interact with people, pets, or objects. But interacting with computers using mid-air gestures continues to challenge the design of touchless systems. Traditional approaches to touchless interaction focus on exploring gesture inputs and evaluating user interfaces. I shift the focus from gesture elicitation and interface evaluation to touchless interaction mechanics. I argue for a novel approach to generate design guidelines for touchless systems: to use fundamental interaction principles, instead of a reactive adaptation to the sensing technology. In five sets of experiments, I explore visual and pseudo-haptic feedback, motor intuitiveness, handedness, and perceptual Gestalt effects. Particularly, I study the interaction mechanics in touchless target selection. To that end, I introduce two novel interaction techniques: touchless circular menus that allow command selection using directional strokes and interface topographies that use pseudo-haptic feedback to guide steering–targeting tasks. Results illuminate different facets of touchless interaction mechanics. For example, motor-intuitive touchless interactions explain how our sensorimotor abilities inform touchless interface affordances: we often make a holistic oblique gesture instead of several orthogonal hand gestures while reaching toward a distant display. Following the Gestalt theory of visual perception, we found similarity between user interface (UI) components decreased user accuracy while good continuity made users faster. Other findings include hemispheric asymmetry affecting transfer of training between dominant and nondominant hands and pseudo-haptic feedback improving touchless accuracy. The results of this dissertation contribute design guidelines for future touchless systems. Practical applications of this work include the use of touchless interaction techniques in various domains, such as entertainment, consumer appliances, surgery, patient-centric health settings, smart cities, interactive visualization, and collaboration

    In Search of a Common Thread: Enhancing the LBD Workflow with a view to its Widespread Applicability

    Get PDF
    Literature-Based Discovery (LBD) research focuses on discovering implicit knowledge linkages in existing scientific literature to provide impetus to innovation and research productivity. Despite significant advancements in LBD research, previous studies contain several open problems and shortcomings that are hindering its progress. The overarching goal of this thesis is to address these issues, not only to enhance the discovery component of LBD, but also to shed light on new directions that can further strengthen the existing understanding of the LBD work ow. In accordance with this goal, the thesis aims to enhance the LBD work ow with a view to ensuring its widespread applicability. The goal of widespread applicability is twofold. Firstly, it relates to the adaptability of the proposed solutions to a diverse range of problem settings. These problem settings are not necessarily application areas that are closely related to the LBD context, but could include a wide range of problems beyond the typical scope of LBD, which has traditionally been applied to scientific literature. Adapting the LBD work ow to problems outside the typical scope of LBD is a worthwhile goal, since the intrinsic objective of LBD research, which is discovering novel linkages in text corpora is valid across a vast range of problem settings. Secondly, the idea of widespread applicability also denotes the capability of the proposed solutions to be executed in new environments. These `new environments' are various academic disciplines (i.e., cross-domain knowledge discovery) and publication languages (i.e., cross-lingual knowledge discovery). The application of LBD models to new environments is timely, since the massive growth of the scientific literature has engendered huge challenges to academics, irrespective of their domain. This thesis is divided into five main research objectives that address the following topics: literature synthesis, the input component, the discovery component, reusability, and portability. The objective of the literature synthesis is to address the gaps in existing LBD reviews by conducting the rst systematic literature review. The input component section aims to provide generalised insights on the suitability of various input types in the LBD work ow, focusing on their role and potential impact on the information retrieval cycle of LBD. The discovery component section aims to intermingle two research directions that have been under-investigated in the LBD literature, `modern word embedding techniques' and `temporal dimension' by proposing diachronic semantic inferences. Their potential positive in uence in knowledge discovery is veri ed through both direct and indirect uses. The reusability section aims to present a new, distinct viewpoint on these LBD models by verifying their reusability in a timely application area using a methodical reuse plan. The last section, portability, proposes an interdisciplinary LBD framework that can be applied to new environments. While highly cost-e cient and easily pluggable, this framework also gives rise to a new perspective on knowledge discovery through its generalisable capabilities. Succinctly, this thesis presents novel and distinct viewpoints to accomplish five main research objectives, enhancing the existing understanding of the LBD work ow. The thesis offers new insights which future LBD research could further explore and expand to create more eficient, widely applicable LBD models to enable broader community benefits.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202

    Feature Papers of Drones - Volume II

    Get PDF
    [EN] The present book is divided into two volumes (Volume I: articles 1–23, and Volume II: articles 24–54) which compile the articles and communications submitted to the Topical Collection ”Feature Papers of Drones” during the years 2020 to 2022 describing novel or new cutting-edge designs, developments, and/or applications of unmanned vehicles (drones). Articles 24–41 are focused on drone applications, but emphasize two types: firstly, those related to agriculture and forestry (articles 24–35) where the number of applications of drones dominates all other possible applications. These articles review the latest research and future directions for precision agriculture, vegetation monitoring, change monitoring, forestry management, and forest fires. Secondly, articles 36–41 addresses the water and marine application of drones for ecological and conservation-related applications with emphasis on the monitoring of water resources and habitat monitoring. Finally, articles 42–54 looks at just a few of the huge variety of potential applications of civil drones from different points of view, including the following: the social acceptance of drone operations in urban areas or their influential factors; 3D reconstruction applications; sensor technologies to either improve the performance of existing applications or to open up new working areas; and machine and deep learning development
    • 

    corecore