124 research outputs found
Datasets for Large Language Models: A Comprehensive Survey
This paper embarks on an exploration into the Large Language Model (LLM)
datasets, which play a crucial role in the remarkable advancements of LLMs. The
datasets serve as the foundational infrastructure analogous to a root system
that sustains and nurtures the development of LLMs. Consequently, examination
of these datasets emerges as a critical topic in research. In order to address
the current lack of a comprehensive overview and thorough analysis of LLM
datasets, and to gain insights into their current status and future trends,
this survey consolidates and categorizes the fundamental aspects of LLM
datasets from five perspectives: (1) Pre-training Corpora; (2) Instruction
Fine-tuning Datasets; (3) Preference Datasets; (4) Evaluation Datasets; (5)
Traditional Natural Language Processing (NLP) Datasets. The survey sheds light
on the prevailing challenges and points out potential avenues for future
investigation. Additionally, a comprehensive review of the existing available
dataset resources is also provided, including statistics from 444 datasets,
covering 8 language categories and spanning 32 domains. Information from 20
dimensions is incorporated into the dataset statistics. The total data size
surveyed surpasses 774.5 TB for pre-training corpora and 700M instances for
other datasets. We aim to present the entire landscape of LLM text datasets,
serving as a comprehensive reference for researchers in this field and
contributing to future studies. Related resources are available at:
https://github.com/lmmlzn/Awesome-LLMs-Datasets.Comment: 181 pages, 21 figure
Automatic Extraction and Assessment of Entities from the Web
The search for information about entities, such as people or movies, plays an increasingly important role on the Web. This information is still scattered across many Web pages, making it more time consuming for a user to ïŹnd all relevant information about an entity. This thesis describes techniques to extract entities and information about these entities from the Web, such as facts, opinions, questions and answers, interactive multimedia objects, and events. The ïŹndings of this thesis are that it is possible to create a large knowledge base automatically using a manually-crafted ontology. The precision of the extracted information was found to be between 75â90 % (facts and entities respectively) after using assessment algorithms. The algorithms from this thesis can be used to create such a knowledge base, which can be used in various research ïŹelds, such as question answering, named entity recognition, and information retrieval
Study on open science: The general state of the play in Open Science principles and practices at European life sciences institutes
Nowadays, open science is a hot topic on all levels and also is one of the priorities of the European Research Area. Components that are commonly associated with open science are open access, open data, open methodology, open source, open peer review, open science policies and citizen science. Open science may a great potential to connect and influence the practices of researchers, funding institutions and the public. In this paper, we evaluate the level of openness based on public surveys at four European life sciences institute
Understanding interaction mechanics in touchless target selection
Indiana University-Purdue University Indianapolis (IUPUI)We use gestures frequently in daily lifeâto interact with people, pets, or objects. But interacting with computers using mid-air gestures continues to challenge the design of touchless systems. Traditional approaches to touchless interaction focus on exploring gesture inputs and evaluating user interfaces. I shift the focus from gesture elicitation and interface evaluation to touchless interaction mechanics. I argue for a novel approach to generate design guidelines for touchless systems: to use fundamental interaction principles, instead of a reactive adaptation to the sensing technology. In five sets of experiments, I explore visual and pseudo-haptic feedback, motor intuitiveness, handedness, and perceptual Gestalt effects. Particularly, I study the interaction mechanics in touchless target selection. To that end, I introduce two novel interaction techniques: touchless circular menus that allow command selection using directional strokes and interface topographies that use pseudo-haptic feedback to guide steeringâtargeting tasks. Results illuminate different facets of touchless interaction mechanics. For example, motor-intuitive touchless interactions explain how our sensorimotor abilities inform touchless interface affordances: we often make a holistic oblique gesture instead of several orthogonal hand gestures while reaching toward a distant display. Following the Gestalt theory of visual perception, we found similarity between user interface (UI) components decreased user accuracy while good continuity made users faster. Other findings include hemispheric asymmetry affecting transfer of training between dominant and nondominant hands and pseudo-haptic feedback improving touchless accuracy. The results of this dissertation contribute design guidelines for future touchless systems. Practical applications of this work include the use of touchless interaction techniques in various domains, such as entertainment, consumer appliances, surgery, patient-centric health settings, smart cities, interactive visualization, and collaboration
In Search of a Common Thread: Enhancing the LBD Workflow with a view to its Widespread Applicability
Literature-Based Discovery (LBD) research focuses on discovering implicit knowledge
linkages in existing scientific literature to provide impetus to innovation and research
productivity. Despite significant advancements in LBD research, previous studies contain
several open problems and shortcomings that are hindering its progress. The overarching
goal of this thesis is to address these issues, not only to enhance the discovery
component of LBD, but also to shed light on new directions that can further strengthen
the existing understanding of the LBD work
ow. In accordance with this goal, the thesis
aims to enhance the LBD work
ow with a view to ensuring its widespread applicability.
The goal of widespread applicability is twofold. Firstly, it relates to the adaptability of
the proposed solutions to a diverse range of problem settings. These problem settings
are not necessarily application areas that are closely related to the LBD context, but
could include a wide range of problems beyond the typical scope of LBD, which has traditionally
been applied to scientific literature. Adapting the LBD work
ow to problems
outside the typical scope of LBD is a worthwhile goal, since the intrinsic objective of
LBD research, which is discovering novel linkages in text corpora is valid across a vast
range of problem settings.
Secondly, the idea of widespread applicability also denotes the capability of the proposed
solutions to be executed in new environments. These `new environments' are various
academic disciplines (i.e., cross-domain knowledge discovery) and publication languages
(i.e., cross-lingual knowledge discovery). The application of LBD models to new environments
is timely, since the massive growth of the scientific literature has engendered
huge challenges to academics, irrespective of their domain.
This thesis is divided into five main research objectives that address the following topics:
literature synthesis, the input component, the discovery component, reusability, and
portability. The objective of the literature synthesis is to address the gaps in existing
LBD reviews by conducting the rst systematic literature review. The input component
section aims to provide generalised insights on the suitability of various input types in the
LBD work
ow, focusing on their role and potential impact on the information retrieval
cycle of LBD.
The discovery component section aims to intermingle two research directions that have
been under-investigated in the LBD literature, `modern word embedding techniques'
and `temporal dimension' by proposing diachronic semantic inferences. Their potential
positive in
uence in knowledge discovery is veri ed through both direct and indirect
uses. The reusability section aims to present a new, distinct viewpoint on these LBD
models by verifying their reusability in a timely application area using a methodical reuse
plan. The last section, portability, proposes an interdisciplinary LBD framework that
can be applied to new environments. While highly cost-e cient and easily pluggable, this framework also gives rise to a new perspective on knowledge discovery through its
generalisable capabilities.
Succinctly, this thesis presents novel and distinct viewpoints to accomplish five main
research objectives, enhancing the existing understanding of the LBD work
ow. The
thesis offers new insights which future LBD research could further explore and expand
to create more eficient, widely applicable LBD models to enable broader community
benefits.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202
Feature Papers of Drones - Volume II
[EN] The present book is divided into two volumes (Volume I: articles 1â23, and Volume II: articles 24â54) which compile the articles and communications submitted to the Topical Collection âFeature Papers of Dronesâ during the years 2020 to 2022 describing novel or new cutting-edge designs, developments, and/or applications of unmanned vehicles (drones). Articles 24â41 are focused on drone applications, but emphasize two types: firstly, those related to agriculture and forestry (articles 24â35) where the number of applications of drones dominates all other possible applications. These articles review the latest research and future directions for precision agriculture, vegetation monitoring, change monitoring, forestry management, and forest fires. Secondly, articles 36â41 addresses the water and marine application of drones for ecological and conservation-related applications with emphasis on the monitoring of water resources and habitat monitoring. Finally, articles 42â54 looks at just a few of the huge variety of potential applications of civil drones from different points of view, including the following: the social acceptance of drone operations in urban areas or their influential factors; 3D reconstruction applications; sensor technologies to either improve the performance of existing applications or to open up new working areas; and machine and deep learning development
- âŠ