465 research outputs found
Dataset search: a survey
Generating value from data requires the ability to find, access and make
sense of datasets. There are many efforts underway to encourage data sharing
and reuse, from scientific publishers asking authors to submit data alongside
manuscripts to data marketplaces, open data portals and data communities.
Google recently beta released a search service for datasets, which allows users
to discover data stored in various online repositories via keyword queries.
These developments foreshadow an emerging research field around dataset search
or retrieval that broadly encompasses frameworks, methods and tools that help
match a user data need against a collection of datasets. Here, we survey the
state of the art of research and commercial systems in dataset retrieval. We
identify what makes dataset search a research field in its own right, with
unique challenges and methods and highlight open problems. We look at
approaches and implementations from related areas dataset search is drawing
upon, including information retrieval, databases, entity-centric and tabular
search in order to identify possible paths to resolve these open problems as
well as immediate next steps that will take the field forward.Comment: 20 pages, 153 reference
Interactive Machine Learning with Applications in Health Informatics
Recent years have witnessed unprecedented growth of health data, including millions of biomedical research publications, electronic health records, patient discussions on health forums and social media, fitness tracker trajectories, and genome sequences. Information retrieval and machine learning techniques are powerful tools to unlock invaluable knowledge in these data, yet they need to be guided by human experts. Unlike training machine learning models in other domains, labeling and analyzing health data requires highly specialized expertise, and the time of medical experts is extremely limited. How can we mine big health data with little expert effort? In this dissertation, I develop state-of-the-art interactive machine learning algorithms that bring together human intelligence and machine intelligence in health data mining tasks. By making efficient use of human expert's domain knowledge, we can achieve high-quality solutions with minimal manual effort.
I first introduce a high-recall information retrieval framework that helps human users efficiently harvest not just one but as many relevant documents as possible from a searchable corpus. This is a common need in professional search scenarios such as medical search and literature review. Then I develop two interactive machine learning algorithms that leverage human expert's domain knowledge to combat the curse of "cold start" in active learning, with applications in clinical natural language processing. A consistent empirical observation is that the overall learning process can be reliably accelerated by a knowledge-driven "warm start", followed by machine-initiated active learning. As a theoretical contribution, I propose a general framework for interactive machine learning. Under this framework, a unified optimization objective explains many existing algorithms used in practice, and inspires the design of new algorithms.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147518/1/raywang_1.pd
Exploring the State of the Art in Legal QA Systems
Answering questions related to the legal domain is a complex task, primarily
due to the intricate nature and diverse range of legal document systems.
Providing an accurate answer to a legal query typically necessitates
specialized knowledge in the relevant domain, which makes this task all the
more challenging, even for human experts. QA (Question answering systems) are
designed to generate answers to questions asked in human languages. They use
natural language processing to understand questions and search through
information to find relevant answers. QA has various practical applications,
including customer service, education, research, and cross-lingual
communication. However, they face challenges such as improving natural language
understanding and handling complex and ambiguous questions. Answering questions
related to the legal domain is a complex task, primarily due to the intricate
nature and diverse range of legal document systems. Providing an accurate
answer to a legal query typically necessitates specialized knowledge in the
relevant domain, which makes this task all the more challenging, even for human
experts. At this time, there is a lack of surveys that discuss legal question
answering. To address this problem, we provide a comprehensive survey that
reviews 14 benchmark datasets for question-answering in the legal field as well
as presents a comprehensive review of the state-of-the-art Legal Question
Answering deep learning models. We cover the different architectures and
techniques used in these studies and the performance and limitations of these
models. Moreover, we have established a public GitHub repository where we
regularly upload the most recent articles, open data, and source code. The
repository is available at:
\url{https://github.com/abdoelsayed2016/Legal-Question-Answering-Review}
Hybrid deep neural network for Bangla automated image descriptor
Automated image to text generation is a computationally challenging computer vision task which requires sufficient comprehension of both syntactic and semantic meaning of an image to generate a meaningful description. Until recent times, it has been studied to a limited scope due to the lack of visual-descriptor dataset and functional models to capture intrinsic complexities involving features of an image. In this study, a novel dataset was constructed by generating Bangla textual descriptor from visual input, called Bangla Natural Language Image to Text (BNLIT), incorporating 100 classes with annotation. A deep neural network-based image captioning model was proposed to generate image description. The model employs Convolutional Neural Network (CNN) to classify the whole dataset, while Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) capture the sequential semantic representation of text-based sentences and generate pertinent description based on the modular complexities of an image. When tested on the new dataset, the model accomplishes significant enhancement of centrality execution for image semantic recovery assignment. For the experiment of that task, we implemented a hybrid image captioning model, which achieved a remarkable result for a new self-made dataset, and that task was new for the Bangladesh perspective. In brief, the model provided benchmark precision in the characteristic Bangla syntax reconstruction and comprehensive numerical analysis of the model execution results on the dataset
- …