27 research outputs found
Towards automatic construction of diverse, high-quality image dataset
University of Technology Sydney. Faculty of Engineering and Information Technology.The availability of labeled image datasets has been shown critical for high-level image understanding, which continuously drives the progress of feature designing and models developing. However, the process of manual labeling is both time-consuming and labor-intensive. To reduce the cost of manual annotation, there has been increased research interest in automatically constructing image datasets by exploiting web images. Datasets constructed by existing methods tend to suffer from the disadvantage of low accuracy and low diversity. These datasets tend to have a weak domain adaptation ability, which is known as the âdataset bias problemâ.
This research aims at automatically collect accurate and diverse images for given queries from the Web, and construct a domain robust image dataset. Thus, within this thesis, various methods are developed and presented to address the following research challenges. The first is the retrieved web images are usually noisy, how to remove noise and construct a relatively high accuracy dataset. The second is the collected web images are often associated with low diversity, how to address the dataset bias problem and construct a domain robust dataset.
In Chapter 3, a framework is presented to address the problem of polysemy in the process of constructing a high accuracy dataset. Visual polysemy means that a word has several semantic (text) senses that are visually (image) distinct. Solving polysemy can help to choose appropriate visual senses for sense-specific images collection, thereby improving the accuracy of the collected images. Unlike previous methods which leveraged the human-developed knowledge such as Wikipedia or dictionaries to handle polysemy, we propose to automate the process of discovering and distinguishing multiple visual senses from untagged corpora to solve the problem of polysemy.
In Chapter 4, a domain robust framework is presented for image dataset construction. To address the dataset bias problem, our framework mainly consists of three stages. Specifically, we first obtain the candidate query expansions by searching in the Google Books Ngram Corpus. Then, by treating word-word (semantic) and visual-visual distance (visual) as features from two different views, we formulate noisy query expansions pruning as a multi-view learning problem. Finally, by treating each selected query expansion as a âbagâ and the images therein as âinstancesâ, we formulate image selection and noise removal as a multi-instance learning problem. In this way, images from different distributions can be kept while noise is filtered out.
Chapter 5 details a method for noisy images removing and accurate images selecting. The accuracy of selected images is limited by two issues: the noisy query expansions which are not filtered out and the error index of image search engine. To deal with the noisy query expansions, we divide them into two types and propose to remove noise from visual consistency and relevancy respectively. To handle noise induced by error index, we classify the noisy images into three categories and filter out noise by different mechanisms separately.
Chapter 6 proposes an approach for enhancing classifier learning by using the collected web images. Different from previous works, our approach, while improving the accuracy and robustness of the classifier, greatly reduces the time and labor dependence. Specifically, we proposed a new instance-level MIL model to select a subset of training images from each selected privileged information and simultaneously learn the optimal classifiers based on the selected images.
Chapter 7 concludes the thesis and outlines the scope of future work
Statistical and Computational Models for Whole Word Morphology
Das Ziel dieser Arbeit ist die Formulierung eines Ansatzes zum maschinellen Lernen von Sprachmorphologie, in dem letztere als Zeichenkettentransformationen auf ganzen Wörtern, und nicht als Zerlegung von Wörtern in kleinere stukturelle Einheiten, modelliert wird. Der Beitrag besteht aus zwei wesentlichen Teilen: zum einen wird ein Rechenmodell formuliert, in dem morphologische Regeln als Funktionen auf Zeichenketten definiert sind. Solche Funktionen lassen sich leicht zu endlichen Transduktoren ĂŒbersetzen, was eine solide algorithmische Grundlage fĂŒr den Ansatz liefert. Zum anderen wird ein statistisches Modell fĂŒr Graphen von Wortab\-leitungen eingefĂŒhrt. Die Inferenz in diesem Modell erfolgt mithilfe des Monte Carlo Expectation Maximization-Algorithmus und die Erwartungswerte ĂŒber Graphen werden durch einen Metropolis-Hastings-Sampler approximiert. Das Modell wird auf einer Reihe von praktischen Aufgaben evaluiert: Clustering flektierter Formen, Lernen von Lemmatisierung, Vorhersage von Wortart fĂŒr unbekannte Wörter, sowie Generierung neuer Wörter
Automated retrieval and analysis of published biomedical literature through natural language processing for clinical applications
The size of the existing academic literature corpus and the incredible rate of new publications
offers a great need and opportunity to harness computational approaches to data and
knowledge extraction across all research fields. Elements of this challenge can be met by
developments in automation for retrieval of electronic documents, document classification
and knowledge extraction. In this thesis, I detail studies of these processes in three related
chapters. Although the focus of each chapter is distinct, they contribute to my aim of
developing a generalisable pipeline for clinical applications in Natural Language Processing
in the academic literature. In chapter one, I describe the development of âCadmusâ, An open-source system developed in Python to generate corpora of biomedical text from the published
literature. Cadmus comprises three main steps: Search query & meta-data collection,
document retrieval, and parsing of the retrieved text. I present an example of full-text
retrieval for a corpus of over two hundred thousand articles using a gene-based search query
with quality control metrics for this retrieval process and a high-level illustration of the utility
of full text over metadata for each article. For a corpus of 204,043 articles, the retrieval rate
was 85.2% with institutional subscription access and 54.4% without. Chapter Two details
developing a custom-built NaĂŻve Bayes supervised machine learning document classifier.
This binary classifier is based on calculating the relative enrichment of biomedical terms
between two classes of documents in a training set.
The classifier is trained and tested upon a manually classified set of over 8000 abstract and
full-text articles to identify articles containing human phenotype descriptions. 10-fold cross-validation of the model showed a performance of recall of 85%, specificity of 99%, Precision
of 0.76%, f1 score of 0.82 and accuracy of 90%. Chapter three illustrates the clinical
applications of automated retrieval, processing, and classification by considering the
published literature on Paediatric COVID-19. Case reports and similar articles were classified
into âsevereâ and ânon-severeâ classes, and term enrichment was evaluated to find
biomarkers associated with, or predictive of, severe paediatric COVID-19. Time series
analysis was employed to illustrate emerging disease entities like the Multisystem
Inflammatory Syndrome in Children (MIS-C) and consider unrecognised trends through
literature-based discovery
Developing Methods and Resources for Automated Processing of the African Language Igbo
Natural Language Processing (NLP) research is still in its infancy in Africa. Most of
languages in Africa have few or zero NLP resources available, of which Igbo is among those
at zero state. In this study, we develop NLP resources to support NLP-based research in
the Igbo language. The springboard is the development of a new part-of-speech (POS)
tagset for Igbo (IgbTS) based on a slight adaptation of the EAGLES guideline as a result
of language internal features not recognized in EAGLES. The tagset consists of three
granularities: fine-grain (85 tags), medium-grain (70 tags) and coarse-grain (15 tags). The
medium-grained tagset is to strike a balance between the other two grains for practical
purpose. Following this is the preprocessing of Igbo electronic texts through normalization
and tokenization processes. The tokenizer is developed in this study using the tagset
definition of a word token and the outcome is an Igbo corpus (IgbC) of about one million
tokens.
This IgbTS was applied to a part of the IgbC to produce the first Igbo tagged corpus
(IgbTC). To investigate the effectiveness, validity and reproducibility of the IgbTS, an
inter-annotation agreement (IAA) exercise was undertaken, which led to the revision of the
IgbTS where necessary. A novel automatic method was developed to bootstrap a manual
annotation process through exploitation of the by-products of this IAA exercise, to improve
IgbTC. To further improve the quality of the IgbTC, a committee of taggers approach
was adopted to propose erroneous instances on IgbTC for correction. A novel automatic
method that uses knowledge of affixes to flag and correct all morphologically-inflected
words in the IgbTC whose tags violate their status as not being morphologically-inflected
was also developed and used.
Experiments towards the development of an automatic POS tagging system for Igbo
using IgbTC show good accuracy scores comparable to other languages that these taggers
have been tested on, such as English. Accuracy on the words previously unseen during
the taggersâ training (also called unknown words) is considerably low, and much lower
on the unknown words that are morphologically-complex, which indicates difficulty in
handling morphologically-complex words in Igbo. This was improved by adopting a
morphological reconstruction method (a linguistically-informed segmentation into stems
and affixes) that reformatted these morphologically-complex words into patterns learnable
by machines. This enables taggers to use the knowledge of stems and associated affixes
of these morphologically-complex words during the tagging process to predict their
appropriate tags. Interestingly, this method outperforms other methods that existing
taggers use in handling unknown words, and achieves an impressive increase for the
accuracy of the morphologically-inflected unknown words and overall unknown words.
These developments are the first NLP toolkit for the Igbo language and a step towards
achieving the objective of Basic Language Resources Kits (BLARK) for the language. This
IgboNLP toolkit will be made available for the NLP community and should encourage
further research and development for the language
Human-competitive automatic topic indexing
Topic indexing is the task of identifying the main topics covered by a document. These are useful for many purposes: as subject headings in libraries, as keywords in academic publications and as tags on the web. Knowing a document's topics helps people judge its relevance quickly. However, assigning topics manually is labor intensive. This thesis shows how to generate them automatically in a way that competes with human performance.
Three kinds of indexing are investigated: term assignment, a task commonly performed by librarians, who select topics from a controlled vocabulary; tagging, a popular activity of web users, who choose topics freely; and a new method of keyphrase extraction, where topics are equated to Wikipedia article names. A general two-stage algorithm is introduced that first selects candidate topics and then ranks them by significance based on their properties. These properties draw on statistical, semantic, domain-specific and encyclopedic knowledge. They are combined using a machine learning algorithm that models human indexing behavior from examples.
This approach is evaluated by comparing automatically generated topics to those assigned by professional indexers, and by amateurs. We claim that the algorithm is human-competitive because it chooses topics that are as consistent with those assigned by humans as their topics are with each other. The approach is generalizable, requires little training data and applies across different domains and languages