20 research outputs found

    CWI-evaluation - Progress Report 1993-1998

    Get PDF

    Part-of-speech Tagging: A Machine Learning Approach based on Decision Trees

    Get PDF
    The study and application of general Machine Learning (ML) algorithms to theclassical ambiguity problems in the area of Natural Language Processing (NLP) isa currently very active area of research. This trend is sometimes called NaturalLanguage Learning. Within this framework, the present work explores the applicationof a concrete machine-learning technique, namely decision-tree induction, toa very basic NLP problem, namely part-of-speech disambiguation (POS tagging).Its main contributions fall in the NLP field, while topics appearing are addressedfrom the artificial intelligence perspective, rather from a linguistic point of view.A relevant property of the system we propose is the clear separation betweenthe acquisition of the language model and its application within a concrete disambiguationalgorithm, with the aim of constructing two components which are asindependent as possible. Such an approach has many advantages. For instance, thelanguage models obtained can be easily adapted into previously existing taggingformalisms; the two modules can be improved and extended separately; etc.As a first step, we have experimentally proven that decision trees (DT) providea flexible (by allowing a rich feature representation), efficient and compact wayfor acquiring, representing and accessing the information about POS ambiguities.In addition to that, DTs provide proper estimations of conditional probabilities fortags and words in their particular contexts. Additional machine learning techniques,based on the combination of classifiers, have been applied to address some particularweaknesses of our tree-based approach, and to further improve the accuracy in themost difficult cases.As a second step, the acquired models have been used to construct simple,accurate and effective taggers, based on diiferent paradigms. In particular, wepresent three different taggers that include the tree-based models: RTT, STT, andRELAX, which have shown different properties regarding speed, flexibility, accuracy,etc. The idea is that the particular user needs and environment will define whichis the most appropriate tagger in each situation. Although we have observed slightdifferences, the accuracy results for the three taggers, tested on the WSJ test benchcorpus, are uniformly very high, and, if not better, they are at least as good asthose of a number of current taggers based on automatic acquisition (a qualitativecomparison with the most relevant current work is also reported.Additionally, our approach has been adapted to annotate a general Spanishcorpus, with the particular limitation of learning from small training sets. A newtechnique, based on tagger combination and bootstrapping, has been proposed toaddress this problem and to improve accuracy. Experimental results showed thatvery high accuracy is possible for Spanish tagging, with a relatively low manualeffort. Additionally, the success in this real application has confirmed the validity of our approach, and the validity of the previously presented portability argumentin favour of automatically acquired taggers

    Enhancing solids deposit prediction in gully pots with explainable hybrid models: A review

    Get PDF
    Urban flooding has made it necessary to gain a better understanding of how well gully pots perform when overwhelmed by solids deposition due to various climatic and anthropogenic variables. This study investigates solids deposition in gully pots through the review of eight models, comprising four deterministic models, two hybrid models, a statistical model, and a conceptual model, representing a wide spectrum of solid depositional processes. Traditional models understand and manage the impact of climatic and anthropogenic variables on solid deposition but they are prone to uncertainties due to inadequate handling of complex and non-linear variables, restricted applicability, inflexibility and data bias. Hybrid models which integrate traditional models with data-driven approaches have proved to improve predictions and guarantee the development of uncertainty-proof models. Despite their effectiveness, hybrid models lack explainability. Hence, this study presents the significance of eXplainable Artificial Intelligence (XAI) tools in addressing the challenges associated with hybrid models. Finally, crossovers between various models and a representative workflow for the approach to solids deposition modelling in gully pots is suggested. The paper concludes that the application of explainable hybrid modeling can serve as a valuable tool for gully pot management as it can address key limitations present in existing models

    Doctor of Philosophy

    Get PDF
    dissertationNeuroscientists are developing new imaging techniques and generating large volumes of data in an effort to understand the complex structure of the nervous system. The complexity and size of this data makes human interpretation a labor intensive task. To aid in the analysis, new segmentation techniques for identifying neurons in these feature rich datasets are required. However, the extremely anisotropic resolution of the data makes segmentation and tracking across slices difficult. Furthermore, the thickness of the slices can make the membranes of the neurons hard to identify. Similarly, structures can change significantly from one section to the next due to slice thickness which makes tracking difficult. This thesis presents a complete method for segmenting many neurons at once in two-dimensional (2D) electron microscopy images and reconstructing and visualizing them in three-dimensions (3D). First, we present an advanced method for identifying neuron membranes in 2D, necessary for whole neuron segmentation, using a machine learning approach. The method described uses a series of artificial neural networks (ANNs) in a framework combined with a feature vector that is composed of image and context; intensities sampled over a stencil neighborhood. Several ANNs are applied in series allowing each ANN to use the classification context; provided by the previous network to improve detection accuracy. To improve the membrane detection, we use information from a nonlinear alignment of sequential learned membrane images in a final ANN that improves membrane detection in each section. The final output, the detected membranes, are used to obtain 2D segmentations of all the neurons in an image. We also present a method that constructs 3D neuron representations by formulating the problem of finding paths through sets of sections as an optimal path computation, which applies a cost function to the identification of a cell from one section to the next and solves this optimization problem using Dijkstras algorithm. This basic formulation accounts for variability or inconsistencies between sections and prioritizes cells based on the evidence of their connectivity. Finally, we present a tool that combines these techniques with a visual user interface that enables users to quickly segment whole neurons in large volumes

    Assessment of a multi-measure functional connectivity approach

    Get PDF
    Efforts to find differences in brain activity patterns of subjects with neurological and psychiatric disorders that could help in their diagnosis and prognosis have been increasing in recent years and promise to revolutionise clinical practice and our understanding of such illnesses in the future. Resting-state functional magnetic resonance imaging (rsfMRI) data has been increasingly used to evaluate said activity and to characterize the connectivity between distinct brain regions, commonly organized in functional connectivity (FC) matrices. Here, machine learning methods were used to assess the extent to which multiple FC matrices, each determined with a different statistical method, could change classification performance relative to when only one matrix is used, as is common practice. Used statistical methods include correlation, coherence, mutual information, transfer entropy and non-linear correlation, as implemented in the MULAN toolbox. Classification was made using random forests and support vector machine (SVM) classifiers. Besides the previously mentioned objective, this study had three other goals: to individually investigate which of these statistical methods yielded better classification performances, to confirm the importance of the blood-oxygen-level-dependent (BOLD) signal in the frequency range 0.009-0.08 Hz for FC based classifications as well as to assess the impact of feature selection in SVM classifiers. Publicly available rs-fMRI data from the Addiction Connectome Preprocessed Initiative (ACPI) and the ADHD-200 databases was used to perform classification of controls vs subjects with Attention-Deficit/Hyperactivity Disorder (ADHD). Maximum accuracy and macro-averaged f-measure values of 0.744 and 0.677 were respectively achieved in the ACPI dataset and of 0.678 and 0.648 in the ADHD-200 dataset. Results show that combining matrices could significantly improve classification accuracy and macro-averaged f-measure if feature selection is made. Also, the results of this study suggest that mutual information methods might play an important role in FC based classifications, at least when classifying subjects with ADHD

    New Paradigms for Active Learning

    Get PDF
    In traditional active learning, learning algorithms (or learners) mainly focus on the performance of the final model built and the total number of queries needed for learning a good model. However, in many real-world applications, active learners have to focus on the learning process for achieving finer goals, such as minimizing the number of mistakes in predicting unlabeled examples. These learning goals are common and important in real-world applications. For example, in direct marketing, a sales agent (learner) has to focus on the process of selecting customers to approach, and tries to make correct predictions (i.e., fewer mistakes) on the customers who will buy the product. However, traditional active learning algorithms cannot achieve the finer learning goals due to the different focuses. In this thesis, we study how to control the learning process in active learning such that those goals can be accomplished. According to various learning tasks and goals, we address four new active paradigms as follows. The first paradigm is learning actively and conservatively. Under this paradigm, the learner actively selects and predicts the most certain example (thus, conservatively) iteratively during the learning process. The goal of this paradigm is to minimize the number of mistakes in predicting unlabeled examples during active learning. Intuitively the conservative strategy is less likely to make mistakes, i.e., more likely to achieve the learning goal. We apply this new learning strategy in an educational software, as well as direct marketing. The second paradigm is learning actively and aggressively. Under this paradigm, unlabeled examples and multiple oracles are available. The learner actively selects the best multiple oracles to label the most uncertain example (thus, aggressively) iteratively during the learning process. The learning goal is to learn a good model with guaranteed label quality. The third paradigm is learning actively with conservative-aggressive tradeoff. Under this learning paradigm, firstly, unlabeled examples are available and learners are allowed to select examples actively to learn. Secondly, to obtain the labels, two actions can be considered: querying oracles and making predictions. Lastly, cost has to be paid for querying oracles or for making wrong predictions. The tradeoff between the two actions is necessary for achieving the learning goal: minimizing the total cost for obtaining the labels. The last paradigm is learning actively with minimal/maximal effort. Under this paradigm, the labels of the examples are all provided and learners are allowed to select examples actively to learn. The learning goal is to control the learning process by selecting examples actively such that the learning can be accomplished with minimal effort or a good model can be built fast with maximal effort. For each of the four learning paradigms, we propose effective learning algorithms accordingly and demonstrate empirically that related learning problems in real applications can be solved well and the learning goals can be accomplished. In summary, this thesis focuses on controlling the learning process to achieve fine goals in active learning. According to various real application tasks, we propose four novel learning paradigms, and for each paradigm we propose efficient learning algorithms to solve the learning problems. The experimental results show that our learning algorithms outperform other state-of-the-art learning algorithms

    Learning categorial grammars

    Get PDF
    In 1967 E. M. Gold published a paper in which the language classes from the Chomsky-hierarchy were analyzed in terms of learnability, in the technical sense of identification in the limit. His results were mostly negative, and perhaps because of this his work had little impact on linguistics. In the early eighties there was renewed interest in the paradigm, mainly because of work by Angluin and Wright. Around the same time, Arikawa and his co-workers refined the paradigm by applying it to so-called Elementary Formal Systems. By making use of this approach Takeshi Shinohara was able to come up with an impressive result; any class of context-sensitive grammars with a bound on its number of rules is learnable. Some linguistically motivated work on learnability also appeared from this point on, most notably Wexler & Culicover 1980 and Kanazawa 1994. The latter investigates the learnability of various classes of categorial grammar, inspired by work by Buszkowski and Penn, and raises some interesting questions. We follow up on this work by exploring complexity issues relevant to learning these classes, answering an open question from Kanazawa 1994, and applying the same kind of approach to obtain (non)learnable classes of Combinatory Categorial Grammars, Tree Adjoining Grammars, Minimalist grammars, Generalized Quantifiers, and some variants of Lambek Grammars. We also discuss work on learning tree languages and its application to learning Dependency Grammars. Our main conclusions are: - formal learning theory is relevant to linguistics, - identification in the limit is feasible for non-trivial classes, - the `Shinohara approach' -i.e., placing a numerical bound on the complexity of a grammar- can lead to a learnable class, but this completely depends on the specific nature of the formalism and the notion of complexity. We give examples of natural classes of commonly used linguistic formalisms that resist this kind of approach, - learning is hard work. Our results indicate that learning even `simple' classes of languages requires a lot of computational effort, - dealing with structure (derivation-, dependency-) languages instead of string languages offers a useful and promising approach to learnabilty in a linguistic contex
    corecore