27 research outputs found
Formal Linguistic Models and Knowledge Processing. A Structuralist Approach to Rule-Based Ontology Learning and Population
2013 - 2014The main aim of this research is to propose a structuralist approach for knowledge processing by means of ontology learning and population, achieved starting from unstructured and structured texts. The method suggested includes distributional semantic approaches and NL formalization theories, in order to develop a framework, which relies upon deep linguistic analysis... [edited by author]XIII n.s
Darstellung und stochastische Auflösung von AmbiguitÀt in constraint-basiertem Parsing
Diese Arbeit untersucht zwei komplementĂ€re AnsĂ€tze zum Umgang mit Mehrdeutigkeiten bei der automatischen Verarbeitung natĂŒrlicher Sprache. ZunĂ€chst werden Methoden vorgestellt, die es erlauben, viele konkurrierende Interpretationen in einer gemeinsamen Datenstruktur kompakt zu reprĂ€sentieren. Dann werden AnsĂ€tze vorgeschlagen, die verschiedenen Interpretationen mit Hilfe von stochastischen Modellen zu bewerten. FĂŒr das dabei auftretende Problem, Wahrscheinlichkeiten von seltenen Ereignissen zu schĂ€tzen, die in den Trainingsdaten nicht auftraten, werden neuartige Methoden vorgeschlagen.This thesis investigates two complementary approches to cope with ambiguities in natural language processing. It first presents methods that allow to store many competing interpretations compactly in one shared datastructure. It then suggests approaches to score the different interpretations using stochastic models. This leads to the problem of estimation of probabilities of rare events that have not been observed in the training data, for which novel methods are proposed
Active Learning - An Explicit Treatment of Unreliable Parameters
Institute for Communicating and Collaborative SystemsActive learning reduces annotation costs for supervised learning by concentrating labelling efforts on the most informative data. Most active learning methods assume that the model structure is fixed in advance and focus upon improving parameters within
that structure. However, this is not appropriate for natural language processing where the model structure and associated parameters are determined using labelled data. Applying traditional active learning methods to natural language processing can fail to produce expected reductions in annotation cost. We show that one of the reasons for this problem is that active learning can only select examples which are already covered by the model. In this thesis, we better tailor active learning to the need of natural language processing as follows. We formulate the Unreliable Parameter Principle:
Active learning should explicitly and additionally address unreliably trained
model parameters in order to optimally reduce classification error. In order
to do so, we should target both missing events and infrequent events.
We demonstrate the effectiveness of such an approach for a range of natural language
processing tasks: prepositional phrase attachment, sequence labelling, and syntactic
parsing. For prepositional phrase attachment, the explicit selection of unknown prepositions significantly improves coverage and classification performance for all examined active learning methods. For sequence labelling, we introduce a novel active learning method which explicitly targets unreliable parameters by selecting sentences with many unknown words and a large number of unobserved transition probabilities. For parsing, targeting unparseable sentences significantly improves coverage and f-measure in active learning
Combining bayesian and support vector machines learning to automatically complete syntactical information for HPSG-like formalisms
Learning Bayesian Belief Networks (BBN) from corpora and incorporating the extracted inferring knowledge with a Support Vector Machines (SVM) classifier has been applied to the automatic acquisition of verb subcategorization frames for Modern Greek. We have made use of minimal linguistic resources, such as basic morphological tagging and phrase chunking, to demonstrate that verb subcategorization, which is of great significance for developing robust natural language human computer interaction systems, could be achieved using large corpora, without having any general-purpose syntactic parser at all. Moreover, by taking advantage of the plethora in unlabeled data found in text corpora in addition to some available labeled examples, we overcome the expensive task of annotating the whole set of training data and the performance of the subcategorization frames learner is increased. We argue that a classifier generated from BBN and SVM is well suited for learning to identify verb subcategorization frames. Empirical results will support this claim. Performance has been methodically evaluated using two different corpora, one balanced and one domain-specific in order to determine the unbiased behavior of the trained models. Limited training data are proved to endow with satisfactory results. We have been able to achieve precision exceeding 90 % on the identification of subcategorization frames which were not known beforehand. The obtained valid frames have been used to fill out the subcategorization field of verb entries in an HPSG-like lexicon using the LKB grammar development environment
Recommended from our members
Deciphering clinical text: concept recognition in primary care text notes
Electronic patient records, containing data about the health and care of a patient, are a valuable source of information for longitudinal clinical studies. The General Practice Research Database (GPRD) has collected patient records from UK primary care practices since the late 1980s. These records contain both structured data (in the form of codes and numeric values) and free text notes. While the structured data have been used extensively in clinical studies, there are significant practical obstacles in extracting information from the free text notes. The main obstacles are data access restrictions, due to the presence of sensitive information, and the specific language of medical practitioners, which renders standard language processing tools ineffective.
The aim of this research is to investigate approaches for computer analysis of free text notes. The research involved designing a primary care text corpus (the Harvey Corpus) annotated with syntactic chunks and clinically-relevant semantic entities, developing a statistical chunking model, and devising a novel method for applying machine learning for entity recognition based on chunk annotation. The tools produced would facilitate reliable information extraction from primary care patient records, needed for the development of clinically-related research. The three medical concept types targeted in this thesis could contribute to epidemiological studies by enhancing the detection of co-morbidities, and better analysing the descriptions of patient experiences and treatments.
The main contributions of the research reported in this thesis are: guidelines for chunk and concept annotation of clinical text, an approach to maximising agreement between human annotators, the Harvey Corpus, a method for using a standard part-of-speech tagging model in clinical text chunking, and a novel approach to recognising clinically relevant medical concepts
Recommended from our members
Aspects of emergent cyclicity in language and computation
This thesis has four parts, which correspond to the presentation and development of a theoretical
framework for the study of cognitive capacities qua physical phenomena, and a case study of locality conditions over natural languages.
Part I deals with computational considerations, setting the tone of the rest of the thesis, and introducing and defining critical concepts like âgrammarâ, âautomatonâ, and the relations between them
. Fundamental questions concerning the place of formal language theory in
linguistic inquiry, as well as the expressibility of linguistic and computational concepts in
common terms, are raised in this part.
Part II further explores the issues addressed in Part I with particular emphasis on how
grammars are implemented by means of automata, and the properties of the formal languages
that these automata generate. We will argue against the equation between effective computation
and function-based computation, and introduce examples of computable procedures which are
nevertheless impossible to capture using traditional function-based theories. The connection
with cognition will be made in the light of dynamical frustrations: the irreconciliable tension
between mutually incompatible tendencies that hold for a given dynamical system. We will
provide arguments in favour of analyzing natural language as emerging from a tension between
different systems (essentially, semantics and morpho-phonology) which impose orthogonal
requirements over admissible outputs. The concept of level of organization or scale comes to
the foreground here; and apparent contradictions and incommensurabilities between concepts
and theories are revisited in a new light: that of dynamical nonlinear systems which are
fundamentally frustrated. We will also characterize the computational system that emerges from
such an architecture: the goal is to get a syntactic component which assigns the simplest
possible structural description to sub-strings, in terms of its computational complexity. A
system which can oscillate back and forth in the hierarchy of formal languages in assigning
structural representations to local domains will be referred to as a computationally mixed
system.
Part III is where the really fun stuff starts. Field theory is introduced, and its applicability to
neurocognitive phenomena is made explicit, with all due scale considerations. Physical and
mathematical concepts are permanently interacting as we analyze phrase structure in terms of
pseudo-fractals (in Mandelbrotâs sense) and define syntax as a (possibly unary) set of
topological operations over completely Hausdorff (CH) ultrametric spaces. These operations, which makes field perturbations interfere, transform that initial completely Hausdorff
ultrametric space into a metric, Hausdorff space with a weaker separation axiom. Syntax, in this
proposal, is not âgenerativeâ in any traditional sense âexcept the âfully explicit theoryâ one-:
rather, it partitions (technically, âparametrizesâ) a topological space. Syntactic dependencies are
defined as interferences between perturbations over a field, which reduce the total entropy of
the system per cycles, at the cost of introducing further dimensions where attractors
corresponding to interpretations for a phrase marker can be found.
Part IV is a sample of what we can gain by further pursuing the physics of language approach,
both in terms of empirical adequacy and theoretical elegance, not to mention the unlimited
possibilities of interdisciplinary collaboration. In this section we set our focus on island
phenomena as defined by Ross (1967), critically revisiting the most relevant literature on this
topic, and establishing a typology of constructions that are strong islands, which cannot be
violated. These constructions are particularly interesting because they limit the phase space of
what is expressible via natural language, and thus reveal crucial aspects of its underlying
dynamics. We will argue that a dynamically frustrated system which is characterized by
displaying mixed computational dependencies can provide straightforward characterizations of
cyclicity in terms of changes in dependencies in local domains