461 research outputs found
Evaluating Parsers with Dependency Constraints
Many syntactic parsers now score over 90% on English in-domain evaluation, but the remaining errors have been challenging to address and difficult to quantify. Standard parsing metrics provide a consistent basis for comparison between parsers, but do not illuminate what errors remain to be addressed. This thesis develops a constraint-based evaluation for dependency and Combinatory Categorial Grammar (CCG) parsers to address this deficiency. We examine the constrained and cascading impact, representing the direct and indirect effects of errors on parsing accuracy. This identifies errors that are the underlying source of problems in parses, compared to those which are a consequence of those problems. Kummerfeld et al. (2012) propose a static post-parsing analysis to categorise groups of errors into abstract classes, but this cannot account for cascading changes resulting from repairing errors, or limitations which may prevent the parser from applying a repair. In contrast, our technique is based on enforcing the presence of certain dependencies during parsing, whilst allowing the parser to choose the remainder of the analysis according to its grammar and model. We draw constraints for this process from gold-standard annotated corpora, grouping them into abstract error classes such as NP attachment, PP attachment, and clause attachment. By applying constraints from each error class in turn, we can examine how parsers respond when forced to correctly analyse each class. We show how to apply dependency constraints in three parsers: the graph-based MSTParser (McDonald and Pereira, 2006) and the transition-based ZPar (Zhang and Clark, 2011b) dependency parsers, and the C&C CCG parser (Clark and Curran, 2007b). Each is widely-used and influential in the field, and each generates some form of predicate-argument dependencies. We compare the parsers, identifying common sources of error, and differences in the distribution of errors between constrained and cascaded impact. Our work allows us to contrast the implementations of each parser, and how they respond to constraint application. Using our analysis, we experiment with new features for dependency parsing, which encode the frequency of proposed arcs in large-scale corpora derived from scanned books. These features are inspired by and extend on the work of Bansal and Klein (2011). We target these features at the most notable errors, and show how they address some, but not all of the difficult attachments across newswire and web text. CCG parsing is particularly challenging, as different derivations do not always generate different dependencies. We develop dependency hashing to address semantically redundant parses in n-best CCG parsing, and demonstrate its necessity and effectiveness. Dependency hashing substantially improves the diversity of n-best CCG parses, and improves a CCG reranker when used for creating training and test data. We show the intricacies of applying constraints to C&C, and describe instances where applying constraints causes the parser to produce a worse analysis. These results illustrate how algorithms which are relatively straightforward for constituency and dependency parsers are non-trivial to implement in CCG. This work has explored dependencies as constraints in dependency and CCG parsing. We have shown how dependency hashing can efficiently eliminate semantically redundant CCG n-best parses, and presented a new evaluation framework based on enforcing the presence of dependencies in the output of the parser. By otherwise allowing the parser to proceed as it would have, we avoid the assumptions inherent in other work. We hope this work will provide insights into the remaining errors in parsing, and target efforts to address those errors, creating better syntactic analysis for downstream applications
Text Mining of Patient Demographics and Diagnoses from Psychiatric Assessments
Automatic extraction of patient demographics and psychiatric diagnoses from clinical notes allows for the collection of patient data on a large scale. This data could be used for a variety of research purposes including outcomes studies or developing clinical trials. However, current research has not yet discussed the automatic extraction of demographics and psychiatric diagnoses in detail. The aim of this study is to apply text mining to extract patient demographics - age, gender, marital status, education level, and admission diagnoses from the psychiatric assessments at a mental health hospital and also assign codes to each category. Gender is coded as either Male or Female, marital status is coded as either Single, Married, Divorced, or Widowed, and education level can be coded starting with Some High School through Graduate Degree (PhD/JD/MD etc. Level). Classifications for diagnoses are based on the DSM-IV. For each category, a rule-based approach was developed utilizing keyword-based regular expressions as well as constituency trees and typed dependencies. We employ a two-step approach that first maximizes recall through the development of keyword-based patterns and if necessary, maximizes precision by using NLP-based rules to handle the problem of ambiguity. To develop and evaluate our method, we annotated a corpus of 200 assessments, using a portion of the corpus for developing the method and the rest as a test set. F-score was satisfactory for each category (Age: 0.997; Gender: 0.989; Primary Diagnosis: 0.983; Marital Status: 0.875; Education Level: 0.851) as was coding accuracy (Age: 1.0; Gender: 0.989; Primary Diagnosis: 0.922; Marital Status: 0.889; Education Level: 0.778). These results indicate that a rule-based approach could be considered for extracting these types of information in the psychiatric field. At the same time, the results showed a drop in performance from the development set to the test set, which is partly due to the need for more generality in the rules developed
Distinguishing between factual information and insulting or abusive messages bearing words or phrases in news articles
This thesis report is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2006.Cataloged from PDF version of thesis report.Includes bibliographical references (page 75).Since Internet has become the leading source of information for the users,
flames or abusive messages have also become the prominent factors of time
wasting for retrieving information. Moreover, a text can contain factual
information as well as abusive or insulting contents. This paper describes a
new approach for an automated system to distinguish between information
and personal attack containing insulting or abusive messages in a given
document. In NLP, flames or abusive messages are considered as extreme
subjective language, which refers to detect personal opinions or emotions in
a news article. Insulting or abusive messages are viewed as extreme subset
of the subjective language because of its extreme nature. We defined some
rules to extract the semantic information of a given sentence from the
general semantic structure of that sentence.Altaf MahmudKazi Zubair AhmedB. Computer Science and Engineerin
Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme
Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie
Information Extraction on Para-Relational Data.
Para-relational data (such as spreadsheets and diagrams) refers to a type of nearly
relational data that shares the important qualities of relational data but does not
present itself in a relational format. Para-relational data often conveys highly valuable
information and is widely used in many different areas. If we can convert para-relational
data into the relational format, many existing tools can be leveraged for a
variety of interesting applications, such as data analysis with relational query systems
and data integration applications.
This dissertation aims to convert para-relational data into a high-quality relational
form with little user assistance. We have developed four standalone systems, each
addressing a specific type of para-relational data. Senbazuru is a prototype spreadsheet
database management system that extracts relational information from a large
number of spreadsheets. Anthias is an extension of the Senbazuru system to convert
a broader range of spreadsheets into a relational format. Lyretail is an extraction
system to detect long-tail dictionary entities on webpages. Finally, DiagramFlyer is
a web-based search system that obtains a large number of diagrams automatically
extracted from web-crawled PDFs. Together, these four systems demonstrate that
converting para-relational data into the relational format is possible today, and also
suggest directions for future systems.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120853/1/chenzhe_1.pd
Proceedings
Proceedings of the Workshop on Annotation and
Exploitation of Parallel Corpora AEPC 2010.
Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk.
NEALT Proceedings Series, Vol. 10 (2010), 98 pages.
© 2010 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/15893
- …