15,128 research outputs found
Using distributional similarity to organise biomedical terminology
We investigate an application of distributional similarity techniques to the problem of structural organisation of biomedical terminology. Our application domain is the relatively small GENIA corpus. Using terms that have been accurately marked-up by hand within the corpus, we consider the problem of automatically determining semantic proximity. Terminological units are dened for our purposes as normalised classes of individual terms. Syntactic analysis of the corpus data is carried out using the Pro3Gres parser and provides the data required to calculate distributional similarity using a variety of dierent measures. Evaluation is performed against a hand-crafted gold standard for this domain in the form of the GENIA ontology. We show that distributional similarity can be used to predict semantic type with a good degree of accuracy
Resources for Evaluation of Summarization Techniques
We report on two corpora to be used in the evaluation of component systems
for the tasks of (1) linear segmentation of text and (2) summary-directed
sentence extraction. We present characteristics of the corpora, methods used in
the collection of user judgments, and an overview of the application of the
corpora to evaluating the component system. Finally, we discuss the problems
and issues with construction of the test set which apply broadly to the
construction of evaluation resources for language technologies.Comment: LaTeX source, 5 pages, US Letter, uses lrec98.st
Recommended from our members
Developing professionalism in new IT graduates? Who needs it?
A new graduate may require a period of ‘acclimatisation’ through a process of ‘developing their professionalism’ to fit into their work environment. The e-Skills UK Technology Counts Insights 2010 report suggests that 110,500 new entrants a year are required to fill IT & Telecoms professional job roles, with 20,800 coming from education (predominantly graduate level and higher). However, 43% of recruiters were reporting a lack of suitable candidates for IT & Telecoms posts where growing importance will be placed on relationship management, business process analysis and design, project and programme management. IT & Telecoms professionals are increasingly expected to be multi-skilled, with sophisticated business and interpersonal skills as well as technical competence. As the report also says: ‘UK growth will continue to be primarily in high-value roles with an increasing need for customer and business-oriented skills as well as sophisticated technical competencies.’
The diverse needs and requirements of the IT sector, as specified by various employer groups and professional bodies including BCS, IET, eSkills, the CBI and the SFIA Foundation, are discussed. According to the CBI, ‘62% of entrants to the IT sector need to draw on managerial and professional business skills almost immediately.’ For organisations to succeed, their IT graduate recruits must supplement their IT skills with managerial and professional business skills. Well considered CPD will ensure that recent graduates can enhance their ‘academic’ skills with the necessary work-based skills for the benefit of both themselves and their new employer. The focus of the improvement will balance the student-centred needs for development and the engaging employer’s commercial needs
Annotating patient clinical records with syntactic chunks and named entities: the Harvey corpus
The free text notes typed by physicians during patient consultations contain valuable information for the study of disease and treatment. These notes are difficult to process by existing natural language analysis tools since they are highly telegraphic (omitting many words), and contain many spelling mistakes, inconsistencies in punctuation, and non-standard word order. To support information extraction and classification tasks over such text, we describe a de-identified corpus of free text notes, a shallow syntactic and named entity annotation scheme for this kind of text, and an approach to training domain specialists with no linguistic background to annotate the text. Finally, we present a statistical chunking system for such clinical text with a stable learning rate and good accuracy, indicating that the manual annotation is consistent and that the annotation scheme is tractable for machine learning
Drawing Elena Ferrante's Profile. Workshop Proceedings, Padova, 7 September 2017
Elena Ferrante is an internationally acclaimed Italian novelist whose real identity has been kept secret by E/O publishing house for more than 25 years. Owing to her popularity, major Italian and foreign newspapers have long tried to discover her real identity. However, only a few attempts have been made to foster a scientific debate on her work.
In 2016, Arjuna Tuzzi and Michele Cortelazzo led an Italian research team that conducted a preliminary study and collected a well-founded, large corpus of Italian novels comprising 150 works published in the last 30 years by 40 different authors. Moreover, they shared their data with a select group of international experts on authorship attribution, profiling, and analysis of textual data: Maciej Eder and Jan Rybicki (Poland), Patrick Juola (United States), Vittorio Loreto and his research team, Margherita Lalli and Francesca Tria (Italy), George Mikros (Greece), Pierre Ratinaud (France), and Jacques Savoy (Switzerland).
The chapters of this volume report the results of this endeavour that were first presented during the international workshop Drawing Elena Ferrante's Profile in Padua on 7 September 2017 as part of the 3rd IQLA-GIAT Summer School in Quantitative Analysis of Textual Data. The fascinating research findings suggest that Elena Ferrante\u2019s work definitely deserves \u201cmany hands\u201d as well as an extensive effort to understand her distinct writing style and the reasons for her worldwide success
- …