16 research outputs found
The gene normalization task in BioCreative III
BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k).
RESULTS: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively.
CONCLUSIONS: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance
Extração de informação aplicada a comentários da área do turismo
Motivation: The primary motivation of this dissertation was to show
that it is possible to construct an NLP solution for the Portuguese
language capable of helping in the hotel industry.
Objective(s): The main objective of this dissertation was to extract
useful information from hotel commentaries using NLP.
Method: An NLP pipeline was created to extract useful information,
and then sentimental analyse was used to characterise that information.
Results: After processing all the commentaries of a hotel was possible
to extract what people like or dislike about it.
Conclusions: The two main conclusions were that is possible to create
a Portuguese NLP pipeline for the hotel industry, and that is possible
to extract useful information from thousands of commentaries.Motivação: A principal motivação por trás desta tese foi mostrar que
é possível escrever um programa para NLP usando a língua portuguesa.
Objetivo(s): O principal objetivo desta tese foi extrair informação
hotel dos comentários feitos a hotéis usando NLP.
Método: Foi criado um pipeline de NLP para extrair informação útil.
Depois foi usado análise de sentimentos para caracterizar essa informação.
Resultados: Depois de todos os comentários serem processados foi
possível descobrir o que as pessoas gostam ou desgostam sobre um
hotel.
Conclusões: As duas principais conclusões foram que era possível
fazer NLP em português e que era possível extrair informação útil de
milhar de comentários.Mestrado em Engenharia Eletrónica e Telecomunicaçõe
Theory and Applications for Advanced Text Mining
Due to the growth of computer technologies and web technologies, we can easily collect and store large amounts of text data. We can believe that the data include useful knowledge. Text mining techniques have been studied aggressively in order to extract the knowledge from the data since late 1990s. Even if many important techniques have been developed, the text mining research field continues to expand for the needs arising from various application fields. This book is composed of 9 chapters introducing advanced text mining techniques. They are various techniques from relation extraction to under or less resourced language. I believe that this book will give new knowledge in the text mining field and help many readers open their new research fields
Recommended from our members
B!SON: A Tool for Open Access Journal Recommendation
Finding a suitable open access journal to publish scientific work is a complex task: Researchers have to navigate a constantly growing number of journals, institutional agreements with publishers, funders’ conditions and the risk of Predatory Publishers. To help with these challenges, we introduce a web-based journal recommendation system called B!SON. It is developed based on a systematic requirements analysis, built on open data, gives publisher-independent recommendations and works across domains. It suggests open access journals based on title, abstract and references provided by the user. The recommendation quality has been evaluated using a large test set of 10,000 articles. Development by two German scientific libraries ensures the longevity of the project
Recommended from our members
Arabic Language Processing for Text Classification. Contributions to Arabic Root Extraction Techniques, Building An Arabic Corpus, and to Arabic Text Classification Techniques.
The impact and dynamics of Internet-based resources for Arabic-speaking users is increasing in significance, depth and breadth at highest pace than ever, and thus requires updated mechanisms for computational processing of Arabic texts. Arabic is a complex language and as such requires in depth investigation for analysis and improvement of available automatic processing techniques such as root extraction methods or text classification techniques, and for developing text collections that are already labeled, whether with single or multiple labels.
This thesis proposes new ideas and methods to improve available automatic processing techniques for Arabic texts. Any automatic processing technique would require data in order to be used and critically reviewed and assessed, and here an attempt to develop a labeled Arabic corpus is also proposed. This thesis is composed of three parts: 1- Arabic corpus development, 2- proposing, improving and implementing root extraction techniques, and 3- proposing and investigating the effect of different pre-processing methods on single-labeled text classification methods for Arabic.
This thesis first develops an Arabic corpus that is prepared to be used here for testing root extraction methods as well as single-label text classification techniques. It also enhances a rule-based root extraction method by handling irregular cases (that appear in about 34% of texts). It proposes and implements two expanded algorithms as well as an adjustment for a weight-based method. It also includes the algorithm that handles irregular cases to all and compares the performances of these proposed methods with original ones. This thesis thus develops a root extraction system that handles foreign Arabized words by constructing a list of about 7,000 foreign words. The outcome of the technique with best accuracy results in extracting the correct stem and root for respective words in texts, which is an enhanced rule-based method, is used in the third part of this thesis. This thesis finally proposes and implements a variant term frequency inverse document frequency weighting method, and investigates the effect of using different choices of features in document representation on single-label text classification performance (words, stems or roots as well as including to these choices their respective phrases). This thesis applies forty seven classifiers on all proposed representations and compares their performances. One challenge for researchers in Arabic text processing is that reported root extraction techniques in literature are either not accessible or require a long time to be reproduced while labeled benchmark Arabic text corpus is not fully available online. Also, by now few machine learning techniques were investigated on Arabic where usual preprocessing steps before classification were chosen. Such challenges are addressed in this thesis by developing a new labeled Arabic text corpus for extended applications of computational techniques.
Results of investigated issues here show that proposing and implementing an algorithm that handles irregular words in Arabic did improve the performance of all implemented root extraction techniques. The performance of the algorithm that handles such irregular cases is evaluated in terms of accuracy improvement and execution time. Its efficiency is investigated with different document lengths and empirically is found to be linear in time for document lengths less than about 8,000. The rule-based technique is improved the highest among implemented root extraction methods when including the irregular cases handling algorithm. This thesis validates that choosing roots or stems instead of words in documents representations indeed improves single-label classification performance significantly for most used classifiers. However, the effect of extending such representations with their respective phrases on single-label text classification performance shows that it has no significant improvement. Many classifiers were not yet tested for Arabic such as the ripple-down rule classifier. The outcome of comparing the classifiers' performances concludes that the Bayesian network classifier performance is significantly the best in terms of accuracy, training time, and root mean square error values for all proposed and implemented representations.Petra University, Amman (Jordan
Anthropology of Color
The field of color categorization has always been intrinsically multi- and inter-disciplinary, since its beginnings in the nineteenth century. The main contribution of this book is to foster a new level of integration among different approaches to the anthropological study of color. The editors have put great effort into bringing together research from anthropology, linguistics, psychology, semiotics, and a variety of other fields, by promoting the exploration of the different but interacting and complementary ways in which these various perspectives model the domain of color experience. By so doing, they significantly promote the emergence of a coherent field of the anthropology of color
Graduate School: Course Decriptions, 1972-73
Official publication of Cornell University V.64 1972/7
Recommended from our members
Extrapolating Subjectivity Research to Other Languages
Socrates articulated it best, "Speak, so I may see you." Indeed, language represents an invisible probe into the mind. It is the medium through which we express our deepest thoughts, our aspirations, our views, our feelings, our inner reality. From the beginning of artificial intelligence, researchers have sought to impart human like understanding to machines. As much of our language represents a form of self expression, capturing thoughts, beliefs, evaluations, opinions, and emotions which are not available for scrutiny by an outside observer, in the field of natural language, research involving these aspects has crystallized under the name of subjectivity and sentiment analysis. While subjectivity classification labels text as either subjective or objective, sentiment classification further divides subjective text into either positive, negative or neutral. In this thesis, I investigate techniques of generating tools and resources for subjectivity analysis that do not rely on an existing natural language processing infrastructure in a given language. This constraint is motivated by the fact that the vast majority of human languages are scarce from an electronic point of view: they lack basic tools such as part-of-speech taggers, parsers, or basic resources such as electronic text, annotated corpora or lexica. This severely limits the implementation of techniques on par with those developed for English, and by applying methods that are lighter in the usage of text processing infrastructure, we are able to conduct multilingual subjectivity research in these languages as well. Since my aim is also to minimize the amount of manual work required to develop lexica or corpora in these languages, the techniques proposed employ a lever approach, where English often acts as the donor language (the fulcrum in a lever) and allows through a relatively minimal amount of effort to establish preliminary subjectivity research in a target language