Search CORE

112,396 research outputs found

Support Vector Machines (SVM) in Test Extraction

Author: Ghazali Nadirah
Publication venue: Universiti Teknologi PETRONAS
Publication date: 01/11/2006
Field of study

Text categorization is the process of grouping documents or words into predefined categories. Each category consists of documents or words having similar attributes. There exist numerous algorithms to address the need of text categorization including Naive Bayes, k-nearest-neighbor classifier, and decision trees. In this project, Support Vector Machines (SVM) is studied and experimented by the implementation ofa textual extractor. This algorithm is used to extract important points from a lengthy document, by which it classifies each word in the document under its relevant category and constructs the structure of the summary with reference to the categorized words. The performance of the extractor is evaluated using a similar corpus against an existing summarizer, which uses a different kind of approach. Summarization is part of text categorization whereby it is considered an essential part of today's information-led society, and it has been a growing area of research for over 40 years. This project's objective is to create a summarizer, or extractor, based on machine learning algorithms, which are namely SVM and K-Means. Each word in the particular document is processed by both algorithms to determine its actual occurrence in the document by which it will first be clustered or grouped into categories based on parts of speech (verb, noun, adjective) which is done by K-Means, then later processed by SVM to determine the actual occurrence of each word in each of the cluster, taking into account whether the words have similar meanings with otherwords in the subsequent cluster. The corpus chosen to evaluate the application is the Reuters-21578 dataset comprising of newspaper articles. Evaluation of the applications are carried out against another accompanying system-generated extract which is already in the market, as a means to observe the amount of sentences overlap with the tested applications, in this case, the Text Extractor and also Microsoft Word AutoSummarizer. Results show that the Text Extractor has optimal results at compression rates of 10 - 20% and 35 - 45

UTPedia

Generating indicative-informative summaries with SumUM

Author: Benbrahim Mohamed
Guy Lapalme
Horacio Saggion
Jing Hongyan
Johnson Frances C
Jordan Michael P
Radev Dragomir R
Teufel S.
Tombros Anastasios
Publication venue: 'MIT Press - Journals'
Publication date: 01/01/2002
Field of study

We present and evaluate SumUM, a text summarization system that takes a raw technical text as input and produces an indicative informative summary. The indicative part of the summary identifies the topics of the document, and the informative part elaborates on some of these topics according to the reader's interest. SumUM motivates the topics, describes entities, and defines concepts. It is a first step for exploring the issue of dynamic summarization. This is accomplished through a process of shallow syntactic and semantic analysis, concept identification, and text regeneration. Our method was developed through the study of a corpus of abstracts written by professional abstractors. Relying on human judgment, we have evaluated indicativeness, informativeness, and text acceptability of the automatic summaries. The results thus far indicate good performance when compared with other summarization technologies

CiteSeerX

Crossref

White Rose Research Online

Content-Based Book Recommending Using Learning for Text Categorization

Author: Mooney Raymond J.
Roy Loriene
Publication venue
Publication date: 01/01/1999
Field of study

Recommender systems improve access to relevant products and information by making personalized suggestions based on previous examples of a user's likes and dislikes. Most existing recommender systems use social filtering methods that base recommendations on other users' preferences. By contrast, content-based methods use information about an item itself to make suggestions. This approach has the advantage of being able to recommended previously unrated items to users with unique interests and to provide explanations for its recommendations. We describe a content-based book recommending system that utilizes information extraction and a machine-learning algorithm for text categorization. Initial experimental results demonstrate that this approach can produce accurate recommendations.Comment: 8 pages, 3 figures, Submission to Fourth ACM Conference on Digital Librarie

arXiv.org e-Print Archive

CiteSeerX

Chi-square-based scoring function for categorization of MEDLINE citations

Author: Hristovski Dimitar
Kastrin Andrej
Peterlin Borut
Publication venue: 'Georg Thieme Verlag KG'
Publication date: 01/01/2010
Field of study

Objectives: Text categorization has been used in biomedical informatics for identifying documents containing relevant topics of interest. We developed a simple method that uses a chi-square-based scoring function to determine the likelihood of MEDLINE citations containing genetic relevant topic. Methods: Our procedure requires construction of a genetic and a nongenetic domain document corpus. We used MeSH descriptors assigned to MEDLINE citations for this categorization task. We compared frequencies of MeSH descriptors between two corpora applying chi-square test. A MeSH descriptor was considered to be a positive indicator if its relative observed frequency in the genetic domain corpus was greater than its relative observed frequency in the nongenetic domain corpus. The output of the proposed method is a list of scores for all the citations, with the highest score given to those citations containing MeSH descriptors typical for the genetic domain. Results: Validation was done on a set of 734 manually annotated MEDLINE citations. It achieved predictive accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method by comparing it to three machine learning algorithms (support vector machines, decision trees, na\"ive Bayes). Although the differences were not statistically significantly different, results showed that our chi-square scoring performs as good as compared machine learning algorithms. Conclusions: We suggest that the chi-square scoring is an effective solution to help categorize MEDLINE citations. The algorithm is implemented in the BITOLA literature-based discovery support system as a preprocessor for gene symbol disambiguation process.Comment: 34 pages, 2 figure

arXiv.org e-Print Archive

Crossref

FOSTER D2.1 - Technical protocol for rich metadata categorization and content classification

Author: Davidson Joy
Jones Sarah
Kuchma Iryna
Orth Astrid
Proudman Vanessa
Publication venue: FOSTER Project
Publication date: 10/07/2014
Field of study

FOSTER aims to set in place sustainable mechanisms for EU researchers to FOSTER OPEN SCIENCE in their daily workflow, supporting researchers optimizing their research visibility and impact and the adoption of EU open access policies in line with the EU objectives on Responsible Research & Innovation. More specifically, the FOSTER objectives are to: • Support different stakeholders, especially young researchers, in adopting open access in the context of the European Research Area (ERA) and in complying with the open access policies and rules of participation set out for Horizon 2020; • Integrate open access principles and practice in the current research workflow by targeting the young researcher training environment; • Strengthen the institutional training capacity to foster compliance with the open access policies of the ERA and Horizon 2020 (beyond the FOSTER project); • Facilitate the adoption, reinforcement and implementation of open access policies from other European funders, in line with the EC’s recommendation, in partnership with PASTEUR4OA project. As stated in the project Description of Work (DoW) these objectives will be pursued and achieved through the combination of 3 main activities: content identification, repacking and creation; creation of the FOSTER Portal; delivery of training. The core activity of the Task T2.1 will be to define a basic quality control protocol for content, and map available content by target group, and content type in parallel with WP3 Task 3.1. Training materials include the full range of classical (structured presentation slides) and multi-media content (short videos, interactive e-books, ) that clearly and succinctly frames a problem and offers a working solution, in support of the learning objectives of each target group, and the range of learning options to be used in WP4 (elearning, blended learning, self-learning). The map of existing content metadata will be delivered to WP3 for best choice of system requirements for continuous and sustainable content aggregation, enhancement and content delivery via “Tasks 3.2 e-Learning Portal” and “Task 3.4 Content Upload”. The resulting content compilation will be tailored to each Target Group and delivered to WP4

Enlighten