Text Summarization System with Bayesian Theorem on Oil & Gas Drilling Topic

Abstract

Text summarization is the process of identifying the important sentences or words from the article which later to be represented and combined to generate the summary. There exist numerous algorithms to address the need for text summarization including Support Vector Machine, k-nearest neighbor classifier, and decision trees. In this project, Bayes theorem algorithm is studied and experimented by the implementation of a textual summarizer. This algorithm is used to extract the important points from a lengthy document, by which it classifies each word in the document under its relevant probability of the word's likeliness to be included in the summary given the corpus containing the summary done by the experts as the initial probability. As the application is used and processed, it would learn and keep track of the probability of each keyword so that it would predict the chance of certain keywords to be included in the future summarization. The objectives of this project are to look at the current situation in the area of text summarization research, to study the statistical approach in automatic text summary generation, and then to create a simple sample of text summarization tool which takes into account the existing research. Since the area of the application is specific, which is on oil and gas drilling topic, the ready-used corpus on that area is not easy to find. The articles collected are from the journals, news and any other information sources which are related to the discussed topic. Evaluation of the application is carried out against another accompanying system-generated summarizer which is already in the market. Human-made summary are used as the ideal or reference summary in evaluating both performance; the Text Summarization system and the Word Auto Summarizer. Current results show that the Text Summarization system performs better than the Word Auto Summarizer at the compression rate 60% and 70% (2/3 of the articles' length) by 11.31% and 10.80% respectively. Optimum value for overall performance is 85.82%

    Similar works