3,610 research outputs found

    Document highlighting - message classification in printed business letters

    Get PDF
    This paper presents the INFOCLAS system applying statistical methods of information retrieval primarily for the classification of German business letters into corresponding message types such as order, offer, confirmation, etc. INFOCLAS is a first step towards understanding of documents. Actually, it is composed of three modules: the central indexer (extraction and weighting of indexing terms), the classifier (classification of business letters into given types) and the focuser (highlighting relevant letter parts). The system employs several knowledge sources including a database of about 100 letters, word frequency statistics for German, message type specific words, morphological knowledge as well as the underlying document model. As output, the system evaluates a set of weighted hypotheses about the type of letter at hand, or highlights relevant text (text focus), respectively. Classification of documents allows the automatic distribution or archiving of letters and is also an excellent starting point for higher-level document analysis

    A NEW ANOMALOUS TEXT DETECTION APPROACH USING UNSUPERVISED METHODS

    Get PDF
    Increasing size of text data in databases requires appropriate classification and analysis in order to acquire knowledge and improve the quality of decision-making in organizations. The process of discovering the hidden patterns in the data set, called data mining, requires access to quality data in order to receive a valid response from the system. Detecting and removing anomalous data is one of the pre-processing steps and cleaning data in this process. Methods for anomalous data detection are generally classified into three groups including supervised, semi-supervised, and unsupervised. This research tried to offer an unsupervised approach for spotting the anomalous data in text collections. In the proposed method, a combination of two approaches (i.e., clustering-based and distance-based) is used for detecting anomaly in the text data. In order to evaluate the efficiency of the proposed approach, this method is applied on four labeled data sets. The accuracy of Na¨ıve Bayes classification algorithms and decision tree are compared before and after removal of anomalous data with the proposed method and some other methods such as Density-based spatial clustering of applications with noise (DBSCAN). Our proposed method shows that accuracy of more than 92.39% can be achieved. In general, the results revealed that in most cases the proposed method has a good performance

    An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluation

    Get PDF
    A focused crawler is topic-specific and aims selectively to collect web pages that are relevant to a given topic from the Internet. However, the performance of the current focused crawling can easily suffer the impact of the environments of web pages and multiple topic web pages. In the crawling process, a highly relevant region may be ignored owing to the low overall relevance of that page, and anchor text or link-context may misguide crawlers. In order to solve these problems, this paper proposes a new focused crawler. First, we build a web page classifier based on improved term weighting approach (ITFIDF), in order to gain highly relevant web pages. In addition, this paper introduces an evaluation approach of the link, link priority evaluation (LPE), which combines web page content block partition algorithm and the strategy of joint feature evaluation (JFE), to better judge the relevance between URLs on the web page and the given topic. The experimental results demonstrate that the classifier using ITFIDF outperforms TFIDF, and our focused crawler is superior to other focused crawlers based on breadth-first, best-first, anchor text only, link-context only, and content block partition in terms of harvest rate and target recall. In conclusion, our methods are significant and effective for focused crawler

    Review of Feature Selection and Optimization Strategies in Opinion Mining

    Get PDF
    Opinion mining and sentiment analysis methods has become a prerogative models in terms of gaining insights from the huge volume of data that is being generated from vivid sources. There are vivid range of data that is being generated from varied sources. If such veracity and variety of data can be explored in terms of evaluating the opinion mining process, it could help the target groups in getting the public pulse which could support them in taking informed decisions. Though the process of opinion mining and sentiment analysis has been one of the hot topics focused upon by the researchers, the process has not been completely revolutionary. In this study the focus has been upon reviewing varied range of models and solutions that are proposed for sentiment analysis and opinion mining. From the vivid range of inputs that are gathered and the detailed study that is carried out, it is evident that the current models are still in complex terms of evaluation and result fetching, due to constraints like comprehensive knowledge and natural language limitation factors. As a futuristic model in the domain, the process of adapting scope of evolutionary computational methods and adapting hybridization of such methods for feature extraction as an idea is tossed in this paper
    corecore