3 research outputs found

    Mining Web Content Outliers by using Term Weighting Technique and Rank Correlation Coefficient Approach

    Get PDF
    In the Internet area, World Wide Web (www) involves with voluminous amount of information with more redundant and irrelevant web pages. Outliers are the data that differ significantly from the rest of data. Web content mining is a subarea under web mining that mines required and useful knowledge or information from web page content. Web content outlier mining concentrates on finding outliers such as irrelevant and redundant pages from the web pages. Webs contain unstructured and semi-structured documents, so algorithms for web content mining are needed to handle both unstructured and semi structured documents. The proposed system based on big web data. The objective of proposed system is to obtain higher accurate result. In this proposal, Term Frequency Inverse Document Frequency (TF.IDF) technique based on full word matching with domain dictionary is used to remove the irrelevant documents from the unstructured web documents based on user’s input query. Removal of outliers (irrelevant and redundant contents) from webs not only leads to reduction in indexing space and time complexity, but also improves the accuracy of search results. The documents that have very little similarity words from the user’s input query are assumed as the web outliers. And then a mathematical approach called Spearman’s rank correlation coefficient is used to remove the redundant web documents and to retrieve ranked relevant web documents

    Dissimilarity algorithm on conceptual graphs to mine text outliers

    Get PDF
    The graphical text representation method such as Conceptual Graphs (CGs) attempts to capture the structure and semantics of documents.As such, they are the preferred text representation approach for a wide range of problems namely in natural language processing, information retrieval and text mining.In a number of these applications, it is necessary to measure the dissimilarity (or similarity) between knowledge represented in the CGs.In this paper, we would like to present a dissimilarity algorithm to detect outliers from a collection of text represented with Conceptual Graph Interchange Format (CGIF).In order to avoid the NP-complete problem of graph matching algorithm, we introduce the use of a standard CG in the dissimilarity computation.We evaluate our method in the context of analyzing real world financial statements for identifying outlying performance indicators.For evaluation purposes, we compare the proposed dissimilarity function with a dice-coefficient similarity function used in a related previous work.Experimental results indicate that our method outperforms the existing method and correlates better to human judgements. In Comparison to other text outlier detection method, this approach managed to capture the semantics of documents through the use of CGs and is convenient to detect outliers through a simple dissimilarity function.Furthermore, our proposed algorithm retains a linear complexity with the increasing number of CGs

    Deviation detection in text using conceptual graph interchange format and error tolerance dissimilarity function

    Get PDF
    The rapid increase in the amount of textual data has brought forward a growing research interest towards mining text to detect deviations. Specialized methods for specific domains have emerged to satisfy various needs in discovering rare patterns in text. This paper focuses on a graph-based approach for text representation and presents a novel error tolerance dissimilarity algorithm for deviation detection. We resolve two non-trivial problems, i.e. semantic representation of text and the complexity of graph matching. We employ conceptual graphs interchange format (CGIF) – a knowledge representation formalism to capture the structure and semantics of sentences. We propose a novel error tolerance dissimilarity algorithm to detect deviations in the CGIFs. We evaluate our method in the context of analyzing real world financial statements for identifying deviating performance indicators. We show that our method performs better when compared with two related text based graph similarity measuring methods. Our proposed method has managed to identify deviating sentences and it strongly correlates with expert judgments. Furthermore, it offers error tolerance matching of CGIFs and retains a linear complexity with the increasing number of CGIFs
    corecore