52,530 research outputs found

    Context-aware Document-clustering Technique

    Get PDF
    Document clustering is an intentional act that should reflect individuals’ preferences with regard to the semantic coherency or relevant categorization of documents and should conform to the context of a target task under investigation. Thus, effective documentclustering techniques need to take into account a user’s categorization context defined by or relevant to the target task under consideration. However, existing document-clustering techniques generally anchor in pure content-based analysis and therefore are not able to facilitate context-aware document-clustering. In response, we propose a Context-Aware document-Clustering (CAC) technique that takes into consideration a user’s categorization preference (expressed as a list of anchoring terms) relevant to the context of a target task and subsequently generates a set of document clusters from this specific contextual perspective. Our empirical evaluation results suggest that our proposed CAC technique outperforms the pure content-based document-clustering technique

    Sequential and unsupervised document authorial clustering based on hidden markov model

    Full text link
    © 2017 IEEE. Document clustering groups documents of certain similar characteristics in one cluster. Document clustering has shown advantages on organization, retrieval, navigation and summarization of a huge amount of text documents on Internet. This paper presents a novel, unsupervised approach for clustering single-author documents into groups based on authorship. The key novelty is that we propose to extract contextual correlations to depict the writing style hidden among sentences of each document for clustering the documents. For this purpose, we build an Hidden Markov Model (HMM) for representing the relations of sequential sentences, and a two-level, unsupervised framework is constructed. Our proposed approach is evaluated on four benchmark datasets, widely used for document authorship analysis. A scientific paper is also used to demonstrate the performance of the approach on clustering short segments of a text into authorial components. Experimental results show that the proposed approach outperforms the state-of-the-art approaches

    Collaborative Filtering-based Context-Aware Document-Clustering (CF-CAC) Technique

    Get PDF
    Document clustering is an intentional act that should reflect an individual\u27s preference with regard to the semantic coherency or relevant categorization of documents and should conform to the context of a target task under investigation. Thus, effective document clustering techniques need to take into account a user\u27s categorization context. In response, Yang & Wei (2007) propose a Context-Aware document Clustering (CAC) technique that takes into consideration a user\u27s categorization preference relevant to the context of a target task and subsequently generates a set of document clusters from this specific contextual perspective. However, the CAC technique encounters the problem of small-sized anchoring terms. To overcome this shortcoming, we extend the CAC technique and propose a Collaborative Filtering-based Context-Aware document-Clustering (CF-CAC) technique that considers not only a target user\u27s but also other users\u27 anchoring terms when approximating the categorization context of the target user. Our empirical evaluation results suggest that our proposed CF-CAC technique outperforms the CAC technique

    Harnessing Deep Learning Techniques for Text Clustering and Document Categorization

    Get PDF
    This research paper delves into the realm of deep text clustering algorithms with the aim of enhancing the accuracy of document classification. In recent years, the fusion of deep learning techniques and text clustering has shown promise in extracting meaningful patterns and representations from textual data. This paper provides an in-depth exploration of various deep text clustering methodologies, assessing their efficacy in improving document classification accuracy. Delving into the core of deep text clustering, the paper investigates various feature representation techniques, ranging from conventional word embeddings to contextual embeddings furnished by BERT and GPT models.By critically reviewing and comparing these algorithms, we shed light on their strengths, limitations, and potential applications. Through this comprehensive study, we offer insights into the evolving landscape of document analysis and classification, driven by the power of deep text clustering algorithms.Through an original synthesis of existing literature, this research serves as a beacon for researchers and practitioners in harnessing the prowess of deep learning to enhance the accuracy of document classification endeavors

    On the Effect of Semantically Enriched Context Models on Software Modularization

    Full text link
    Many of the existing approaches for program comprehension rely on the linguistic information found in source code, such as identifier names and comments. Semantic clustering is one such technique for modularization of the system that relies on the informal semantics of the program, encoded in the vocabulary used in the source code. Treating the source code as a collection of tokens loses the semantic information embedded within the identifiers. We try to overcome this problem by introducing context models for source code identifiers to obtain a semantic kernel, which can be used for both deriving the topics that run through the system as well as their clustering. In the first model, we abstract an identifier to its type representation and build on this notion of context to construct contextual vector representation of the source code. The second notion of context is defined based on the flow of data between identifiers to represent a module as a dependency graph where the nodes correspond to identifiers and the edges represent the data dependencies between pairs of identifiers. We have applied our approach to 10 medium-sized open source Java projects, and show that by introducing contexts for identifiers, the quality of the modularization of the software systems is improved. Both of the context models give results that are superior to the plain vector representation of documents. In some cases, the authoritativeness of decompositions is improved by 67%. Furthermore, a more detailed evaluation of our approach on JEdit, an open source editor, demonstrates that inferred topics through performing topic analysis on the contextual representations are more meaningful compared to the plain representation of the documents. The proposed approach in introducing a context model for source code identifiers paves the way for building tools that support developers in program comprehension tasks such as application and domain concept location, software modularization and topic analysis

    Optical tomography: Image improvement using mixed projection of parallel and fan beam modes

    Get PDF
    Mixed parallel and fan beam projection is a technique used to increase the quality images. This research focuses on enhancing the image quality in optical tomography. Image quality can be defined by measuring the Peak Signal to Noise Ratio (PSNR) and Normalized Mean Square Error (NMSE) parameters. The findings of this research prove that by combining parallel and fan beam projection, the image quality can be increased by more than 10%in terms of its PSNR value and more than 100% in terms of its NMSE value compared to a single parallel beam

    Con-S2V: A Generic Framework for Incorporating Extra-Sentential Context into Sen2Vec

    Get PDF
    We present a novel approach to learn distributed representation of sentences from unlabeled data by modeling both content and context of a sentence. The content model learns sentence representation by predicting its words. On the other hand, the context model comprises a neighbor prediction component and a regularizer to model distributional and proximity hypotheses, respectively. We propose an online algorithm to train the model components jointly. We evaluate the models in a setup, where contextual information is available. The experimental results on tasks involving classification, clustering, and ranking of sentences show that our model outperforms the best existing models by a wide margin across multiple datasets
    corecore