8 research outputs found

    Revisiting the challenges and surveys in text similarity matching and detection methods

    Get PDF
    The massive amount of information from the internet has revolutionized the field of natural language processing. One of the challenges was estimating the similarity between texts. This has been an open research problem although various studies have proposed new methods over the years. This paper surveyed and traced the primary studies in the field of text similarity. The aim was to give a broad overview of existing issues, applications, and methods of text similarity research. This paper identified four issues and several applications of text similarity matching. It classified current studies based on intrinsic, extrinsic, and hybrid approaches. Then, we identified the methods and classified them into lexical-similarity, syntactic-similarity, semantic-similarity, structural-similarity, and hybrid. Furthermore, this study also analyzed and discussed method improvement, current limitations, and open challenges on this topic for future research directions

    A tree based keyphrase extraction technique for academic literature

    Get PDF
    Automatic keyphrase extraction techniques aim to extract quality keyphrases to summarize a document at a higher level. Among the existing techniques some of them are domain-specific and require application domain knowledge, some of them are based on higher-order statistical methods and are computationally expensive, and some of them require large train data which are rare for many applications. Overcoming these issues, this thesis proposes a new unsupervised automatic keyphrase extraction technique, named TeKET or Tree-based Keyphrase Extraction Technique, which is domain-independent, employs limited statistical knowledge, and requires no train data. The proposed technique also introduces a new variant of the binary tree, called KeyPhrase Extraction (KePhEx) tree to extract final keyphrases from candidate keyphrases. Depending on the candidate keyphrases the KePhEx tree structure is either expanded or shrunk or maintained. In addition, a measure, called Cohesiveness Index or CI, is derived that denotes the degree of cohesiveness of a given node with respect to the root which is used in extracting final keyphrases from a resultant tree in a flexible manner and is utilized in ranking keyphrases alongside Term Frequency. The effectiveness of the proposed technique is evaluated using an experimental evaluation on a benchmark corpus, called SemEval-2010 with total 244 train and test articles, and compared with other relevant unsupervised techniques by taking the representatives from both statistical (such as Term Frequency-Inverse Document Frequency and YAKE) and graph-based techniques (PositionRank, CollabRank (SingleRank), TopicRank, and MultipartiteRank) into account. Three evaluation metrics, namely precision, recall and F1 score are taken into consideration during the experiments. The obtained results demonstrate the improved performance of the proposed technique over other similar techniques in terms of precision, recall, and F1 scores

    State of the art document clustering algorithms based on semantic similarity

    Get PDF
    The constant success of the Internet made the number of text documents in electronic forms increases hugely. The techniques to group these documents into meaningful clusters are becoming critical missions. The traditional clustering method was based on statistical features, and the clustering was done using a syntactic notion rather than semantically. However, these techniques resulted in un-similar data gathered in the same group due to polysemy and synonymy problems. The important solution to this issue is to document clustering based on semantic similarity, in which the documents are grouped according to the meaning and not keywords. In this research, eighty papers that use semantic similarity in different fields have been reviewed; forty of them that are using semantic similarity based on document clustering in seven recent years have been selected for a deep study, published between the years 2014 to 2020. A comprehensive literature review for all the selected papers is stated. Detailed research and comparison regarding their clustering algorithms, utilized tools, and methods of evaluation are given. This helps in the implementation and evaluation of the clustering of documents. The exposed research is used in the same direction when preparing the proposed research. Finally, an intensive discussion comparing the works is presented, and the result of our research is shown in figures

    Automated detection and classification of tumor histotypes on dynamic PET imaging data through machine-learning driven voxel classification

    Get PDF
    2-deoxy-2-fluorine-(18F)fluoro-D-glucose Positron Emission Tomography/Computed Tomography (18F-FDG-PET/CT) is widely used in oncology mainly for diagnosis and staging of various cancer types, including lung cancer, which is the most common cancer worldwide. Since histopathologic subtypes of lung cancer show different degree of 18F-FDG uptake, to date there are some diagnostic limits and uncertainties, hindering an 18F-FDG-PET-driven classification of histologic subtypes of lung cancers. On the other hand, since activated macrophages, neutrophils, fibroblasts and granulation tissues also show an increased 18F-FDG activity, infectious and/or inflammatory processes and post-surgical and post-radiation changes may cause false-positive results, especially for lymph-nodes assessment. Here we propose a model-free, machine-learning based algorithm for the automated classification of adenocarcinoma, the most common type of lung cancer, and other types of tumors. Input for the algorithm are dynamic acquisitions of PET data (dPET), providing for a spatially and temporally resolved characterization of the uptake kinetic. The algorithm consists in a trained Random Forest classifier which, relying contextually on several spatial and temporal features of 18F-FDG uptake, generates as an outcome probability maps allowing to distinguish adenocarcinoma from other lung histotype and to identify metastatic lymph-nodes, ultimately increasing the specificity of the technique. Its performance, evaluated on a dPET dataset of 19 patients affected by primary lung cancer, provides a probability 0.943 ± 0.090 for the detection of adenocarcinoma. The use of this algorithm will guarantee an automatic and more accurate localization and discrimination of tumors, also providing a powerful tool for detecting at which extent tumor has spread beyond a primary tumor into lymphatic system

    An efficient Wikipedia semantic matching approach to text document classification

    Full text link
    © 2017 Elsevier Inc. A traditional classification approach based on keyword matching represents each text document as a set of keywords, without considering the semantic information, thereby, reducing the accuracy of classification. To solve this problem, a new classification approach based on Wikipedia matching was proposed, which represents each document as a concept vector in the Wikipedia semantic space so as to understand the text semantics, and has been demonstrated to improve the accuracy of classification. However, the immense Wikipedia semantic space greatly reduces the generation efficiency of a concept vector, resulting in a negative impact on the availability of the approach in an online environment. In this paper, we propose an efficient Wikipedia semantic matching approach to document classification. First, we define several heuristic selection rules to quickly pick out related concepts for a document from the Wikipedia semantic space, making it no longer necessary to match all the concepts in the semantic space, thus greatly improving the generation efficiency of the concept vector. Second, based on the semantic representation of each text document, we compute the similarity between documents so as to accurately classify the documents. Finally, evaluation experiments demonstrate the effectiveness of our approach, i.e., which can improve the classification efficiency of the Wikipedia matching under the precondition of not compromising the classification accuracy

    Navigating Copyright for Libraries

    Get PDF
    Much of the information that libraries make available is protected by copyright or subject to the terms of license agreements. This reader presents an overview of current issues in copyright law reform. The chapters present salient points, overviews of the law and legal concepts, selected comparisons of approaches around the world, significance of the topic, and opportunities for reform, advocacy, and other related resources

    Navigating Copyright for Libraries – Purpose and Scope

    Get PDF
    Information is a critical resource for personal, economic and social development. Libraries and archives are the primary access point to information for individuals and communities with much of the information protected by copyright or licence terms. In this complex legal environment, librarians and information professionals operate at the fulcrum of copyright’s balance, ensuring understanding of and compliance with copyright legislation and enabling access to knowledge in the pursuit of research, education and innovation. This book, produced on behalf of the IFLA Copyright and other Legal Matters (CLM) Advisory Committee, provides basic and advanced information about copyright, outlines limitations and exceptions, discusses communicating with users and highlights emerging copyright issues. The chapters note the significance of the topic; describe salient points of the law and legal concepts; present selected comparisons of approaches around the world; highlight opportunities for reform and advocacy; and help libraries and librarians find their way through the copyright maze
    corecore