28,846 research outputs found

    Integrating Semantic Knowledge to Tackle Zero-shot Text Classification

    Get PDF
    Insufficient or even unavailable training data of emerging classes is a big challenge of many classification tasks, including text classification. Recognising text documents of classes that have never been seen in the learning stage, so-called zero-shot text classification, is therefore difficult and only limited previous works tackled this problem. In this paper, we propose a two-phase framework together with data augmentation and feature augmentation to solve this problem. Four kinds of semantic knowledge (word embeddings, class descriptions, class hierarchy, and a general knowledge graph) are incorporated into the proposed framework to deal with instances of unseen classes effectively. Experimental results show that each and the combination of the two phases achieve the best overall accuracy compared with baselines and recent approaches in classifying real-world texts under the zero-shot scenario.Comment: Accepted NAACL-HLT 201

    New instances classification framework on Quran ontology applied to question answering system

    Get PDF
    Instances classification with the small dataset for Quran ontology is the current research problem which appears in Quran ontology development. The existing classification approach used machine learning: Backpropagation Neural Network. However, this method has a drawback; if the training set amount is small, then the classifier accuracy could decline. Unfortunately, Holy Quran has a small corpus. Based on this problem, our study aims to formulate new instances classification framework for small training corpus applied to semantic question answering system. As a result, the instances classification framework consists of several essential components: pre-processing, morphology analysis, semantic analysis, feature extraction, instances classification with Radial Basis Function Networks algorithm, and the transformation module. This algorithm is chosen since it robustness to noisy data and has an excellent achievement to handle small dataset. Furthermore, document processing module on question answering system is used to access instances classification result in Quran ontology

    Enhancing Sensitivity Classification with Semantic Features using Word Embeddings

    Get PDF
    Government documents must be reviewed to identify any sensitive information they may contain, before they can be released to the public. However, traditional paper-based sensitivity review processes are not practical for reviewing born-digital documents. Therefore, there is a timely need for automatic sensitivity classification techniques, to assist the digital sensitivity review process. However, sensitivity is typically a product of the relations between combinations of terms, such as who said what about whom, therefore, automatic sensitivity classification is a difficult task. Vector representations of terms, such as word embeddings, have been shown to be effective at encoding latent term features that preserve semantic relations between terms, which can also be beneficial to sensitivity classification. In this work, we present a thorough evaluation of the effectiveness of semantic word embedding features, along with term and grammatical features, for sensitivity classification. On a test collection of government documents containing real sensitivities, we show that extending text classification with semantic features and additional term n-grams results in significant improvements in classification effectiveness, correctly classifying 9.99% more sensitive documents compared to the text classification baseline

    ON FEATURE EXTRACTION FOR ENGLISH HOLY QURAN TAFSEER TEXT CLASSIFICATION

    Get PDF
    Numerous previous works classified text corpus by topic, sentiment, genre, or author. This investigates a different case of text corpus. The corpus is the tafseer of Holy Quran verses by Al-Jalalayn. Holy Quran dataset is selected as the corpus for this study because of its content which sometimes is difficult to separate even by human judge. The number of distinctive words is small, but the number of noise words is relatively high. The challenge of classifying the Holy Quran is that there are verses that have implicit meaning. To overcome the lack of ability to recognize implicit meaning in the text, WordNet Thesaurus is used to perform a semantic similarity approach. In this research, several processes to classify a document were performed, which were pre-processing, feature extraction, semantic weighting, classifier training, and evaluation. During feature extraction, produced several features as follows: Term Frequency (TF), Term Frequency–Inverse Document Frequency (TF-IDF), Part-of-Speech Tagging (POSTAG), and Bigram. The proposed method is performing weight calculation called Document-to-Class semantic similarity. The new measure used in the semantic similarity calculation was a combination of the Wu and Palmer (WUP) method and shortest path semantic similarity method with minor modifications. This was followed with classifier training, where the classification process using a Modified Multinomial Naive-Bayes classifier were performed. The proposed method is to modify the likelihood probability by using a weighted value from a prior process called document-to-class semantic similarity. During evaluation process, we evaluated the classifier performance using the Holy Quran dataset we created. For comparation, we also used an Amazon review dataset, a Yelp review dataset, and an IMDB review dataset. The measures used in the evaluation process were Accuracy, Precision, Recall, and F1-Measure. The F1-Measures for the Holy Quran dataset using feature combination POSTAG, BIGRAM and TF was 60.5 %. The F1 score for combination POSTAG, BIGRAM and TFIDF was 58.6% and The F1 score for combination POSTAG, BIGRAM and proposed Weighted TF 66.4%

    Towards the Automatic Classification of Documents in User-generated Classifications

    Get PDF
    There is a huge amount of information scattered on the World Wide Web. As the information flow occurs at a high speed in the WWW, there is a need to organize it in the right manner so that a user can access it very easily. Previously the organization of information was generally done manually, by matching the document contents to some pre-defined categories. There are two approaches for this text-based categorization: manual and automatic. In the manual approach, a human expert performs the classification task, and in the second case supervised classifiers are used to automatically classify resources. In a supervised classification, manual interaction is required to create some training data before the automatic classification task takes place. In our new approach, we intend to propose automatic classification of documents through semantic keywords and building the formulas generation by these keywords. Thus we can reduce this human participation by combining the knowledge of a given classification and the knowledge extracted from the data. The main focus of this PhD thesis, supervised by Prof. Fausto Giunchiglia, is the automatic classification of documents into user-generated classifications. The key benefits foreseen from this automatic document classification is not only related to search engines, but also to many other fields like, document organization, text filtering, semantic index managing
    • …
    corecore