21 research outputs found

    Hybrid Technique for Arabic Text Compression

    Get PDF
    Arabic content on the Internet and other digital media is increasing exponentially, and the number of Arab users of these media has multiplied by more than 20 over the past five years. There is a real need to save allocated space for this content as well as allowing more efficient usage, searching, and retrieving information operations on this content. Using techniques borrowed from other languages or general data compression techniques, ignoring the proper features of Arabic has limited success in terms of compression ratio. In this paper, we present a hybrid technique that uses the linguistic features of Arabic language to improve the compression ratio of Arabic texts. This technique works in phases. In the first phase, the text file is split into four different files using a multilayer model-based approach. In the second phase, each one of these four files is compressed using the Burrows-Wheeler compression algorithm

    Deep Learning Based Abstractive Text Summarization: Approaches, Datasets, Evaluation Measures, and Challenges

    No full text
    In recent years, the volume of textual data has rapidly increased, which has generated a valuable resource for extracting and analysing information. To retrieve useful knowledge within a reasonable time period, this information must be summarised. This paper reviews recent approaches for abstractive text summarisation using deep learning models. In addition, existing datasets for training and validating these approaches are reviewed, and their features and limitations are presented. The Gigaword dataset is commonly employed for single-sentence summary approaches, while the Cable News Network (CNN)/Daily Mail dataset is commonly employed for multisentence summary approaches. Furthermore, the measures that are utilised to evaluate the quality of summarisation are investigated, and Recall-Oriented Understudy for Gisting Evaluation 1 (ROUGE1), ROUGE2, and ROUGE-L are determined to be the most commonly applied metrics. The challenges that are encountered during the summarisation process and the solutions proposed in each approach are analysed. The analysis of the several approaches shows that recurrent neural networks with an attention mechanism and long short-term memory (LSTM) are the most prevalent techniques for abstractive text summarisation. The experimental results show that text summarisation with a pretrained encoder model achieved the highest values for ROUGE1, ROUGE2, and ROUGE-L (43.85, 20.34, and 39.9, respectively). Furthermore, it was determined that most abstractive text summarisation models faced challenges such as the unavailability of a golden token at testing time, out-of-vocabulary (OOV) words, summary sentence repetition, inaccurate sentences, and fake facts

    SemanticGraph2Vec: Semantic graph embedding for text representation

    No full text
    Graph embedding is an important representational technique that aims to maintain the structure of a graph while learning low-dimensional representations of its vertices. Semantic relationships between vertices contain essential information regarding the meaning of the represented graph. However, most graph embedding methods do not consider the semantic relationships during the learning process. In this paper, we propose a novel semantic graph embedding approach, called SemanticGraph2Vec. SemanticGraph2Vec learns mappings of vertices into low-dimensional feature spaces that consider the most important semantic relationships between graph vertices. The proposed approach extends and enhances prior work based on a set of random walks of graph vertices by using semantic walks instead of random walks which provides more useful embeddings for text graphs. A set of experiments are conducted to evaluate the performance of SemanticGraph2Vec. SemanticGraph2Vec is employed on a part-of-speech tagging task. Experimental results demonstrate that SemanticGraph2Vec outperforms two state-of-the-art baselines methods in terms of precision and F1 score
    corecore