412 research outputs found

    Semantic feature reduction and hybrid feature selection for clustering of Arabic Web pages

    Get PDF
    In the literature, high-dimensional data reduces the efficiency of clustering algorithms. Clustering the Arabic text is challenging because semantics of the text involves deep semantic processing. To overcome the problems, the feature selection and reduction methods have become essential to select and identify the appropriate features in reducing high-dimensional space. There is a need to develop a suitable design for feature selection and reduction methods that would result in a more relevant, meaningful and reduced representation of the Arabic texts to ease the clustering process. The research developed three different methods for analyzing the features of the Arabic Web text. The first method is based on hybrid feature selection that selects the informative term representation within the Arabic Web pages. It incorporates three different feature selection methods known as Chi-square, Mutual Information and Term Frequency–Inverse Document Frequency to build a hybrid model. The second method is a latent document vectorization method used to represent the documents as the probability distribution in the vector space. It overcomes the problems of high-dimension by reducing the dimensional space. To extract the best features, two document vectorizer methods have been implemented, known as the Bayesian vectorizer and semantic vectorizer. The third method is an Arabic semantic feature analysis used to improve the capability of the Arabic Web analysis. It ensures a good design for the clustering method to optimize clustering ability when analysing these Web pages. This is done by overcoming the problems of term representation, semantic modeling and dimensional reduction. Different experiments were carried out with k-means clustering on two different data sets. The methods provided solutions to reduce high-dimensional data and identify the semantic features shared between similar Arabic Web pages that are grouped together in one cluster. These pages were clustered according to the semantic similarities between them whereby they have a small Davies–Bouldin index and high accuracy. This study contributed to research in clustering algorithm by developing three methods to identify the most relevant features of the Arabic Web pages

    Arabic web page clustering: a review

    Get PDF
    Clustering is the method employed to group Web pages containing related information into clusters, which facilitates the allocation of relevant information. Clustering performance is mostly dependent on the text features' characteristics. The Arabic language has a complex morphology and is highly inflected. Thus, selecting appropriate features affects clustering performance positively. Many studies have addressed the clustering problem in Web pages with Arabic content. There are three main challenges in applying text clustering to Arabic Web page content. The first challenge concerns difficulty with identifying significant term features to represent original content by considering the hidden knowledge. The second challenge is related to reducing data dimensionality without losing essential information. The third challenge regards how to design a suitable model for clustering Arabic text that is capable of improving clustering performance. This paper presents an overview of existing Arabic Web page clustering methods, with the goals of clarifying existing problems and examining feature selection and reduction techniques for solving clustering difficulties. In line with the objectives and scope of this study, the present research is a joint effort to improve feature selection and vectorization frameworks in order to enhance current text analysis techniques that can be applied to Arabic Web pages

    A survey on extremism analysis using natural language processing: definitions, literature review, trends and challenges

    Get PDF
    Extremism has grown as a global problem for society in recent years, especially after the apparition of movements such as jihadism. This and other extremist groups have taken advantage of different approaches, such as the use of Social Media, to spread their ideology, promote their acts and recruit followers. The extremist discourse, therefore, is reflected on the language used by these groups. Natural language processing (NLP) provides a way of detecting this type of content, and several authors make use of it to describe and discriminate the discourse held by these groups, with the final objective of detecting and preventing its spread. Following this approach, this survey aims to review the contributions of NLP to the field of extremism research, providing the reader with a comprehensive picture of the state of the art of this research area. The content includes a first conceptualization of the term extremism, the elements that compose an extremist discourse and the differences with other terms. After that, a review description and comparison of the frequently used NLP techniques is presented, including how they were applied, the insights they provided, the most frequently used NLP software tools, descriptive and classification applications, and the availability of datasets and data sources for research. Finally, research questions are approached and answered with highlights from the review, while future trends, challenges and directions derived from these highlights are suggested towards stimulating further research in this exciting research area.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature

    A survey on extremism analysis using natural language processing: definitions, literature review, trends and challenges

    Get PDF
    Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.Extremism has grown as a global problem for society in recent years, especially after the apparition of movements such as jihadism. This and other extremist groups have taken advantage of different approaches, such as the use of Social Media, to spread their ideology, promote their acts and recruit followers. The extremist discourse, therefore, is reflected on the language used by these groups. Natural language processing (NLP) provides a way of detecting this type of content, and several authors make use of it to describe and discriminate the discourse held by these groups, with the final objective of detecting and preventing its spread. Following this approach, this survey aims to review the contributions of NLP to the field of extremism research, providing the reader with a comprehensive picture of the state of the art of this research area. The content includes a first conceptualization of the term extremism, the elements that compose an extremist discourse and the differences with other terms. After that, a review description and comparison of the frequently used NLP techniques is presented, including how they were applied, the insights they provided, the most frequently used NLP software tools, descriptive and classification applications, and the availability of datasets and data sources for research. Finally, research questions are approached and answered with highlights from the review, while future trends, challenges and directions derived from these highlights are suggested towards stimulating further research in this exciting research area.CRUE-CSIC agreementSpringer Natur

    Text Classification Using Novel Term Weighting Scheme-Based Improved TF-IDF for Internet Media Reports

    Get PDF
    With the rapid development of the internet technology, a large amount of internet text data can be obtained. The text classification (TC) technology plays a very important role in processing massive text data, but the accuracy of classification is directly affected by the performance of term weighting in TC. Due to the original design of information retrieval (IR), term frequency-inverse document frequency (TF-IDF) is not effective enough for TC, especially for processing text data with unbalanced distributions in internet media reports. Therefore, the variance between the DF value of a particular term and the average of all DFs , namely, the document frequency variance (ADF), is proposed to enhance the ability in processing text data with unbalanced distribution. Then, the normal TF-IDF is modified by the proposed ADF for processing unbalanced text collection in four different ways, namely, TF-IADF, TF-IADF+, TF-IADFnorm, and TF-IADF+norm. As a result, an effective model can be established for the TC task of internet media reports. A series of simulations have been carried out to evaluate the performance of the proposed methods. Compared with TF-IDF on state-of-the-art classification algorithms, the effectiveness and feasibility of the proposed methods are confirmed by simulation results

    Question Answering System : A Review On Question Analysis, Document Processing, And Answer Extraction Techniques

    Get PDF
    Question Answering System could automatically provide an answer to a question posed by human in natural languages. This system consists of question analysis, document processing, and answer extraction module. Question Analysis module has task to translate query into a form that can be processed by document processing module. Document processing is a technique for identifying candidate documents, containing answer relevant to the user query. Furthermore, answer extraction module receives the set of passages from document processing module, then determine the best answers to user. Challenge to optimize Question Answering framework is to increase the performance of all modules in the framework. The performance of all modules that has not been optimized has led to the less accurate answer from question answering systems. Based on this issues, the objective of this study is to review the current state of question analysis, document processing, and answer extraction techniques. Result from this study reveals the potential research issues, namely morphology analysis, question classification, and term weighting algorithm for question classification

    Being Young in Arab Detroit: Media and Identity in Post-9/11 America.

    Full text link
    More than ten years after the events of 9/11, the Arab American community of the Dearborn and Detroit, Michigan area continues to feel the effects of the nation’s intense scrutiny of their lives and identities. As formal government and informal communal surveillance, threats of violence and deportation, and general anxieties escalated, the Arab Detroit community has been at the center of various efforts to understand Arab and Muslim Americans. Through an engagement with post-9/11 national news media discourses, participant-observation work at the Arab American National Museum (founded in 2005 and located in Dearborn, MI), interviews and focus groups with Arab American youth, and digital ethnography of Arab American youths’ online cultural productions, this dissertation examines what it means to be young and Arab American in Dearborn. Moving beyond well-worn distinctions between mainstream and grassroots media, this dissertation examines news discourse and television programming in relation to various media produced and circulated by Arab American youth. In doing so, this dissertation contributes to scholarship in a number of related areas including media representations of race and ethnicity, geography and identity, the increasingly vexed relationship between race and religion, and youth culture.PhDCommunication StudiesUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/111598/1/cjhaddad_1.pd

    Towards Designing a Multipurpose Cybercrime Intelligence Framework

    Get PDF
    With the wide spread of the Internet and the increasing popularity of social networks that provide prompt and ease of communication, several criminal and radical groups have adopted it as a medium of operation. Existing literature in the area of cybercrime intelligence focuses on several research questions and adopts multiple methods using techniques such as social network analysis to address them. In this paper, we study the broad state-of-the-art research in cybercrime intelligence in order to identify existing research gaps. Our core aim is designing and developing a multipurpose framework that is able to fill these gaps using a wide range of techniques. We present an outline of a framework designed to aid law enforcement in detecting, analysing and making sense out of cybercrime data

    Saad Elkhadem's The Plague in English: A Study of the Translation Strategies used to Recreate the Egyptian Ethos

    Get PDF
    This thesis focuses on translation as a transcultural activity. It studies the foreignizing and domesticating translation strategies used to recreate the Egyptian ethos in the translation of Elkhadems The Plague from Arabic to English. Five theories are incorporated in the analysis. These are Venutis Domesticating and Foreignizing Theory; Tourys DTS; Genettes Paratexts; Pedersens taxonomy of strategies for rendering culture-bound references and his classification of culture-bound elements; and Vermeers Skopos Theory. Three types of analysis are conducted: a literary analysis of the source text; a microanalysis of the target text, further divided into an analysis of the novel's paratexts and a descriptive analysis of ninety-eight culture-bound references; and finally, a macro-analysis of the overall norms and of the skopos of the translation showing how both affect the transmission of the Egyptian ethos. Overall, this thesis provides some insight into the influence of translation on cultural identity
    • …
    corecore