36 research outputs found

    Probabilistic Reference to Suspect or Victim in Nationality Extraction from Unstructured Crime News Documents

    Get PDF
    There is valuable information in unstructured crime news documents which crime analysts must manually search for. To solve this issue, several information extraction models have been implemented, all of which are capable of being enhanced. This gap has created the motivation to propose an enhanced information extraction model that uses named entity recognition to extract the nationality from crime news documents and coreference resolution to associate the nationality to either the suspect or the victim. After the proposed model extracts the nationality, it references it to the suspect or victim by looking up all of the victim related keywords and the suspect related keywords within the text, and their corresponding distances from the position of the nationality keyword. Based on their total distances, a probability score algorithm decides whether the nationality is more likely to belong to either the victim or the suspect. Two experiments were conducted to evaluate the nationality extractor component and the reference identification component used by the model. The former experiment had achieved 90%, 94%, and 91% for precision, recall, and F-measure values respectively. The latter experiment had achieved 65%, 68%, and 66% for precision, recall, and F-measure respectively. The model had achieved promising results after evaluation. Keywords: information extraction, named entity recognition, coreference resolution, crime domai

    Text Classification for Authorship Attribution Using Naive Bayes Classifier with Limited Training Data

    Get PDF
    Authorship attribution (AA) is the task of identifying authors of disputed or anonymous texts. It can be seen as a single, multi-class text classification task. It is concerned with writing style rather than topic matter. The scalability issue in traditional AA studies concerns the effect of data size, the amount of data per candidate author. This has not been probed in much depth yet, since most stylometry researches tend to focus on long texts per author or multiple short texts, because stylistic choices frequently occur less in such short texts. This paper investigates the task of authorship attribution on short historical Arabic texts written by10 different authors. Several experiments are conducted on these texts by extracting various lexical and character features of the writing style of each author, using N-grams word level (1,2,3, and 4) and character level (1,2,3, and 4) grams as a text representation. Then Naive Bayes (NB) classifier is employed in order to classify the texts to their authors. This is to show robustness of NB classifier in doing AA on very short-sized texts when compared to Support Vector Machines (SVMs). Using dataset (called AAAT) which consists of 3 short texts per authorā€™s book, it is shown our method is at least as effective as Information Gain (IG) for the selection of the most significant n-grams. Moreover, the significance of punctuation marks is explored in order to distinguish between authors, showing that an increase in the performance can be achieved. As well, the NB classifier achieved high accuracy results. Since the experiments of AA task that are done on AAAT dataset show interesting results with a classification accuracy of the best score obtained up to 96% using N-gram word level 1gram. Keywords: Authorship attribution, Text classification, Naive Bayes classifier, Character n-grams features, Word n-grams features

    Normalized Google Distance for Collocation Extraction from Islamic Domain

    Get PDF
    This study investigates the properties of Arabic collocations, and classifies them according to their structural patterns on Islamic domain. Based on linguistic information, the patterns and the variation of the collocations have been identified.Ā  Then, a system that extracts the collocations from Islamic domain based on statistical measures has been described. In candidate ranking, the normalized Google distance has been adapted to measure the associations between the words in the candidates set. Finally, the n-best evaluation that selects n-best lists for each association measure has been used to annotate all candidates in these lists manually. The following association measures (log-likelihood ratio, t-score, mutual information, and enhanced mutual information) have been utilized in the candidate ranking step to compare these measures with the normalized Google distance in Arabic collocation extraction. In the experiment of this work, the normalized Google distance achieved the highest precision value 93% compared with other association measures. In fact, this strengthens our motivation to utilize the normalized Google distance to measure the relatedness between the constituent words of the collocations instead of using the frequency-based association measures as in the state-of-the-art methods. Keywords: normalized Google distance, collocation extraction, Islamic domai

    i-JEN: Visual interactive Malaysia crime news retrieval system

    Get PDF
    Supporting crime news investigation involves a mechanism to help monitor the current and past status of criminal events. We believe this could be well facilitated by focusing on the user interfaces and the event crime model aspects. In this paper we discuss on a development of Visual Interactive Malaysia Crime News Retrieval System (i-JEN) and describe the approach, user studies and planned, the system architecture and future plan. Our main objectives are to construct crime-based event; investigate the use of crime-based event in improving the classification and clustering; develop an interactive crime news retrieval system; visualize crime news in an effective and interactive way; integrate them into a usable and robust system and evaluate the usability and system performance. The system will serve as a news monitoring system which aims to automatically organize, retrieve and present the crime news in such a way as to support an effective monitoring, searching, and browsing for the target users groups of general public, news analysts and policemen or crime investigators. The study will contribute to the better understanding of the crime data consumption in the Malaysian context as well as the developed system with the visualisation features to address crime data and the eventual goal of combating the crimes

    Recognition of Sarcasm in Tweets Based on Concept Level Sentiment Analysis and Supervised Learning Approaches

    Get PDF

    GIS-based urban village regional fire risk assessment and mapping

    Get PDF
    Fires in residential areas are one of the threats out of 13 disasters in Indonesia. Fires are disasters based on their causes, classified as disasters caused by human negligence. This research aims to identify residential fire incidents, assess fire risk levels, and map the risk level. We used the geographic information system (GIS) analysis approach and direct observation of the study area. The research location was in the Tamansari sub-district in Bandung City. The subdistrict of Tamansari consists of 20 neighborhood units (rukun warga/ RW) with 22,995 people and 6,598 households. We conducted a field survey from December 2019 to March 2020. We used a spatial approach to analyze fire risk in this residential area by using GIS to map urban-village regional fire incidents and assess the risk level. There were four fire hazard variables: population density, building density, building quality, road class. On the other hand, vulnerability variables are based on the community's social parameters: population density, percentage of old age and children under five, people with disabilities, and the population's sex ratio. The hazard and vulnerability maps overlay showed three neighborhood units (rukun warga/ RW) with a high risk of fire, eight RWs with a moderate risk of residential fires, and nine RWs with a low risk of residential fires. The areas with low-risk categories must remain vigilant because the width of the roads in these areas is relatively narrow

    A Comprehensive Instrument for Measuring Knowledge Management System Satisfaction

    Get PDF
    This paper study on measuring the user satisfactionof Knowledge Resources for Science and Technology ExcellenceMalaysia (KRSTE.my) as a medium for managing knowledge inScience, Technology and Innovation (STI), amongst theregistered users. As a Knowledge Management System (KMS),KRSTE.my functions as a collector of STI information relatedmaterial, provides a platform for collaboration and discussion ofthe community, and also a receptor of the latest inventions inSTI. This study proposes an integrated instrument for theempirical evaluation of userā€™s satisfaction of a KMS. We haveconsolidated factors from several instruments developed byprevious researchers. This effort has resulted in a comprehensiveinstrument for measuring users' satisfaction of a knowledgemanagement system. The instrument consists of six knowledgefactors, namely: content, map, manipulation, community,usefulness, and security, which measure the level of usersatisfaction towards the system. The instrument includes 22items that measure user satisfaction of KRSTE.my. A total of271 Malaysian citizen registered subscribers that has accessed tothe system are involved in this study. Quantitative researchmethods have been employed in data collection processconducted over a period of seven weeks. This study involved astatistical analysis to determine significant factors that measureuser satisfaction on KRSTE.my. Results from the analysisindicate that the instrument is reliable which show all itemsmeasuring the six dimensions are correlated. The finding of thestudy shown that knowledge content and knowledge map gives ahigh level of satisfaction to the user based on the mean score.While only the knowledge security, knowledge manipulation,knowledge usefulness and knowledge community are at moderatelevel of satisfaction. Overall, user satisfaction is high onKRSTE.my with the mean score of 3.49 (of the maximum score of5). This study also makes an important contribution indetermining the level of user satisfaction toward KRSTE.my as aKMS. In addition, the study produced a reliable instrument

    A review of Arabic text recognition dataset

    Get PDF
    Building a robust Optical Character Recognition (OCR) system for languages, such as Arabic with cursive scripts, has always been challenging. These challenges increase if the text contains diacritics of different sizes for characters and words. Apart from the complexity of the used font, these challenges must be addressed in recognizing the text of the Holy Quran. To solve these challenges, the OCR system would have to undergo different phases. Each problem would have to be addressed using different approaches, thus, researchers are studying these challenges and proposing various solutions. This has motivate this study to review Arabic OCR dataset because the dataset plays a major role in determining the nature of the OCR systems. State-of-the-art approaches in segmentation and recognition are discovered with the implementation of Recurrent Neural Networks (Long Short-Term Memory-LSTM and Gated Recurrent Unit-GRU) with the use of the Connectionist Temporal Classification (CTC). This also includes deep learning model and implementation of GRU in the Arabic domain. This paper has contribute in profiling the Arabic text recognition dataset thus determining the nature of OCR system developed and has identified research direction in building Arabic text recognition dataset
    corecore