5,299 research outputs found

    Feature extraction using regular expression in detecting proper noun for Malay news articles based on KNN algorithm

    Get PDF
    No AbstractKeywords: data mining; named entity recognition; regular expression; natural language processin

    An Enhanced Malay Named Entity Recognition Using Clustering and Classification Approach For Crime Textual Data Analysis

    Get PDF
    Named Entity Recognition (NER) is one of the tasks undertaken in the information extraction. NER is used for extracting and classifying words or entities that belong to the proper noun category in text data such as the person's name, location, organization, date, etc. As seen in today's generation, social media such as web pages, blogs, Facebook, Twitter, Instagram and online newspapers are among the major contributors to information extraction. These resources contain various types of unstructured data such as text. However, the amount of works done to process this type of data is limited for Malay Named Entity Recognition (MNER). The deficiency on Malay textual analytic has led to difficulties in extracting information for decision making. This research aims to present a Malay Named Entity Recognition technique that focuses on crime data analysis in the Malay language that extracted from Polis Diraja Malaysia (PDRM) news web page. This Malay Named Entity Recognition (MNER) technique is proposed by using multi-staged of clustering and classification methods. The methods are Fuzzy C-Means and K-Nearest Neighbors Algorithm. The methods involve multi-layer features extraction to recognize entities such as person name, location, organization, date and crime type. This multi-staged technique is obtained 95.24% accuracy in the process of recognizing named entities for text analysis, particularly in Malay. The proposed technique can improve the accuracy performance on named entity recognition of crime data based on the suitability selected features for the Malay language

    Projecting named entity tags from a resource rich language to a resource poor language

    Get PDF
    Named Entities (NE) are the prominent entities appearing in textual documents.Automatic classification of NE in a textual corpus is a vital process in Information Extraction and Information Retrieval research. Named Entity Recognition (NER) is the identification of words in text that correspond to a pre-defined taxonomy such as person, organization, location, date, time, etc.This article focuses on the person (PER), organization (ORG) and location (LOC) entities for a Malay journalistic corpus of terrorism.A projection algorithm, using the Dice Coefficient function and bigram scoring method with domain-specific rules, is suggested to map the NE information from the English corpus to the Malay corpus of terrorism.The English corpus is the translated version of the Malay corpus.Hence, these two corpora are treated as parallel corpora. The method computes the string similarity between the English words and the list of available lexemes in a pre-built lexicon that approximates the best NE mapping.The algorithm has been effectively evaluated using our own terrorism tagged corpus; it achieved satisfactory results in terms of precision, recall, and F-measure.An evaluation of the selected open source NER tool for English is also presented

    Named Entity Recognition using Fuzzy C-Means Clustering Method for Malay Textual Data Analysis

    Get PDF
    The Named Entity Recognition (NER) task is among the important tasks in analysing unstructured textual data as a solution to gain important and valuable information from the text document. This task is very useful in Natural Language Processing (NLP) to analyse various languages with distinctive styles of writing, characteristics and word structures. The social media act as the primary source where most information and unstructured textual data are obtained through these sources. In this paper, unstructured textual data were analysed through NER task focusing on the Malay language. The analysis was implemented to investigate the impact of text features transformation set used for recognising entities from unstructured Malay textual data using fuzzy c-means method. It focuses on using Bernama Malay news as a dataset through several steps for the experiment namely pre-processing, text features transformation, experimental and evaluation steps. As a conclusion, the overall percentage accuracy gave markedly good results based on clustering matching with 98.57%. This accuracy was derived from the precision and recall evaluation of both classes. The precision result for NON_ENTITY class is 98.39% with 100.00% recall, whereas for an ENTITY class, the precision and recall are 100.00% and 88.97%, respectively

    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

    Get PDF
    Indonesian and Malay are underrepresented in the development of natural language processing (NLP) technologies and available resources are difficult to find. A clear picture of existing work can invigorate and inform how researchers conceptualise worthwhile projects. Using an education sector project to motivate the study, we conducted a wide-ranging overview of Indonesian and Malay human language technologies and corpus work. We charted 657 included studies according to Hirschberg and Manning's 2015 description of NLP, concluding that the field was dominated by exploratory corpus work, machine reading of text gathered from the Internet, and sentiment analysis. In this paper, we identify most published authors and research hubs, and make a number of recommendations to encourage future collaboration and efficiency within NLP in Indonesian and Malay

    Probabilistic Reference to Suspect or Victim in Nationality Extraction from Unstructured Crime News Documents

    Get PDF
    There is valuable information in unstructured crime news documents which crime analysts must manually search for. To solve this issue, several information extraction models have been implemented, all of which are capable of being enhanced. This gap has created the motivation to propose an enhanced information extraction model that uses named entity recognition to extract the nationality from crime news documents and coreference resolution to associate the nationality to either the suspect or the victim. After the proposed model extracts the nationality, it references it to the suspect or victim by looking up all of the victim related keywords and the suspect related keywords within the text, and their corresponding distances from the position of the nationality keyword. Based on their total distances, a probability score algorithm decides whether the nationality is more likely to belong to either the victim or the suspect. Two experiments were conducted to evaluate the nationality extractor component and the reference identification component used by the model. The former experiment had achieved 90%, 94%, and 91% for precision, recall, and F-measure values respectively. The latter experiment had achieved 65%, 68%, and 66% for precision, recall, and F-measure respectively. The model had achieved promising results after evaluation. Keywords: information extraction, named entity recognition, coreference resolution, crime domai

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen
    • …
    corecore