28 research outputs found

    Automatic Identification of Close Languages – Case Study: Malay and Indonesian.

    Get PDF
    Identifying the language of an unknown text is not a new problem but what is new is the task of identifying close languages. Malay and Indonesian as many other language€ are very similar, and therefore it is a real difficulty to search, retrieve, classify, and above all translate texts written in one of the two Identifying the language of an unknown text is not a new problem but what is new is the task of identifying close languages. Malay and Indonesian as many other language€ are very similar, and therefore it is a real difficulty to search, retrieve, classify, and above all translate texts written in one of the two languages

    The Notion Of Instrument In Malay Language.

    Get PDF
    In Malay, the official language of Malaysia, the notion of instrument is expressed in five ways. In the expressions first two the noun instrument is introduced by either the preposition dengan 'with' or the preposition melalui 'through, via': <X Z=Action {dengan, melalui} Y=Instrument(e.g. Remaja pukul ibunya dengan batang paip ‘An adolescent hit his mother with a pipe', menghantar bantahan melalui e-mel kepada Dr. X ‘to send a protest through email to Dr' X')

    Which Extractive Summarization Method For Malay Texts?

    Get PDF
    The number of texts written in Malay increases every day. When these texts are lengthy, interested readers tend to skim through them. Automatic text summarization may assist these readers to get access to the important parts of the texts without scanning from the beginning to the end. As of today, only few Malay text summarizers have been presented in the literature. Therefore, a comparative study of three extractive summarization methods (Luhn’s method, Edmundson’s method, and LexRank method) was undertaken and the results are reported in this paper. The aim of the study is to determine the adequate extractive method. Several experiments were conducted by comparing the results of three extractive methods with human extracts as well as human abstracts. It appears that the Luhn’s method, one of the oldest automatic extractive summarization, shows a good perfor-mance while tested on 14 Malay abstract summaries and 20 Malay extrac-tive summaries

    Design and Implementation of PIAK: A Personalized Internet Access System for Kids

    Get PDF
    Internet plays an important role to deliver information worldwide. But the available huge amounts of online information are not all appropriate for children. This paper presents the design and implementation of PIAK, a Personalized Internet Access system for Kids. It aims to assist and teach children about using the Internet in one single and safe environment. PIAK features four personalized components: cross-platform user interface, multilingual support, educative and assistive mediums, and web content filtering. Its design is based on the children’s needs inferred from a survey finding. This will enable the Internet access to be more appealing to the children as they can explore the Internet in a controlled environment

    Identifying And Classifying Unknown Words In Malay Texts.

    Get PDF
    In this paper, we propose a method based on a chain of filters to handle the problem of identifying and classifying unknown words in Malay texts. A word is identified as unknown when it is not listed in the lexicon

    Using TEI XML Schema to Encode the Structures of Sarawak Gazette

    Get PDF
    Automatic extraction of information from old printed documents which have been digitised injudiciously will end up with a lot human corrections. To overcome the problem, one possible solution is to annotate the documents with some markups. This paper presents the encoding of the digitised sample of Sarawak Gazette published from 1903 until 1939 using the standard TEI XML schema. The output of the work is a set of six TEI XML templates that is considered to represent the different layout structures found in the studied samples

    Minimizing Human Labelling Effort for Annotating Named Entities in Historical Newspaper

    Get PDF
    To accelerate the annotation of named entities (NEs) in historical newspapers like Sarawak Gazette, only two choices are possible: an automatic approach or a semi-automatic approach. This paper presents a fully automatic annotation of NEs occurring in Sarawak Gazette. At the initial stage, a subset of the historical newspapers is fed to an established rule-based named entity recognizer (NER), that is ANNIE. Then, the preannotated corpus is used as training and testing data for three supervised learning NER, which are based on Naïve Bayes, J48 decision trees, and SVM-SMO methods. These methods are not always accurate and it appears that SVM-SMO and J48 have better performance than Naïve Bayes. Thus, a thorough study on the errors done by SVM-SMO and J48 yield to the creation of ad hoc rules to correct the errors automatically. The proposed approach is promising even though it still needs more experiments to refine the rules

    Wiki SaGa: an Interactive Timeline to Visualize Historical Documents

    Get PDF
    Searching for information inside a repository of digitised historical documents is a very common task. A timeline interface that represents the historical content which can perform the same search function will reveal better results to researchers. This paper presents the integration of SIMILE Timeline within a wiki, named Wiki SaGa, containing digitised version of Sarawak Gazette. The proposed approach allows display of events and relevant information search compared to traditional list of documents

    Comparative Studies of Ontologies on Sarawak Gazette

    Get PDF
    This paper presents a discussion on experience and process during initial stage of ontology building in history. The objective of this paper is to create a manual semantic annotation process to determine the concepts that will be used in the historical news ontology. It will describe the tasks of facilitating the analysis of missing concepts existing in Sarawak Gazette (SAGA) documents. Semantically annotating SAGA documents enable to enrich the element of concepts and relations taken from existing ontologies. Furthermore, an initial result is provided to observe the performance gain due to domainspecific annotations. Finally, we conclude on the importance of semantic annotations process in the construction of an ontology

    Inducing a Semantically Rich Nested Event Model

    Get PDF
    Research has revealed that getting data with named entities (NEs) labels are laboured intensive and costly. This paper is proposing two approaches to enable NE classes to be added to the semantic role label (SRL) predicateargument structure of Nested Event Model. The first approach associates SRL to Named Entity Recognition (NER), which is named as SRL-NER, to tag the appropriate entity class to the simple argument of the model. The second approach associates SRL to NER by fine-tuning entities in complex argument structures with Automatic Content Extraction (ACE) structure. This approach is called SRL-ACE-NER. Stanford NER tool is used as the benchmark for evaluation. The result shows that the proposed approaches are able to recognize more PERSON entities. However, the approaches are not able to recognize LOCATION/PLACE as efficiently as the benchmark. It is also observed that the benchmark tool is sometimes not able to tag as comprehensively as the proposed approaches. This paper has successfully demonstrated the potential of using a semantically enriched Nested Event Model as an alternative for NER technique. SRL-ACE-NER has achieved an average precision of 92 % in recognising PERSON, LOCATION/PLACE, TIME, and ORGANIZATION