28 research outputs found
Automatic Identification of Close Languages – Case Study: Malay and Indonesian.
Identifying the language of an unknown text is not a new problem but what is new is the task of identifying close languages. Malay and Indonesian as many other language€ are very similar, and therefore it is a real difficulty to search, retrieve, classify, and above all translate texts written in one of the two Identifying the language of an unknown text is not a new problem but what is new is the task of identifying close languages. Malay and Indonesian
as many other language€ are very similar, and therefore
it is a real difficulty to search, retrieve, classify,
and above all translate texts written in one of the
two languages
The Notion Of Instrument In Malay Language.
In Malay, the official language of Malaysia, the notion of instrument is expressed in five ways. In the expressions first two the noun instrument is introduced by either the preposition dengan 'with' or the preposition melalui 'through, via': <X Z=Action {dengan, melalui} Y=Instrument(e.g. Remaja pukul ibunya dengan batang paip ‘An adolescent hit his mother with a pipe', menghantar bantahan melalui e-mel kepada Dr. X ‘to send a protest through email to Dr' X')
Which Extractive Summarization Method For Malay Texts?
The number of texts written in Malay increases every day. When these texts are lengthy, interested readers tend to skim through them. Automatic text summarization may assist these readers to get access to the important parts of the texts without scanning from the beginning to the end. As of today, only few Malay text summarizers have been presented in the literature. Therefore, a comparative study of three extractive summarization methods (Luhn’s method, Edmundson’s method, and LexRank method) was undertaken and the results are reported in this paper. The aim of the study is to determine the adequate extractive method. Several experiments were conducted by comparing the results of three extractive methods with human extracts as well as human abstracts. It appears that the Luhn’s method, one of the oldest automatic extractive summarization, shows a good perfor-mance while tested on 14 Malay abstract summaries and 20 Malay extrac-tive summaries
Design and Implementation of PIAK: A Personalized Internet Access System for Kids
Internet plays an important role to deliver information worldwide. But the available huge amounts of online information are not all appropriate for children. This paper presents the design and implementation of PIAK, a Personalized Internet Access system for Kids. It aims to assist and teach children about using the Internet in one single and safe environment. PIAK features four personalized components: cross-platform user interface, multilingual support, educative and assistive mediums, and web content filtering. Its design is based on the children’s needs inferred from a survey finding. This will enable the Internet access to be more appealing to the children as they can explore the Internet in a controlled environment
Identifying And Classifying Unknown Words In Malay Texts.
In this paper, we propose a method based on a chain of filters to handle the problem of identifying and classifying
unknown words in Malay texts. A word is identified as unknown when it is not listed in the lexicon
Using TEI XML Schema to Encode the Structures of Sarawak Gazette
Automatic extraction of information from old
printed documents which have been digitised injudiciously will
end up with a lot human corrections. To overcome the problem,
one possible solution is to annotate the documents with some
markups. This paper presents the encoding of the digitised
sample of Sarawak Gazette published from 1903 until 1939
using the standard TEI XML schema. The output of the work is
a set of six TEI XML templates that is considered to represent
the different layout structures found in the studied samples
Minimizing Human Labelling Effort for Annotating Named Entities in Historical Newspaper
To accelerate the annotation of named entities
(NEs) in historical newspapers like Sarawak Gazette, only two choices are possible: an automatic approach or a semi-automatic approach. This paper presents a fully automatic annotation of NEs occurring in Sarawak Gazette. At the initial stage, a subset of the historical newspapers is fed to an established rule-based
named entity recognizer (NER), that is ANNIE. Then, the preannotated corpus is used as training and testing data for three supervised learning NER, which are based on Naïve Bayes, J48 decision trees, and SVM-SMO methods. These methods are not always accurate and it appears that SVM-SMO and J48 have better performance than Naïve Bayes. Thus, a thorough study on the errors done by SVM-SMO and J48 yield to the creation of ad hoc rules to correct the errors automatically. The proposed approach is promising even though it still needs more experiments to refine the rules
Wiki SaGa: an Interactive Timeline to Visualize Historical Documents
Searching for information inside a repository of digitised historical
documents is a very common task. A timeline interface that represents the historical
content which can perform the same search function will reveal better results
to researchers. This paper presents the integration of SIMILE Timeline within a
wiki, named Wiki SaGa, containing digitised version of Sarawak Gazette. The
proposed approach allows display of events and relevant information search compared
to traditional list of documents
Comparative Studies of Ontologies on Sarawak Gazette
This paper presents a discussion on experience and process during initial stage of ontology building in history. The objective of this paper is to create a manual semantic annotation process to determine the concepts that will be used in the historical news ontology. It will describe the tasks of facilitating the analysis of missing concepts existing in Sarawak Gazette (SAGA) documents. Semantically annotating SAGA documents enable to enrich the element of concepts and relations taken from existing ontologies. Furthermore, an initial result is provided to observe the performance gain due to domainspecific annotations. Finally, we conclude on the importance of semantic annotations process in the construction of an ontology
Inducing a Semantically Rich Nested Event Model
Research has revealed that getting data with named entities (NEs)
labels are laboured intensive and costly. This paper is proposing two approaches
to enable NE classes to be added to the semantic role label (SRL) predicateargument
structure of Nested Event Model. The first approach associates SRL to
Named Entity Recognition (NER), which is named as SRL-NER, to tag the
appropriate entity class to the simple argument of the model. The second
approach associates SRL to NER by fine-tuning entities in complex argument
structures with Automatic Content Extraction (ACE) structure. This approach is
called SRL-ACE-NER. Stanford NER tool is used as the benchmark for evaluation.
The result shows that the proposed approaches are able to recognize
more PERSON entities. However, the approaches are not able to recognize
LOCATION/PLACE as efficiently as the benchmark. It is also observed that the
benchmark tool is sometimes not able to tag as comprehensively as the proposed
approaches. This paper has successfully demonstrated the potential of using a
semantically enriched Nested Event Model as an alternative for NER technique.
SRL-ACE-NER has achieved an average precision of 92 % in recognising
PERSON, LOCATION/PLACE, TIME, and ORGANIZATION