7 research outputs found

    A new technology on translating Indonesian spoken language into Indonesian sign language system

    Get PDF
    People with hearing disabilities are those who are unable to hear, resulted in their disability to communicate using spoken language. The solution offered in this research is by creating a one way translation technology to interpret spoken language to Indonesian sign language system (SIBI). The mechanism applied here is by catching the sentences (audio) spoken by common society to be converted to texts, by using speech recognition. The texts are then processed in text processing to select the input texts. The next stage is stemming the texts into prefixes, basic words, and suffixes. Each words are then being indexed and matched to SIBI. Afterwards, the system will arrange the words into SIBI sentences based on the original sentences, so that the people with hearing disabilities can get the information contained within the spoken language. This technology success rate were tested using Confusion Matrix, which resulted in precision value of 76%, accuracy value of 78%, and recall value of 79%. This technology has been tested in SMP-LB Karya Mulya on the 7th grader students with the total of 9 students. From the test, it is obtained that 86% of students stated that this technology runs very well

    Current Implementation and Future Prospects of Santi-Morf V.1.0

    Get PDF
    SANTI-Morf (Prihantoro, 2021) is a new morphological analyser for Indonesian. In SANTI-Morf annotation scheme (Prihantoro, 2019), morpheme tokens are linked to their annotations. The tokens are presented in their orthographic and citation forms to allow (allo)morph or morpheme-based searches. Users can also perform retrievals on the basis of formal and functional morphological criteria as SANTI-Morf tagset encodes the analyses of morphemes’ forms (e.g. roots, clitics, affix type) and functions (e.g. passive voice, active voice, adjective degrees, etc.). Currently, the scheme is implemented in Nooj (Silberztein, 2003), a linguistic development environment. It enables users to index and annotate Indonesian texts in their local PC, and later perform searches based on morphological criteria and or tokens defined by the SANTI-Morf scheme. AbstrakSANTI-Morf (Prihantoro, 2021) adalah sebuah program analisis morfologi terbaru untuk bahasa Indonesia. Dalam skema anotasi SANTI-morf (Prihantoro, A new tagset for morphological analysis of Indonesian, 2019), setiap token morfem terhubung dengan anotasinya. Token-token ini direpresentasikan dalam bentuk ortografis dan bentuk sitasi sehingga memungkinkan pengguna untuk melakukan penelusuran berbasis (alo)morf atau morfem. Selain itu, pengguna juga bisa melakukan penelusuran berbasiskan bentuk atau fungsi morfem. Ini karena tagset analitik yang digunakan di SANTI-morf mencakup bentuk (di antaranya: akar, klitik, jenis afiksasi) dan fungsi (di antaranya: aktif, pasif, derajat ajektiva). Saat ini, SANTI-morf diimplementasikan menggunakan NooJ (Silberztein, 2003), sebuah program pengembangan aplikasi linguistik. Pengguna dapat mengindeks dan menganotasi teks berbahasa Indonesia di komputer mereka, dan selanjutnya melakukan penelusuran menggunakan kriteria morfologi dan skema tokenisasi yang digunakan di skema anotasi SANTI-morf

    Konstruksi Kausatif Bahasa Batak Toba dan Bahasa Mandailing: Kajian Tipologis Bahasa

    Get PDF
    This research is a typology study of the Toba Batak languages and Mandailing languages. Both languages are cognate languages with similar language structure and typology. Until now, these two languages are still actively used in the North Sumatra region. This study specifically compares the causative construction in the Toba Batak language and the Mandailing language by selecting the same verbs in both languages and comparing them. This research is qualitative research. Data was pored out by speaking and listening technique. Furthermore, the data is examined using the equivalent method and the method of testing tested with triangulation techniques. The results showed that the causative in the Batak Toba language and Mandailing Language in general have the same form. Both of these languages have lexical causative, morphological causative and analytic causative. Lexical causatives in BT and BM languages have subtype (2), which is a verb subtype that is unique and subtype (3), which is a different verb in forming a causative construction. Both of these languages also recognize direct and indirect causatives. Morphological causatives in the Toba Batak language are characterized by affective causative markers (-hon), (-i), (pa- / par-), (pa-hon), and (pa-) whereas in Mandailing language are marked by causative markers (ma-kon), (pa-kon), (pa-on), (pa-), (tar-). In Batak Toba and Mandailing languages, analytical causatives are found whose construction is formed by predicates containing verbs (intransitive and transitive), adjectives, and nouns and shows causal events with two separate predicates (cause and effect). AbstrakPenelitian ini merupakan kajian tipologi terhadap bahasa Batak Toba dan bahasa Mandailing. Kedua bahasa ini merupakan bahasa serumpun dengan struktur dan tipologi bahasa yang mirip. Sampai saat ini, kedua bahasa ini masih digunakan secara aktif di wilayah Sumatera Utara. Penelitian ini secara spesifik membandingkan konstruksi kausatif dalam bahasa Batak Toba dan bahasa Mandailing dengan cara memilih kata kerja yang sama dalam kedua bahasa serta membandingkannya. Penelitian ini merupakan penelitian kualitatif. Data diperoleh dengan metode simak dan cakap. Selanjutnya, data dikaji dengan menggunakan metode padan dan metode agih yang diuji dengan teknik triangulasi. Hasil penelitian menunjukkan bahwa kausatif dalam bahasa Batak Toba dan bahasa Mandailing secara umum memiliki kesamaan bentuk. Kedua bahasa ini memiliki kausatif leksikal, kausatif morfologis dan kausatif analitik. Konstruksi kausatif leksikal dalam bahasa Batak Toba dan bahasa Mandailing memiliki subtipe (2) yaitu subtipe verba yang memiliki keunikan dan subtipe (3) yaitu verba berbeda dalam membentuk konstruksi kausatif. Kedua bahasa ini juga mengenal kausatif langsung dan tidak langsung. Kausatif morfologis dalam bahasa Batak Toba ditandai oleh pemarkah kausatif afiks (-hon), (-i), (pa-/par-), (pa—hon), dan (pa-), sedangkan dalam bahasa Mandailing ditandai oleh pemarkah kausatif (ma-kon), (pa –kon), (pa-on), (pa-), (tar-). Dalam bahasa Batak Toba dan bahasa Mandailing ditemukan kausatif analitik yang konstruksinya dibentuk oleh predikat yang mengandung verba (intransitif dan transitif), adjektiva, dan nomina serta menunjukkan peristiwa kausal dengan dua predikat (sebab dan akibat) yang terpisah.

    Plagiarism detection for Indonesian texts

    Get PDF
    As plagiarism becomes an increasing concern for Indonesian universities and research centers, the need of using automatic plagiarism checker is becoming more real. However, researches on Plagiarism Detection Systems (PDS) in Indonesian documents have not been well developed, since most of them deal with detecting duplicate or near-duplicate documents, have not addressed the problem of retrieving source documents, or show tendency to measure document similarity globally. Therefore, systems resulted from these researches are incapable of referring to exact locations of ``similar passage'' pairs. Besides, there has been no public and standard corpora available to evaluate PDS in Indonesian texts. To address the weaknesses of former researches, this thesis develops a plagiarism detection system which executes various methods of plagiarism detection stages in a workflow system. In retrieval stage, a novel document feature coined as phraseword is introduced and executed along with word unigram and character n-grams to address the problem of retrieving source documents, whose contents are copied partially or obfuscated in a suspicious document. The detection stage, which exploits a two-step paragraph-based comparison, is aimed to address the problems of detecting and locating source-obfuscated passage pairs. The seeds for matching source-obfuscated passage pairs are based on locally-weighted significant terms to capture paraphrased and summarized passages. In addition to this system, an evaluation corpus was created through simulation by human writers, and by algorithmic random generation. Using this corpus, the performance evaluation of the proposed methods was performed in three scenarios. On the first scenario which evaluated source retrieval performance, some methods using phraseword and token features were able to achieve the optimum recall rate 1. On the second scenario which evaluated detection performance, our system was compared to Alvi's algorithm and evaluated in 4 levels of measures: character, passage, document, and cases. The experiment results showed that methods resulted from using token as seeds have higher scores than Alvi's algorithm in all 4 levels of measures both in artificial and simulated plagiarism cases. In case detection, our systems outperform Alvi's algorithm in recognizing copied, shaked, and paraphrased passages. However, Alvi's recognition rate on summarized passage is insignificantly higher than our system. The same tendency of experiment results were demonstrated on the third experiment scenario, only the precision rates of Alvi's algorithm in character and paragraph levels are higher than our system. The higher Plagdet scores produced by some methods in our system than Alvi's scores show that this study has fulfilled its objective in implementing a competitive state-of-the-art algorithm for detecting plagiarism in Indonesian texts. Being run at our test document corpus, Alvi's highest scores of recall, precision, Plagdet, and detection rate on no-plagiarism cases correspond to its scores when it was tested on PAN'14 corpus. Thus, this study has contributed in creating a standard evaluation corpus for assessing PDS for Indonesian documents. Besides, this study contributes in a source retrieval algorithm which introduces phrasewords as document features, and a paragraph-based text alignment algorithm which relies on two different strategies. One of them is to apply local-word weighting used in text summarization field to select seeds for both discriminating paragraph pair candidates and matching process. The proposed detection algorithm results in almost no multiple detection. This contributes to the strength of this algorithm

    An automatic morphological analysis system for Indonesian

    Get PDF
    This thesis reports the creation of SANTI-morf (Sistem Analisis Teks Indonesia – morfologi), a rule-based system that performs morphological annotation for Indonesian. The system has been built across three stages, namely preliminaries, annotation scheme creation (the linguistic aspect of the project), and system implementation (the computational aspect of the project). The preliminary matters covered include the necessary key concepts in morphology and Natural Language Processing (NLP), as well as a concise description of Indonesian morphology (largely based on the two primary reference grammars of Indonesian, Alwi et al. 1998 and Sneddon et al. 2010, together with work in the linguistic literature on Indonesian morphology (e.g. Kridalaksana 1989; Chaer 2008). As part of this preliminary stage, I created a testbed corpus for evaluation purposes. The design of the testbed is justified by considering the design of existing evaluation corpora, such as the testbed used by the English Constraint Grammar or EngCG system (Voutilanen 1992), the British National Corpus (BNC) 1994 evaluation data , and the training data used by MorphInd (Larasati et al. 2011), a morphological analyser (MA) for Indonesian. The dataset for this testbed was created by narrowing down an existing very large bit unbalanced collection of texts (drawn from the Leipzig corpora; see Goldhahn et al. 2012). The initial collection was reduced to a corpus composed of nine domains following the domain categorisation of the BNC) . A set of texts from each domain, proportional in size, was extracted and combined to form a testbed that complies with the design cited informed by the prior literature. The second stage, scheme creation, involved the creation of a new Morphological Annotation Scheme (MAS) for Indonesian, for use in the SANTI-morf system. First, a review of MASs in different languages (Finnish, Turkish, Arabic, Indonesian) as well as the Universal Dependencies MAS identifies the best practices in the field. From these, 15 design principles for the novel MAS were devised. This MAS consists of a morphological tagset, together with comprehensive justification of the morphological analyses used in the system. It achieves full morpheme-level annotation, presenting each morpheme’s orthographic and citation forms in the defined output, accompanied by robust morphological analyses, both formal and functional; to my knowledge, this is the first MAS of its kind for Indonesian. The MAS’s design is based not only on reference grammars of Indonesian and other linguistic sources, but also on the anticipated needs of researchers and other users of texts and corpora annotated using this scheme of analysis. The new MAS aims at The third stage of the project, implementation, consisted of three parts: a benchmarking evaluation exercise, a survey of frameworks and tools, leading ultimately to the actual implementation and evaluation of SANTI-morf. MorphInd (Larasati et al. 2012) is the prior state-of-the-art MA for Indonesian. That being the case, I evaluated MorphInd’s performance against the aforementioned testbed, both as just5ification of the need for an improved system, and to serve as a benchmark for SANTI-morf. MorphInd scored 93% on lexical coverage and 89% on tagging accuracy. Next, I surveyed existing MAs frameworks and tools. This survey justifies my choice for the rule-based approach (inspired by Koskenniemi’s 1983 Two Level Morphology, and NooJ (Silberztein 2S003) as respectively the framework and the software tool for SANTI-morf. After selection of this approach and tool, the language resources that constitute the SANTI-morf system were created. These are, primarily, a number of lexicons and sets of analysis rules, as well as necessary NooJ system configuration files. SANTI-morf’s 3 lexicon files (in total 86,590 entries) and 15 rule files (in total 659 rules) are organised into four modules, namely the Annotator, the Guesser, the Improver and the Disambiguator. These modules are applied one after another in a pipeline. The Annotator provides initial morpheme-level annotation for Indonesian words by identifying their having been built according to various morphological processes (affixation, reduplication, compounding, and cliticisation). The Guesser ensures that words not covered by the Annotator, because they are not covered by its lexicons, receive best guesses as to the correct analysis from the application of a set of probable but not exceptionless rules. The Improver improves the existing annotation, by adding probable analyses that the Annotator might have missed. Finally, the Disambiguator resolves ambiguities, that is, words for which the earlier elements of the pipeline have generated two or more possible analyses in terms of the morphemes identified or their annotation. NooJ annotations are saved in a binary file, but for evaluation purposes, plain-text output is required. I thus developed a system for data export using an in-NooJ mapping to and from a modified, exportable expression of the MAS, and wrote a small program to enable re-conversion of the output in plain-text format. For purposes of the evaluation, I created a 10,000 -word gold-standard SANTI-morf manually-annotated dataset. The outcome of the evaluation is that SANTI-morf has 100% coverage (because a best-guess analysis is always provided for unrecognised word forms), and 99% precision and recall for the morphological annotations, with a 1% rate of remaining ambiguity in the final output. SANTI-morf is thus shown to present a number of advancements over MorphInd, the state-of-the-art MA for Indonesian, exhibiting more robust annotation and better coverage. Other performance indicators, namely the high precision and recall, make SANTI-morf a concrete advance in the field of automated morphological annotation for Indonesian, and in consequence a substantive contribution to the field of Indonesian linguistics overall

    A Two-Level Morphological Analyser for the Indonesian Language

    No full text
    corecore