329 research outputs found

    EVALUASI ESAI OTOMATIS DENGAN ALGORITMA NAZIEF & ADRIANI DAN WINNOWING

    Get PDF
    Pemanfaatan teknologi informasi diharapkan dapat memudahkan aktifitas sehari-hari tidak terkecuali proses evaluasi hasil belajar mahasiswa olah dosen. Pemeriksaan hasil ujian esai secara manual tentu jauh dari kata efektif dan efisien sehingga penelitian ini memanfaatkan fingerprint yang diperoleh dari nilai hash kumpulan teks menggunakan algoritma Winnowing untuk selanjutnya dihitung nilai kesamaannya menggunakan Jaccard’s Similarity Coeficient. Sebelum proses perhitungan, jawaban dalam bentuk teks bahasa Indonesai melalui pre-prosesing dan menggunakan algoritma Nazief & Andriani sebagai stemmer sehingga dapat memperoleh hasil evaluasi 30 jawaban siswa yang dibandingkan dengan kunci jawaban dalam waktu 1,62 detik dengan nilai rata-rata kesamaan 81,20%

    Plagiarism detection for Indonesian texts

    Get PDF
    As plagiarism becomes an increasing concern for Indonesian universities and research centers, the need of using automatic plagiarism checker is becoming more real. However, researches on Plagiarism Detection Systems (PDS) in Indonesian documents have not been well developed, since most of them deal with detecting duplicate or near-duplicate documents, have not addressed the problem of retrieving source documents, or show tendency to measure document similarity globally. Therefore, systems resulted from these researches are incapable of referring to exact locations of ``similar passage'' pairs. Besides, there has been no public and standard corpora available to evaluate PDS in Indonesian texts. To address the weaknesses of former researches, this thesis develops a plagiarism detection system which executes various methods of plagiarism detection stages in a workflow system. In retrieval stage, a novel document feature coined as phraseword is introduced and executed along with word unigram and character n-grams to address the problem of retrieving source documents, whose contents are copied partially or obfuscated in a suspicious document. The detection stage, which exploits a two-step paragraph-based comparison, is aimed to address the problems of detecting and locating source-obfuscated passage pairs. The seeds for matching source-obfuscated passage pairs are based on locally-weighted significant terms to capture paraphrased and summarized passages. In addition to this system, an evaluation corpus was created through simulation by human writers, and by algorithmic random generation. Using this corpus, the performance evaluation of the proposed methods was performed in three scenarios. On the first scenario which evaluated source retrieval performance, some methods using phraseword and token features were able to achieve the optimum recall rate 1. On the second scenario which evaluated detection performance, our system was compared to Alvi's algorithm and evaluated in 4 levels of measures: character, passage, document, and cases. The experiment results showed that methods resulted from using token as seeds have higher scores than Alvi's algorithm in all 4 levels of measures both in artificial and simulated plagiarism cases. In case detection, our systems outperform Alvi's algorithm in recognizing copied, shaked, and paraphrased passages. However, Alvi's recognition rate on summarized passage is insignificantly higher than our system. The same tendency of experiment results were demonstrated on the third experiment scenario, only the precision rates of Alvi's algorithm in character and paragraph levels are higher than our system. The higher Plagdet scores produced by some methods in our system than Alvi's scores show that this study has fulfilled its objective in implementing a competitive state-of-the-art algorithm for detecting plagiarism in Indonesian texts. Being run at our test document corpus, Alvi's highest scores of recall, precision, Plagdet, and detection rate on no-plagiarism cases correspond to its scores when it was tested on PAN'14 corpus. Thus, this study has contributed in creating a standard evaluation corpus for assessing PDS for Indonesian documents. Besides, this study contributes in a source retrieval algorithm which introduces phrasewords as document features, and a paragraph-based text alignment algorithm which relies on two different strategies. One of them is to apply local-word weighting used in text summarization field to select seeds for both discriminating paragraph pair candidates and matching process. The proposed detection algorithm results in almost no multiple detection. This contributes to the strength of this algorithm

    User stories collection via interactive chatbot to support requirements gathering

    Get PDF
    Nowadays, software products have become an essential part of human life. To build software, developers must have a good understanding of the requirements of the software. However, software developers tend to jumpstart system construction without having a clear and detailed understanding of the requirements. The user story concept is one of the practices of the requirements elicitation. This paper aims to present the work conducted to develop an Android chatbot application to support the requirements elicitation activity in software engineering, making the work less time-consuming and structured even for users not accustomed to requirements engineering. The chatbot uses Nazief & Adriani stemming algorithm to pre-process the natural language it receives from the users and artificial mark-up language (AIML) as the knowledge base to process the bot’s responses. A preliminary acceptance test based on the technology acceptance model results in an 83.03% score for users’ behavioral intention to use

    Clustering topic groups of documents using K-Means algorithm: Australian Embassy Jakarta media releases 2006-2016

    Get PDF
    Introduction. The Australian Embassy in Jakarta is storing a wide array of media release document. Analyzing particular and vital patterns of the documents collection is imperative as it will result in new insights and knowledge of significant topic groups of the documents.Methodology. K-Means was used algorithm as a non-hierarchical clustering method which partitioning data objects into clusters. The method works through minimizing data variation within cluster and maximizing data variation between clusters. Data Analysis.  Of the documents issued between 2006 and 2016, 839 documents were examined in order to determine term frequencies and to generate clusters. Evaluation was conducted by nominating an expert to validate the cluster result.Results and discussions. The result showed that there were 57 meaningful terms grouped into 3 clusters. “People to people links”, “economic cooperation”, and “human development” were chosen to represent topics of the Australian Embassy Jakarta media releases from 2006 to 2016.Conclusions. Text mining can be used to cluster topic groups of documents. It provides a more systematic clustering process as the text analysis is conducted through a number of stages with specifically set parameters. 

    Aplicação de técnicas de Clustering ao contexto da Tomada de Decisão em Grupo

    Get PDF
    Nowadays, decisions made by executives and managers are primarily made in a group. Therefore, group decision-making is a process where a group of people called participants work together to analyze a set of variables, considering and evaluating a set of alternatives to select one or more solutions. There are many problems associated with group decision-making, namely when the participants cannot meet for any reason, ranging from schedule incompatibility to being in different countries with different time zones. To support this process, Group Decision Support Systems (GDSS) evolved to what today we call web-based GDSS. In GDSS, argumentation is ideal since it makes it easier to use justifications and explanations in interactions between decision-makers so they can sustain their opinions. Aspect Based Sentiment Analysis (ABSA) is a subfield of Argument Mining closely related to Natural Language Processing. It intends to classify opinions at the aspect level and identify the elements of an opinion. Applying ABSA techniques to Group Decision Making Context results in the automatic identification of alternatives and criteria, for example. This automatic identification is essential to reduce the time decision-makers take to step themselves up on Group Decision Support Systems and offer them various insights and knowledge on the discussion they are participants. One of these insights can be arguments getting used by the decision-makers about an alternative. Therefore, this dissertation proposes a methodology that uses an unsupervised technique, Clustering, and aims to segment the participants of a discussion based on arguments used so it can produce knowledge from the current information in the GDSS. This methodology can be hosted in a web service that follows a micro-service architecture and utilizes Data Preprocessing and Intra-sentence Segmentation in addition to Clustering to achieve the objectives of the dissertation. Word Embedding is needed when we apply clustering techniques to natural language text to transform the natural language text into vectors usable by the clustering techniques. In addition to Word Embedding, Dimensionality Reduction techniques were tested to improve the results. Maintaining the same Preprocessing steps and varying the chosen Clustering techniques, Word Embedders, and Dimensionality Reduction techniques came up with the best approach. This approach consisted of the KMeans++ clustering technique, using SBERT as the word embedder with UMAP dimensionality reduction, reducing the number of dimensions to 2. This experiment achieved a Silhouette Score of 0.63 with 8 clusters on the baseball dataset, which wielded good cluster results based on their manual review and Wordclouds. The same approach obtained a Silhouette Score of 0.59 with 16 clusters on the car brand dataset, which we used as an approach validation dataset.Atualmente, as decisões tomadas por gestores e executivos são maioritariamente realizadas em grupo. Sendo assim, a tomada de decisão em grupo é um processo no qual um grupo de pessoas denominadas de participantes, atuam em conjunto, analisando um conjunto de variáveis, considerando e avaliando um conjunto de alternativas com o objetivo de selecionar uma ou mais soluções. Existem muitos problemas associados ao processo de tomada de decisão, principalmente quando os participantes não têm possibilidades de se reunirem (Exs.: Os participantes encontramse em diferentes locais, os países onde estão têm fusos horários diferentes, incompatibilidades de agenda, etc.). Para suportar este processo de tomada de decisão, os Sistemas de Apoio à Tomada de Decisão em Grupo (SADG) evoluíram para o que hoje se chamam de Sistemas de Apoio à Tomada de Decisão em Grupo baseados na Web. Num SADG, argumentação é ideal pois facilita a utilização de justificações e explicações nas interações entre decisores para que possam suster as suas opiniões. Aspect Based Sentiment Analysis (ABSA) é uma área de Argument Mining correlacionada com o Processamento de Linguagem Natural. Esta área pretende classificar opiniões ao nível do aspeto da frase e identificar os elementos de uma opinião. Aplicando técnicas de ABSA à Tomada de Decisão em Grupo resulta na identificação automática de alternativas e critérios por exemplo. Esta identificação automática é essencial para reduzir o tempo que os decisores gastam a customizarem-se no SADG e oferece aos mesmos conhecimento e entendimentos sobre a discussão ao qual participam. Um destes entendimentos pode ser os argumentos a serem usados pelos decisores sobre uma alternativa. Assim, esta dissertação propõe uma metodologia que utiliza uma técnica não-supervisionada, Clustering, com o objetivo de segmentar os participantes de uma discussão com base nos argumentos usados pelos mesmos de modo a produzir conhecimento com a informação atual no SADG. Esta metodologia pode ser colocada num serviço web que segue a arquitetura micro serviços e utiliza Preprocessamento de Dados e Segmentação Intra Frase em conjunto com o Clustering para atingir os objetivos desta dissertação. Word Embedding também é necessário para aplicar técnicas de Clustering a texto em linguagem natural para transformar o texto em vetores que possam ser usados pelas técnicas de Clustering. Também Técnicas de Redução de Dimensionalidade também foram testadas de modo a melhorar os resultados. Mantendo os passos de Preprocessamento e variando as técnicas de Clustering, Word Embedder e as técnicas de Redução de Dimensionalidade de modo a encontrar a melhor abordagem. Essa abordagem consiste na utilização da técnica de Clustering KMeans++ com o SBERT como Word Embedder e UMAP como a técnica de redução de dimensionalidade, reduzindo as dimensões iniciais para duas. Esta experiência obteve um Silhouette Score de 0.63 com 8 clusters no dataset de baseball, que resultou em bons resultados de cluster com base na sua revisão manual e visualização dos WordClouds. A mesma abordagem obteve um Silhouette Score de 0.59 com 16 clusters no dataset das marcas de carros, ao qual usamos esse dataset com validação de abordagem

    Plagiarism detection for Indonesian texts

    Get PDF
    As plagiarism becomes an increasing concern for Indonesian universities and research centers, the need of using automatic plagiarism checker is becoming more real. However, researches on Plagiarism Detection Systems (PDS) in Indonesian documents have not been well developed, since most of them deal with detecting duplicate or near-duplicate documents, have not addressed the problem of retrieving source documents, or show tendency to measure document similarity globally. Therefore, systems resulted from these researches are incapable of referring to exact locations of ``similar passage'' pairs. Besides, there has been no public and standard corpora available to evaluate PDS in Indonesian texts. To address the weaknesses of former researches, this thesis develops a plagiarism detection system which executes various methods of plagiarism detection stages in a workflow system. In retrieval stage, a novel document feature coined as phraseword is introduced and executed along with word unigram and character n-grams to address the problem of retrieving source documents, whose contents are copied partially or obfuscated in a suspicious document. The detection stage, which exploits a two-step paragraph-based comparison, is aimed to address the problems of detecting and locating source-obfuscated passage pairs. The seeds for matching source-obfuscated passage pairs are based on locally-weighted significant terms to capture paraphrased and summarized passages. In addition to this system, an evaluation corpus was created through simulation by human writers, and by algorithmic random generation. Using this corpus, the performance evaluation of the proposed methods was performed in three scenarios. On the first scenario which evaluated source retrieval performance, some methods using phraseword and token features were able to achieve the optimum recall rate 1. On the second scenario which evaluated detection performance, our system was compared to Alvi's algorithm and evaluated in 4 levels of measures: character, passage, document, and cases. The experiment results showed that methods resulted from using token as seeds have higher scores than Alvi's algorithm in all 4 levels of measures both in artificial and simulated plagiarism cases. In case detection, our systems outperform Alvi's algorithm in recognizing copied, shaked, and paraphrased passages. However, Alvi's recognition rate on summarized passage is insignificantly higher than our system. The same tendency of experiment results were demonstrated on the third experiment scenario, only the precision rates of Alvi's algorithm in character and paragraph levels are higher than our system. The higher Plagdet scores produced by some methods in our system than Alvi's scores show that this study has fulfilled its objective in implementing a competitive state-of-the-art algorithm for detecting plagiarism in Indonesian texts. Being run at our test document corpus, Alvi's highest scores of recall, precision, Plagdet, and detection rate on no-plagiarism cases correspond to its scores when it was tested on PAN'14 corpus. Thus, this study has contributed in creating a standard evaluation corpus for assessing PDS for Indonesian documents. Besides, this study contributes in a source retrieval algorithm which introduces phrasewords as document features, and a paragraph-based text alignment algorithm which relies on two different strategies. One of them is to apply local-word weighting used in text summarization field to select seeds for both discriminating paragraph pair candidates and matching process. The proposed detection algorithm results in almost no multiple detection. This contributes to the strength of this algorithm

    Role of images on World Wide Web readability

    Get PDF
    As the Internet and World Wide Web have grown, many good things have come. If you have access to a computer, you can find a lot of information quickly and easily. Electronic devices can store and retrieve vast amounts of data in seconds. You no longer have to leave your house to get products and services you could only get in person. Documents can be changed from English to Urdu or from text to speech almost instantly, making it easy for people from different cultures and with different abilities to talk to each other. As technology improves, web developers and website visitors want more animation, colour, and technology. As computers get faster at processing images and other graphics, web developers use them more and more. Users who can see colour, pictures, animation, and images can help understand and read the Web and improve the Web experience. People who have trouble reading or whose first language is not used on the website can also benefit from using pictures. But not all images help people understand and read the text they go with. For example, images just for decoration or picked by the people who made the website should not be used. Also, different factors could affect how easy it is to read graphical content, such as a low image resolution, a bad aspect ratio, a bad colour combination in the image itself, a small font size, etc., and the WCAG gave different rules for each of these problems. The rules suggest using alternative text, the right combination of colours, low contrast, and a higher resolution. But one of the biggest problems is that images that don't go with the text on a web page can make it hard to read the text. On the other hand, relevant pictures could make the page easier to read. A method has been suggested to figure out how relevant the images on websites are from the point of view of web readability. This method combines different ways to get information from images by using Cloud Vision API and Optical Character Recognition (OCR), and reading text from websites to find relevancy between them. Techniques for preprocessing data have been used on the information that has been extracted. Natural Language Processing (NLP) technique has been used to determine what images and text on a web page have to do with each other. This tool looks at fifty educational websites' pictures and assesses their relevance. Results show that images that have nothing to do with the page's content and images that aren't very good cause lower relevancy scores. A user study was done to evaluate the hypothesis that the relevant images could enhance web readability based on two evaluations: the evaluation of the 1024 end users of the page and the heuristic evaluation, which was done by 32 experts in accessibility. The user study was done with questions about what the user knows, how they feel, and what they can do. The results back up the idea that images that are relevant to the page make it easier to read. This method will help web designers make pages easier to read by looking at only the essential parts of a page and not relying on their judgment.Programa de Doctorado en Ciencia y Tecnología Informática por la Universidad Carlos III de MadridPresidente: José Luis Lépez Cuadrado.- Secretario: Divakar Yadav.- Vocal: Arti Jai

    Understanding user behavior aspects on emergency mobile applications during emergency communications using NLP and text mining techniques

    Get PDF
    Abstract. The use of mobile devices has been skyrocketing in our society. Users can access and share any type of information in a timely manner through these devices using different social media applications. This enabled users to increase their awareness of ongoing events such as election campaigns, sports updates, movie releases, disaster occurrences, and studies. The attractiveness, affordability, and two-way communication capabilities empowered these mobile devices that support various social media platforms to be central to emergency communication as well. This makes a mobile-based emergency application an attractive communication tool during emergencies. The emergence of mobile-based emergency communication has intrigued us to learn about the user behavior related to the usage of these applications. Our study was mainly conducted on emergency apps in Nordic countries such as Finland, Sweden, and Norway. To understand the user objects regarding the usage of emergency mobile applications we leveraged various Natural Language Processing and Text Mining techniques. VADER sentiment tool was used to predict and track users’ review polarity of a particular application over time. Lately, to identify factors that affect users’ sentiments, we employed topic modeling techniques such as the Latent Dirichlet Allocation (LDA) model. This model identifies various themes discussed in the user reviews and the result of each theme will be represented by the weighted sum of words in the corpus. Even though LDA succeeds in highlighting the user-related factors, it fails to identify the aspects of the user, and the topic definition from the LDA model is vague. Hence we leveraged Aspect Based Sentiment Analysis (ABSA) methods to extract the user aspects from the user reviews. To perform this task we consider fine-tuning DeBERTa (a variant of the BERT model). BERT is a Bidirectional Encoder Representation of transformer architecture which allows the model to learn the context in the text. Following this, we performed a sentence pair sentiment classification task using different variants of BERT. Later, we dwell on different sentiments to highlight the factors and the categories that impact user behavior most by leveraging the Empath categorization technique. Finally, we construct a word association by considering different Ontological vocabularies related to mobile applications and emergency response and management systems. The insights from the study can be used to identify the user aspect terms, predict the sentiment of the aspect term in the review provided, and find how the aspect term impacts the user perspective on the usage of mobile emergency applications

    Perbandingan dokumen Modul Ajar Mata Pelajaran Informatika Sekolah Menengah Atas (SMA) berbasis Similarity

    Get PDF
    ABSTRAK Pemerintah telah merilis kebijakan Kurikulum Merdeka pada tahun 2022. Kebijakan tersebut mengubah penggunaan kurikulum sebelumnya yaitu kurikulum 2013. Dalam pelaksanaan kurikulum merdeka pada jenjang Sekolah Menengah Atas (SMA) yaitu pada Fase E, mata pelajaran informatika memiliki elemen materi sebanyak delapan elemen dasar yang dapat dikembangkan secara mandiri pada setiap sekolah. Dalam praktik pengembangan modul ajar oleh guru di sekolah belum ada evaluasi terukur dari pemerintah untuk mengukur pengembangan konten materi apakah sudah sesuai dengan kurikulum standar yang ada. Pada penelitian ini mengusulkan analisis similarity dokumen modul ajar mata pelajaran informatika dengan pembanding dokumen Computer Science Curricula 2013. Dengan mengusulkan metode Cosine Similarity mendapatkan hasil rata-rata similarity sebesar 0.29606 sedangkan dengan menggunakan metode Word2Vec mendapatkan hasil skor similarity dengan rata-rata similarity sebesar 0.98449. Pada tahap pemenuhan Knowledge Area dari skor similarity yang didapatkan dengan metode Cosine-similarity pada seluruh Knowledge Area yang diujikan mendapatkan rata-rata pemenuhan sebesar 29.6%, sedangkan dengan metode Word2Vec pada seluruh Knowledge Area yang diujikan mendapatkan rata-rata pemenuhan sebesar 98.44%. Pada tahap pengukuran akurasi dengan metode Cosine-similarity mendapatkan rata-rata akurasi pada seluruh Knowledge Area yang diujikan sebesar 0.6525 dan dengan metode Word2Vec mendapatkan rata-rata akurasi pada seluruh Knowledge Area yang diujikan sebesar 0.5895. ABSTRACT The government has issued an Independent Curriculum policy in 2022. This policy changes the use of the previous curriculum, namely the 2013 curriculum. In implementing the independent curriculum at the Senior High School (SMA) level, namely in Phase E, the informatics subject has eight basic material elements that can be developed independently at each school. In the practice of open development modules by teachers in schools, there has been no measurable evaluation from the government to measure whether the content development material is in accordance with the existing standard curriculum. In this research, recommendations for similarity analysis of informatics teaching module documents are compared with the 2013 Computer Science Curriculum document comparison. With the addition of the Cosine Similarity method, we get an average similarity result of 0.29606, while using the Word2Vec method we get a similarity score result with an average similarity of 0.98449. At the Knowledge Area stage, the similarity scores obtained using the Cosine-similarity method for all Knowledge Areas tested obtained a satisfactory average of 29.6%, while using the Word2Vec method for all Knowledge Areas tested obtained a satisfactory average of 98.44%. At the effectiveness stage of measurement using the Cosine-similarity method, the average accuracy in all Knowledge Areas tested was 0.6525 and with the Word2Vec method, the average accuracy in all Knowledge Areas tested was 0.5895. مستخلص البحث مقارنة وثائق وحدة التدريس لمواد المعلوماتية في المدرسة أصدرت الحكومة منهج دراسة "مردكا" في عام 2022. وتغير هذا المنهج استخدام المنهج السابق وهو منهج دراسة "2013". عند تنفيذ منهج "مردكا" على مستوى المدرسة الثانوية في مرحلة "ه"، يحتوي موضوع المعلوماتية على ثمانية عناصر مادية أساسية يمكن تطويرها بشكل مستقل في كل مدرسة. في ممارسة تطوير وحدات التدريس من قبل المعلمين في المدارس، لم يكن هناك تقييم قابل للقياس من قبل الحكومة لقياس ما إذا كان تطوير محتوى المواد يتوافق مع المنهج القياسي الحالي. في هذا البحث يقترح الباحث تحليل "سيميلاريتي" لوثائق وحدة تدريس المعلوماتية من خلال مقارنة وثائق "كمبيوتر ساينس كرّيكلار 2013". ومن خلال اقتراح طريقة تشابه جيب التمام (Cosine Similarity) حصل على نتيجة سيميلاريتي متوسطة قدره 0.29606. وأما استخدام طريقة " Word2Vec " فيؤدي إلى الحصول على درجة سيميلاريتي بمتوسط سيميلاريتي قدره 0.98449. في مرحلة استيفاء مجال المعرفة (Knowledge Area)، حصلت درجة سيميلاريتي التي تم الحصول عليها من طريقة "كوسين-سيميلاريتي" لجميع مجالات المعرفة التي تم اختبارها على متوسطة استيفاء قدرها 29.6%. وأما طريقة " Word2Vec " لجميع مجالات المعرفة التي تم اختبارها فحصلت على متوسطة استيفاء قدرها 98.44%. وفي مرحلة قياس الدقة، حصلت طريقة "كوسين-سيميلاريتي" لجميع المجالات المعرفة التي تم اختبارها على دقة متوسطة قدرها 0.6525. وأما طريقة " Word2Vec " لجميع مجالات المعرفة التي تم اختبارها فحصلت على دقة متوسطة قدرها 0.5895
    corecore