6 research outputs found

    Rekomendasi Pengembangan Fasilitas Wisata Tugu Pahlawan Surabaya Melalui Visualisasi Dashboard Hasil Klasifikasi Analisis Sentimen Ulasan Pengunjung

    Get PDF
    Tugu Pahlawan Surabaya merupakan salah satu pariwisata andalan Kota Surabaya yang selalu berupaya memperhatikan ulasan pengunjung sebagai acuan evaluasi. Namun, pengelola tidak memiliki teknologi yang mampu mengumpulkan, mengolah, dan menganalisis seluruh data ulasan yang dapat menghasilkan informasi secara ringkas. Salah satu solusi dapat dilakukan melalui analisis sentimen pada level aspek terhadap aspek edukasi, fasilitas, kebersihan, pelayanan, dan umum dengan penyajian informasi dalam bentuk dashboard. Analisis sentimen dilakukan menggunakan Support Vector Machine terhadap 2180 data ulasan selama 2 tahun terakhir yang diambil dari Google Review. Ulasan terbanyak terdapat pada aspek fasilitas sebanyak 538 ulasan dengan sebaran sentimen 285 ulasan positif, 95 ulasan negatif, dan 158 ulasan netral. Rekomendasi berdasarkan kekuatan dan kelemahan saat ini adalah penyediaan lahan atau objek foto bernuansa sejarah pahlawan secara lebih nyata serta penyediaan ventilasi terbuka atau standing cooler di beberapa area. Berdasarkan confusion matrix, nilai F1-Score menjadi penentu seberapa baik model mengklasifikan data daripada nilai Accuracy dikarenakan dataset yang dimiliki bersifat imbalance sehingga kesalahan prediksi pada precision atau recall sangat memungkinkan terjadi. Kesalahan prediksi banyak ditemukan pada kelas sentimen netral. Keseluruhan hasil klasifikasi disajikan dalam bentuk dashboard dengan nilai SUS Score 77,5, menandakan bahwa dashboard dapat diterima dengan baik oleh responden sebagai pengguna. Abstract Tugu Pahlawan Surabaya is one of the mainstays of tourism in Surabaya city which always tries to pay attention to visitor reviews as a reference for evaluation. However, the managers do not have technology capable of collecting, processing and analyzing all review datas that can produce information in a concise manner. One solution can be done through sentiment analysis at the aspect level of education, facilities, cleanliness, service, and general aspects by presenting information in the form of a dashboard. Sentiment analysis was carried out using the Support Vector Machine on 2180 review datas for the last 2 years taken from Google Reviews. The most reviews were on the facility aspect in total of 538 reviews with a sentiment distribution of 285 positive reviews, 95 negative reviews and 158 neutral reviews. Recommendations based on current strengths and weaknesses are providing more real area or photo objects with historical nuances of heroes and providing open ventilation or standing coolers in several areas. Based on the confusion matrix, the F1-Score value determines how well the model classifies data rather than the Accuracy value because the dataset is imbalance so that prediction errors in precision or recall are very possible. Prediction errors are more likely to be found in the neutral sentiment class. The overall classification results are presented in the form of a dashboard with a SUS Score of 77.5, indicating that the dashboard is well received by respondent

    Pengelompokan dan Klasifikasi Laporan Masyarakat di Situs Media Center Surabaya Menggunakan Metode K-Means Clustering dan Support Vector Machine

    Get PDF
    Media center merupakan sebuah sistem pelayanan terintegrasi bagi masyarakat Surabaya. Melalui media center masyarakat dapat berpartisipasi dengan memberikan laporan terkait kota Surabaya. Pengelompokan laporan masyarakat yang dilakukan oleh media center masih manual. Sehingga diperlukan sebuah alternatif lain untuk mempermudah pihak media center dalam melakukan pengelompokan tersebut. Pada tugas akhir ini penulis melakukan penelitan terkait pengelompokan dan klasifikasi laporan masyarakat dengan menerapkan metode K-Means Clustering dan Support Vector Machine. Laporan masyarakat yang digunakan pada penelitian ini belum memiliki label sehingga harus melalui pengelompokan terlebih dahulu menggunakan K-Means Clustering untuk memberikan label pada data berdasarkan label clusternya. Selanjutnya data yang telah berlabel tersebut dapat diklasifikasikan dengan SVM untuk membentuk model klasifikasi. Berdasarkan hasil pengelompokan yang dilakukan terhadap 1948 data laporan, diperoleh 10-cluster sebagai cluster terbaik dengan nilai koefisien sillhoute sebesar 0,61. Selanjutnya dengan menggunakan 1568 data training dan 380 data testing didapatkan akurasi model klasifikasi sebesar 83,42% ================================================================= Media center is an integrated service system for the community. Through the media center community can participate by providing related reports the city of Surabaya. Grouping reports conducted by media center still manual. So it required an alternative method to simplify media center do grouping. In this final task the author doing clustering and classification community report by applying method of K-Means Clustering and Support Vector Machine. Community reports used in this research does not yet have the label so it has to go through the process first clustering using K-Means Clustering to provide labels on the data. Further data that has been labeled based on clusters can be classified with SVM to form the model. Based on the results of clustering is done against the 1948 report data, retrieved 10-cluster as the best cluster with sillhoute coefficients of 0.61. Furthermore using 1568 data training and 380 data testing obtained accuracy classification model of 83.42%

    Sehaa: A big data analytics tool for healthcare symptoms and diseases detection using Twitter, Apache Spark, and Machine Learning

    Get PDF
    Smartness, which underpins smart cities and societies, is defined by our ability to engage with our environments, analyze them, and make decisions, all in a timely manner. Healthcare is the prime candidate needing the transformative capability of this smartness. Social media could enable a ubiquitous and continuous engagement between healthcare stakeholders, leading to better public health. Current works are limited in their scope, functionality, and scalability. This paper proposes Sehaa, a big data analytics tool for healthcare in the Kingdom of Saudi Arabia (KSA) using Twitter data in Arabic. Sehaa uses Naive Bayes, Logistic Regression, and multiple feature extraction methods to detect various diseases in the KSA. Sehaa found that the top five diseases in Saudi Arabia in terms of the actual aicted cases are dermal diseases, heart diseases, hypertension, cancer, and diabetes. Riyadh and Jeddah need to do more in creating awareness about the top diseases. Taif is the healthiest city in the KSA in terms of the detected diseases and awareness activities. Sehaa is developed over Apache Spark allowing true scalability. The dataset used comprises 18.9 million tweets collected from November 2018 to September 2019. The results are evaluated using well-known numerical criteria (Accuracy and F1-Score) and are validated against externally available statistics

    Uticaj klasifikacije teksta na primene u obradi prirodnih jezika

    Get PDF
    The main goal of this dissertation is to put different text classification tasks in the same frame, by mapping the input data into the common vector space of linguistic attributes. Subsequently, several classification problems of great importance for natural language processing are solved by applying the appropriate classification algorithms. The dissertation deals with the problem of validation of bilingual translation pairs, so that the final goal is to construct a classifier which provides a substitute for human evaluation and which decides whether the pair is a proper translation between the appropriate languages by means of applying a variety of linguistic information and methods. In dictionaries it is useful to have a sentence that demonstrates use for a particular dictionary entry. This task is called the classification of good dictionary examples. In this thesis, a method is developed which automatically estimates whether an example is good or bad for a specific dictionary entry. Two cases of short message classification are also discussed in this dissertation. In the first case, classes are the authors of the messages, and the task is to assign each message to its author from that fixed set. This task is called authorship identification. The other observed classification of short messages is called opinion mining, or sentiment analysis. Starting from the assumption that a short message carries a positive or negative attitude about a thing, or is purely informative, classes can be: positive, negative and neutral. These tasks are of great importance in the field of natural language processing and the proposed solutions are language-independent, based on machine learning methods: support vector machines, decision trees and gradient boosting. For all of these tasks, a demonstration of the effectiveness of the proposed methods is shown on for the Serbian language.Osnovni cilj disertacije je stavljanje različitih zadataka klasifikacije teksta u isti okvir, preslikavanjem ulaznih podataka u isti vektorski prostor lingvističkih atributa..

    Definición de un framework para el análisis predictivo de datos no estructurados

    Get PDF
    La cantidad de información que se genera segundo a segundo en Internet aumenta en volumen y variedad cada día. La web 2.0, el Internet de las cosas y los dispositivos móviles son tan sólo algunos de los elementos que han generado tal incremento en el volumen de los datos. En el futuro cercano, la introducción de la tecnología 5G propiciará un incremento exponencial en la generación de datos al permitir una mayor transferencia de Gb/s. Por lo anterior, la investigación en esta área debe establecer las pautas que guíen el camino mediante el cual se puedan establecer metodologías para el análisis de los datos, así como medios para tratarlos. No obstante, el tamaño y la diversidad de estos datos hacen que tengan que conjuntarse diversas disciplinas científicas para poder analizar los datos y obtener hallazgos relevantes dentro de la información. Es decir, que no sólo se aplicarán las técnicas tradicionales para realizar el análisis, sino que se tendrán que conjuntar otras áreas de la ciencia para poder extraer la denominada ‘información oculta’ que se encuentra tras estos datos. Por otra parte, dentro de esta disponibilidad de datos que se está generando, la web 2.0 contribuye con el paradigma de las redes sociales y los tipos de datos (no estructurados) que estos generan, comúnmente texto libre. Este texto libre puede venir asociado a otros elementos dependiendo de la fuente de donde procedan, por ejemplo, pueden estar asociados a una escala de valoración de algún producto o servicio. Por todo lo anterior, esta tesis plantea la definición de un framework que permita el análisis de datos no estructurados de redes sociales mediante técnicas de aprendizaje automático, procesamiento de lenguaje natural y big data. Dentro de las características principales de este framework se tienen: - El framework está dividido en dos fases, cada una de las cuáles consta de un conjunto de etapas definidas con el propósito de analizar un volumen de datos ya sea pequeño (inferior a lo considerado big data) o grande (big data). - El elemento central de la fase uno del framework es el modelo de aprendizaje automático el cual consiste de dos elementos: (i) una serie de técnicas de procesamiento de lenguaje natural orientadas al preprocesamiento de datos y (ii) una serie de algoritmos de aprendizaje automático para la clasificación de la información. - El modelo de aprendizaje automático construido en la primera fase tiene como intención el poder ser empleado en la segunda (big data) para analizar el mismo origen de datos, pero a un volumen mucho mayor. - El modelo de aprendizaje automático no está relacionado directamente con la aplicación de determinados algoritmos para su uso, lo que lo convierte en un modelo versátil para emplear. De tal manera que como se observa, el marco en que se desenvuelve esta investigación es multidisciplinar al conjuntar diversas disciplinas científicas con un mismo propósito. Por lo cual, el resolver el problema de análisis de datos no estructurados provenientes de redes sociales requiere de la unión de técnicas heterogéneas procedentes de diversas áreas de la ciencia y la ingeniería. La metodología de investigación seguida para la elaboración de esta tesis doctoral ha consistido en: 1. Estado del Arte: Se presenta una selección de estudios que otros autores en las áreas de Big Data, Machine Learning y Procesamiento de Lenguaje Natural han realizado al respecto, así como la unión de estos temas con el área de análisis de sentimientos y los sistemas de calificación de redes sociales. También se presenta una comparativa que integra los temas abordados con el propósito de conocer el estado del arte en cuanto a lo que otros autores han propuesto en sus estudios al combinar las tres áreas cubiertas por el framework. 2. Estado de la Técnica: En esta fase se analizaron los diversos elementos que componen el framework y a partir de esto se presenta una retrospectiva teórica al respecto. Se abordan temas más técnicos, para lo cual se presenta un panorama de las tecnologías que se están empleando en la investigación actual. 3. Solución Propuesta: En esta fase se presenta el framework propuesto analizándolo desde dos perspectivas: los aspectos teóricos que comprende cada fase y los aspectos de implementación, en los cuáles se abordan temas como la complejidad de llevar a la práctica cada fase en una situación real. 4. Evaluación y Validación: Se definen una serie de pruebas destinadas a comprobar las hipótesis establecidas al principio de la investigación, para demostrar la validez del modelo propuesto. 5. Documentación y Conclusiones.: Esta actividad consistió en documentar todos los aspectos relacionados con esta tesis y presentar las conclusiones que surgen al término de la investigación. Por consiguiente, se construyó un framework que contempla dos fases a través de las cuáles se realiza el análisis de un conjunto de datos no estructurados, siendo una distinción de este framework la construcción de un modelo de aprendizaje automático durante la primera fase, que pretende servir como base en la segunda, la cual se caracteriza por el procesamiento de datos de gran volumen. Para poder validar este trabajo de tesis, se emplearon datos de Yelp, concretamente del sector de la hotelería. De igual manera, se evaluó el framework mediante la ejecución de diversas pruebas empleando clasificadores de aprendizaje automático, obteniendo porcentajes altos de predicción en la búsqueda binaria llevada a cabo tanto en el entorno no big data como en big data. Las conclusiones obtenidas tras haber diseñado el framework, así como haber analizado y validado los resultados conseguidos demuestran que el modelo presentado es capaz de analizar datos no estructurados de redes sociales tanto a una escala menor (no big data) como mayor (big data) de análisis. Por otra parte, interesantes retos y futuras líneas de investigación surgen tras haber concluido el modelo tanto para extenderlo hacia el análisis de otro tipo de información, como en el aspecto de la integración y adaptación del modelo de aprendizaje automático de la primera hacia la segunda fase.The amount of information generated continuously on the Internet increases in volume and variety each day. Web 2.0, the Internet of things and mobile devices are just some of the elements that have generated such an increase in the volume of data. In the near future, the introduction of 5G technology will lead to an exponential increase in data generation by allowing a greater Gb/s transfer. Therefore, research in this area should establish the guidelines that guide the way by which methodologies can be established for the analysis of data, as well as means to deal with them. However, the size and diversity of these data mean that different scientific disciplines have to be combined in order to analyze the data and obtain relevant findings within the information. That is, not only traditional techniques will be applied to carry out the analysis, but other areas of science will have to be combined in order to extract the so-called 'hidden information' found behind these data. On the other hand, in this availability of data being generated, web 2.0 contributes with the paradigm of social networks and the types of (unstructured) data that these generate, commonly free text. This free text may be associated with other elements depending on the source they come from, for example, they may be associated with a rating scale of a product or service. For all the above, this thesis proposes the definition of a framework that allows the analysis of unstructured data of social networks using machine learning, natural language processing and big data techniques. The main features of this framework are: - The framework is divided into two phases, each of which consists of a set of stages defined for the purpose of analyzing a volume of data either small (less than big data) or large (big data). - The central element of phase one of the framework is the machine learning model which consists of two elements: (i) a series of natural language processing techniques for data preprocessing and (ii) a series of machine learning algorithms for the classification of information. - The machine learning model built in the first phase is intended to be used in the second phase (big data phase) to analyze the same data source, but at a much larger volume. - The machine learning model is not directly related to the application of certain algorithms for its use, which makes it a versatile model to adopt. Therefore, the framework where this research is developed is multidisciplinary by combining diverse scientific disciplines with a same purpose. Therefore, to solve the problem of unstructured data analysis of social networks requires the union of heterogeneous techniques from various areas of science and engineering. The research methodology for the preparation of this doctoral thesis consisted of the following: 1. State of the Art: It presents a selection of studies where other authors in the Big Data, Machine Learning and Natural Language Processing areas have done research about them, as well as the union of these topics with sentiment analysis and social network rating systems. It also presents a comparison that integrates the mentioned topics with the purpose of knowing the state of the art in terms of what other authors have proposed in their studies by combining the three areas covered by the framework. 2. State of the Technique: In this phase, the various elements that make up the framework were analyzed, presenting a theoretical retrospective about. More technical issues are addressed, presenting an overview of the technologies that are being used in current research. 3. Proposed Solution: In this phase, the proposed framework is presented analyzing it from two perspectives: the theoretical aspects that each phase comprises and the aspects of implementation, where topics as complexity of carrying out each phase in a real situation are addressed. 4. Evaluation and Validation: A series of tests are defined to verify the hypotheses established at the beginning of the research, to demonstrate the validity of the proposed model. 5. Documentation and Conclusions: This activity consisted of documenting all the aspects related to this thesis and presenting the conclusions that emerge at the end of the research. Therefore, a framework was built including two phases that perform the analysis of a set of unstructured data, a distinction of this framework is the construction of a machine learning model during the first phase, which aims to serve as a basis in the second, characterized by the processing of large volume of data. In order to validate this thesis, Yelp data was used, specifically in the hotel sector. Likewise, the framework was evaluated by executing several tests using machine learning classifiers, obtaining high prediction percentages in the binary search carried out both in the non-big data and the big data environment. The conclusions obtained after having designed the framework, as well as having analyzed and validated the results obtained show that the presented model is capable of analyzing unstructured data of social networks both on a smaller scale (not big data) and a higher scale (big data) of analysis. On the other hand, interesting challenges and future lines of research arise after having completed the model for both extending it to the analysis of another type of information, as in the aspect of integration and adaptation of the machine learning model from the first to the second phase.Programa Oficial de Doctorado en Ciencia y Tecnología InformáticaPresidente: Alejandro Calderón Mateos.- Secretario: Alejandro Rodríguez González.- Vocal: Mario Graff Guerrer
    corecore