34 research outputs found

    An Investigation into the Pedagogical Features of Documents

    Full text link
    Characterizing the content of a technical document in terms of its learning utility can be useful for applications related to education, such as generating reading lists from large collections of documents. We refer to this learning utility as the "pedagogical value" of the document to the learner. While pedagogical value is an important concept that has been studied extensively within the education domain, there has been little work exploring it from a computational, i.e., natural language processing (NLP), perspective. To allow a computational exploration of this concept, we introduce the notion of "pedagogical roles" of documents (e.g., Tutorial and Survey) as an intermediary component for the study of pedagogical value. Given the lack of available corpora for our exploration, we create the first annotated corpus of pedagogical roles and use it to test baseline techniques for automatic prediction of such roles.Comment: 12th Workshop on Innovative Use of NLP for Building Educational Applications (BEA) at EMNLP 2017; 12 page

    Experiments on the Use of Feature Selection and Machine Learning Methods in Automatic Malay Text Categorization

    Get PDF
    AbstractDue to the rapid growth of documents in digital form, research in automatic text categorization into predefined categories has witnessed a booming interest. Although, there is a wide range of supervised machine learning methods have been applied to categorize English, relatively, only a few studies have been done on Malay text categorization. This paper reports our comparative evaluation of three machine learning methods on Malay text categorization. Two feature selection methods (Information gain (IG) and Chi-square) and three machine learning methods (K-Nearest Neighbor (k-NN), Naive Bayes (NB) and N-gram) were investigated. The three supervised machine learning models were evaluated on categorized Malay corpus, and experimental results showed that the k- NN with the Chi-square feature selection gave the best performance (Macro-F1 = 96.14)

    A New Weighted k-Nearest Neighbor Algorithm Based on Newton¿s Gravitational Force

    Full text link
    [EN] The kNN algorithm has three main advantages that make it appealing to the community: it is easy to understand, it regularly offers competitive performance and its structure can be easily tuning to adapting to the needs of researchers to achieve better results. One of the variations is weighting the instances based on their distance. In this paper we propose a weighting based on the Newton's gravitational force, so that a mass (or relevance) has to be assigned to each instance. We evaluated this idea in the kNN context over 13 benchmark data sets used for binary and multi-class classification experiments. Results in F1 score, statistically validated, suggest that our proposal outperforms the original version of kNN and is statistically competitive with the distance weighted kNN version as well.This research was partially supported by CONACYT-Mexico (project FC-2410). The work of Paolo Rosso has been partially funded by the SomEMBED TIN2015-71147-C2-1-P MINECO research project.Aguilera, J.; González, LC.; Montes-Y-Gómez, M.; Rosso, P. (2019). A New Weighted k-Nearest Neighbor Algorithm Based on Newton¿s Gravitational Force. Lecture Notes in Computer Science. 11401:305-313. https://doi.org/10.1007/978-3-030-13469-3_36S3053131140

    Improving the Predictive Performances of kk Nearest Neighbors Learning by Efficient Variable Selection

    Full text link
    This paper computationally demonstrates a sharp improvement in predictive performance for kk nearest neighbors thanks to an efficient forward selection of the predictor variables. We show both simulated and real-world data that this novel repeatedly approaches outperformance regression models under stepwise selectionComment: 11 pages, 7 figure

    IOT Security Against Network Anomalies through Ensemble of Classifiers Approach

    Get PDF
    The use of IoT networks to monitor critical environments of all types where the volume of data transferred has greatly expanded in recent years due to a large rise in all forms of data. Since so many devices are connected to the Internet of Things (IoT), network and device security is of paramount importance. Network dynamics and complexity are still the biggest challenges to detecting IOT attacks. The dynamic nature of the network makes it challenging to categorise them using a single classifier. To identify the abnormalities, we therefore suggested an ensemble classifier in this study. The proposed ensemble classifier combines the independent classifiers ELM, Nave Byes (NB), and the k-nearest neighbour (KNN) in bagging and boosting configurations. The proposed technique is evaluated and compared using the MQTTset, a dataset focused on the MQTT protocol, which is frequently utilised in IoT networks. The analysis demonstrates that the proposed classifier outperforms the baseline classifiers in terms of classification accuracy, precision, recall, and F-score

    Machine Learning versus Deep Learning for Malware Detection

    Get PDF
    It is often claimed that the primary advantage of deep learning is that such models can continue to learn as more data is available, provided that sufficient computing power is available for training. In contrast, for other forms of machine learning it is claimed that models ‘‘saturate,’’ in the sense that no additional learning can occur beyond some point, regardless of the amount of data or computing power available. In this research, we compare the accuracy of deep learning to other forms of machine learning for malware detection, as a function of the training dataset size. We experiment with a wide variety of hyperparameters for our deep learning models, and we compare these models to results obtained using �-nearest neighbors. In these experiments, we use a subset of a large and diverse malware dataset that was collected as part of a recent research project

    Analisis Sentimen Kurikulum 2013 Pada Twitter Menggunakan Ensemble Feature Dan Metode K-Nearest Neighbor

    Get PDF
    Kurikulum 2013 merupakan kurikulum baru dalam sistem pendidikan Indonesia yang telah diberlakukan oleh pemerintah untuk menggantikan kurikulum 2006 atau Kurikulum Tingkat Satuan Pendidikan (KTSP). Diberlakukannya kurikulum ini pada beberapa tahun terakhir memicu berbagai kontroversi dalam dunia pendidikan Indonesia terutama di kalangan pelajar, hal-hal seperti siswa yang dituntut lebih aktif, jam pelajaran yang ditambah dan hal-hal lainnya yang menyebabkan muncul berbagai opini yang berkembang di masyarakat terutama pada media sosial Twitter. Diperkirakan sekitar 200 juta pengguna Twitter melakukan posting 400 juta tweet per hari. Dalam penelitian ini, dilakukan analisis sentimen untuk mengetahui opini yang berkembang tersebut yang dibagi ke dalam opini positif atau opini negatif. Fitur dan metode yang digunakan adalah ensemble feature dan metode klasifikasi K-Nearest Neighbor (K-NN). Ensemble feature merupakan fitur gabungan, berupa fitur statistik Bag of Words (BoW) dan semantik (twitter specific, textual features, PoS features, lexicon based features). Berdasarkan serangkaian pengujian, kombinasi fitur berdampak dalam meningkatkan akurasi metode K-Nearest Neighbor (K-NN) untuk menentukan opini positif atau negatif. Penggabungan fitur ini dapat melengkapi kelemahan masing-masing fitur, sehingga hasil akhir akurasi yang didapatkan dengan menggabungkan kedua fitur tersebut mecapai 96%. Berbeda hal jika hanya menggunakan fitur secara independen saja, akurasi yang didapatkan hanya mencapai 80% pada fitur Bag of Words (BoW) dan 82% pada fitur ensemble tanpa Bag of Words (BoW)

    Comparing tagging suggestion models on discrete corpora

    Get PDF
    This paper aims to investigate the methods for the prediction of tags on a textual corpus that describes diverse data sets based on short messages; as an example, the authors demonstrate the usage of methods based on hotel staff inputs in a ticketing system as well as the publicly available StackOverflow corpus. The aim is to improve the tagging process and find the most suitable method for suggesting tags for a new text entry

    Predictive model for detecting fake reviews: Exploring the possible enhancements of using word embeddings

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceFake data contaminates the insights that can be obtained about a product or service and ultimately hurts both businesses and consumers. Being able to correctly identify the truthful reviews will ensure consumers are able to more effectively find products that suit their needs. The following paper aims to develop a predictive model for detecting fake hotel reviews using Natural Language Processing techniques and applying various Machine Learning models. The current research in this area has primarily focused on sentiment analysis and the detection of fake reviews using various text mining methods including bag of words, tokenization, POS tagging and TF-IDF. The research mostly looks at some combination of quantitative and qualitative information. The text component is only analyzed with regards to which words appear in the review, while the semantic relationship is ignored. This research attempts to develop a higher level of performance by implementing pretrained word embeddings during the preprocessing of the text data. The goal is to introduce some context to the text data and see how each model’s performance changes. Traditional text mining models were applied to the dataset to provide a benchmark. Subsequently, GloVe, Word2Vec and BERT word embeddings were implemented and the performance of 8 models was reviewed. The analysis shows a somewhat lower performance obtained by the word embeddings. It seems that in texts of short length, the appearance of words is more indicative of a fake review than the semantic meaning of those words
    corecore