4 research outputs found

    Online content clustering using variant K-Means Algorithms

    Get PDF
    Thesis (MTech)--Cape Peninsula University of Technology, 2019We live at a time when so much information is created. Unfortunately, much of the information is redundant. There is a huge amount of online information in the form of news articles that discuss similar stories. The number of articles is projected to grow. The growth makes it difficult for a person to process all that information in order to update themselves on a subject matter. There is an overwhelming amount of similar information on the internet. There is need for a solution that can organize this similar information into specific themes. The solution is a branch of Artificial intelligence (AI) called machine learning (ML) using clustering algorithms. This refers to clustering groups of information that is similar into containers. When the information is clustered people can be presented with information on their subject of interest, grouped together. The information in a group can be further processed into a summary. This research focuses on unsupervised learning. Literature has it that K-Means is one of the most widely used unsupervised clustering algorithm. K-Means is easy to learn, easy to implement and is also efficient. However, there is a horde of variations of K-Means. The research seeks to find a variant of K-Means that can be used with an acceptable performance, to cluster duplicate or similar news articles into correct semantic groups. The research is an experiment. News articles were collected from the internet using gocrawler. gocrawler is a program that takes Universal Resource Locators (URLs) as an argument and collects a story from a website pointed to by the URL. The URLs are read from a repository. The stories come riddled with adverts and images from the web page. This is referred to as a dirty text. The dirty text is sanitized. Sanitization is basically cleaning the collected news articles. This includes removing adverts and images from the web page. The clean text is stored in a repository, it is the input for the algorithm. The other input is the K value. All K-Means based variants take K value that defines the number of clusters to be produced. The stories are manually classified and labelled. The labelling is done to check the accuracy of machine clustering. Each story is labelled with a class to which it belongs. The data collection process itself was not unsupervised but the algorithms used to cluster are totally unsupervised. A total of 45 stories were collected and 9 manual clusters were identified. Under each manual cluster there are sub clusters of stories talking about one specific event. The performance of all the variants is compared to see the one with the best clustering results. Performance was checked by comparing the manual classification and the clustering results from the algorithm. Each K-Means variant is run on the same set of settings and same data set, that is 45 stories. The settings used are, • Dimensionality of the feature vectors, • Window size, • Maximum distance between the current and predicted word in a sentence, • Minimum word frequency, • Specified range of words to ignore, • Number of threads to train the model. • The training algorithm either distributed memory (PV-DM) or distributed bag of words (PV-DBOW), • The initial learning rate. The learning rate decreases to minimum alpha as training progresses, • Number of iterations per cycle, • Final learning rate, • Number of clusters to form, • The number of times the algorithm will be run, • The method used for initialization. The results obtained show that K-Means can perform better than K-Modes. The results are tabulated and presented in graphs in chapter six. Clustering can be improved by incorporating Named Entity (NER) recognition into the K-Means algorithms. Results can also be improved by implementing multi-stage clustering technique. Where initial clustering is done then you take the cluster group and further cluster it to achieve finer clustering results

    An Investigation of the Digital Sublime in Video Game Production

    Get PDF
    This research project examines how video games can be programmed to generate the sense of the digital sublime. The digital sublime is a term proposed by this research to describe experiences where the combination of code and art produces games that appear boundless and autonomous. The definition of this term is arrived at by building on various texts and literature such as the work of Kant, Deleuze and Wark and on video games such as Proteus, Minecraft and Love. The research is based on the investigative practice of my work as an artist-programmer and demonstrates how games can be produced to encourage digitally sublime scenarios. In the three games developed for this thesis I employ computer code as an artistic medium, to generate games that explore permutational complexity and present experiences that walk the margins between confusion and control. The structure of this thesis begins with a reading of the Kantian sublime, which I introduce as the foundation for my definition of the digital sublime. I then combine this reading with elements of contemporary philosophy and computational theory to establish a definition applicable to the medium of digital games. This definition is used to guide my art practice in the development of three games that examine different aspects of the digital sublime such as autonomy, abstraction, complexity and permutation. The production of these games is at the core of my research methodology and their development and analysis is used to produce contributions in the following areas. 1. New models for artist-led game design. This includes methods that re-contextualise existing aesthetic forms such as futurism, synaesthesia and romantic landscape through game design and coding. It also presents techniques that merge visuals and mechanics into a format developed for artistic and philosophical enquiry. 2. The development of new procedural and generative techniques in the programming of video games. This includes the implementation of a realtime marching cubes algorithm that generates fractal noise filtered terrain. It also includes a versatile three-dimensional space packing architectural construction algorithm. 3. A new reading of the digital sublime. This reading draws from the Kantian sublime and the writings of Deleuze, Wark and De Landa in order to present an understanding of the digital sublime specific to the domain of art practice within video games. These contributions are evidenced in the writing of this thesis and in the construction of the associated portfolio of games

    An enhanced performance model for metamorphic computer virus classification and detectioN

    Get PDF
    Metamorphic computer virus employs various code mutation techniques to change its code to become new generations. These generations have similar behavior and functionality and yet, they could not be detected by most commercial antivirus because their solutions depend on a signature database and make use of string signature-based detection methods. However, the antivirus detection engine can be avoided by metamorphism techniques. The purpose of this study is to develop a performance model based on computer virus classification and detection. The model would also be able to examine portable executable files that would classify and detect metamorphic computer viruses. A Hidden Markov Model implemented on portable executable files was employed to classify and detect the metamorphic viruses. This proposed model that produce common virus statistical patterns was evaluated by comparing the results with previous related works and famous commercial antiviruses. This was done by investigating the metamorphic computer viruses and their features, and the existing classifications and detection methods. Specifically, this model was applied on binary format of portable executable files and it was able to classify if the files belonged to a virus family. Besides that, the performance of the model, practically implemented and tested, was also evaluated based on detection rate and overall accuracy. The findings indicated that the proposed model is able to classify and detect the metamorphic virus variants in portable executable file format with a high average of 99.7% detection rate. The implementation of the model is proven useful and applicable for antivirus programs

    Fast head profile estimation using curvature, derivatives and deep learning methods

    Get PDF
    Fast estimation of head profile and posture has applications across many disciplines, for example, it can be used in sleep apnoea screening and orthodontic examination or could support a suitable physiotherapy regime. Consequently, this thesis focuses on the investigation of methods to estimate head profile and posture efficiently and accurately, and results in the development and evaluation of datasets, features and deep learning models that can achieve this. Accordingly, this thesis initially investigated properties of contour curves that could act as effective features to train machine learning models. Features based on curvature and the first and second Gaussian derivatives were evaluated. These outperformed established features used in the literature to train a long short-term memory recurrent neural network and produced a significant speedup in execution time where pre-filtering of a sampled dataset was required. Following on from this, a new dataset of head profile contours was generated and annotated with anthropometric cranio-facial landmarks, and a novel method of automatically improving the accuracy of the landmark positions was developed using ideas based on the curvature of a plane curve. The features identified here were extracted from the new head profile contour dataset and used to train long short-term recurrent neural networks. The best network, using Gaussian derivatives features achieved an accuracy of 91% and macro F1 score of 91%, an improvement of 51% and 71% respectively when compared with the un-processed contour feature. When using Gaussian derivative features, the network was able to regress landmarks accurately with mean absolute errors ranging from 0 to 5.3 pixels and standard deviations ranging from 0 to 6.9, respectively. End-to-end machine learning approaches, where a deep neural network learns the best features to use from the raw input data, were also investigated. Such an approach, using a one-dimensional temporal convolutional network was able to match previous classifiers in terms of accuracy and macro F1 score, and showed comparable regression abilities. However, this was at the expense of increased training times and increased inference times. This network was an order of magnitude slower when classifying and regressing contours
    corecore