564 research outputs found

    Knowledge modeling of phishing emails

    Get PDF
    This dissertation investigates whether or not malicious phishing emails are detected better when a meaningful representation of the email bodies is available. The natural language processing theory of Ontological Semantics Technology is used for its ability to model the knowledge representation present in the email messages. Known good and phishing emails were analyzed and their meaning representations fed into machine learning binary classifiers. Unigram language models of the same emails were used as a baseline for comparing the performance of the meaningful data. The end results show how a binary classifier trained on meaningful data is better at detecting phishing emails than a unigram language model binary classifier at least using some of the selected machine learning algorithms

    Modern considerations for the use of naive Bayes in the supervised classification of genetic sequence data

    Get PDF
    2021 Spring.Includes bibliographical references.Genetic sequence classification is the task of assigning a known genetic label to an unknown genetic sequence. Often, this is the first step in genetic sequence analysis and is critical to understanding data produced by molecular techniques like high throughput sequencing. Here, we explore an algorithm called naive Bayes that was historically successful in classifying 16S ribosomal gene sequences for microbiome analysis. We extend the naive Bayes classifier to perform the task of general sequence classification by leveraging advancements in computational parallelism and the statistical distributions that underlie naive Bayes. In Chapter 2, we show that our implementation of naive Bayes, called WarpNL, performs within a margin of error of modern classifiers like Kraken2 and local alignment. We discuss five crucial aspects of genetic sequence classification and show how these areas affect classifier performance: the query data, the reference sequence database, the feature encoding method, the classification algorithm, and access to computational resources. In Chapter 3, we cover the critical computational advancements introduced in WarpNL that make it efficient in a modern computing framework. This includes efficient feature encoding, introduction of a log-odds ratio for comparison of naive Bayes posterior estimates, description of schema for parallel and distributed naive Bayes architectures, and use of machine learning classifiers to perform outgroup sequence classification. Finally in Chapter 4, we explore a variant of the Dirichlet multinomial distribution that underlies the naive Bayes likelihood, called the beta-Liouville multinomial. We show that the beta-Liouville multinomial can be used to enhance classifier performance, and we provide mathematical proofs regarding its convergence during maximum likelihood estimation. Overall, this work explores the naive Bayes algorithm in a modern context and shows that it is competitive for genetic sequence classification

    Contributions to the study of Austism Spectrum Brain conectivity

    Get PDF
    164 p.Autism Spectrum Disorder (ASD) is a largely prevalent neurodevelopmental condition with a big social and economical impact affecting the entire life of families. There is an intense search for biomarkers that can be assessed as early as possible in order to initiate treatment and preparation of the family to deal with the challenges imposed by the condition. Brain imaging biomarkers have special interest. Specifically, functional connectivity data extracted from resting state functional magnetic resonance imaging (rs-fMRI) should allow to detect brain connectivity alterations. Machine learning pipelines encompass the estimation of the functional connectivity matrix from brain parcellations, feature extraction and building classification models for ASD prediction. The works reported in the literature are very heterogeneous from the computational and methodological point of view. In this Thesis we carry out a comprehensive computational exploration of the impact of the choices involved while building these machine learning pipelines

    On the adaptive advantage of always being right (even when one is not)

    Get PDF
    We propose another positive illusion – overconfidence in the generalisability of one’s theory – that fits with McKay & Dennett’s (M&D’s) criteria for adaptive misbeliefs. This illusion is pervasive in adult reasoning but we focus on its prevalence in children’s developing theories. It is a strongly held conviction arising from normal functioning of the doxastic system that confers adaptive advantage on the individual

    An analysis of the efficiency of ontology and symbolic learning algorithms in indigenous knowledge representation

    Get PDF
    It is without a doubt that machine learning has been the area of focus in early days of artificial intelligence, but the early neural networks approach suffered some shortcomings and this led to a temporary decline in research capacity. New symbolic learning techniques have emerged since then which have yielded promising results and have led to a revival in research in machine learning. This has seen many researchers focusing on these techniques and experimenting with them by comparing their performances for different applications. With that in mind, the research thus decided to make an analysis of the symbolic approach against other approaches such as the neural network (connectionist) to evaluate the power of the former approach. This was done by first generating an ontology that acted as a representation of some collected indigenous knowledge. It is from this ontology that a dataset was generated. The dataset was made ambiguous to see the learning power of classifiers in such data. Two experiments were done, one using WEKA and the other using Orange as tools. The reason why the two experiments were used is because there was not a single tool which contained all the required learning algorithms. The research wanted to make use of ID3 and CN2 symbolic algorithm. However, WEKA had ID3 and not CN2 while Orange had CN2 and not ID3. The most important attributes from the ontology regarding the indigenous knowledge were the name of the plant, the province it is found and the disease the plant treats. Therefore the dataset had two features which were disease and province and one label which was the name of the plant. The learning algorithm was to use the two features to generate rules used to predict the label. However, there was ambiguity on the dataset. The challenge was that two different labels would contain the same features, thus leading to wrongful classification. This was the core of the research. Even though the learning model concluded this situation as wrongful classification, in real time, a system using the same learning model would provide desired and correct results. The only flow which was there is that the learning model simply used one label to predict under and ignore the other label with similar features. This was identified as a flow for both symbolic and non-symbolic algorithms. There is no way of giving suggestions in the case a user wants a different plant but with similar features. Therefore for classification using an ambiguous dataset, both these approaches proved to have the fore mentions flow. The research then decided to use recall to analyze the power of these approaches. It was discovered that ID3 has better recall than Multilayer perceptron and NaĂŻve Bayes algorithms when using a training set. ID3 managed to recall clearly and effectively three of its classes by a probability of 1 while Bayes Net had only one class with recall probability of 1. To further investigate the issue of recall, cross validation was used to contrast the competence of recall of the three algorithms to strengthen the assertion that indeed ID3 has a better recall as compared to the other two algorithms. Three stages of cross-validation were done, one stage using 10 fold, the other 20 fold, and the last using 50 fold. For all the different stages of crossvalidation, Bayes Net proved to perform better in terms of recall than the other two algorithms. In cross-validation, MLP could recall approximately above 88% of the instances available in contrast to when using training set where the algorithm recall only two out of 18 instances. In overall the symbolic approach proved to be a commendable approach for use over the nonsymbolic approach. The study of machine learning involves the building of learning algorithms, improving upon learning algorithms or making comparisons of machine learning algorithms. The research raised awareness on some improvements that need to be done on not only symbolic algorithms but non-symbolic ones as well. Some improvements include improving on or coming up with algorithms that suggest alternative predictions in cases of ambiguity instead of doing wrongful classification and not reflect on other possibilities

    Exploring the value of big data analysis of Twitter tweets and share prices

    Get PDF
    Over the past decade, the use of social media (SM) such as Facebook, Twitter, Pinterest and Tumblr has dramatically increased. Using SM, millions of users are creating large amounts of data every day. According to some estimates ninety per cent of the content on the Internet is now user generated. Social Media (SM) can be seen as a distributed content creation and sharing platform based on Web 2.0 technologies. SM sites make it very easy for its users to publish text, pictures, links, messages or videos without the need to be able to program. Users post reviews on products and services they bought, write about their interests and intentions or give their opinions and views on political subjects. SM has also been a key factor in mass movements such as the Arab Spring and the Occupy Wall Street protests and is used for human aid and disaster relief (HADR). There is a growing interest in SM analysis from organisations for detecting new trends, getting user opinions on their products and services or finding out about their online reputation. Companies such as Amazon or eBay use SM data for their recommendation engines and to generate more business. TV stations buy data about opinions on their TV programs from Facebook to find out what the popularity of a certain TV show is. Companies such as Topsy, Gnip, DataSift and Zoomph have built their entire business models around SM analysis. The purpose of this thesis is to explore the economic value of Twitter tweets. The economic value is determined by trying to predict the share price of a company. If the share price of a company can be predicted using SM data, it should be possible to deduce a monetary value. There is limited research on determining the economic value of SM data for “nowcasting”, predicting the present, and for forecasting. This study aims to determine the monetary value of Twitter by correlating the daily frequencies of positive and negative Tweets about the Apple company and some of its most popular products with the development of the Apple Inc. share price. If the number of positive tweets about Apple increases and the share price follows this development, the tweets have predictive information about the share price. A literature review has found that there is a growing interest in analysing SM data from different industries. A lot of research is conducted studying SM from various perspectives. Many studies try to determine the impact of online marketing campaigns or try to quantify the value of social capital. Others, in the area of behavioural economics, focus on the influence of SM on decision-making. There are studies trying to predict financial indicators such as the Dow Jones Industrial Average (DJIA). However, the literature review has indicated that there is no study correlating sentiment polarity on products and companies in tweets with the share price of the company. The theoretical framework used in this study is based on Computational Social Science (CSS) and Big Data. Supporting theories of CSS are Social Media Mining (SMM) and sentiment analysis. Supporting theories of Big Data are Data Mining (DM) and Predictive Analysis (PA). Machine learning (ML) techniques have been adopted to analyse and classify the tweets. In the first stage of the study, a body of tweets was collected and pre-processed, and then analysed for their sentiment polarity towards Apple Inc., the iPad and the iPhone. Several datasets were created using different pre-processing and analysis methods. The tweet frequencies were then represented as time series. The time series were analysed against the share price time series using the Granger causality test to determine if one time series has predictive information about the share price time series over the same period of time. For this study, several Predictive Analytics (PA) techniques on tweets were evaluated to predict the Apple share price. To collect and analyse the data, a framework has been developed based on the LingPipe (LingPipe 2015) Natural Language Processing (NLP) tool kit for sentiment analysis, and using R, the functional language and environment for statistical computing, for correlation analysis. Twitter provides an API (Application Programming Interface) to access and collect its data programmatically. Whereas no clear correlation could be determined, at least one dataset was showed to have some predictive information on the development of the Apple share price. The other datasets did not show to have any predictive capabilities. There are many data analysis and PA techniques. The techniques applied in this study did not indicate a direct correlation. However, some results suggest that this is due to noise or asymmetric distributions in the datasets. The study contributes to the literature by providing a quantitative analysis of SM data, for example tweets about Apple and its most popular products, the iPad and iPhone. It shows how SM data can be used for PA. It contributes to the literature on Big Data and SMM by showing how SM data can be collected, analysed and classified and explore if the share price of a company can be determined based on sentiment time series. It may ultimately lead to better decision making, for instance for investments or share buyback

    Data-driven design of intelligent wireless networks: an overview and tutorial

    Get PDF
    Data science or "data-driven research" is a research approach that uses real-life data to gain insight about the behavior of systems. It enables the analysis of small, simple as well as large and more complex systems in order to assess whether they function according to the intended design and as seen in simulation. Data science approaches have been successfully applied to analyze networked interactions in several research areas such as large-scale social networks, advanced business and healthcare processes. Wireless networks can exhibit unpredictable interactions between algorithms from multiple protocol layers, interactions between multiple devices, and hardware specific influences. These interactions can lead to a difference between real-world functioning and design time functioning. Data science methods can help to detect the actual behavior and possibly help to correct it. Data science is increasingly used in wireless research. To support data-driven research in wireless networks, this paper illustrates the step-by-step methodology that has to be applied to extract knowledge from raw data traces. To this end, the paper (i) clarifies when, why and how to use data science in wireless network research; (ii) provides a generic framework for applying data science in wireless networks; (iii) gives an overview of existing research papers that utilized data science approaches in wireless networks; (iv) illustrates the overall knowledge discovery process through an extensive example in which device types are identified based on their traffic patterns; (v) provides the reader the necessary datasets and scripts to go through the tutorial steps themselves

    Intelligent Energy Management with IoT Framework in Smart Cities Using Intelligent Analysis: An Application of Machine Learning Methods for Complex Networks and Systems

    Full text link
    Smart buildings are increasingly using Internet of Things (IoT)-based wireless sensing systems to reduce their energy consumption and environmental impact. As a result of their compact size and ability to sense, measure, and compute all electrical properties, Internet of Things devices have become increasingly important in our society. A major contribution of this study is the development of a comprehensive IoT-based framework for smart city energy management, incorporating multiple components of IoT architecture and framework. An IoT framework for intelligent energy management applications that employ intelligent analysis is an essential system component that collects and stores information. Additionally, it serves as a platform for the development of applications by other companies. Furthermore, we have studied intelligent energy management solutions based on intelligent mechanisms. The depletion of energy resources and the increase in energy demand have led to an increase in energy consumption and building maintenance. The data collected is used to monitor, control, and enhance the efficiency of the system
    • …
    corecore