304 research outputs found

    Unsupervised Intrusion Detection with Cross-Domain Artificial Intelligence Methods

    Get PDF
    Cybercrime is a major concern for corporations, business owners, governments and citizens, and it continues to grow in spite of increasing investments in security and fraud prevention. The main challenges in this research field are: being able to detect unknown attacks, and reducing the false positive ratio. The aim of this research work was to target both problems by leveraging four artificial intelligence techniques. The first technique is a novel unsupervised learning method based on skip-gram modeling. It was designed, developed and tested against a public dataset with popular intrusion patterns. A high accuracy and a low false positive rate were achieved without prior knowledge of attack patterns. The second technique is a novel unsupervised learning method based on topic modeling. It was applied to three related domains (network attacks, payments fraud, IoT malware traffic). A high accuracy was achieved in the three scenarios, even though the malicious activity significantly differs from one domain to the other. The third technique is a novel unsupervised learning method based on deep autoencoders, with feature selection performed by a supervised method, random forest. Obtained results showed that this technique can outperform other similar techniques. The fourth technique is based on an MLP neural network, and is applied to alert reduction in fraud prevention. This method automates manual reviews previously done by human experts, without significantly impacting accuracy

    LEVERAGING MACHINE LEARNING TO IDENTIFY QUALITY ISSUES IN THE MEDICAID CLAIM ADJUDICATION PROCESS

    Get PDF
    Medicaid is the largest health insurance in the U.S. It provides health coverage to over 68 million individuals, costs the nation over $600 billion a year, and subject to improper payments (fraud, waste, and abuse) or inaccurate payments (claim processed erroneously). Medicaid programs partially use Fee-For-Services (FFS) to provide coverage to beneficiaries by adjudicating claims and leveraging traditional inferential statistics to verify the quality of adjudicated claims. These quality methods only provide an interval estimate of the quality errors and are incapable of detecting most claim adjudication errors, potentially millions of dollar opportunity costs. This dissertation studied a method of applying supervised learning to detect erroneous payment in the entire population of adjudicated claims in each Medicaid Management Information System (MMIS), focusing on two specific claim types: inpatient and outpatient. A synthesized source of adjudicated claims generated by the Centers for Medicare & Medicaid Services (CMS) was used to create the original dataset. Quality reports from California FFS Medicaid were used to extract the underlying statistical pattern of claim adjudication errors in each Medicaid FFS and data labeling utilizing the goodness of fit and Anderson-Darling tests. Principle Component Analysis (PCA) and business knowledge were applied for dimensionality reduction resulting in the selection of sixteen (16) features for the outpatient and nineteen (19) features for the inpatient claims models. Ten (10) supervised learning algorithms were trained and tested on the labeled data: Decision tree with two configurations - Entropy and Gini, Random forests with two configurations - Entropy and Gini, Naïve Bayes, K Nearest Neighbor, Logistic Regression, Neural Network, Discriminant Analysis, and Gradient Boosting. Five (5) cross-validation and event-based sampling were applied during the training process (with oversampling using SMOTE method and stratification within oversampling). The prediction power (Gini importance) for the selected features were measured using the Mean Decrease in Impurity (MDI) method across three algorithms. A one-way ANOVA and Tukey and Fisher LSD pairwise comparisons were conducted. Results show that the Claim Payment Amount significantly outperforms the rest of the prediction power (highest Mean F-value for Gini importance at the α = 0.05 significance) for both claim types. Finally, all algorithms' recall and F1-score were measured for both claim types (inpatient and outpatient) and with and without oversampling. A one-way ANOVA and Tukey and Fisher LSD pairwise comparisons were conducted. The results show a statistically significant difference in the algorithm's performance in detecting quality issues in the outpatient and inpatient claims. Gradient Boosting, Decision Tree (with various configurations and sampling strategies) outperform the rest of the algorithms in recall and F1-measure on both datasets. Logistic Regression showing better recall on the outpatient than inpatient data, and Naïve Bays performs considerably better from recall and F1- score on outpatient data. Medicaid FFS programs and consultants, Medicaid administrators, and researchers could use this study to develop machine learning models to detect quality issues in the Medicaid FFS claim datasets at scale, saving potentially millions of dollars

    Sports Data Mining Technology Used in Basketball Outcome Prediction

    Get PDF
    Driven by the increasing comprehensive data in sports datasets and data mining technique successfully used in different area, sports data mining technique emerges and enables us to find hidden knowledge to impact the sport industry. In many instances, predicting the outcomes of sporting events has always been a challenging and attractive work and is therefore drawing a wide concern to conduct research in this field. This project focuses on using machine learning algorithms to build a model for predicting the NBA game outcomes and the algorithms involve Simple Logistics Classifier, Artificial Neural Networks, SVM and Naïve Bayes. In order to complete a convincing result, data of 5 regular NBA seasons was collected for model training and data of 1 NBA regular season was used as scoring dataset. After processes of automated data collection and cloud techniques enabled data management, a data mart containing NBA statistics data is built. Then machine learning models mentioned above is trained and tested by consuming data in the data mart. After applying scoring dataset to evaluate the model accuracy, Simple Logistics Classifier finally yields the best result with an accuracy of 69.67%. The results obtained are compared to other methods from different source. It was found that results of this project are more persuasive since such a vast quantity of data was applied in this project. Meanwhile, it can be referenced for the future work

    Opinion mining with the SentWordNet lexical resource

    Get PDF
    Sentiment classification concerns the application of automatic methods for predicting the orientation of sentiment present on text documents. It is an important subject in opinion mining research, with applications on a number of areas including recommender and advertising systems, customer intelligence and information retrieval. SentiWordNet is a lexical resource of sentiment information for terms in the English language designed to assist in opinion mining tasks, where each term is associated with numerical scores for positive and negative sentiment information. A resource that makes term level sentiment information readily available could be of use in building more effective sentiment classification methods. This research presents the results of an experiment that applied the SentiWordNet lexical resource to the problem of automatic sentiment classification of film reviews. First, a data set of relevant features extracted from text documents using SentiWordNet was designed and implemented. The resulting feature set is then used as input for training a support vector machine classifier for predicting the sentiment orientation of the underlying film review. Several scenarios exploring variations on the parameters that generate the data set, outlier removal and feature selection were executed. The results obtained are compared to other methods documented in the literature. It was found that they are in line with other experiments that propose similar approaches and use the same data set of film reviews, indicating SentiWordNet could become an important resource for the task of sentiment classification. Considerations on future improvements are also presented based on a detailed analysis of classification results

    SENTIMENT AND BEHAVIORAL ANALYSIS IN EDISCOVERY

    Get PDF
    A suspect or person-of-interest during legal case review or forensic evidence review can exhibit signs of their individual personality through the digital evidence collected for the case. Such personality traits of interest can be analytically harvested for case investigators or case reviewers. However, manual review of evidence for such flags can take time and contribute to increased costs. This study focuses on certain use-case scenarios of behavior and sentiment analysis as a critical requirement for a legal case’s success. This study aims to quicken the review and analysis phase and offers a software prototype as a proof-of-concept. The study starts with the build and storage of Electronic Stored Information (ESI) datasets for three separate fictitious legal cases using publicly available data such as emails, Facebook posts, tweets, text messages and a few custom MS Word documents. The next step of this study leverages statistical algorithms and automation to propose approaches towards identifying human sentiments, behavior such as, evidence of financial fraud behavior, and evidence of sexual harassment behavior of a suspect or person-of-interest from the case ESI. The last stage of the study automates these approaches via a custom software and presents a user interface for eDiscovery teams and digital forensic investigators

    Big data techniques in auditing research and practice: current trends and future opportunities

    Get PDF
    This paper analyzes the use of big data techniques in auditing, and finds that the practice is not as widespread as it is in other related fields. We first introduce contemporary big data techniques to promote understanding of their potential application. Next, we review existing research on big data in accounting and finance. In addition to auditing, our analysis shows that existing research extends across three other genealogies: financial distress modelling, financial fraud modelling, and stock market prediction and quantitative modelling. Auditing is lagging behind the other research streams in the use of valuable big data techniques. A possible explanation is that auditors are reluctant to use techniques that are far ahead of those adopted by their clients, but we refute this argument. We call for more research and a greater alignment to practice. We also outline future opportunities for auditing in the context of real-time information and in collaborative platforms and peer-to-peer marketplaces

    Unknown Threat Detection With Honeypot Ensemble Analsyis Using Big Datasecurity Architecture

    Get PDF
    The amount of data that is being generated continues to rapidly grow in size and complexity. Frameworks such as Apache Hadoop and Apache Spark are evolving at a rapid rate as organizations are building data driven applications to gain competitive advantages. Data analytics frameworks decomposes our problems to build applications that are more than just inference and can help make predictions as well as prescriptions to problems in real time instead of batch processes. Information Security is becoming more important to organizations as the Internet and cloud technologies become more integrated with their internal processes. The number of attacks and attack vectors has been increasing steadily over the years. Border defense measures (e.g. Intrusion Detection Systems) are no longer enough to identify and stop attackers. Data driven information security is not a new approach to solving information security; however there is an increased emphasis on combining heterogeneous sources to gain a broader view of the problem instead of isolated systems. Stitching together multiple alerts into a cohesive system can increase the number of True Positives. With the increased concern of unknown insider threats and zero-day attacks, identifying unknown attack vectors becomes more difficult. Previous research has shown that with as little as 10 commands it is possible to identify a masquerade attack against a user\u27s profile. This thesis is going to look at a data driven information security architecture that relies on both behavioral analysis of SSH profiles and bad actor data collected from an SSH honeypot to identify bad actor attack vectors. Honeypots should collect only data from bad actors; therefore have a high True Positive rate. Using Apache Spark and Apache Hadoop we can create a real time data driven architecture that can collect and analyze new bad actor behaviors from honeypot data and monitor legitimate user accounts to create predictive and prescriptive models. Previously unidentified attack vectors can be cataloged for review

    Exploring the value of big data analysis of Twitter tweets and share prices

    Get PDF
    Over the past decade, the use of social media (SM) such as Facebook, Twitter, Pinterest and Tumblr has dramatically increased. Using SM, millions of users are creating large amounts of data every day. According to some estimates ninety per cent of the content on the Internet is now user generated. Social Media (SM) can be seen as a distributed content creation and sharing platform based on Web 2.0 technologies. SM sites make it very easy for its users to publish text, pictures, links, messages or videos without the need to be able to program. Users post reviews on products and services they bought, write about their interests and intentions or give their opinions and views on political subjects. SM has also been a key factor in mass movements such as the Arab Spring and the Occupy Wall Street protests and is used for human aid and disaster relief (HADR). There is a growing interest in SM analysis from organisations for detecting new trends, getting user opinions on their products and services or finding out about their online reputation. Companies such as Amazon or eBay use SM data for their recommendation engines and to generate more business. TV stations buy data about opinions on their TV programs from Facebook to find out what the popularity of a certain TV show is. Companies such as Topsy, Gnip, DataSift and Zoomph have built their entire business models around SM analysis. The purpose of this thesis is to explore the economic value of Twitter tweets. The economic value is determined by trying to predict the share price of a company. If the share price of a company can be predicted using SM data, it should be possible to deduce a monetary value. There is limited research on determining the economic value of SM data for “nowcasting”, predicting the present, and for forecasting. This study aims to determine the monetary value of Twitter by correlating the daily frequencies of positive and negative Tweets about the Apple company and some of its most popular products with the development of the Apple Inc. share price. If the number of positive tweets about Apple increases and the share price follows this development, the tweets have predictive information about the share price. A literature review has found that there is a growing interest in analysing SM data from different industries. A lot of research is conducted studying SM from various perspectives. Many studies try to determine the impact of online marketing campaigns or try to quantify the value of social capital. Others, in the area of behavioural economics, focus on the influence of SM on decision-making. There are studies trying to predict financial indicators such as the Dow Jones Industrial Average (DJIA). However, the literature review has indicated that there is no study correlating sentiment polarity on products and companies in tweets with the share price of the company. The theoretical framework used in this study is based on Computational Social Science (CSS) and Big Data. Supporting theories of CSS are Social Media Mining (SMM) and sentiment analysis. Supporting theories of Big Data are Data Mining (DM) and Predictive Analysis (PA). Machine learning (ML) techniques have been adopted to analyse and classify the tweets. In the first stage of the study, a body of tweets was collected and pre-processed, and then analysed for their sentiment polarity towards Apple Inc., the iPad and the iPhone. Several datasets were created using different pre-processing and analysis methods. The tweet frequencies were then represented as time series. The time series were analysed against the share price time series using the Granger causality test to determine if one time series has predictive information about the share price time series over the same period of time. For this study, several Predictive Analytics (PA) techniques on tweets were evaluated to predict the Apple share price. To collect and analyse the data, a framework has been developed based on the LingPipe (LingPipe 2015) Natural Language Processing (NLP) tool kit for sentiment analysis, and using R, the functional language and environment for statistical computing, for correlation analysis. Twitter provides an API (Application Programming Interface) to access and collect its data programmatically. Whereas no clear correlation could be determined, at least one dataset was showed to have some predictive information on the development of the Apple share price. The other datasets did not show to have any predictive capabilities. There are many data analysis and PA techniques. The techniques applied in this study did not indicate a direct correlation. However, some results suggest that this is due to noise or asymmetric distributions in the datasets. The study contributes to the literature by providing a quantitative analysis of SM data, for example tweets about Apple and its most popular products, the iPad and iPhone. It shows how SM data can be used for PA. It contributes to the literature on Big Data and SMM by showing how SM data can be collected, analysed and classified and explore if the share price of a company can be determined based on sentiment time series. It may ultimately lead to better decision making, for instance for investments or share buyback
    corecore