294 research outputs found

    An Empirical Study of Online Consumer Review Spam: A Design Science Approach

    Get PDF
    Because of the sheer volume of consumer reviews posted to the Internet, a manual approach for the detection and analysis of fake reviews is not practical. However, automated detection of fake reviews is a very challenging research problem given the fact that fake reviews could just look like legitimate reviews. Guided by the design science research methodology, one of the main contributions of our research work is the development of a novel methodology and an instantiation which can effectively detect untruthful consumer reviews. The results of our experiment confirm that the proposed methodology outperforms other well-known baseline methods for detecting untruthful reviews collected from amazon.com. Above all, the designed artifacts enable us to conduct an econometric analysis to examine the impact of fake reviews on product sales. To the best of our knowledge, this is the first empirical study conducted to analyze the economic impact of fake consumer reviews

    Unsupervised learning for anomaly detection in Australian medical payment data

    Full text link
    Fraudulent or wasteful medical insurance claims made by health care providers are costly for insurers. Typically, OECD healthcare organisations lose 3-8% of total expenditure due to fraud. As Australia’s universal public health insurer, Medicare Australia, spends approximately A34billionperannumontheMedicareBenefitsSchedule(MBS)andPharmaceuticalBenefitsScheme,wastedspendingofA 34 billion per annum on the Medicare Benefits Schedule (MBS) and Pharmaceutical Benefits Scheme, wasted spending of A1–2.7 billion could be expected.However, fewer than 1% of claims to Medicare Australia are detected as fraudulent, below international benchmarks. Variation is common in medicine, and health conditions, along with their presentation and treatment, are heterogenous by nature. Increasing volumes of data and rapidly changing patterns bring challenges which require novel solutions. Machine learning and data mining are becoming commonplace in this field, but no gold standard is yet available. In this project, requirements are developed for real-world application to compliance analytics at the Australian Government Department of Health and Aged Care (DoH), covering: unsupervised learning; problem generalisation; human interpretability; context discovery; and cost prediction. Three novel methods are presented which rank providers by potentially recoverable costs. These methods used association analysis, topic modelling, and sequential pattern mining to provide interpretable, expert-editable models of typical provider claims. Anomalous providers are identified through comparison to the typical models, using metrics based on costs of excess or upgraded services. Domain knowledge is incorporated in a machine-friendly way in two of the methods through the use of the MBS as an ontology. Validation by subject-matter experts and comparison to existing techniques shows that the methods perform well. The methods are implemented in a software framework which enables rapid prototyping and quality assurance. The code is implemented at the DoH, and further applications as decision-support systems are in progress. The developed requirements will apply to future work in this fiel

    Exploring the value of big data analysis of Twitter tweets and share prices

    Get PDF
    Over the past decade, the use of social media (SM) such as Facebook, Twitter, Pinterest and Tumblr has dramatically increased. Using SM, millions of users are creating large amounts of data every day. According to some estimates ninety per cent of the content on the Internet is now user generated. Social Media (SM) can be seen as a distributed content creation and sharing platform based on Web 2.0 technologies. SM sites make it very easy for its users to publish text, pictures, links, messages or videos without the need to be able to program. Users post reviews on products and services they bought, write about their interests and intentions or give their opinions and views on political subjects. SM has also been a key factor in mass movements such as the Arab Spring and the Occupy Wall Street protests and is used for human aid and disaster relief (HADR). There is a growing interest in SM analysis from organisations for detecting new trends, getting user opinions on their products and services or finding out about their online reputation. Companies such as Amazon or eBay use SM data for their recommendation engines and to generate more business. TV stations buy data about opinions on their TV programs from Facebook to find out what the popularity of a certain TV show is. Companies such as Topsy, Gnip, DataSift and Zoomph have built their entire business models around SM analysis. The purpose of this thesis is to explore the economic value of Twitter tweets. The economic value is determined by trying to predict the share price of a company. If the share price of a company can be predicted using SM data, it should be possible to deduce a monetary value. There is limited research on determining the economic value of SM data for “nowcasting”, predicting the present, and for forecasting. This study aims to determine the monetary value of Twitter by correlating the daily frequencies of positive and negative Tweets about the Apple company and some of its most popular products with the development of the Apple Inc. share price. If the number of positive tweets about Apple increases and the share price follows this development, the tweets have predictive information about the share price. A literature review has found that there is a growing interest in analysing SM data from different industries. A lot of research is conducted studying SM from various perspectives. Many studies try to determine the impact of online marketing campaigns or try to quantify the value of social capital. Others, in the area of behavioural economics, focus on the influence of SM on decision-making. There are studies trying to predict financial indicators such as the Dow Jones Industrial Average (DJIA). However, the literature review has indicated that there is no study correlating sentiment polarity on products and companies in tweets with the share price of the company. The theoretical framework used in this study is based on Computational Social Science (CSS) and Big Data. Supporting theories of CSS are Social Media Mining (SMM) and sentiment analysis. Supporting theories of Big Data are Data Mining (DM) and Predictive Analysis (PA). Machine learning (ML) techniques have been adopted to analyse and classify the tweets. In the first stage of the study, a body of tweets was collected and pre-processed, and then analysed for their sentiment polarity towards Apple Inc., the iPad and the iPhone. Several datasets were created using different pre-processing and analysis methods. The tweet frequencies were then represented as time series. The time series were analysed against the share price time series using the Granger causality test to determine if one time series has predictive information about the share price time series over the same period of time. For this study, several Predictive Analytics (PA) techniques on tweets were evaluated to predict the Apple share price. To collect and analyse the data, a framework has been developed based on the LingPipe (LingPipe 2015) Natural Language Processing (NLP) tool kit for sentiment analysis, and using R, the functional language and environment for statistical computing, for correlation analysis. Twitter provides an API (Application Programming Interface) to access and collect its data programmatically. Whereas no clear correlation could be determined, at least one dataset was showed to have some predictive information on the development of the Apple share price. The other datasets did not show to have any predictive capabilities. There are many data analysis and PA techniques. The techniques applied in this study did not indicate a direct correlation. However, some results suggest that this is due to noise or asymmetric distributions in the datasets. The study contributes to the literature by providing a quantitative analysis of SM data, for example tweets about Apple and its most popular products, the iPad and iPhone. It shows how SM data can be used for PA. It contributes to the literature on Big Data and SMM by showing how SM data can be collected, analysed and classified and explore if the share price of a company can be determined based on sentiment time series. It may ultimately lead to better decision making, for instance for investments or share buyback

    Applying Bayesian Growth Modeling In Machine Learning For Longitudinal Data

    Get PDF
    There has been increasing interest in the use of Bayesian growth modeling in machine learning environment to answer the questions relating to the patterns of change in trends of social and human behavior in longitudinal data. It is well understood that machine learning works properly with “big data,” because large sample sizes offer machines the better opportunity to “learn” the pattern/structure of data from a training data set to predict the performance in an unseen testing data set. Unfortunately, not all researchers have access to large samples and there is a lack of methodological research addressing the utility of using machine learning with longitudinal data based on small sample size. Additionally, there is limited methodological research conducted around moderation effect that priors have on other data conditions. Therefore, the purpose of the current study was to understand: (a) the interactive relationship between priors and sample sizes in longitudinal predictive modeling, (b) the interactive relationship between priors and number of waves of data, and (c) the interactive relationship between priors and the proportion of cases in the two levels of a dichotomous time-invariant predictor for Bayesian growth modeling in a machine learning environment. Monte Carlo simulation was adopted to answer assess the above aspects and data were generated based on alumni donation data from a university in the mid-Atlantic region where model parameters were set to mimic “real life” data as closely as possible. Results from the study show that although all main and interaction effects are statistically significant, only main effect of sample size, wave of data, and interaction between waves of data and sample sizes show meaningful effect size. Additionally, given the condition of prior of the study, informative priors did not show any higher prediction accuracy compared to non-informative priors. The reason behind indifferent between choices of informative and non-informative prior associated with model complexity, competition between strong informative and weakly informative prior. This study was one of the first known study to examine Bayesian estimation in the context of machine learning. Results of the current study suggest that capitalizing on the advantages offered jointly by these two modeling approaches shows promise. Although much is still unknown and in need of investigation regarding the conditions under which a combination of Bayesian modeling and machine learning affects prediction accuracy, the current dissertation provides a first step in that direction

    Proceedings of the Making Sense of Microposts Workshop (#Microposts2015) at the World Wide Web Conference

    Get PDF

    An Investigation of Autism Support Groups on Facebook

    Get PDF
    Autism-affected users, such as autism patients, caregivers, parents, family members, and researchers, currently seek informational support and social support from communities on social media. To reveal the information needs of autism- affected users, this study centers on the research of users’ interactions and information sharing within autism communities on social media. It aims to understand how autism-affected users utilize support groups on Facebook. A systematic method was proposed to aid in the data analysis including social network analysis, topic modeling, sentiment analysis, and inferential analysis. Social network analysis method was adopted to reveal the interaction patterns appearing in the groups, and topic modeling method was employed to uncover the discussion themes that users were concerned with in their daily lives. Sentiment analysis method helped analyze the emotional characteristics of the content that users expressed in the groups. Inferential analysis method was applied to compare the similarities and differences among different autism support groups found on Facebook. This study collected user-generated content from five sampled support groups (an awareness group, a treatment group, a parents group, a research group, and a local support group) on Facebook. Findings show that the discussion topics varied in different groups. Influential users in each Facebook support group were identified through the analysis of the interaction network. The results indicated that the influential users not only attracted more attention from other group members but also led the discussion topics in the group. In addition, it was examined that autism support groups on Facebook offered a supportive emotional atmosphere for group members. The findings of this study revealed the characteristics of user interactions and information exchanges in autism support groups on social media. Theoretically, the findings demonstrated the significance of social media for autism users. The unique implication of this study is to identify support groups on Facebook as a source of informational, social, and emotional support for autism-related users. The methodology applied in this study presented a systematic approach to evaluating the information exchange in health-related support groups on social media. Further, it investigated the potential role of technology in the social lives of autism-related users. The outcomes of this study can contribute to improving online intervention programs by highlighting effective communication approaches

    Digital Forensics AI: on Practicality, Optimality, and Interpretability of Digital Evidence Mining Techniques

    Get PDF
    Digital forensics as a field has progressed alongside technological advancements over the years, just as digital devices have gotten more robust and sophisticated. However, criminals and attackers have devised means for exploiting the vulnerabilities or sophistication of these devices to carry out malicious activities in unprecedented ways. Their belief is that electronic crimes can be committed without identities being revealed or trails being established. Several applications of artificial intelligence (AI) have demonstrated interesting and promising solutions to seemingly intractable societal challenges. This thesis aims to advance the concept of applying AI techniques in digital forensic investigation. Our approach involves experimenting with a complex case scenario in which suspects corresponded by e-mail and deleted, suspiciously, certain communications, presumably to conceal evidence. The purpose is to demonstrate the efficacy of Artificial Neural Networks (ANN) in learning and detecting communication patterns over time, and then predicting the possibility of missing communication(s) along with potential topics of discussion. To do this, we developed a novel approach and included other existing models. The accuracy of our results is evaluated, and their performance on previously unseen data is measured. Second, we proposed conceptualizing the term “Digital Forensics AI” (DFAI) to formalize the application of AI in digital forensics. The objective is to highlight the instruments that facilitate the best evidential outcomes and presentation mechanisms that are adaptable to the probabilistic output of AI models. Finally, we enhanced our notion in support of the application of AI in digital forensics by recommending methodologies and approaches for bridging trust gaps through the development of interpretable models that facilitate the admissibility of digital evidence in legal proceedings

    Digital Forensics AI: on Practicality, Optimality, and Interpretability of Digital Evidence Mining Techniques

    Get PDF
    Digital forensics as a field has progressed alongside technological advancements over the years, just as digital devices have gotten more robust and sophisticated. However, criminals and attackers have devised means for exploiting the vulnerabilities or sophistication of these devices to carry out malicious activities in unprecedented ways. Their belief is that electronic crimes can be committed without identities being revealed or trails being established. Several applications of artificial intelligence (AI) have demonstrated interesting and promising solutions to seemingly intractable societal challenges. This thesis aims to advance the concept of applying AI techniques in digital forensic investigation. Our approach involves experimenting with a complex case scenario in which suspects corresponded by e-mail and deleted, suspiciously, certain communications, presumably to conceal evidence. The purpose is to demonstrate the efficacy of Artificial Neural Networks (ANN) in learning and detecting communication patterns over time, and then predicting the possibility of missing communication(s) along with potential topics of discussion. To do this, we developed a novel approach and included other existing models. The accuracy of our results is evaluated, and their performance on previously unseen data is measured. Second, we proposed conceptualizing the term “Digital Forensics AI” (DFAI) to formalize the application of AI in digital forensics. The objective is to highlight the instruments that facilitate the best evidential outcomes and presentation mechanisms that are adaptable to the probabilistic output of AI models. Finally, we enhanced our notion in support of the application of AI in digital forensics by recommending methodologies and approaches for bridging trust gaps through the development of interpretable models that facilitate the admissibility of digital evidence in legal proceedings

    Automated retrieval and analysis of published biomedical literature through natural language processing for clinical applications

    Get PDF
    The size of the existing academic literature corpus and the incredible rate of new publications offers a great need and opportunity to harness computational approaches to data and knowledge extraction across all research fields. Elements of this challenge can be met by developments in automation for retrieval of electronic documents, document classification and knowledge extraction. In this thesis, I detail studies of these processes in three related chapters. Although the focus of each chapter is distinct, they contribute to my aim of developing a generalisable pipeline for clinical applications in Natural Language Processing in the academic literature. In chapter one, I describe the development of “Cadmus”, An open-source system developed in Python to generate corpora of biomedical text from the published literature. Cadmus comprises three main steps: Search query & meta-data collection, document retrieval, and parsing of the retrieved text. I present an example of full-text retrieval for a corpus of over two hundred thousand articles using a gene-based search query with quality control metrics for this retrieval process and a high-level illustration of the utility of full text over metadata for each article. For a corpus of 204,043 articles, the retrieval rate was 85.2% with institutional subscription access and 54.4% without. Chapter Two details developing a custom-built Naïve Bayes supervised machine learning document classifier. This binary classifier is based on calculating the relative enrichment of biomedical terms between two classes of documents in a training set. The classifier is trained and tested upon a manually classified set of over 8000 abstract and full-text articles to identify articles containing human phenotype descriptions. 10-fold cross-validation of the model showed a performance of recall of 85%, specificity of 99%, Precision of 0.76%, f1 score of 0.82 and accuracy of 90%. Chapter three illustrates the clinical applications of automated retrieval, processing, and classification by considering the published literature on Paediatric COVID-19. Case reports and similar articles were classified into “severe” and “non-severe” classes, and term enrichment was evaluated to find biomarkers associated with, or predictive of, severe paediatric COVID-19. Time series analysis was employed to illustrate emerging disease entities like the Multisystem Inflammatory Syndrome in Children (MIS-C) and consider unrecognised trends through literature-based discovery

    CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

    Get PDF
    Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
    • 

    corecore