2,087 research outputs found
ANALYZING TEMPORAL PATTERNS IN PHISHING EMAIL TOPICS
In 2020, the Federal Bureau of Investigation (FBI) found phishing to be the most common cybercrime, with a record number of complaints from Americans reporting losses exceeding $4.1 billion. Various phishing prevention methods exist; however, these methods are usually reactionary in nature as they activate only after a phishing campaign has been launched. Priming people ahead of time with the knowledge of which phishing topic is more likely to occur could be an effective proactive phishing prevention strategy. It has been noted that the volume of phishing emails tended to increase around key calendar dates and during times of uncertainty. This thesis aimed to create a classifier to predict which phishing topics have an increased likelihood of occurring in reference to an external event. After distilling around 1.2 million phishes until only meaningful words remained, a Latent Dirichlet allocation (LDA) topic model uncovered 90 latent phishing topics. On average, human evaluators agreed with the composition of a topic 74% of the time in one of the phishing topic evaluation tasks, showing an accordance of human judgment to the topics produced by the LDA model. Each topic was turned into a timeseries by creating a frequency count over the dataset’s two-year timespan. This time-series was changed into an intensity count to highlight the days of increased phishing activity. All phishing topics were analyzed and reviewed for influencing events. After the review, ten topics were identified to have external events that could have possibly influenced their respective intensities. After performing the intervention analysis, none of the selected topics were found to correlate with the identified external event. The analysis stopped here, and no predictive classifiers were pursued. With this dataset, temporal patterns coupled with external events were not able to predict the likelihood of a phishing attack
Mining Frequency of Drug Side Effects Over a Large Twitter Dataset Using Apache Spark
Despite clinical trials by pharmaceutical companies as well as current FDA reporting systems, there are still drug side effects that have not been caught. To find a larger sample of reports, a possible way is to mine online social media. With its current widespread use, social media such as Twitter has given rise to massive amounts of data, which can be used as reports for drug side effects. To process these large datasets, Apache Spark has become popular for fast, distributed batch processing. In this work, we have improved on previous pipelines in sentimental analysis-based mining, processing, and extracting tweets with drug-caused side effects. We have also added a new ensemble classifier using a combination of sentiment analysis features to increase the accuracy of identifying drug-caused side effects. In addition, the frequency count for the side effects is also provided. Furthermore, we have also implemented the same pipeline in Apache Spark to improve the speed of processing of tweets by 2.5 times, as well as to support the process of large tweet datasets. As the frequency count of drug side effects opens a wide door for further analysis, we present a preliminary study on this issue, including the side effects of simultaneously using two drugs, and the potential danger of using less-common combination of drugs. We believe the pipeline design and the results present in this work would have great implication on studying drug side effects and on big data analysis in general
Calibration of Natural Language Understanding Models with Venn--ABERS Predictors
Transformers, currently the state-of-the-art in natural language
understanding (NLU) tasks, are prone to generate uncalibrated predictions or
extreme probabilities, making the process of taking different decisions based
on their output relatively difficult. In this paper we propose to build several
inductive Venn--ABERS predictors (IVAP), which are guaranteed to be well
calibrated under minimal assumptions, based on a selection of pre-trained
transformers. We test their performance over a set of diverse NLU tasks and
show that they are capable of producing well-calibrated probabilistic
predictions that are uniformly spread over the [0,1] interval -- all while
retaining the original model's predictive accuracy.Comment: Accepted at the 11th Symposium on Conformal and Probabilistic
Prediction with Applications - COPA 202
- …