41 research outputs found
100,000 prize jackpot. Call now!: Identifying the pertinent features of SMS spam
ABSTRACT Mobile SMS spam is on the rise and is a prevalent problem. While recent work has shown that simple machine learning techniques can distinguish between ham and spam with high accuracy, this paper explores the individual contributions of various textual features in the classification process. Our results reveal the surprising finding that simple is better: using the largest spam corpus of which we are aware, we find that using simple textual features is sufficient to provide accuracy that is nearly identical to that achieved by the best known techniques, while achieving a twofold speedup
A Survey of Email Spam Filtering Methods
E-mail is one of the most secure medium for online communication and transferring data or messages through the web. An overgrowing increase in popularity, the number of unsolicited data has also increased rapidly. To filtering data, different approaches exist which automatically detect and remove these untenable messages. There are several numbers of email spam filtering technique such as Knowledge-based technique, Clustering techniques, Learning based technique, Heuristic processes and so on. This paper illustrates a survey of different existing email spam filtering system regarding Machine Learning Technique (MLT) such as Naive Bayes, SVM, K-Nearest Neighbor, Bayes Additive Regression, KNN Tree, and rules. However, here we present the classification, evaluation and comparison of different email spam filtering system Keywords: e-mail spam, spam filtering methods, machine learning technique, classification, SVM, AN
Penanganan Fitur Kontinyu dengan Feature Discretization Berbasis Expectation Maximization Clustering untuk Klasifikasi Spam Email Menggunakan Algoritma ID3
Pemanfaatan jaringan internet saat ini berkembang begitu pesatnya, salah satunya adalah pengiriman surat elektronik atau email. Akhir-akhir ini ramai diperbincangkan adanya spam email. Spam email adalah email yang tidak diminta dan tidak diinginkan dari orang asing yang dikirim dalam jumlah besar ke mailing list, biasanya beberapa dengan sifat komersial. Adanya spam ini mengurangi produktivitas karyawan karena harus meluangkan waktu untuk menghapus pesan spam. Untuk mengatasi permasalahan tersebut dibutuhkan sebuah filter email yang akan mendeteksi keberadaan spam sehingga tidak dimunculkan pada inbox mail. Banyak peneliti yang mencoba untuk membuat filter email dengan berbagai macam metode, tetapi belum ada yang menghasilkan akurasi maksimal. Pada penelitian ini akan dilakukan klasifikasi dengan menggunakan algoritma Decision Tree Iterative Dicotomizer 3 (ID3) karena ID3 merupakan algoritma yang paling banyak digunakan di pohon keputusan, terkenal dengan kecepatan tinggi dalam klasifikasi, kemampuan belajar yang kuat dan konstruksi mudah. Tetapi ID3 tidak dapat menangani fitur kontinyu sehingga proses klasifikasi tidak bisa dilakukan. Pada penelitian ini, feature discretization berbasis Expectation Maximization (EM) Clustering digunakan untuk merubah fitur kontinyu menjadi fitur diskrit, sehingga proses klasifikasi spam email bisa dilakukan. Hasil eksperimen menunjukkan ID3 dapat melakukan klasifikasi spam email dengan akurasi 91,96% jika menggunakan data training 90%. Terjadi peningkatan sebesar 28,05% dibandingkan dengan klasifikasi ID3 menggunakan binning
Active Multi-Field Learning for Spam Filtering
Ubiquitous spam messages cause a serious waste of time and resources. This paper addresses the practical spam filtering problem, and proposes a universal approach to fight with various spam messages. The proposed active multi-field learning approach is based on: 1) It is cost-sensitive to obtain a label for a real-world spam filter, which suggests an active learning idea; and 2) Different messages often have a similar multi-field text structure, which suggests a multi-field learning idea. The multi-field learning framework combines multiple results predicted from field classifiers by a novel compound weight, and each field classifier calculates the arithmetical average of multiple conditional probabilities predicted from feature strings according to a data structure of string-frequency index. Comparing the current variance of field classifying results with the historical variance, the active learner evaluates the classifying confidence and regards the more uncertain message as the more informative sample for which to request a label. The experimental results show that the proposed approach can achieve the state-of-the-art performance at greatly reduced label requirements both in email spam filtering and short text spam filtering. Our active multi-field learning performance, the standard (1-ROCA) % measurement, even exceeds the full feedback performance of some advanced individual classifying algorithm
λνμ΄μ¬ λ³κ²½ μ΄ν μ£Όκ° λ³λμ± μμΈ‘μ μν λ΄μ€μ λνμ΄μ¬ λ³κ²½ μ¬μ λΆλ₯ μλν
νμλ
Όλ¬Έ(μμ¬) -- μμΈλνκ΅λνμ : 곡과λν μ°μ
곡νκ³Ό, 2023. 2. μ‘°μ±μ€.A CEO turnover event is an event significantly influencing the company. The role of CEO at a firm is to manage overall operations, and thus a change in CEO could af fect not only the firms strategic direction but also consumer perception, investment decision and eventually the share price. Thus, shareholders and investors keep an eye on the change of CEO, especially on the reason why the CEO has changed. CEO turnover causes can be inferred from the detailed information about the firm such as the firm performance and stock price prior to the event. However, in financial news related to CEO turnover specifically describe the motivation of the turnover. In this paper, a machine learning thecniques such as the TF-IDF method and the fine-tuned DistilBERT language model were utilized to classify the turnover causes from financial news related to CEO turnover. The main contribution of this paper is to automate the manual labeling process to aid shareholders and investors to cap ture the investment opportunity in a timely manner. A contextualized embedding of news articles obtained from the language model is then further utilized as an additional feature for predicting the post-event stock volatility of a firm.λνμ΄μ¬ λ³κ²½μ κΈ°μ
μμ λ°μνλ μ΄λ²€νΈ μ€μ νλμ΄λ©° ν΄λΉ κΈ°μ
μ ν° μν₯μ μ€λ€. λνμ΄μ¬μ μν μ κΈ°μ
μ μ λ°μ μΈ κ²½μ μ λ΅ λ±μ λ΄λΉνλ©°, λλ¬Έμ λνμ΄μ¬μ λ³ κ²½μ κΈ°μ
μ κ²½μ μ λ΅λΏλ§ μλλΌ μλΉμ μΈμ, ν¬μ μ λ΅ λ±μ μν₯μ μ£Όλ©° μ΄λ ν΄λΉ κΈ°μ
μ μ£Όκ°μλ λ°μλλ€. κ·Έλ κΈ° λλ¬Έμ ν¬μμλ€μκ²λ λνμ΄μ¬ λ³κ²½μ λμ¬κ²¨λ³Ό μ΄λ²€νΈμ΄λ©°, νΉν λ³κ²½ μ¬μ λ ν¬μμλ€μ΄ μ£Όμνλ λΆλΆμ΄λ€. λνμ΄μ¬ λ³κ²½ μ¬μ λ μ΄λ²€νΈ λ°μ μ΄μ μ μ£Όκ°μ λ³λ, κΈ°μ
μ€μ λ±μ ν΅ν΄μλ λλ΅μ μΌλ‘ μ μΆν μ μ λ€. νμ§λ§ λνμ΄μ¬ λ³κ²½μ κ΄λ ¨λ λ΄μ€μλ λ³΄λ€ μ§μ μ μΌλ‘ μ¬μ μ λν΄μ μ°Ύμλ³Ό μ μλ€. νΉλ³ν μ΄μ μμ΄ λμ΄κ° λ€κ±°λ κ·Έλ‘μΈν΄ μκΈ΄ μ§λ³μΌλ‘ μΈν΄ λ³ΈμΈμ μμ§ λ‘ λνμ΄μ¬μ§μμ λ¬Όλ¬λκ±°λ νΉλ³ν μ΄μ λ‘ μΈν΄ κ°μ μ μΌλ‘ λ¬Όλ¬λλ λ±μ μ¬μ κ° μμ μλ μκ³ , λ λ€λ₯Έ κ²½μ°μλ λ€μ νμμμ λν μ 보λ νμ
ν μ μλ€. λ³Έ λ
Ό λ¬Έμμλ λ΄μ€λ‘ λΆν° μμ°μ΄μ²λ¦¬λ₯Ό ν΅νμ¬ λνμ΄μ¬ λ³κ²½μ μ¬μ λ₯Ό λΆλ₯νλ λͺ¨λΈμ μ μνλ€. κΈ°μ‘΄μ μκΈ°λ‘ λ μ΄λΈλ§ νλ λ°©μμ μλννλ κ²μ μμλ₯Ό λλ€. λ¨μ΄μ λΉλμ μ λ¬Έμ λΉλλ₯Ό νμ©ν TF-IDF λͺ¨λΈμ λ³κ²½ μ¬μ λΆλ₯ λͺ¨λΈμ λ²€μΉλ§ν¬ λͺ¨λΈλ‘ νμ©νκ³ , νΈλμ€ν¬λ¨Έ ꡬ쑰μ μ¬μ νμ΅λ μΈμ΄λͺ¨λΈμ μ¬μ©νμ¬ λνμ΄μ¬ λ³κ²½ μ¬μ λ₯Ό λΆλ₯νλ ν
μ€ν¬λ₯Ό ν΅νμ¬ νμΈνλνλ κ³Όμ μμ λ΄μ€μ μλ² λ©μ μΆμΆνλ€. λνμ΄μ¬ λ³κ²½ μ¬μ λΆλ₯λ₯Ό ν΅νμ¬ μ¬μ μ λ°λΌ λ³κ²½ μ΄ν μ£Όκ° λ³λμ±μ΄ μ¦κ°ν κ²μ΄λ μ νΈλ₯Ό ν¬μμλ€μκ² μ 곡ν¨μΌλ‘μ¨ λΉ λ₯΄κ² ν¬μ μ λ΅μ μ‘°μ ν μ μλλ‘ κΈ°μ¬νλ€. λν, μΈ μ΄λͺ¨λΈμμ μ»μ λ§₯λ½μ ν¬ν¨ν λ²‘ν° μλ² λ©μ νμ©νμ¬ μ΄λ²€νΈ λ°μ μ΄ν ν΄λΉ κΈ°μ
μ μ£Όκ° λ³λμ±μ μμΈ‘νλ λͺ¨λΈμ ꡬμΆνμ¬ μ¬μ λΆλ₯ λͺ¨λΈμ νμ©λλ₯Ό μ€ννμλ€.1. Introduction 1
1.1 Background 1
1.2 Problem Description 2
1.3 Research Motivation and Contribution 4
1.4 Organization of the Thesis 6
2. Literature Review 7
2.1 CEO Turnover and Volatility 7
2.2 Machine Learning for Text Classification 8
2.3 Pretrained Language Model for Text Classification 9
3. Proposed Method 12
3.1 Overall Architecture 12
3.2 Machine Learning Text Classification 13
3.3 Fine-Tuning DistilBERT for Text Classification 18
3.4 Regression Model for Stock Volatility Prediction 20
4. Experiments and Results 23
4.1 Data 23
4.1.1 Label Engineering & Imbalance Dataset 29
4.2 Evaluation 34
4.3 Results 36
5. Conclusion 43
Bibliography 46
κ΅λ¬Έμ΄λ‘ 52
κ°μ¬μ κΈ 54μ
That ainβt you: Blocking spearphishing through behavioral modelling
One of the ways in which attackers steal sensitive information from corporations is by sending spearphishing emails. A typical spearphishing email appears to be sent by one of the victimβs coworkers or business partners, but has instead been crafted by the attacker. A particularly insidious type of spearphishing emails are the ones that do not only claim to be written by a certain person, but are also sent by that personβs email account, which has been compromised. Spearphishing emails are very dangerous for companies, because they can be the starting point to a more sophisticated attack or cause intellectual property theft, and lead to high financial losses. Currently, there are no effective systems to protect users against such threats. Existing systems leverage adaptations of anti-spam techniques. However, these techniques are often inadequate to detect spearphishing attacks. The reason is that spearphishing has very different characteristics from spam and even traditional phishing. To fight the spearphishing threat, we propose a change of focus in the techniques that we use for detecting malicious emails: instead of looking for features that are indicative of attack emails, we look for emails that claim to have been written by a certain person within a company, but were actually authored by an attacker. We do this by modelling the email-sending behavior of users over time, and comparing any subsequent email sent by their accounts against this model. Our approach can block advanced email attacks that traditional protection systems are unable to detect, and is an important step towards detecting advanced spearphishing attacks