41 research outputs found

    100,000 prize jackpot. Call now!: Identifying the pertinent features of SMS spam

    Get PDF
    ABSTRACT Mobile SMS spam is on the rise and is a prevalent problem. While recent work has shown that simple machine learning techniques can distinguish between ham and spam with high accuracy, this paper explores the individual contributions of various textual features in the classification process. Our results reveal the surprising finding that simple is better: using the largest spam corpus of which we are aware, we find that using simple textual features is sufficient to provide accuracy that is nearly identical to that achieved by the best known techniques, while achieving a twofold speedup

    A Survey of Email Spam Filtering Methods

    Get PDF
    E-mail is one of the most secure medium for online communication and transferring data or messages through the web. An overgrowing increase in popularity, the number of unsolicited data has also increased rapidly. To filtering data, different approaches exist which automatically detect and remove these untenable messages. There are several numbers of email spam filtering technique such as Knowledge-based technique, Clustering techniques, Learning based technique, Heuristic processes and so on. This paper illustrates a survey of different existing email spam filtering system regarding Machine Learning Technique (MLT) such as Naive Bayes, SVM, K-Nearest Neighbor, Bayes Additive Regression, KNN Tree, and rules. However, here we present the classification, evaluation and comparison of different email spam filtering system Keywords: e-mail spam, spam filtering methods, machine learning technique, classification, SVM, AN

    Penanganan Fitur Kontinyu dengan Feature Discretization Berbasis Expectation Maximization Clustering untuk Klasifikasi Spam Email Menggunakan Algoritma ID3

    Full text link
    Pemanfaatan jaringan internet saat ini berkembang begitu pesatnya, salah satunya adalah pengiriman surat elektronik atau email. Akhir-akhir ini ramai diperbincangkan adanya spam email. Spam email adalah email yang tidak diminta dan tidak diinginkan dari orang asing yang dikirim dalam jumlah besar ke mailing list, biasanya beberapa dengan sifat komersial. Adanya spam ini mengurangi produktivitas karyawan karena harus meluangkan waktu untuk menghapus pesan spam. Untuk mengatasi permasalahan tersebut dibutuhkan sebuah filter email yang akan mendeteksi keberadaan spam sehingga tidak dimunculkan pada inbox mail. Banyak peneliti yang mencoba untuk membuat filter email dengan berbagai macam metode, tetapi belum ada yang menghasilkan akurasi maksimal. Pada penelitian ini akan dilakukan klasifikasi dengan menggunakan algoritma Decision Tree Iterative Dicotomizer 3 (ID3) karena ID3 merupakan algoritma yang paling banyak digunakan di pohon keputusan, terkenal dengan kecepatan tinggi dalam klasifikasi, kemampuan belajar yang kuat dan konstruksi mudah. Tetapi ID3 tidak dapat menangani fitur kontinyu sehingga proses klasifikasi tidak bisa dilakukan. Pada penelitian ini, feature discretization berbasis Expectation Maximization (EM) Clustering digunakan untuk merubah fitur kontinyu menjadi fitur diskrit, sehingga proses klasifikasi spam email bisa dilakukan. Hasil eksperimen menunjukkan ID3 dapat melakukan klasifikasi spam email dengan akurasi 91,96% jika menggunakan data training 90%. Terjadi peningkatan sebesar 28,05% dibandingkan dengan klasifikasi ID3 menggunakan binning

    Active Multi-Field Learning for Spam Filtering

    Get PDF
    Ubiquitous spam messages cause a serious waste of time and resources. This paper addresses the practical spam filtering problem, and proposes a universal approach to fight with various spam messages. The proposed active multi-field learning approach is based on: 1) It is cost-sensitive to obtain a label for a real-world spam filter, which suggests an active learning idea; and 2) Different messages often have a similar multi-field text structure, which suggests a multi-field learning idea. The multi-field learning framework combines multiple results predicted from field classifiers by a novel compound weight, and each field classifier calculates the arithmetical average of multiple conditional probabilities predicted from feature strings according to a data structure of string-frequency index. Comparing the current variance of field classifying results with the historical variance, the active learner evaluates the classifying confidence and regards the more uncertain message as the more informative sample for which to request a label. The experimental results show that the proposed approach can achieve the state-of-the-art performance at greatly reduced label requirements both in email spam filtering and short text spam filtering. Our active multi-field learning performance, the standard (1-ROCA) % measurement, even exceeds the full feedback performance of some advanced individual classifying algorithm

    λŒ€ν‘œμ΄μ‚¬ λ³€κ²½ 이후 μ£Όκ°€ 변동성 μ˜ˆμΈ‘μ„ μœ„ν•œ λ‰΄μŠ€μ˜ λŒ€ν‘œμ΄μ‚¬ λ³€κ²½ μ‚¬μœ  λΆ„λ₯˜ μžλ™ν™”

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(석사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 산업곡학과, 2023. 2. μ‘°μ„±μ€€.A CEO turnover event is an event significantly influencing the company. The role of CEO at a firm is to manage overall operations, and thus a change in CEO could af fect not only the firms strategic direction but also consumer perception, investment decision and eventually the share price. Thus, shareholders and investors keep an eye on the change of CEO, especially on the reason why the CEO has changed. CEO turnover causes can be inferred from the detailed information about the firm such as the firm performance and stock price prior to the event. However, in financial news related to CEO turnover specifically describe the motivation of the turnover. In this paper, a machine learning thecniques such as the TF-IDF method and the fine-tuned DistilBERT language model were utilized to classify the turnover causes from financial news related to CEO turnover. The main contribution of this paper is to automate the manual labeling process to aid shareholders and investors to cap ture the investment opportunity in a timely manner. A contextualized embedding of news articles obtained from the language model is then further utilized as an additional feature for predicting the post-event stock volatility of a firm.λŒ€ν‘œμ΄μ‚¬ 변경은 κΈ°μ—…μ—μ„œ λ°œμƒν•˜λŠ” 이벀트 μ€‘μ˜ ν•˜λ‚˜μ΄λ©° ν•΄λ‹Ή 기업에 큰 영ν–₯을 μ€€λ‹€. λŒ€ν‘œμ΄μ‚¬μ˜ 역할은 κΈ°μ—…μ˜ μ „λ°˜μ μΈ 경영 μ „λž΅ 등을 λ‹΄λ‹Ήν•˜λ©°, λ•Œλ¬Έμ— λŒ€ν‘œμ΄μ‚¬μ˜ λ³€ 경은 κΈ°μ—…μ˜ 경영 μ „λž΅λΏλ§Œ μ•„λ‹ˆλΌ μ†ŒλΉ„μž 인식, 투자 μ „λž΅ 등에 영ν–₯을 μ£Όλ©° μ΄λŠ” ν•΄λ‹Ή κΈ°μ—…μ˜ 주가에도 λ°˜μ˜λœλ‹€. κ·Έλ ‡κΈ° λ•Œλ¬Έμ— νˆ¬μžμžλ“€μ—κ²Œλ„ λŒ€ν‘œμ΄μ‚¬ 변경은 λˆˆμ—¬κ²¨λ³Ό 이벀트이며, 특히 λ³€κ²½ μ‚¬μœ λŠ” νˆ¬μžμžλ“€μ΄ μ£Όμ˜ν•˜λŠ” 뢀뢄이닀. λŒ€ν‘œμ΄μ‚¬ λ³€κ²½ μ‚¬μœ λŠ” 이벀트 λ°œμƒ μ΄μ „μ˜ μ£Όκ°€μ˜ 변동, κΈ°μ—… 싀적 등을 ν†΅ν•΄μ„œλ„ λŒ€λž΅μ μœΌλ‘œ μœ μΆ”ν•  수 있 λ‹€. ν•˜μ§€λ§Œ λŒ€ν‘œμ΄μ‚¬ 변경에 κ΄€λ ¨λœ λ‰΄μŠ€μ—λŠ” 보닀 μ§μ ‘μ μœΌλ‘œ μ‚¬μœ μ— λŒ€ν•΄μ„œ μ°Ύμ•„λ³Ό 수 μžˆλ‹€. νŠΉλ³„ν•œ μ΄μœ μ—†μ΄ λ‚˜μ΄κ°€ λ“€κ±°λ‚˜ κ·Έλ‘œμΈν•΄ 생긴 μ§ˆλ³‘μœΌλ‘œ 인해 본인의 μ˜μ§€ 둜 λŒ€ν‘œμ΄μ‚¬μ§μ—μ„œ λ¬ΌλŸ¬λ‚˜κ±°λ‚˜ νŠΉλ³„ν•œ 이유둜 인해 κ°•μ œμ μœΌλ‘œ λ¬ΌλŸ¬λ‚˜λŠ” λ“±μ˜ μ‚¬μœ κ°€ μžˆμ„ μˆ˜λ„ 있고, 또 λ‹€λ₯Έ κ²½μš°μ—λŠ” λ‹€μŒ ν›„μž„μžμ— λŒ€ν•œ 정보도 νŒŒμ•…ν• μˆ˜ μžˆλ‹€. λ³Έ λ…Ό λ¬Έμ—μ„œλŠ” λ‰΄μŠ€λ‘œ λΆ€ν„° μžμ—°μ–΄μ²˜λ¦¬λ₯Ό ν†΅ν•˜μ—¬ λŒ€ν‘œμ΄μ‚¬ λ³€κ²½μ˜ μ‚¬μœ λ₯Ό λΆ„λ₯˜ν•˜λŠ” λͺ¨λΈμ„ μ œμ•ˆν•œλ‹€. 기쑴의 수기둜 λ ˆμ΄λΈ”λ§ ν•˜λŠ” 방식을 μžλ™ν™”ν•˜λŠ” 것에 의의λ₯Ό λ‘”λ‹€. λ‹¨μ–΄μ˜ λΉˆλ„μ™€ μ—­ λ¬Έμ„œ λΉˆλ„λ₯Ό ν™œμš©ν•œ TF-IDF λͺ¨λΈμ„ λ³€κ²½ μ‚¬μœ  λΆ„λ₯˜ λͺ¨λΈμ˜ 벀치마크 λͺ¨λΈλ‘œ ν™œμš©ν•˜κ³ , 트랜슀포머 ꡬ쑰의 μ‚¬μ „ν•™μŠ΅λœ μ–Έμ–΄λͺ¨λΈμ„ μ‚¬μš©ν•˜μ—¬ λŒ€ν‘œμ΄μ‚¬ λ³€κ²½ μ‚¬μœ λ₯Ό λΆ„λ₯˜ν•˜λŠ” ν…ŒμŠ€ν¬λ₯Ό ν†΅ν•˜μ—¬ νŒŒμΈνŠœλ‹ν•˜λŠ” κ³Όμ •μ—μ„œ λ‰΄μŠ€μ˜ μž„λ² λ”©μ„ μΆ”μΆœν•œλ‹€. λŒ€ν‘œμ΄μ‚¬ λ³€κ²½ μ‚¬μœ  λΆ„λ₯˜λ₯Ό ν†΅ν•˜μ—¬ μ‚¬μœ μ— 따라 λ³€κ²½ 이후 μ£Όκ°€ 변동성이 증가할 κ²ƒμ΄λž€ μ‹ ν˜Έλ₯Ό νˆ¬μžμžλ“€μ—κ²Œ μ œκ³΅ν•¨μœΌλ‘œμ¨ λΉ λ₯΄κ²Œ 투자 μ „λž΅μ„ μ‘°μ •ν•  수 μžˆλ„λ‘ κΈ°μ—¬ν•œλ‹€. λ˜ν•œ, μ–Έ μ–΄λͺ¨λΈμ—μ„œ 얻은 λ§₯락을 ν¬ν•¨ν•œ 벑터 μž„λ² λ”©μ„ ν™œμš©ν•˜μ—¬ 이벀트 λ°œμƒ 이후 ν•΄λ‹Ή κΈ°μ—…μ˜ μ£Όκ°€ 변동성을 μ˜ˆμΈ‘ν•˜λŠ” λͺ¨λΈμ„ κ΅¬μΆ•ν•˜μ—¬ μ‚¬μœ  λΆ„λ₯˜ λͺ¨λΈμ˜ ν™œμš©λ„λ₯Ό μ‹€ν—˜ν•˜μ˜€λ‹€.1. Introduction 1 1.1 Background 1 1.2 Problem Description 2 1.3 Research Motivation and Contribution 4 1.4 Organization of the Thesis 6 2. Literature Review 7 2.1 CEO Turnover and Volatility 7 2.2 Machine Learning for Text Classification 8 2.3 Pretrained Language Model for Text Classification 9 3. Proposed Method 12 3.1 Overall Architecture 12 3.2 Machine Learning Text Classification 13 3.3 Fine-Tuning DistilBERT for Text Classification 18 3.4 Regression Model for Stock Volatility Prediction 20 4. Experiments and Results 23 4.1 Data 23 4.1.1 Label Engineering & Imbalance Dataset 29 4.2 Evaluation 34 4.3 Results 36 5. Conclusion 43 Bibliography 46 ꡭ문초둝 52 κ°μ‚¬μ˜ κΈ€ 54석

    That ain’t you: Blocking spearphishing through behavioral modelling

    Get PDF
    One of the ways in which attackers steal sensitive information from corporations is by sending spearphishing emails. A typical spearphishing email appears to be sent by one of the victim’s coworkers or business partners, but has instead been crafted by the attacker. A particularly insidious type of spearphishing emails are the ones that do not only claim to be written by a certain person, but are also sent by that person’s email account, which has been compromised. Spearphishing emails are very dangerous for companies, because they can be the starting point to a more sophisticated attack or cause intellectual property theft, and lead to high financial losses. Currently, there are no effective systems to protect users against such threats. Existing systems leverage adaptations of anti-spam techniques. However, these techniques are often inadequate to detect spearphishing attacks. The reason is that spearphishing has very different characteristics from spam and even traditional phishing. To fight the spearphishing threat, we propose a change of focus in the techniques that we use for detecting malicious emails: instead of looking for features that are indicative of attack emails, we look for emails that claim to have been written by a certain person within a company, but were actually authored by an attacker. We do this by modelling the email-sending behavior of users over time, and comparing any subsequent email sent by their accounts against this model. Our approach can block advanced email attacks that traditional protection systems are unable to detect, and is an important step towards detecting advanced spearphishing attacks
    corecore