63 research outputs found
Using online linear classifiers to filter spam Emails
The performance of two online linear classifiers - the Perceptron and Littlestone’s Winnow – is explored for two anti-spam filtering benchmark corpora - PU1 and Ling-Spam. We study the performance for varying numbers of features, along with three different feature selection methods: Information Gain (IG), Document Frequency (DF) and Odds Ratio. The size of the training set and the number of training iterations are also investigated for both classifiers. The experimental results show that both the Perceptron and Winnow perform much better when using IG or DF than using Odds Ratio. It is further demonstrated that when using IG or DF, the classifiers are insensitive to the number of features and the number of training iterations, and not greatly sensitive to the size of training set. Winnow is shown to slightly outperform the Perceptron. It is also demonstrated that both of these online classifiers perform much better than a standard Naïve Bayes method. The theoretical and implementation computational complexity of these two classifiers are very low, and they are very easily adaptively updated. They outperform most of the published results, while being significantly easier to train and adapt. The analysis and promising experimental results indicate that the Perceptron and Winnow are two very competitive classifiers for anti-spam filtering
Kemahiran pemikiran komputasional pelajar melalui modul pembelajaran berasaskan teknologi internet pelbagai benda
kemahiran pemikiran komputasional pelajar, ke arah lebih kreatif dan kritis
melalui penggunaan Modul Pembelajaran Berasaskan Teknologi Internet
Pelbagai Benda (MP-IoT) yang telah dibangunkan oleh penyelidik.
Pembangunan MP-IoT mengikut Model ADDIE dan melibatkan Teknologi
Arduino yang diterapkan dalam 5 aktiviti pembelajaran secara amali. Kajian
berbentuk kuantitatif jenis kuasi-eksperimental ini telah dijalankan ke atas 52
orang pelajar Tingkatan 4 dari 2 buah sekolah di daerah Batu Pahat, Johor dan
Kuala Kangsar, Perak. Data pula telah dianalisis secara deskriptif dan inferensi.
Satu set ujian pencapaian pra dan pasca sebagai instrument telah dibangunkan.
Analisis Item Indeks Kesukaran (IK), Indeks Diskriminasi, serta Interprestasi
skor bagi nilai Alpha Cronbach telah digunakan bagi memastikan soalan ujian
pencapaian sesuai digunakan. Manakala dalam proses pembangunan modul
MP-IoT, seramai 6 orang guru dari mata pelajaran Sains Komputer dipilih
sebagai pakar untuk mengenal pasti kesesuaian dari segi format, kandungan dan
kebolehgunaan modul yang dibangunkan Skala Likert lima mata digunakan
dalam kajian ini. Secara keseluruhannya, dapatan kajian menggunakan ujian-T
sampel berpasangan, menunjukkan terdapat perbezaan yang signifikan terhadap
tahap pencapaian pelajar kumpulan kawalan yang didedahkan dengan kaedah
konvensional dengan kumpulan rawatan yang didedahkan dengan modul MPIoT,
dengan
nilai
p-value
adalah
.000 iaitu
kurang
dari
.05 (p<0.05).
Selain
itu,
tahap
kemahiran pemikiran komputasional pelajar juga meningkat setelah
didedahkan dengan modul MP-IoT
SMS Spam Filtering: Methods and Data
Mobile or SMS spam is a real and growing problem primarily due to the availability of very cheap bulk pre-pay SMS packages and the fact that SMS engenders higher response rates as it is a trusted and personal service. SMS spam filtering is a relatively new task which inherits many issues and solu- tions from email spam filtering. However it poses its own specific challenges. This paper motivates work on filtering SMS spam and reviews recent devel- opments in SMS spam filtering. The paper also discusses the issues with data collection and availability for furthering research in this area, analyses a large corpus of SMS spam, and provides some initial benchmark results
Content based hybrid sms spam filtering system
World has changed. Everybody is connected.
Almost each and everyone have a mobile phone. Millions of
SMSs are going around the world over mobile networks in
every second. But about 113 of them are spam. SMS spam has
become a crucial problem with the increase of mobile
penetration around the world. SMS spam filtering is a
relatively new task which inherits many issues and solutions
from email spam filtering. However it poses its own specific
challenges. Server based approaches and Mobile application
based approaches are accommodate content based and
content less mechanism to do the SMS spam filtering. Though
there are approaches, still there is a lack of a hybrid solution
which can do general filtering at server level while user
specific filtering can be done on mobile level. This paper
presents a hybrid solution for SMS spam filtering where both
feature phone users as well as smart phone users get benefited.
Feature phone users can experience the general filter while
smart phone users can configure and filter SMSs based on
their own preferences rather than sticking in to a general
filter. Server level solution consists of a neural network along
with a Bayesian filter and device level filter consists of a
Bayesian filter. We have evaluated the accuracy of neural
network using spam huge dataset along with some randomly
used personal SMSs
Detecting word substitutions in text
Searching for words on a watchlist is one way in which large-scale surveillance of communication can be done, for example in intelligence and counterterrorism settings. One obvious defense is to replace words that might attract attention to a message with other, more innocuous, words. For example, the sentence the attack will be tomorrow" might be altered to the complex will be tomorrow", since 'complex' is a word whose frequency is close to that of 'attack'. Such substitutions are readily detectable by humans since they do not make sense. We address the problem of detecting such substitutions automatically, by looking for discrepancies between words and their contexts, and using only syntactic information. We define a set of measures, each of which is quite weak, but which together produce per-sentence detection rates around 90% with false positive rates around 10%. Rules for combining persentence detection into per-message detection can reduce the false positive and false negative rates for messages to practical levels. We test the approach using sentences from the Enron email and Brown corpora, representing informal and formal text respectively
Pembangunan elemen kemahiran hijau dalam pengajaran dan pembelajaran (PdP) bagi pensyarah kolej vokasional
Kemahiran hijau (Green Skill) merupakan satu kemahiran berasaskan kepandaian dan
kecekapan yang menjadi aset kepada setiap individu sebelum menerokai semua bidang
pekerjaan ke arah pembangunan yang mampan. Kajian kualitatif ini menggunakan
kaedah penerokaan sebagai reka bentuk kajian yang bertujuan untuk membangunkan
elemen kemahiran hijau dalam pengajaran dan pembelajaran (PdP) bagi pensyarah
kolej vokasional. Pada fasa pertama iaitu fasa pembangunan, pengkaji telah
menjalankan temu bual bersama tiga (3) orang pensyarah yang mempunyai kepakaran
di dalam bidang PdP dan Teknologi Pembinaan. Selepas melaksanakan protokol temu
bual maklumat telah ditemakan melalui analisis tematik dan seterusnya telah dianalisis
melalui analisis matrik bersama semakan literatur sistematik bagi mendapatkan
persamaan dan perbezaan maklumat. Pada fasa kedua iaitu fasa pengesahan, seramai
lima (5) orang pakar telah membuat pengesahan terhadap format dan kandungan itemitem
yang telah dikeluarkan. Fasa ini melibatkan dua belas (12) orang pakar yang
terdiri daripada pensyarah yang mempunyai pengalaman selama sepuluh (10) tahun
dan ke atas sebagai responden utama. Melalui teknik Fuzzy Delphi sebagai prosedur
penganalisian data, Data kajian telah di analisis bagi mendapatkan nilai purata m
1
(nilai
minimum), m
2
(nilai paling munasabah) dan m
3
(nilai maksimum), seterusnya nilai ‘d’
Threshold value, konsensus 75% pengesahan kumpulan pakar dan Fuzzy Evaluation.
Di dalam kajian ini, hanya satu (1) item iaitu “mengguna kertas terpakai untuk
sebarang tugasan” daripada elemen kemahiran hijau dalam penilaian dan tugasan telah
ditolak kerana nilai d≤0.2 iaitu 0.243 dan peratus kesepakatan tidak mencapai >75%
namun peratusan keseluruhan konstuk bagi elemen tersebut diterima dengan jumlah
sebanyak 97.92% dan nilai d= 0.126. Seterusnya, item-item yang lain untuk
keseluruhan elemen telah diterima oleh pihak kumpulan pakar bagi meneruskan
kajian. Kesimpulanya, elemen kemahiran hijau dalam PdP perlu dilanjutkan sebagai
garis panduan di dalam PdP untuk para pendidik pada masa akan datang
Spam Filter Improvement Through Measurement
This work supports the thesis that sound quantitative evaluation for
spam filters leads to substantial improvement in the classification
of email. To this end, new laboratory testing methods and datasets
are introduced, and evidence is presented that their adoption at Text
REtrieval Conference (TREC)and elsewhere has led to an improvement in state of the art
spam filtering. While many of these improvements have been discovered
by others, the best-performing method known at this time -- spam filter
fusion -- was demonstrated by the author.
This work describes four principal dimensions of spam filter evaluation
methodology and spam filter improvement. An initial study investigates
the application of twelve open-source filter configurations in a laboratory
environment, using a stream of 50,000 messages captured from a single
recipient over eight months. The study measures the impact of user
feedback and on-line learning on filter performance using methodology
and measures which were released to the research community as the
TREC Spam Filter Evaluation Toolkit.
The toolkit was used as the basis of the TREC Spam Track, which the
author co-founded with Cormack. The Spam Track, in addition to evaluating
a new application (email spam), addressed the issue of testing systems
on both private and public data. While streams of private messages
are most realistic, they are not easy to come by and cannot be shared
with the research community as archival benchmarks. Using the toolkit,
participant filters were evaluated on both, and the differences found
not to substantially confound evaluation; as a result, public corpora
were validated as research tools. Over the course of TREC and similar
evaluation efforts, a dozen or more archival benchmarks --
some private and some public -- have become available.
The toolkit and methodology have spawned improvements in the state
of the art every year since its deployment in 2005. In 2005, 2006,
and 2007, the spam track yielded new best-performing systems based
on sequential compression models, orthogonal sparse bigram features,
logistic regression and support vector machines. Using the TREC participant
filters, we develop and demonstrate methods for on-line filter fusion
that outperform all other reported on-line personal spam filters
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
- …