617 research outputs found
Spam classification for online discussions
Traditionally, spam messages filtering systems are built by integrating content-based analysis technologies which are developed from the experiences of dealing with E-mail spam. Recently, the new style of information appears in the Internet, Social Media platform, which also expands the space for Internet abusers.
In this thesis, we not only evaluated the traditional content-based approaches to classify spam messages, we also investigated the possibility of integrating context-based technology with con-tent-based approaches to classify spam messages. We built spam classifiers using Novelty de-tection approach combining with Naïve Bayes, k Nearest-Neighbour and Self-organizing map respectively and tested each of them with vast amount of experiment data. And we also took a further step from the previous researches by integrating Self-organizing map with Naive Bayes to carry out the spam classification.
The results of this thesis show that combining context-based approaches with content-based spam classifier wisely can actually improve the performance of content-based spam classifier in variant of directions. In addition, the results from Self-organizing map classifier with Naïve Bayes show a promising future for data clustering method using in spam filtering.
Thus we believe this thesis presents a new insight in Natural Language Processing and the methods and techniques proposed in this thesis provide researchers in spam filtering field a good tool to analyze context-based spam messages
New techniques for Arabic document classification
Text classification (TC) concerns automatically assigning a class (category) label to
a text document, and has increasingly many applications, particularly in the domain
of organizing, for browsing in large document collections. It is typically achieved
via machine learning, where a model is built on the basis of a typically large collection
of document features. Feature selection is critical in this process, since there
are typically several thousand potential features (distinct words or terms). In text
classification, feature selection aims to improve the computational e ciency and
classification accuracy by removing irrelevant and redundant terms (features), while
retaining features (words) that contain su cient information that help with the
classification task.
This thesis proposes binary particle swarm optimization (BPSO) hybridized with
either K Nearest Neighbour (KNN) or Support Vector Machines (SVM) for feature
selection in Arabic text classi cation tasks. Comparison between feature selection
approaches is done on the basis of using the selected features in conjunction with
SVM, Decision Trees (C4.5), and Naive Bayes (NB), to classify a hold out test
set. Using publically available Arabic datasets, results show that BPSO/KNN and
BPSO/SVM techniques are promising in this domain. The sets of selected features
(words) are also analyzed to consider the di erences between the types of features
that BPSO/KNN and BPSO/SVM tend to choose. This leads to speculation concerning
the appropriate feature selection strategy, based on the relationship between
the classes in the document categorization task at hand.
The thesis also investigates the use of statistically extracted phrases of length
two as terms in Arabic text classi cation. In comparison with Bag of Words text
representation, results show that using phrases alone as terms in Arabic TC task
decreases the classification accuracy of Arabic TC classifiers significantly while combining
bag of words and phrase based representations may increase the classification
accuracy of the SVM classifier slightly
Classification of extremist text on the web using sentiment analysis approach
The high volume of extremist materials online makes manual classification impractical. However, there is a need for automated classification techniques. One set of extremist web pages obtained by the TENE Web-crawler was initially subjected to manual classification. A sentiment-based classification model was then developed to automate the classification of such extremist Websites. The classification model measures how well the pages could be automatically matched against their appropriate classes. The method also identifies particular data items that differ in manual classification from their automated classification. The results from our method showed that overall web pages were correctly matched against the manual classification with a 93% success rate. In addition, a feature selection algorithm was able to reduce the original 26-feature set by one feature to attain a better overall performance of 94% in classifying the Web data
Mining mailing lists for content
Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Civil and Environmental Engineering, 2003.Includes bibliographical references (leaves 65-67).In large decentralized institutions such as MIT, finding information about events and activities on a campus-wide basis can be a strenuous task. This is mainly due to the ephemeral nature of events and the inability to impose a centralized information system to all event organizers and target audiences. For the purpose of advertising events, Email is the communication medium of choice. In particular, there is a wide-spread use of electronic mailing lists to publicize events and activities. These can be used as a valuable source for information mining. This dissertation will propose two mining architectures to find category-specific event announcements broadcasted on public MIT mailing lists. At the center of these mining systems is a text classifier that groups Emails based on their textual content. Classification is followed by information extraction where labeled data, such as the event date, is identified and stored along with the Email content in a searchable database. The first architecture is based on a probabilistic classification method, namely naive-Bayes while the second uses a rules-based classifier. A case implementation, FreeFood@MIT, was implemented to expose the results of these classification schemes and is used as a benchmark for recommendations.by Mario A. Harik.M.Eng
Extração de informação de saúde através das redes sociais
Social media has been proven to be an excellent resource for connecting people
and creating a parallel community. Turning it into a suitable source for extracting
real world events information and information about its users as well. All of this
information can be carefully re-arranged for social monitoring purposes and for the
good of its community. For extracting health evidence in the social media, we
started by analyzing and identifying postpartum depression in social media posts.
We participated in an online challenge, eRisk 2020, continuing the previous participation
of BioInfo@UAVR, predicting self-harm users based on their publications on
Reddit. We built an algorithm based on methods of Natural Language Processing
capable of pre-processing text data and vectorizing it. We make use of linguistic
features based on the frequency of specific sets of words, and other models widely
used that represent whole documents with vectors, such as Tf-Idf and Doc2Vec.
The vectors and the correspondent label are then passed to a Machine Learning
classifier in order to train it. Based on the patterns it found, the model predicts
a classification for unlabeled users. We use multiple classifiers, to find the one
that behaves the best with the data. With the goal of getting the most out of
the model, an optimization step is performed in which we remove stop words and
set the text vectorization algorithms and classifier to be ran in parallel. An analysis
of the feature importance is integrated and a validation step is performed.
The results are discussed and presented in various plots, and include a comparison
between different tuning strategies and the relation between the parameters and
the score. We conclude that the choice of parameters is essential for achieving a
better score and for finding them, there are other strategies more efficient then the
widely used Grid Search. Finally, we compare several approaches for building an
incremental classification based on the post timeline of the users. And conclude
that it is possible to have a chronological perception of certain traits of Reddit
users, specifically evaluating the risk of self-harm with a F1 Score of 0.73.As redes sociais são um excelente recurso para conectar pessoas, criando assim
uma comunidade paralela em que fluem informações acerca de eventos globais
bem como sobre os seus utilizadores. Toda esta informação pode ser trabalhada
com o intuito de monitorizar o bem estar da sua comunidade. De forma a encontrar
evidência médica nas redes sociais, começámos por analisar e identificar
posts de mães em risco de depressão pós-parto no Reddit. Participámos num concurso
online, eRisk 2020, com o intuito de continuar a participação da equipa BioInfo@
UAVR, em que prevemos utilizadores que estão em risco de se automutilarem
através da análise das suas publicações no Reddit. Construímos um algoritmo com
base em métodos de Processamento de Linguagem Natural capaz de pré-processar
os dados de texto e vectorizá-los. Fazendo uso de características linguísticas baseadas
na frequência de conjuntos de palavras, e outros modelos usados globalmente,
capazes de representar documentos com vetores, como o Tf-Idf e o Doc2Vec. Os
vetores e a sua respetiva classificação são depois disponibilizados a algoritmos de
Aprendizagem Automática, para serem treinados e encontrar padrões entre eles.
Utilizamos vários classificadores, de forma a encontrar o que se comporta melhor
com os dados. Com base nos padrões que encontrou, os classificadores prevêm
a classificação de utilizadores ainda por avaliar. De forma a tirar o máximo proveito
do algoritmo, é desempenhada uma otimização em que as stop words são
removidas e paralelizamos os algoritmos de vectorização de texto e o classificador.
Incorporamos uma análise da importância dos atributos do modelo e a otimização
dos híper parâmetros de forma a obter um resultado melhor. Os resultados
são discutidos e apresentados em múltiplos plots, e incluem a comparação entre
diferentes estratégias de optimização e observamos a relação entre os parâmetros
e a sua performance. Concluimos que a escolha dos parâmetros é essencial para
conseguir melhores resultados e que para os encontrar, existem estratégias mais
eficientes que o habitual Grid Search, como o Random Search e a Bayesian Optimization.
Comparamos também várias abordagens para formar uma classificação
incremental que tem em conta a cronologia dos posts. Concluimos que é possível
ter uma perceção cronológica de traços dos utilizadores do Reddit, nomeadamente
avaliar o risco de automutilação, com um F1 Score de 0,73.Mestrado em Engenharia de Computadores e Telemátic
Sentiment Analysis for Social Media
Sentiment analysis is a branch of natural language processing concerned with the study of the intensity of the emotions expressed in a piece of text. The automated analysis of the multitude of messages delivered through social media is one of the hottest research fields, both in academy and in industry, due to its extremely high potential applicability in many different domains. This Special Issue describes both technological contributions to the field, mostly based on deep learning techniques, and specific applications in areas like health insurance, gender classification, recommender systems, and cyber aggression detection
Social media mental health analysis framework through applied computational approaches
Studies have shown that mental illness burdens not only public health and productivity but also established market economies throughout the world. However, mental disorders are difficult to diagnose and monitor through traditional methods, which heavily rely on interviews, questionnaires and surveys, resulting in high under-diagnosis and under-treatment rates. The increasing use of online social media, such as Facebook and Twitter, is now a common part of people’s everyday life. The continuous and real-time user-generated content often reflects feelings, opinions, social status and behaviours of individuals, creating an unprecedented wealth of person-specific information. With advances in data science, social media has already been increasingly employed in population health monitoring and more recently mental health applications to understand mental disorders as well as to develop online screening and intervention tools. However, existing research efforts are still in their infancy, primarily aimed at highlighting the potential of employing social media in mental health research. The majority of work is developed on ad hoc datasets and lacks a systematic research pipeline. [Continues.]</div
- …