17 research outputs found

    Extraction of opinionated profiles from comments on web news

    Get PDF
    Tese de mestrado integrado. Engenharia Informática e Computação. Faculdade de Engenharia. Universidade do Porto. 201

    Cyber Security

    Get PDF
    This open access book constitutes the refereed proceedings of the 17th International Annual Conference on Cyber Security, CNCERT 2021, held in Beijing, China, in AJuly 2021. The 14 papers presented were carefully reviewed and selected from 51 submissions. The papers are organized according to the following topical sections: ​data security; privacy protection; anomaly detection; traffic analysis; social network security; vulnerability detection; text classification

    Cyber Security

    Get PDF
    This open access book constitutes the refereed proceedings of the 17th International Annual Conference on Cyber Security, CNCERT 2021, held in Beijing, China, in AJuly 2021. The 14 papers presented were carefully reviewed and selected from 51 submissions. The papers are organized according to the following topical sections: ​data security; privacy protection; anomaly detection; traffic analysis; social network security; vulnerability detection; text classification

    A model for automated topic spotting in a mobile chat based mathematics tutoring environment

    Get PDF
    Systems of writing have existed for thousands of years. The history of civilisation and the history of writing are so intertwined that it is hard to separate the one from the other. These systems of writing, however, are not static. They change. One of the latest developments in systems of writing is short electronic messages such as seen on Twitter and in MXit. One novel application which uses these short electronic messages is the Dr Math® project. Dr Math is a mobile online tutoring system where pupils can use MXit on their cell phones and receive help with their mathematics homework from volunteer tutors around the world. These conversations between pupils and tutors are held in MXit lingo or MXit language – this cryptic, abbreviated system 0f ryting w1ch l0ks lyk dis. Project μ (pronounced mu and indicating MXit Understander) investigated how topics could be determined in MXit lingo and Project μ's research outputs spot mathematics topics in conversations between Dr Math tutors and pupils. Once the topics are determined, supporting documentation can be presented to the tutors to assist them in helping pupils with their mathematics homework. Project μ made the following contributions to new knowledge: a statistical and linguistic analysis of MXit lingo provides letter frequencies, word frequencies, message length statistics as well as linguistic bases for new spelling conventions seen in MXit based conversations; a post-stemmer for use with MXit lingo removes suffixes from the ends of words taking into account MXit spelling conventions allowing words such as equashun and equation to be reduced to the same root stem; a list of over ten thousand stop words for MXit lingo appropriate for the domain of mathematics; a misspelling corrector for MXit lingo which corrects words such as acount and equates it to account; and a model for spotting mathematical topics in MXit lingo. The model was instantiated and integrated into the Dr Math tutoring platform. Empirical evidence as to the effectiveness of the μ Topic Spotter and the other contributions is also presented. The empirical evidence includes specific statistical tests with MXit lingo, specific tests of the misspelling corrector, stemmer, and feedback mechanism, and an extensive exercise of content analysis with respect to mathematics topics

    Untangling the Web: A Guide To Internet Research

    Get PDF
    [Excerpt] Untangling the Web for 2007 is the twelfth edition of a book that started as a small handout. After more than a decade of researching, reading about, using, and trying to understand the Internet, I have come to accept that it is indeed a Sisyphean task. Sometimes I feel that all I can do is to push the rock up to the top of that virtual hill, then stand back and watch as it rolls down again. The Internet—in all its glory of information and misinformation—is for all practical purposes limitless, which of course means we can never know it all, see it all, understand it all, or even imagine all it is and will be. The more we know about the Internet, the more acute is our awareness of what we do not know. The Internet emphasizes the depth of our ignorance because our knowledge can only be finite, while our ignorance must necessarily be infinite. My hope is that Untangling the Web will add to our knowledge of the Internet and the world while recognizing that the rock will always roll back down the hill at the end of the day

    SVMAUD: Using textual information to predict the audience level of written works using support vector machines

    Get PDF
    Information retrieval systems should seek to match resources with the reading ability of the individual user; similarly, an author must choose vocabulary and sentence structures appropriate for his or her audience. Traditional readability formulas, including the popular Flesch-Kincaid Reading Age and the Dale-Chall Reading Ease Score, rely on numerical representations of text characteristics, including syllable counts and sentence lengths, to suggest audience level of resources. However, the author’s chosen vocabulary, sentence structure, and even the page formatting can alter the predicted audience level by several levels, especially in the case of digital library resources. For these reasons, the performance of readability formulas when predicting the audience level of digital library resources is very low. Rather than relying on these inputs, machine learning methods, including cosine, Naïve Bayes, and Support Vector Machines (SVM), can suggest the grade level of an essay based on the vocabulary chosen by the author. The audience level prediction and essay grading problems share the same inputs, expert-labeled documents, and outputs, a numerical score representing quality or audience level. After a human expert labels a representative sample of resources with audience level, the proposed SVM-based audience level prediction program, SVMAUD, constructs a vocabulary for each audience level; then, the text in an unlabeled resource is compared with this predefined vocabulary to suggest the most appropriate audience level. Two readability formulas and four machine learning programs are evaluated with respect to predicting human-expert entered audience levels based on the text contained in an unlabeled resource. In a collection containing 10,238 expert-labeled HTML-based digital library resources, the Flesch-Kincaid Reading Age and the Dale-Chall Reading Ease Score predict the specific audience level with F-measures of 0.10 and 0.05, respectively. Conversely, cosine, Naïve Bayes, the Collins-Thompson and Callan model, and SVMAUD improve these F-measures to 0.57, 0.61, 0.68, and 0.78, respectively. When a term’s weight is adjusted based on the HTML tag in which it occurs, the specific audience level prediction performance of cosine, Naïve Bayes, the Collins-Thompson and Callan method, and SVMAUD improves to 0.68, 0.70, 0.75, and 0.84, respectively. When title, keyword, and abstract metadata is used for training, cosine, Naïve Bayes, the Collins-Thompson and Callan model, and SVMAUD specific audience level prediction F-measures are found to be 0.61, 0.68, 0.75, and 0.86, respectively. When cosine, Naïve Bayes, the Collins-Thompson and Callan method, and SVMAUD are trained and tested using resources from a single subject category, the specific audience level prediction F- measure performance improves to 0.63, 0.70, 0.77, and 0.87, respectively. SVMAUD experiences the highest audience level prediction performance among all methods under evaluation in this study. After SVMAUD is properly trained, it can be used to predict the audience level of any written work

    Filtering Obfuscated Email Spam by means of Phonetic String Matching

    No full text
    Rule-based email filters mainly rely on the occurrence of critical words to classify spam messages. However, perceptive obfuscation techniques can be used to elude exact pattern matching. In this paper we propose a new technique for filtering obfuscated email spam that performs approximate pattern matching both on the original message and on its phonetic transcription

    Quantitative analysis of the release order of defensive mechanisms

    Get PDF
    PhD ThesisDependency on information technology (IT) and computer and information security (CIS) has become a critical concern for many organizations. This concern has essentially centred on protecting secrecy, confidentiality, integrity and availability of information. To overcome this concern, defensive mechanisms, which encompass a variety of services and protections, have been proposed to protect system resources from misuse. Most of these defensive mechanisms, such as CAPTCHAs and spam filters, rely in the first instance on a single algorithm as a defensive mechanism. Attackers would eventually break each mechanism. So, each algorithm would ultimately become useless and the system no longer protected. Although this broken algorithm will be replaced by a new algorithm, no one shed light on a set of algorithms as a defensive mechanism. This thesis looks at a set of algorithms as a holistic defensive mechanism. Our hypothesis is that the order in which a set of defensive algorithms is released has a significant impact on the time taken by attackers to break the combined set of algorithms. The rationale behind this hypothesis is that attackers learn from their attempts, and that the release schedule of defensive mechanisms can be adjusted so as to impair the learning process. To demonstrate the correctness of our hypothesis, an experimental study involving forty participants was conducted to evaluate the effect of algorithms’ order on the time taken to break them. In addition, this experiment explores how the learning process of attackers could be observed. The results showed that the order in which algorithms are released has a statistically significant impact on the time attackers take to break all algorithms. Based on these results, a model has been constructed using Stochastic Petri Nets, which facilitate theoretical analysis of the release order of a set of algorithms approach. Moreover, a tailored optimization algorithm is proposed using a Markov Decision Process model in order to obtain efficiently the optimal release strategy for any given model by maximizing the time taken to break a set of algorithms. As our hypothesis is based on the learning acquisition ability of attackers while interacting with the system, the Attacker Learning Curve (ALC) concept is developed. Based on empirical results of the ALC, an attack strategy detection approach is introduced and evaluated, which has achieved a detection success rate higher than 70%. The empirical findings in this detection approach provide a new understanding of not only how to detect the attack strategy used, but also how to track the attack strategy through the probabilities of classifying results that may provide an advantage for optimising the release order of defensive mechanisms
    corecore