2,233 research outputs found

    CEAI: CCM based Email Authorship Identification Model

    Full text link
    In this paper we present a model for email authorship identification (EAI) by employing a Cluster-based Classification (CCM) technique. Traditionally, stylometric features have been successfully employed in various authorship analysis tasks; we extend the traditional feature-set to include some more interesting and effective features for email authorship identification (e.g. the last punctuation mark used in an email, the tendency of an author to use capitalization at the start of an email, or the punctuation after a greeting or farewell). We also included Info Gain feature selection based content features. It is observed that the use of such features in the authorship identification process has a positive impact on the accuracy of the authorship identification task. We performed experiments to justify our arguments and compared the results with other base line models. Experimental results reveal that the proposed CCM-based email authorship identification model, along with the proposed feature set, outperforms the state-of-the-art support vector machine (SVM)-based models, as well as the models proposed by Iqbal et al. [1, 2]. The proposed model attains an accuracy rate of 94% for 10 authors, 89% for 25 authors, and 81% for 50 authors, respectively on Enron dataset, while 89.5% accuracy has been achieved on authors' constructed real email dataset. The results on Enron dataset have been achieved on quite a large number of authors as compared to the models proposed by Iqbal et al. [1, 2]

    A systematic survey of online data mining technology intended for law enforcement

    Get PDF
    As an increasing amount of crime takes on a digital aspect, law enforcement bodies must tackle an online environment generating huge volumes of data. With manual inspections becoming increasingly infeasible, law enforcement bodies are optimising online investigations through data-mining technologies. Such technologies must be well designed and rigorously grounded, yet no survey of the online data-mining literature exists which examines their techniques, applications and rigour. This article remedies this gap through a systematic mapping study describing online data-mining literature which visibly targets law enforcement applications, using evidence-based practices in survey making to produce a replicable analysis which can be methodologically examined for deficiencies

    Probing the topological properties of complex networks modeling short written texts

    Get PDF
    In recent years, graph theory has been widely employed to probe several language properties. More specifically, the so-called word adjacency model has been proven useful for tackling several practical problems, especially those relying on textual stylistic analysis. The most common approach to treat texts as networks has simply considered either large pieces of texts or entire books. This approach has certainly worked well -- many informative discoveries have been made this way -- but it raises an uncomfortable question: could there be important topological patterns in small pieces of texts? To address this problem, the topological properties of subtexts sampled from entire books was probed. Statistical analyzes performed on a dataset comprising 50 novels revealed that most of the traditional topological measurements are stable for short subtexts. When the performance of the authorship recognition task was analyzed, it was found that a proper sampling yields a discriminability similar to the one found with full texts. Surprisingly, the support vector machine classification based on the characterization of short texts outperformed the one performed with entire books. These findings suggest that a local topological analysis of large documents might improve its global characterization. Most importantly, it was verified, as a proof of principle, that short texts can be analyzed with the methods and concepts of complex networks. As a consequence, the techniques described here can be extended in a straightforward fashion to analyze texts as time-varying complex networks

    E-mail forensic authorship attribution

    Get PDF
    E-mails have become the standard for business as well as personal communication. The inherent security risks within e-mail communication present the problem of anonymity. If an author of an e-mail is not known, the digital forensic investigator needs to determine the authorship of the e-mail using a process that has not been standardised in the e-mail forensic field. This research project examines many problems associated with e-mail communication and the digital forensic domain; more specifically e-mail forensic investigations, and the recovery of legally admissible evidence to be presented in a court of law. The Research Methodology utilised a comprehensive literature review in combination with Design Science which results in the development of an artifact through intensive research. The Proposed E-Mail Forensic Methodology is based on the most current digital forensic investigation process and further validation of the process was established via expert reviews. The opinions of the digital forensic experts were an integral portion of the validation process which adds to the credibility of the study. This was performed through the aid of the Delphi technique. This Proposed E-Mail Forensic Methodology adopts a standardised investigation process applied to an e-mail investigation and takes into account the South African perspective by incorporating various checks with the laws and legislation. By following the Proposed E-mail Forensic Methodology, e-mail forensic investigators can produce evidence that is legally admissible in a court of law

    A multi-disciplinary framework for cyber attribution

    Get PDF
    Effective Cyber security is critical to the prosperity of any nation in the modern world. We have become dependant upon this interconnected network of systems for a number of critical functions within society. As our reliance upon this technology has increased, as has the prospective gains for malicious actors who would abuse these systems for their own personal benefit, at the cost of legitimate users. The result has been an explosion of cyber attacks, or cyber enabled crimes. The threat from hackers, organised criminals and even nations states is ever increasing. One of the critical enablers to our cyber security is that of cyber attribution, the ability to tell who is acting against our systems. A purely technical approach to cyber attribution has been found to be ineffective in the majority of cases, taking too narrow approach to the attribution problem. A purely technical approach will provide Indicators Of Compromise (IOC) which is suitable for the immediate recovery and clean up of a cyber event. It fails however to ask the deeper questions of the origin of the attack. This can be derived from a wider set of analysis and additional sources of data. Unfortunately due to the wide range of data types and highly specialist skills required to perform the deep level analysis there is currently no common framework for analysts to work together towards resolving the attribution problem. This is further exasperated by a communication barrier between the highly specialised fields and no obviously compatible data types. The aim of the project is to develop a common framework upon which experts from a number of disciplines can add to the overall attribution picture. These experts will add their input in the form of a library. Firstly a process was developed to enable the creation of compatible libraries in different specialist fields. A series of libraries can be used by an analyst to create an overarching attribution picture. The framework will highlight any intelligence gaps and additionally an analyst can use the list of libraries to suggest a tool or method to fill that intelligence gap. By the end of the project a working framework had been developed with a number of libraries from a wide range of technical attribution disciplines. These libraries were used to feed in real time intelligence to both technical and nontechnical analysts who were then able to use this information to perform in depth attribution analysis. The pictorial format of the framework was found to assist in the breaking down of the communication barrier between disciplines and was suitable as an intelligence product in its own right, providing a useful visual aid to briefings. The simplicity of the library based system meant that the process was easy to learn with only a short introduction to the framework required

    Design of a Controlled Language for Critical Infrastructures Protection

    Get PDF
    We describe a project for the construction of controlled language for critical infrastructures protection (CIP). This project originates from the need to coordinate and categorize the communications on CIP at the European level. These communications can be physically represented by official documents, reports on incidents, informal communications and plain e-mail. We explore the application of traditional library science tools for the construction of controlled languages in order to achieve our goal. Our starting point is an analogous work done during the sixties in the field of nuclear science known as the Euratom Thesaurus.JRC.G.6-Security technology assessmen

    Essays on Technology in Presence of Globalization

    Get PDF
    Technology has long been known to enable globalization in ways previously not thought possible, with instantaneous communication allowing members of organizations all across the globe to communicate and share information with little to no delay. However, as the effects of globalization have become more prominent, they have in turn helped to shape the very technologies that enable these processes. These three essays analyze three examples of how these two processes – globalization and technological development – impact one another. The first looks at a national policy level, attempting to understand how increased possibilities for inside leakers can force governments to consider asylum requests. The second analyzes the issue at the level of corporations, attempting to understand how and why business leaders choose to hire individuals from other countries. The third and final essay analyzes the issue at the most micro level, studying a potential application that could help analyze linguistic factors that have taken a more prominent role in a more globalized society

    Admissibility of Non-U.S. Electronic Evidence

    Get PDF
    After two long years collecting hundreds of gigabytes of e-mail, data base reports, and social media posts from countries in Europe, Asia, and South America, such as France, South Korea, Argentina, Canada, Australia, and El Salvador, the day of trial has arrived. The trial team has obtained the data at great cost, in dollars as well as person-hours, but is finally ready for trial. First-chair counsel, second-chair counsel, and four paralegals file into the courtroom, not with bankers boxes full of documents as in earlier times, but with laptops, tablet computers, and a data projector. Following opening statements, the first witness takes the stand. After a few questions about the existence of e-mails written by executives of the defendant multinational corporation, a paralegal moves to the projector, as she rehearsed many times, to flip on the switch that will project the e-mails for the jury. She hears, “Objection!” followed immediately by, “Sustained.” Counsel asks for a sidebar. Instead, the judge asks the court officer to take the jury out. She then notes that these e-mails, the production of which she had ruled upon previously, were created outside the U.S. Who will testify to their authenticity? What was the chain of custody—were they altered in some fashion in the office or between the client’s servers and counsel’s laptop? How, exactly, do the e-mails fit into an exception to the hearsay rule? Business records? What is the “business” of this foreign facility that requires the use of e-mail on a regular basis? Counsel asks for a continuance to respond to those questions. “Denied!” the judge says

    Detecting plagiarism in the forensic linguistics turn

    Get PDF
    This study investigates plagiarism detection, with an application in forensic contexts. Two types of data were collected for the purposes of this study. Data in the form of written texts were obtained from two Portuguese Universities and from a Portuguese newspaper. These data are analysed linguistically to identify instances of verbatim, morpho-syntactical, lexical and discursive overlap. Data in the form of survey were obtained from two higher education institutions in Portugal, and another two in the United Kingdom. These data are analysed using a 2 by 2 between-groups Univariate Analysis of Variance (ANOVA), to reveal cross-cultural divergences in the perceptions of plagiarism. The study discusses the legal and social circumstances that may contribute to adopting a punitive approach to plagiarism, or, conversely, reject the punishment. The research adopts a critical approach to plagiarism detection. On the one hand, it describes the linguistic strategies adopted by plagiarists when borrowing from other sources, and, on the other hand, it discusses the relationship between these instances of plagiarism and the context in which they appear. A focus of this study is whether plagiarism involves an intention to deceive, and, in this case, whether forensic linguistic evidence can provide clues to this intentionality. It also evaluates current computational approaches to plagiarism detection, and identifies strategies that these systems fail to detect. Specifically, a method is proposed to translingual plagiarism. The findings indicate that, although cross-cultural aspects influence the different perceptions of plagiarism, a distinction needs to be made between intentional and unintentional plagiarism. The linguistic analysis demonstrates that linguistic elements can contribute to finding clues for the plagiarist’s intentionality. Furthermore, the findings show that translingual plagiarism can be detected by using the method proposed, and that plagiarism detection software can be improved using existing computer tools
    corecore