2,233 research outputs found
CEAI: CCM based Email Authorship Identification Model
In this paper we present a model for email authorship identification (EAI) by
employing a Cluster-based Classification (CCM) technique. Traditionally,
stylometric features have been successfully employed in various authorship
analysis tasks; we extend the traditional feature-set to include some more
interesting and effective features for email authorship identification (e.g.
the last punctuation mark used in an email, the tendency of an author to use
capitalization at the start of an email, or the punctuation after a greeting or
farewell). We also included Info Gain feature selection based content features.
It is observed that the use of such features in the authorship identification
process has a positive impact on the accuracy of the authorship identification
task. We performed experiments to justify our arguments and compared the
results with other base line models. Experimental results reveal that the
proposed CCM-based email authorship identification model, along with the
proposed feature set, outperforms the state-of-the-art support vector machine
(SVM)-based models, as well as the models proposed by Iqbal et al. [1, 2]. The
proposed model attains an accuracy rate of 94% for 10 authors, 89% for 25
authors, and 81% for 50 authors, respectively on Enron dataset, while 89.5%
accuracy has been achieved on authors' constructed real email dataset. The
results on Enron dataset have been achieved on quite a large number of authors
as compared to the models proposed by Iqbal et al. [1, 2]
A systematic survey of online data mining technology intended for law enforcement
As an increasing amount of crime takes on a digital aspect, law enforcement bodies must tackle an online environment generating huge volumes of data. With manual inspections becoming increasingly infeasible, law enforcement bodies are optimising online investigations through data-mining technologies. Such technologies must be well designed and rigorously grounded, yet no survey of the online data-mining literature exists which examines their techniques, applications and rigour. This article remedies this gap through a systematic mapping study describing online data-mining literature which visibly targets law enforcement applications, using evidence-based practices in survey making to produce a replicable analysis which can be methodologically examined for deficiencies
Probing the topological properties of complex networks modeling short written texts
In recent years, graph theory has been widely employed to probe several
language properties. More specifically, the so-called word adjacency model has
been proven useful for tackling several practical problems, especially those
relying on textual stylistic analysis. The most common approach to treat texts
as networks has simply considered either large pieces of texts or entire books.
This approach has certainly worked well -- many informative discoveries have
been made this way -- but it raises an uncomfortable question: could there be
important topological patterns in small pieces of texts? To address this
problem, the topological properties of subtexts sampled from entire books was
probed. Statistical analyzes performed on a dataset comprising 50 novels
revealed that most of the traditional topological measurements are stable for
short subtexts. When the performance of the authorship recognition task was
analyzed, it was found that a proper sampling yields a discriminability similar
to the one found with full texts. Surprisingly, the support vector machine
classification based on the characterization of short texts outperformed the
one performed with entire books. These findings suggest that a local
topological analysis of large documents might improve its global
characterization. Most importantly, it was verified, as a proof of principle,
that short texts can be analyzed with the methods and concepts of complex
networks. As a consequence, the techniques described here can be extended in a
straightforward fashion to analyze texts as time-varying complex networks
E-mail forensic authorship attribution
E-mails have become the standard for business as well as personal communication. The inherent security risks within e-mail communication present the problem of anonymity. If an author of an e-mail is not known, the digital forensic investigator needs to determine the authorship of the e-mail using a process that has not been standardised in the e-mail forensic field. This research project examines many problems associated with e-mail communication and the digital forensic domain; more specifically e-mail forensic investigations, and the recovery of legally admissible evidence to be presented in a court of law. The Research Methodology utilised a comprehensive literature review in combination with Design Science which results in the development of an artifact through intensive research. The Proposed E-Mail Forensic Methodology is based on the most current digital forensic investigation process and further validation of the process was established via expert reviews. The opinions of the digital forensic experts were an integral portion of the validation process which adds to the credibility of the study. This was performed through the aid of the Delphi technique. This Proposed E-Mail Forensic Methodology adopts a standardised investigation process applied to an e-mail investigation and takes into account the South African perspective by incorporating various checks with the laws and legislation. By following the Proposed E-mail Forensic Methodology, e-mail forensic investigators can produce evidence that is legally admissible in a court of law
A multi-disciplinary framework for cyber attribution
Effective Cyber security is critical to the prosperity of any nation in the modern world. We have become
dependant upon this interconnected network of systems for a number of critical functions within society.
As our reliance upon this technology has increased, as has the prospective gains for malicious actors who
would abuse these systems for their own personal benefit, at the cost of legitimate users. The result has
been an explosion of cyber attacks, or cyber enabled crimes. The threat from hackers, organised criminals
and even nations states is ever increasing. One of the critical enablers to our cyber security is that of cyber
attribution, the ability to tell who is acting against our systems.
A purely technical approach to cyber attribution has been found to be ineffective in the majority of cases,
taking too narrow approach to the attribution problem. A purely technical approach will provide Indicators
Of Compromise (IOC) which is suitable for the immediate recovery and clean up of a cyber event. It
fails however to ask the deeper questions of the origin of the attack. This can be derived from a wider
set of analysis and additional sources of data. Unfortunately due to the wide range of data types and
highly specialist skills required to perform the deep level analysis there is currently no common framework
for analysts to work together towards resolving the attribution problem. This is further exasperated by a
communication barrier between the highly specialised fields and no obviously compatible data types.
The aim of the project is to develop a common framework upon which experts from a number of disciplines
can add to the overall attribution picture. These experts will add their input in the form of a library. Firstly
a process was developed to enable the creation of compatible libraries in different specialist fields. A series
of libraries can be used by an analyst to create an overarching attribution picture. The framework will
highlight any intelligence gaps and additionally an analyst can use the list of libraries to suggest a tool or
method to fill that intelligence gap.
By the end of the project a working framework had been developed with a number of libraries from a
wide range of technical attribution disciplines. These libraries were used to feed in real time intelligence
to both technical and nontechnical analysts who were then able to use this information to perform in depth
attribution analysis. The pictorial format of the framework was found to assist in the breaking down of
the communication barrier between disciplines and was suitable as an intelligence product in its own right,
providing a useful visual aid to briefings. The simplicity of the library based system meant that the process
was easy to learn with only a short introduction to the framework required
Design of a Controlled Language for Critical Infrastructures Protection
We describe a project for the construction of controlled language for critical infrastructures protection (CIP). This project originates
from the need to coordinate and categorize the communications on CIP at the European level. These communications can be physically
represented by official documents, reports on incidents, informal communications and plain e-mail. We explore the application of
traditional library science tools for the construction of controlled languages in order to achieve our goal. Our starting point is an
analogous work done during the sixties in the field of nuclear science known as the Euratom Thesaurus.JRC.G.6-Security technology assessmen
Essays on Technology in Presence of Globalization
Technology has long been known to enable globalization in ways previously not thought possible, with instantaneous communication allowing members of organizations all across the globe to communicate and share information with little to no delay. However, as the effects of globalization have become more prominent, they have in turn helped to shape the very technologies that enable these processes. These three essays analyze three examples of how these two processes – globalization and technological development – impact one another. The first looks at a national policy level, attempting to understand how increased possibilities for inside leakers can force governments to consider asylum requests. The second analyzes the issue at the level of corporations, attempting to understand how and why business leaders choose to hire individuals from other countries. The third and final essay analyzes the issue at the most micro level, studying a potential application that could help analyze linguistic factors that have taken a more prominent role in a more globalized society
Admissibility of Non-U.S. Electronic Evidence
After two long years collecting hundreds of gigabytes of e-mail, data base reports, and social media posts from countries in Europe, Asia, and South America, such as France, South Korea, Argentina, Canada, Australia, and El Salvador, the day of trial has arrived. The trial team has obtained the data at great cost, in dollars as well as person-hours, but is finally ready for trial. First-chair counsel, second-chair counsel, and four paralegals file into the courtroom, not with bankers boxes full of documents as in earlier times, but with laptops, tablet computers, and a data projector. Following opening statements, the first witness takes the stand. After a few questions about the existence of e-mails written by executives of the defendant multinational corporation, a paralegal moves to the projector, as she rehearsed many times, to flip on the switch that will project the e-mails for the jury. She hears, “Objection!” followed immediately by, “Sustained.” Counsel asks for a sidebar. Instead, the judge asks the court officer to take the jury out. She then notes that these e-mails, the production of which she had ruled upon previously, were created outside the U.S. Who will testify to their authenticity? What was the chain of custody—were they altered in some fashion in the office or between the client’s servers and counsel’s laptop? How, exactly, do the e-mails fit into an exception to the hearsay rule? Business records? What is the “business” of this foreign facility that requires the use of e-mail on a regular basis? Counsel asks for a continuance to respond to those questions. “Denied!” the judge says
Detecting plagiarism in the forensic linguistics turn
This study investigates plagiarism detection, with an application in forensic contexts. Two types of data were collected for the purposes of this study. Data in the form of written texts were obtained from two Portuguese Universities and from a Portuguese newspaper. These data are analysed linguistically to identify instances of verbatim, morpho-syntactical, lexical and discursive overlap. Data in the form of survey were obtained from two higher education institutions in Portugal, and another two in the United Kingdom. These data are analysed using a 2 by 2 between-groups Univariate Analysis of Variance (ANOVA), to reveal cross-cultural divergences in the perceptions of plagiarism. The study discusses the legal and social circumstances that may contribute to adopting a punitive approach to plagiarism, or, conversely, reject the punishment. The research adopts a critical approach to plagiarism detection. On the one hand, it describes the linguistic strategies adopted by plagiarists when borrowing from other sources, and, on the other hand, it discusses the relationship between these instances of plagiarism and the context in which they appear. A focus of this study is whether plagiarism involves an intention to deceive, and, in this case, whether forensic linguistic evidence can provide clues to this intentionality. It also evaluates current computational approaches to plagiarism detection, and identifies strategies that these systems fail to detect. Specifically, a method is proposed to translingual plagiarism. The findings indicate that, although cross-cultural aspects influence the different perceptions of plagiarism, a distinction needs to be made between intentional and unintentional plagiarism. The linguistic analysis demonstrates that linguistic elements can contribute to finding clues for the plagiarist’s intentionality. Furthermore, the findings show that translingual plagiarism can be detected by using the method proposed, and that plagiarism detection software can be improved using existing computer tools
- …