819 research outputs found
Identifying duplicate content using statistically improbable phrases
Motivation: Document similarity metrics such as PubMed's ‘Find related articles’ feature, which have been primarily used to identify studies with similar topics, can now also be used to detect duplicated or potentially plagiarized papers within literature reference databases. However, the CPU-intensive nature of document comparison has limited MEDLINE text similarity studies to the comparison of abstracts, which constitute only a small fraction of a publication's total text. Extending searches to include text archived by online search engines would drastically increase comparison ability. For large-scale studies, submitting short phrases encased in direct quotes to search engines for exact matches would be optimal for both individual queries and programmatic interfaces. We have derived a method of analyzing statistically improbable phrases (SIPs) for assistance in identifying duplicate content
An IR-based Approach Utilising Query Expansion for Plagiarism Detection in MEDLINE
The identification of duplicated and plagiarised
passages of text has become an increasingly active area of
research. In this paper we investigate methods for plagiarism
detection that aim to identify potential sources of plagiarism
from MEDLINE, particularly when the original text has been
modified through the replacement of words or phrases. A
scalable approach based on Information Retrieval is used to
perform candidate document selection - the identification of a
subset of potential source documents given a suspicious text
- from MEDLINE. Query expansion is performed using the
ULMS Metathesaurus to deal with situations in which original
documents are obfuscated. Various approaches to Word Sense
Disambiguation are investigated to deal with cases where there
are multiple Concept Unique Identifiers (CUIs) for a given term.
Results using the proposed IR-based approach outperform a
state-of-the-art baseline based on Kullback-Leibler Distance
Philosophy’s gender gap and argumentative arena: an empirical study
While the empirical evidence pointing to a gender gap in professional, academic philosophy in the English-speaking world is widely accepted, explanations of this gap are less so. In this paper, we aim to make a modest contribution to the literature on the gender gap in academic philosophy by taking a quantitative, corpus-based empirical approach. Since some philosophers have suggested that it may be the argumentative, “logic-chopping,” and “paradox-mongering” nature of academic philosophy that explains the underrepresentation of women in the discipline, our research questions are the following: Do men and women philosophers make different types of arguments in their published works? If so, which ones and with what frequency? Using data mining and text analysis methods, we study a large corpus of philosophical texts mined from the JSTOR database in order to answer these questions empirically. Using indicator words to classify arguments by type, we search through our corpus to find patterns of argumentation. Overall, the results of our empirical study suggest that women philosophers make deductive, inductive, and abductive arguments in their published works just as much as male philosophers do, with no statistically significant differences in the proportions of those arguments relative to each philosopher’s body of work
Word Order in Epigraphic Gǝ’ǝz
The paper offers the results of analysis of word order throughout the epigraphic corpus of Gǝʿǝz. This evidence is mostly in agreement with the data from Classical Gǝʿǝz and confirms that early Gǝʿǝz represents the classical Semitic type of a right-branching language: objects and prepositional phrases mostly follow the verbs, and relative clauses and genitive complements usually follow the head nouns. At the same time, some differences between the syntax of Classical Gǝʿǝz and Epigraphic Gǝʿǝz have been registered, notably in the behaviour of numerals.
Recommended from our members
How Censorship in China Allows Government Criticism but Silences Collective Expression
We offer the first large scale, multiple source analysis of the outcome of what may be the most extensive effort to selectively censor human expression ever implemented. To do this, we have devised a system to locate, download, and analyze the content of millions of social media posts originating from nearly 1,400 different social media services all over China before the Chinese government is able to find, evaluate, and censor (i.e., remove from the Internet) the large subset they deem objectionable. Using modern computer-assisted text analytic methods that we adapt to and validate in the Chinese language, we compare the substantive content of posts censored to those not censored over time in each of 85 topic areas. Contrary to previous understandings, posts with negative, even vitriolic, criticism of the state, its leaders, and its policies are not more likely to be censored. Instead, we show that the censorship program is aimed at curtailing collective action by silencing comments that represent, reinforce, or spur social mobilization, regardless of content. Censorship is oriented toward attempting to forestall collective activities that are occurring now or may occur in the future --- and, as such, seem to clearly expose government intent.Governmen
- …