Search CORE

3,083 research outputs found

Clustering by compression

Author: Cilibrasi Rudi
Vitanyi Paul
Publication venue
Publication date: 09/04/2004
Field of study

We present a new method for clustering based on compression. The method doesn't use subject-specific features or background knowledge, and works as follows: First, we determine a universal similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pairwise concatenation). Second, we apply a hierarchical clustering method. The NCD is universal in that it is not restricted to a specific application area, and works across application area boundaries. A theoretical precursor, the normalized information distance, co-developed by one of the authors, is provably optimal but uses the non-computable notion of Kolmogorov complexity. We propose precise notions of similarity metric, normal compressor, and show that the NCD based on a normal compressor is a similarity metric that approximates universality. To extract a hierarchy of clusters from the distance matrix, we determine a dendrogram (binary tree) by a new quartet method and a fast heuristic to implement it. The method is implemented and available as public software, and is robust under choice of different compressors. To substantiate our claims of universality and robustness, we report evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors. In genomics we presented new evidence for major questions in Mammalian evolution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta hypothesis against the Theria hypothesis.Comment: LaTeX, 27 pages, 20 figure

arXiv.org e-Print Archive

CiteSeerX

CWI's Institutional Repository

International Migration, Integration and Social Cohesion online publications

Critical Issues Affecting the Reliability and Admissibility of Handwriting Identification Opinion Evidence—How They Have Been Addressed (or Not) Since the 2009 NAS Report, and How They Should Be Addressed Going Forward: A Document Examiner Tells All

Author: Sulner Andrew
Publication venue: eRepository @ Seton Hall
Publication date: 05/06/2018
Field of study

bepress Legal Repository

Seton Hall University Libraries

Recommended from our members

Information Overload: An Overview

Author: Bawden D.
Robinson L.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/06/2020
Field of study

For almost as long as there has been recorded information, there has been a perception that humanity has been overloaded by it. Concerns about 'too much to read' have been expressed for many centuries, and made more urgent since the arrival of ubiquitous digital information in the late twentieth century. The historical perspective is a necessary corrective to the often, and wrongly, held view that it is associated solely with the modern digital information environment, and with social media in particular. However, as society fully experiences Floridi's Fourth Revolution, and moves into hyper-history (with society dependent on, and defined by, information and communication technologies) and the infosphere (a information environment distinguished by a seamless blend of online and offline information actvity), individuals and societies are dependent on, and formed by, information in an unprecedented way, information overload needs to be taken more seriously than ever. Overload has been claimed to be both the major issue of our time, and a complete non-issue. It has been cited as an important factor in, inter alia, science, medicine, education, politics, governance, business and marketing, planning for smart cities, access to news, personal data tracking, home life, use of social media, and online shopping, and has even influenced literature The information overload phenomenon has been known by many different names, including: information overabundance, infobesity, infoglut, data smog, information pollution, information fatigue, social media fatigue, social media overload, information anxiety, library anxiety, infostress, infoxication, reading overload, communication overload, cognitive overload, information violence, and information assault. There is no single generally accepted definition, but it can best be understood as that situation which arises when there is so much relevant and potentially useful information available that it becomes a hindrance rather than a help. Its essential nature has not changed with changing technology, though its causes and proposed solutions have changed much. The best ways of avoiding overload, individually and socially, appear to lie in a variety of coping strategies, such as filtering, withdrawing, queuing, and 'satisficing'. Better design of information systems, effective personal information management, and the promotion of digital and media literacies, also have a part to play. Overload may perhaps best be overcome by seeking a mindful balance in consuming information, and in finding understanding

City Research Online

Crossref

Humanities Commons

The Effects of the ADD Label on Teachers\u27 Attitudes and Expectations

Author: Moreton Catherine
Publication venue: Edith Cowan University, Research Online, Perth, Western Australia
Publication date: 01/01/1995
Field of study

Attention Deficit Disorder (ADD) is rapidly becoming an important educational issue. Although much research has been conducted into the effects of labelling and teachers\u27 attitudes and expectations on children\u27s academic and social behaviour, little research has been conducted into the relationship between the label \u27ADD\u27 and teachers\u27 attitudes and expectations. The main purpose of this study was to determine the effects of the ADD label on teachers\u27 attitudes and expectations for children with ADD. In addition, the effects of teachers personal characteristics on their attitudes and expectations for children with ADD, and teachers perceptions of issues surrounding ADD were investigated. The study was conducted utilising self-report data collected from instruments consisting of one of two vignettes describing the typical ADD behaviours of a hypothetical child, and a Likert-type rating scale. Primary school teachers exposed to the vignette containing the ADD label formed the experimental group, while those who completed the vignette without the ADD label formed the control group. The results revealed the ADD label and teachers\u27 personal characteristics had no effect on their attitudes and expectations regarding children with ADD. The results also showed teachers feel they need more resources (e.g., information, teaching strategies, support) in order to meet the needs of children with learning and behaviour disorders such as ADD

Research Online @ ECU

Integrating Mixed Reality Spatial Learning Analytics into Secure Electronic Exams

Author: Birt James R.
Cowling Michael A.
Hillier Matthew
Publication venue: ASCILITE
Publication date: 01/01/2018
Field of study

Bond University Research Portal

University of Canberra Research Repository

Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm

Author: Emily Franzini
Emily Franzini
Gabriela Rotari
Greta Franzini
Jan Rybicki
Jeremi K. Ochab
Joanna Byszuk
Melina Jander
Mike Kestemont
Publication venue: 'Modern Language Association'
Publication date: 01/01/2018
Field of study

This article presents the results of a multidisciplinary project aimed at better understanding the impact of different digitization strategies in computational text analysis. More specifically, it describes an effort to automatically discern the authorship of Jacob and Wilhelm Grimm in a body of uncorrected correspondence processed by HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition), reporting on the effect this noise has on the analyses necessary to computationally identify the different writing style of the two brothers. In summary, our findings show that OCR digitization serves as a reliable proxy for the more painstaking process of manual digitization, at least when it comes to authorship attribution. Our results suggest that attribution is viable even when using training and test sets from different digitization pipelines. With regards to HTR, this research demonstrates that even though automated transcription significantly increases the risk of text misclassification when compared to OCR, a cleanliness above ≈ 20% is already sufficient to achieve a higher-than-chance probability of correct binary attribution

Crossref

PubliCatt

Directory of Open Access Journals

Frontiers - Publisher Connector

Humanities Commons

Institutional Repository Universiteit Antwerpen

Jagiellonian Univeristy Repository

Citations versus expert opinions: Citation analysis of Featured Reviews of the American Mathematical Society

Author: Cao Aaron
Lercher Aaron J.
Sage Daniel S.
Smolinsky Lawrence
Publication venue
Publication date: 16/12/2020
Field of study

Peer review and citation metrics are two means of gauging the value of scientific research, but the lack of publicly available peer review data makes the comparison of these methods difficult. Mathematics can serve as a useful laboratory for considering these questions because as an exact science, there is a narrow range of reasons for citations. In mathematics, virtually all published articles are post-publication reviewed by mathematicians in Mathematical Reviews (MathSciNet) and so the data set was essentially the Web of Science mathematics publications from 1993 to 2004. For a decade, especially important articles were singled out in Mathematical Reviews for featured reviews. In this study, we analyze the bibliometrics of elite articles selected by peer review and by citation count. We conclude that the two notions of significance described by being a featured review article and being highly cited are distinct. This indicates that peer review and citation counts give largely independent determinations of highly distinguished articles. We also consider whether hiring patterns of subfields and mathematicians' interest in subfields reflect subfields of featured review or highly cited articles. We reexamine data from two earlier studies in light of our methods for implications on the peer review/citation count relationship to a diversity of disciplines.Comment: 21 pages, 3 figures, 4 table

arXiv.org e-Print Archive

Louisiana State University

The construction of a Business English curriculum, relevant to the workplace, and making use of word processing in place of handwriting

Author: Wattanaboot Ladawan
Publication venue: Edith Cowan University, Research Online, Perth, Western Australia
Publication date: 01/01/2004
Field of study

Since the Thailand economic crisis in 1997 there has been a sense of urgency expressed in many areas of the society that businesses must modernize their practices and focus more on international trade and communication. Two important components of the changes required are better use of Information and Communication Technologies (ICT) and better use of the English language for business communication. In the education arena this has translated into the need to provide graduates with better skills in the use of English and computers. These two skill areas come together naturally in the study of Business English. In Thailand Rajabhat Institutes have a major responsibility for the training of business professionals and for the improvement of local communities. Therefore research is required to determine how best Thai Rajabhat may improve the provision of Business English to better service the needs of employing organizations and the local community. This study set out to conduct research to address this area of concern

Research Online @ ECU

The elephant in the record: On the multiplicity of data recording work

Author
Publication venue: 'SAGE Publications'
Publication date: 13/11/2021
Field of study

6noopenThis article focuses on the production side of clinical data work, or data recording work, and in particular, on its multiplicity in terms of data variability. We report the findings from two case studies aimed at assessing the multiplicity that can be observed when the same medical phenomenon is recorded by multiple competent experts, yet the recorded data enable the knowledgeable management of illness trajectories. Often framed in terms of the latent unreliability of medical data, and then treated as a problem to solve, we argue that practitioners in the health informatics field must gain a greater awareness of the natural variability of data inscribing work, assess it, and design solutions that allow actors on both sides of clinical data work, that is, the production and care, as well as the primary and secondary uses of data to aptly inform each other’s practices.openCabitza F.; Locoro A.; Alderighi C.; Rasoini R.; Compagnone D.; Berjano P.Cabitza, F.; Locoro, A.; Alderighi, C.; Rasoini, R.; Compagnone, D.; Berjano, P

Archivio istituzionale della ricerca - Università dell'Insubria

Feasibility of Melville Marginalia Authorship Differentiation

Author: Burdin Aaron
Publication venue: 'IUScholarWorks'
Publication date: 01/08/2017
Field of study

We examine the feasibility of using image processing techniques to determine differentiation in authorship of historical pencil marks. Pencil marks with unattributed and attributed authorship are segmented from digital images of historical books. Analysis is performed on five features that are extracted from the vertical pencil marks, with those features used as a basis for authorship of marks. These marks consist of single stroke marks that are interspersed in the same document. We describe the challenges of the digital format that we were given and the steps taken in using autonomous segmentation to save pixel locations of marks. Five mark features are chosen and extracted: Average Intensity, Stroke Width, Blurriness, Stroke Curvature, and Stroke Angle. Features are then analyzed with the use of different histograms, 2D scatter plots of feature space, and comparing and contrasting the two groups of marks. C-means clustering is performed on the feature spaces of both groups. Semi-supervised clustering is used to test if we can predict the clustering. We then use two forms of cluster validity, Davies-Bouldin Index and Silhouette, in order to v produce a confidence value on the number of clusters and their membership. Then we look at the histograms and 2D scatter plots with the Melville’s Marginalia Online attributed and unattributed labels applied. Extracting features show patterns and trends within the marks that could be used to group marks. Specifically, Stroke Curvature became a dominant feature that showed promises of differentiating marks created by different authors. Extracting features has the potential to be used with high confidence in separating marks by author

Boise State University - ScholarWorks