27,526 research outputs found

    Authorship Identification in Bengali Literature: a Comparative Analysis

    Full text link
    Stylometry is the study of the unique linguistic styles and writing behaviors of individuals. It belongs to the core task of text categorization like authorship identification, plagiarism detection etc. Though reasonable number of studies have been conducted in English language, no major work has been done so far in Bengali. In this work, We will present a demonstration of authorship identification of the documents written in Bengali. We adopt a set of fine-grained stylistic features for the analysis of the text and use them to develop two different models: statistical similarity model consisting of three measures and their combination, and machine learning model with Decision Tree, Neural Network and SVM. Experimental results show that SVM outperforms other state-of-the-art methods after 10-fold cross validations. We also validate the relative importance of each stylistic feature to show that some of them remain consistently significant in every model used in this experiment.Comment: 9 pages, 5 tables, 4 picture

    Research Articles in Simplified HTML: a Web-first format for HTML-based scholarly articles

    Get PDF
    Purpose. This paper introduces the Research Articles in Simplified HTML (or RASH), which is a Web-first format for writing HTML-based scholarly papers; it is accompanied by the RASH Framework, a set of tools for interacting with RASH-based articles. The paper also presents an evaluation that involved authors and reviewers of RASH articles submitted to the SAVE-SD 2015 and SAVE-SD 2016 workshops. Design. RASH has been developed aiming to: be easy to learn and use; share scholarly documents (and embedded semantic annotations) through the Web; support its adoption within the existing publishing workflow. Findings. The evaluation study confirmed that RASH is ready to be adopted in workshops, conferences, and journals and can be quickly learnt by researchers who are familiar with HTML. Research Limitations. The evaluation study also highlighted some issues in the adoption of RASH, and in general of HTML formats, especially by less technically savvy users. Moreover, additional tools are needed, e.g., for enabling additional conversions from/to existing formats such as OpenXML. Practical Implications. RASH (and its Framework) is another step towards enabling the definition of formal representations of the meaning of the content of an article, facilitating its automatic discovery, enabling its linking to semantically related articles, providing access to data within the article in actionable form, and allowing integration of data between papers. Social Implications. RASH addresses the intrinsic needs related to the various users of a scholarly article: researchers (focussing on its content), readers (experiencing new ways for browsing it), citizen scientists (reusing available data formally defined within it through semantic annotations), publishers (using the advantages of new technologies as envisioned by the Semantic Publishing movement). Value. RASH helps authors to focus on the organisation of their texts, supports them in the task of semantically enriching the content of articles, and leaves all the issues about validation, visualisation, conversion, and semantic data extraction to the various tools developed within its Framework

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    The Construction and Formation of Chinese Character

    Get PDF
    Chinese characters are very unique. As one of the most widely used scripts in the world, its complex forms and large volume of characters make it always daunting for beginners, which is related to the long history of Chinese characters. This paper is divided into five parts, based on a definition of the nature of ancient Chinese characters. In the first part, it is argued that Chinese characters are made up of smaller elements. A single Chinese character, as a morpheme, is the smallest ideographic unit of the Chinese language, but a single character may also be formed by combining smaller component affixes. The second part mainly introduces 11 basic patterns by which Chinese characters can be made up of single or multiple elements. The third part lists different types of the formation of Chinese characters. The fourth part introduces the variants of one character that exist in the same time period. Finally, the evolution of a Chinese character over time is discussed

    The Construction And Formation Of Chinese Character

    Get PDF
    Chinese characters are very unique. As one of the most widely used scripts in the world, its complex forms and large volume of characters make it always daunting for beginners, which is related to the long history of Chinese characters. This paper is divided into five parts, based on a definition of the nature of ancient Chinese characters. In the first part, it is argued that Chinese characters are made up of smaller elements. A single Chinese character, as a morpheme, is the smallest ideographic unit of the Chinese language, but a single character may also be formed by combining smaller component affixes. The second part mainly introduces 11 basic patterns by which Chinese characters can be made up of single or multiple elements. The third part lists different types of the formation of Chinese characters. The fourth part introduces the variants of one character that exist in the same time period. Finally, the evolution of a Chinese character over time is discussed

    Deep Adaptive Learning for Writer Identification based on Single Handwritten Word Images

    Get PDF
    There are two types of information in each handwritten word image: explicit information which can be easily read or derived directly, such as lexical content or word length, and implicit attributes such as the author's identity. Whether features learned by a neural network for one task can be used for another task remains an open question. In this paper, we present a deep adaptive learning method for writer identification based on single-word images using multi-task learning. An auxiliary task is added to the training process to enforce the emergence of reusable features. Our proposed method transfers the benefits of the learned features of a convolutional neural network from an auxiliary task such as explicit content recognition to the main task of writer identification in a single procedure. Specifically, we propose a new adaptive convolutional layer to exploit the learned deep features. A multi-task neural network with one or several adaptive convolutional layers is trained end-to-end, to exploit robust generic features for a specific main task, i.e., writer identification. Three auxiliary tasks, corresponding to three explicit attributes of handwritten word images (lexical content, word length and character attributes), are evaluated. Experimental results on two benchmark datasets show that the proposed deep adaptive learning method can improve the performance of writer identification based on single-word images, compared to non-adaptive and simple linear-adaptive approaches.Comment: Under view of Pattern Recognitio
    • …
    corecore