37 research outputs found

    Integrating State-of-the-art NLP Tools into Existing Methods to Address Current Challenges in Plagiarism Detection

    Get PDF
    Paraphrase plagiarism occurs when text is deliberately obfuscated to evade detection, deliberate alteration increases the complexity of plagiarism and the difficulty in detecting paraphrase plagiarism. In paraphrase plagiarism, copied texts often contain little or no matching words, and conventional plagiarism detectors, most of which are designed to detect matching stings are ineffective under such condition. The problem of plagiarism detection has been widely researched in recent years with significant progress made particularly in the platform of Pan@Clef competition on plagiarism detection. However further research is required specifically in the area of paraphrase and translation (obfuscation) plagiarism detection as studies show that the state-of-the-art is unsatisfactory. A rational solution to the problem is to apply models that detect plagiarism using semantic features in texts, rather than matching strings. Deep contextualised learning models (DCLMs) have the ability to learn deep textual features that can be used to compare text for semantic similarity. They have been remarkably effective in many natural language processing (NLP) tasks, but have not yet been tested in paraphrase plagiarism detection. The second problem facing conventional plagiarism detection is translation plagiarism, which occurs when copied text is translated to a different language and sometimes paraphrased and used without acknowledging the original sources. The most common method used for detecting cross-lingual plagiarism (CLP) require internet translation services, which is limiting to the detection process in many ways. A rational solution to the problem is to use detection models that do not utilise internet translation services. In this thesis we addressed these ongoing challenges facing conventional plagiarism detection by applying some of the most advanced methods in NLP, which includes contextualised and non-contextualised deep learning models. To address the problem of paraphrased plagiarism, we proposed a novel paraphrase plagiarism detector that integrates deep contextualised learning (DCL) into a generic plagiarism detection framework. Evaluation results revealed that our proposed paraphrase detector outperformed a state-of-art model, and a number of standard baselines in the task of paraphrase plagiarism detection. With respect to CLP detection, we propose a novel multilingual translation model (MTM) based on the Word2Vec (word embedding) model that can effectively translate text across a number of languages, it is independent of the internet and performs comparably, and in many cases better than a common cross-lingual plagiarism detection model that rely on online machine translator. The MTM does not require parallel or comparable corpora, it is therefore designed to resolve the problem of CLPD in low resource languages. The solutions provided in this research advance the state-of-the-art and contribute to the existing body of knowledge in plagiarism detection, and would also have a positive impact on academic integrity that has been under threat for a while by plagiarism

    Natural Language Processing for Technology Foresight Summarization and Simplification: the case of patents

    Get PDF
    Technology foresight aims to anticipate possible developments, understand trends, and identify technologies of high impact. To this end, monitoring emerging technologies is crucial. Patents -- the legal documents that protect novel inventions -- can be a valuable source for technology monitoring. Millions of patent applications are filed yearly, with 3.4 million applications in 2021 only. Patent documents are primarily textual documents and disclose innovative and potentially valuable inventions. However, their processing is currently underresearched. This is due to several reasons, including the high document complexity: patents are very lengthy and are written in an extremely hard-to-read language, which is a mix of technical and legal jargon. This thesis explores how Natural Language Processing -- the discipline that enables machines to process human language automatically -- can aid patent processing. Specifically, we focus on two tasks: patent summarization (i.e., we try to reduce the document length while preserving its core content) and patent simplification (i.e., we try to reduce the document's linguistic complexity while preserving its original core meaning). We found that older patent summarization approaches were not compared on shared benchmarks (making thus it hard to draw conclusions), and even the most recent abstractive dataset presents important issues that might make comparisons meaningless. We try to fill both gaps: we first document the issues related to the BigPatent dataset and then benchmark extractive, abstraction, and hybrid approaches in the patent domain. We also explore transferring summarization methods from the scientific paper domain with limited success. For the automatic text simplification task, we noticed a lack of simplified text and parallel corpora. We fill this gap by defining a method to generate a silver standard for patent simplification automatically. Lay human judges evaluated the simplified sentences in the corpus as grammatical, adequate, and simpler, and we show that it can be used to train a state-of-the-art simplification model. This thesis describes the first steps toward Natural Language Processing-aided patent summarization and simplification. We hope it will encourage more research on the topic, opening doors for a productive dialog between NLP researchers and domain experts.Technology foresight aims to anticipate possible developments, understand trends, and identify technologies of high impact. To this end, monitoring emerging technologies is crucial. Patents -- the legal documents that protect novel inventions -- can be a valuable source for technology monitoring. Millions of patent applications are filed yearly, with 3.4 million applications in 2021 only. Patent documents are primarily textual documents and disclose innovative and potentially valuable inventions. However, their processing is currently underresearched. This is due to several reasons, including the high document complexity: patents are very lengthy and are written in an extremely hard-to-read language, which is a mix of technical and legal jargon. This thesis explores how Natural Language Processing -- the discipline that enables machines to process human language automatically -- can aid patent processing. Specifically, we focus on two tasks: patent summarization (i.e., we try to reduce the document length while preserving its core content) and patent simplification (i.e., we try to reduce the document's linguistic complexity while preserving its original core meaning). We found that older patent summarization approaches were not compared on shared benchmarks (making thus it hard to draw conclusions), and even the most recent abstractive dataset presents important issues that might make comparisons meaningless. We try to fill both gaps: we first document the issues related to the BigPatent dataset and then benchmark extractive, abstraction, and hybrid approaches in the patent domain. We also explore transferring summarization methods from the scientific paper domain with limited success. For the automatic text simplification task, we noticed a lack of simplified text and parallel corpora. We fill this gap by defining a method to generate a silver standard for patent simplification automatically. Lay human judges evaluated the simplified sentences in the corpus as grammatical, adequate, and simpler, and we show that it can be used to train a state-of-the-art simplification model. This thesis describes the first steps toward Natural Language Processing-aided patent summarization and simplification. We hope it will encourage more research on the topic, opening doors for a productive dialog between NLP researchers and domain experts

    Sixth Goddard Conference on Mass Storage Systems and Technologies Held in Cooperation with the Fifteenth IEEE Symposium on Mass Storage Systems

    Get PDF
    This document contains copies of those technical papers received in time for publication prior to the Sixth Goddard Conference on Mass Storage Systems and Technologies which is being held in cooperation with the Fifteenth IEEE Symposium on Mass Storage Systems at the University of Maryland-University College Inn and Conference Center March 23-26, 1998. As one of an ongoing series, this Conference continues to provide a forum for discussion of issues relevant to the management of large volumes of data. The Conference encourages all interested organizations to discuss long term mass storage requirements and experiences in fielding solutions. Emphasis is on current and future practical solutions addressing issues in data management, storage systems and media, data acquisition, long term retention of data, and data distribution. This year's discussion topics include architecture, tape optimization, new technology, performance, standards, site reports, vendor solutions. Tutorials will be available on shared file systems, file system backups, data mining, and the dynamics of obsolescence

    Essentials of Business Analytics

    Get PDF

    Space station data system analysis/architecture study. Task 2: Options development DR-5. Volume 1: Technology options

    Get PDF
    The second task in the Space Station Data System (SSDS) Analysis/Architecture Study is the development of an information base that will support the conduct of trade studies and provide sufficient data to make key design/programmatic decisions. This volume identifies the preferred options in the technology category and characterizes these options with respect to performance attributes, constraints, cost, and risk. The technology category includes advanced materials, processes, and techniques that can be used to enhance the implementation of SSDS design structures. The specific areas discussed are mass storage, including space and round on-line storage and off-line storage; man/machine interface; data processing hardware, including flight computers and advanced/fault tolerant computer architectures; and software, including data compression algorithms, on-board high level languages, and software tools. Also discussed are artificial intelligence applications and hard-wire communications

    XVIII. Magyar Számítógépes Nyelvészeti Konferencia

    Get PDF

    An investigation into computer and network curricula

    Get PDF
    This thesis consists of a series of internationally published, peer reviewed, journal and conference research papers that analyse the educational and training needs of undergraduate Information Technology (IT) students within the area of Computer and Network Technology (CNT) Education. Research by Maj et al has found that accredited computing science curricula can fail to meet the expectations of employers in the field of CNT: “It was found that none of these students could perform first line maintenance on a Personal Computer (PC) to a professional standard with due regard to safety, both to themselves and the equipment. Neither could they install communication cards, cables and network operating system or manage a population of networked PCs to an acceptable commercial standard without further extensive training. It is noteworthy that none of the students interviewed had ever opened a PC. It is significant that all those interviewed for this study had successfully completed all the units on computer architecture and communication engineering (Maj, Robbins, Shaw, & Duley, 1998). The students\u27 curricula at that time lacked units in which they gained hands-on experience in modern PC hardware or networking skills. This was despite the fact that their computing science course was level one accredited, the highest accreditation level offered by the Australian Computer Society (ACS). The results of the initial survey in Western Australia led to the introduction of two new units within the Computing Science Degree at Edith Cowan University (ECU), Computer Installation & Maintenance (CIM) and Network Installation & Maintenance (NIM) (Maj, Fetherston, Charlesworth, & Robbins, 1998). Uniquely within an Australian university context these new syllabi require students to work on real equipment. Such experience excludes digital circuit investigation, which is still a recommended approach by the Association for Computing Machinery (ACM) for computer architecture units (ACM, 2001, p.97). Instead, the CIM unit employs a top-down approach based initially upon students\u27 everyday experiences, which is more in accordance with constructivist educational theory and practice. These papers propose an alternate model of IT education that helps to accommodate the educational and vocational needs of IT students in the context of continual rapid changes and developments in technology. The ACM have recognised the need for variation noting that: There are many effective ways to organize a curriculum even for a particular set of goals and objectives (Tucker et al., 1991, p.70). A possible major contribution to new knowledge of these papers relates to how high level abstract bandwidth (B-Node) models may contribute to the understanding of why and how computer and networking technology systems have developed over time. Because these models are de-coupled from the underlying technology, which is subject to rapid change, these models may help to future-proof student knowledge and understanding of the ongoing and future development of computer and networking systems. The de-coupling is achieved through abstraction based upon bandwidth or throughput rather than the specific implementation of the underlying technologies. One of the underlying problems is that computing systems tend to change faster than the ability of most educational institutions to respond. Abstraction and the use of B-Node models could help educational models to more quickly respond to changes in the field, and can also help to introduce an element of future-proofing in the education of IT students. The importance of abstraction has been noted by the ACM who state that: Levels of Abstraction: the nature and use of abstraction in computing; the use of abstraction in managing complexity, structuring systems, hiding details, and capturing recurring patterns; the ability to represent an entity or system by abstractions having different levels of detail and specificity (ACM, 1991b). Bloom et al note the importance of abstraction, listing under a heading of: “Knowledge of the universals and abstractions in a field” the objective: Knowledge of the major schemes and patterns by which phenomena and ideas arc organized. These are large structures, theories, and generalizations which dominate a subject or field or problems. These are the highest levels of abstraction and complexity\u27\u27 (Bloom, Engelhart, Furst, Hill, & Krathwohl, 1956, p. 203). Abstractions can be applied to computer and networking technology to help provide students with common fundamental concepts regardless of the particular underlying technological implementation to help avoid the rapid redundancy of a detailed knowledge of modem computer and networking technology implementation and hands-on skills acquisition. Again the ACM note that: “Enduring computing concepts include ideas that transcend any specific vendor, package or skill set... While skills are fleeting, fundamental concepts are enduring and provide long lasting benefits to students, critically important in a rapidly changing discipline (ACM, 2001, p.70) These abstractions can also be reinforced by experiential learning to commercial practices. In this context, the other possibly major contribution of new knowledge provided by this thesis is an efficient, scalable and flexible model for assessing hands-on skills and understanding of IT students. This is a form of Competency-Based Assessment (CBA), which has been successfully tested as part of this research and subsequently implemented at ECU. This is the first time within this field that this specific type of research has been undertaken within the university sector within Australia. Hands-on experience and understanding can become outdated hence the need for future proofing provided via B-Nodes models. The three major research questions of this study are: •Is it possible to develop a new, high level abstraction model for use in CNT education? •Is it possible to have CNT curricula that are more directly relevant to both student and employer expectations without suffering from rapid obsolescence? •Can WI effective, efficient and meaningful assessment be undertaken to test students\u27 hands-on skills and understandings? The ACM Special Interest Group on Data Communication (SJGCOMM) workshop report on Computer Networking, Curriculum Designs and Educational Challenges, note a list of teaching approaches: ... the more \u27hands-on\u27 laboratory approach versus the more traditional in-class lecture-based approach; the bottom-up approach towards subject matter verus the top-down approach (Kurose, Leibeherr, Ostermann, & Ott-Boisseau, 2002, para 1). Bandwidth considerations are approached from the PC hardware level and at each of the seven layers of the International Standards Organisation (ISO) Open Systems Interconnection (OSI) reference model. It is believed that this research is of significance to computing education. However, further research is needed
    corecore