267 research outputs found
Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources
Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen
A method of determining culture related users using computation of correlation
The provision of security on most of computer networks is based on the obtaining and exchange of a unique key between the communicating parties. It is, however, difficult to come up with a truly unique and random secret between two parties with the help from physical randomness. In this work, we focus on the problem of unique random number generation or derivation between users in online social networks. As a result of rapid development of Internet, online social networks provide a vast set of different user comments on different products and services. Such comments can inherently reflect the mindset or cultural background of those people who wrote them and it is possible to derive some unique randomness from such texts with some maneuver. We select movie reviews as the sandbox for our investigation. To manage textual content and search for certain hidden relations, the methodologies of text matching are studied. By looking the similarities of movie reviews from different users, we can refer insights into the cultural background and even predict future preferences from past comments. We present all of our findings here to aspire further investigation. We have investigated the correlation of movie reviews and studied the values of different weight assignments to the sentence and word relation. According to our results, synonym relations are the dominant positive association that impacts correlation value. We calculate correlation between review sets containing multiple reviews to avoid randomness. These correlations have then been used to evaluate and derive a unique random number. We target at a single review, and put it together with other reviews to obtain correlation values from different pairs of reviewers. Then the correlation value is binning to a 1-bit binary number. Through such a simplified extraction, a unique random number can be generated by repeating the process of binning. Such unique random number is able to facilitate to secure information exchanges between the users. In our future work, we will explore such correlations to generate a practically usable unique secret for secret keys
Plagiarism detection for Indonesian texts
As plagiarism becomes an increasing concern for Indonesian universities and research centers, the need of using automatic plagiarism checker is becoming more real. However, researches on Plagiarism Detection Systems (PDS) in Indonesian documents have not been well developed, since most of them deal with detecting duplicate or near-duplicate documents, have not addressed the problem of retrieving source documents, or show tendency to measure document similarity globally. Therefore, systems resulted from these researches are incapable of referring to exact locations of ``similar passage'' pairs. Besides, there has been no public and standard corpora available to evaluate PDS in Indonesian texts.
To address the weaknesses of former researches, this thesis develops a plagiarism detection system which executes various methods of plagiarism detection stages in a workflow system. In retrieval stage, a novel document feature coined as phraseword is introduced and executed along with word unigram and character n-grams to address the problem of retrieving source documents, whose contents are copied partially or obfuscated in a suspicious document. The detection stage, which exploits a two-step paragraph-based comparison, is aimed to address the problems of detecting and locating source-obfuscated passage pairs. The seeds for matching source-obfuscated passage pairs are based on locally-weighted significant terms to capture paraphrased and summarized passages. In addition to this system, an evaluation corpus was created through simulation by human writers, and by algorithmic random generation.
Using this corpus, the performance evaluation of the proposed methods was performed in three scenarios. On the first scenario which evaluated source retrieval performance, some methods using phraseword and token features were able to achieve the optimum recall rate 1. On the second scenario which evaluated detection performance, our system was compared to Alvi's algorithm and evaluated in 4 levels of measures: character, passage, document, and cases. The experiment results showed that methods resulted from using token as seeds have higher scores than Alvi's algorithm in all 4 levels of measures both in artificial and simulated plagiarism cases. In case detection, our systems outperform Alvi's algorithm in recognizing copied, shaked, and paraphrased passages. However, Alvi's recognition rate on summarized passage is insignificantly higher than our system. The same tendency of experiment results were demonstrated on the third experiment scenario, only the precision rates of Alvi's algorithm in character and paragraph levels are higher than our system. The higher Plagdet scores produced by some methods in our system than Alvi's scores show that this study has fulfilled its objective in implementing a competitive state-of-the-art algorithm for detecting plagiarism in Indonesian texts.
Being run at our test document corpus, Alvi's highest scores of recall, precision, Plagdet, and detection rate on no-plagiarism cases correspond to its scores when it was tested on PAN'14 corpus. Thus, this study has contributed in creating a standard evaluation corpus for assessing PDS for Indonesian documents. Besides, this study contributes in a source retrieval algorithm which introduces phrasewords as document features, and a paragraph-based text alignment algorithm which relies on two different strategies. One of them is to apply local-word weighting used in text summarization field to select seeds for both discriminating paragraph pair candidates and matching process. The proposed detection algorithm results in almost no multiple detection. This contributes to the strength of this algorithm
Recommended from our members
Linking Textual Resources to Support Information Discovery
A vast amount of information is today stored in the form of textual documents, many of which are available online. These documents come from different sources and are of different types. They include newspaper articles, books, corporate reports, encyclopedia entries and research papers. At a semantic level, these documents contain knowledge, which was created by explicitly connecting information and expressing it in the form of a natural language. However, a significant amount of knowledge is not explicitly stated in a single document, yet can be derived or discovered by researching, i.e. accessing, comparing, contrasting and analysing, information from multiple documents. Carrying out this work using traditional search interfaces is tedious due to information overload and the difficulty of formulating queries that would help us to discover information we are not aware of.
In order to support this exploratory process, we need to be able to effectively navigate between related pieces of information across documents. While information can be connected using manually curated cross-document links, this approach not only does not scale, but cannot systematically assist us in the discovery of sometimes non-obvious (hidden) relationships. Consequently, there is a need for automatic approaches to link discovery.
This work studies how people link content, investigates the properties of different link types, presents new methods for automatic link discovery and designs a system in which link discovery is applied on a collection of millions of documents to improve access to public knowledge
Intelligent Plagiarism Detection for Electronic Documents
Plagiarism detection is the process of finding similarities on electronic based documents. Recently, this process is highly required because of the large number of available documents on the internet and the ability to copy and paste the text of relevant documents with simply Control+C and Control+V commands.
The proposed solution is to investigate and develop an easy, fast, and multi-language support plagiarism detector with the easy of one click to detect the document plagiarism. This process will be done with the support of intelligent system that can learn, change and adapt to the input document and make a cross-fast search for the content on the local repository and the online repository and link the content of the file with the matching content everywhere found.
Furthermore, the supported document type that we will use is word, text and in some cases, the pdf files –where is the text can be extracting from them- and this made possible by using the DLL file from Word application that Microsoft provided on OS. The using of DLL will let us to not constrain on how to get the text from files; and will help us to apply the file on our Delphi project and walk throw our methodology and read the file word by word to grantee the best working scenarios for the calculation.
In the result, this process will help in the uprising the documents quality and enhance the writer experience related to his work and will save the copyrights for the official writer of the documents by providing a new alternative tool for plagiarism detection problem for easy and fast use to the concerned Institutions for free
Design of a Controlled Language for Critical Infrastructures Protection
We describe a project for the construction of controlled language for critical infrastructures protection (CIP). This project originates
from the need to coordinate and categorize the communications on CIP at the European level. These communications can be physically
represented by official documents, reports on incidents, informal communications and plain e-mail. We explore the application of
traditional library science tools for the construction of controlled languages in order to achieve our goal. Our starting point is an
analogous work done during the sixties in the field of nuclear science known as the Euratom Thesaurus.JRC.G.6-Security technology assessmen
COUNTER - COrpus of Urdu News TExt Reuse
Text reuse is the act of borrowing text from existing documents to create new texts. Freely available and easily accessible large online repositories are not only making reuse of text more common in society but also harder to detect. A major hindrance in the development and evaluation of existing/new mono-lingual text reuse detection methods, especially for South Asian languages, is the unavailability of standardized benchmark corpora. Amongst other things, a gold standard corpus enables researchers to directly compare existing state-of-the-art methods. In our study, we address this gap by developing a benchmark corpus for one of the widely spoken but under resourced languages i.e. Urdu. The COUNTER (COrpus of Urdu News TExt Reuse) corpus contains 1,200 documents with real examples of text reuse from the field of journalism. It has been manually annotated at document level with three levels of reuse: wholly derived, partially derived and non derived. We also apply a number of similarity estimation methods on our corpus to show how it can be used for the development, evaluation and comparison of text reuse detection systems for the Urdu language. The corpus is a vital resource for the development and evaluation of text reuse detection systems in general and specifically for Urdu language
On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism
Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci
- …