1,366 research outputs found
Software Plagiarism Detection Using N-grams
Plagiarism is an act of copying where one doesn’t rightfully credit the original source. The
motivations behind plagiarism can vary from completing academic courses to even gaining
economical advantage. Plagiarism exists in various domains, where people want to take credit
from something they have worked on. These areas can include e.g. literature, art or software,
which all have a meaning for an authorship.
In this thesis we conduct a systematic literature review from the topic of source code
plagiarism detection methods, then based on the literature propose a new approach to detect
plagiarism which combines both similarity detection and authorship identification, introduce
our tokenization method for the source code, and lastly evaluate the model by using real life
data sets. The goal for our model is to point out possible plagiarism from a collection of
documents, which in this thesis is specified as a collection of source code files written by various
authors. Our data, which we will use to our statistical methods, consists of three datasets:
(1) collection of documents belonging to University of Helsinki’s first programming course, (2)
collection of documents belonging to University of Helsinki’s advanced programming course
and (3) submissions for source code re-use competition. Statistical methods in this thesis are
inspired by the theory of search engines, which are related to data mining when detecting
similarity between documents and machine learning when classifying document with the most
likely author in authorship identification.
Results show that our similarity detection model can be used successfully to retrieve
documents for further plagiarism inspection, but false positives are quickly introduced even
when using a high threshold that controls the minimum allowed level of similarity between
documents. We were unable to use the results of authorship identification in our study, as
the results with our machine learning model were not high enough to be used sensibly. This
was possibly caused by the high similarity between documents, which is due to the restricted
tasks and the course setting that teaches a specific programming style during the timespan of
the course
Multimodal Interactive Transcription of Handwritten Text Images
En esta tesis se presenta un nuevo marco interactivo y multimodal para la transcripción de
Documentos manuscritos. Esta aproximación, lejos de proporcionar la transcripción completa
pretende asistir al experto en la dura tarea de transcribir.
Hasta la fecha, los sistemas de reconocimiento de texto manuscrito disponibles no proporcionan
transcripciones aceptables por los usuarios y, generalmente, se requiere la intervención
del humano para corregir las transcripciones obtenidas. Estos sistemas han demostrado ser
realmente útiles en aplicaciones restringidas y con vocabularios limitados (como es el caso
del reconocimiento de direcciones postales o de cantidades numéricas en cheques bancarios),
consiguiendo en este tipo de tareas resultados aceptables. Sin embargo, cuando se trabaja
con documentos manuscritos sin ningún tipo de restricción (como documentos manuscritos
antiguos o texto espontáneo), la tecnología actual solo consigue resultados inaceptables.
El escenario interactivo estudiado en esta tesis permite una solución más efectiva. En este
escenario, el sistema de reconocimiento y el usuario cooperan para generar la transcripción final
de la imagen de texto. El sistema utiliza la imagen de texto y una parte de la transcripción
previamente validada (prefijo) para proponer una posible continuación. Despues, el usuario
encuentra y corrige el siguente error producido por el sistema, generando así un nuevo prefijo
mas largo. Este nuevo prefijo, es utilizado por el sistema para sugerir una nueva hipótesis. La
tecnología utilizada se basa en modelos ocultos de Markov y n-gramas. Estos modelos son
utilizados aquí de la misma manera que en el reconocimiento automático del habla. Algunas
modificaciones en la definición convencional de los n-gramas han sido necesarias para tener
en cuenta la retroalimentación del usuario en este sistema.Romero Gómez, V. (2010). Multimodal Interactive Transcription of Handwritten Text Images [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8541Palanci
Automatic Correction of Arabic Dyslexic Text
This paper proposes an automatic correction system that detects and corrects dyslexic errors in Arabic text. The system uses a language model based on the Prediction by Partial Matching (PPM) text compression scheme that generates possible alternatives for each misspelled word. Furthermore, the generated candidate list is based on edit operations (insertion, deletion, substitution and transposition), and the correct alternative for each misspelled word is chosen on the basis of the compression codelength of the trigram. The system is compared with widely-used Arabic word processing software and the Farasa tool. The system provided good results compared with the other tools, with a recall of 43%, precision 89%, F1 58% and accuracy 81%
Document analysis at DFKI. - Part 1: Image analysis and text recognition
Document analysis is responsible for an essential progress in office automation. This paper is part of an overview about the combined research efforts in document analysis at the DFKI. Common to all document analysis projects is the global goal of providing a high level electronic representation of documents in terms of iconic, structural, textual, and semantic information. These symbolic document descriptions enable an "intelligent\u27; access to a document database. Currently there are three ongoing document analysis projects at DFKI: INCA, OMEGA, and PASCAL2000/PASCAL+. Though the projects pursue different goals in different application domains, they all share the same problems which have to be resolved with similar techniques. For that reason the activities in these projects are bundled to avoid redundant work. At DFKI we have divided the problem of document analysis into two main tasks, text recognition and text analysis, which themselves are divided into a set of subtasks. In a series of three research reports the work of the document analysis and office automation department at DFKI is presented. The first report discusses the problem of text recognition, the second that of text analysis. In a third report we describe our concept for a specialized document analysis knowledge representation language. The report in hand describes the activities dealing with the text recognition task. Text recognition covers the phase starting with capturing a document image up to identifying the written words. This comprises the following subtasks: preprocessing the pictorial information, segmenting into blocks, lines, words, and characters, classifying characters, and identifying the input words. For each subtask several competing solution algorithms, called specialists or knowledge sources, may exist. To efficiently control and organize these specialists an intelligent situation-based planning component is necessary, which is also described in this report. It should be mentioned that the planning component is also responsible to control the overall document analysis system instead of the text recognition phase onl
Handwritten Text Line Detection and Classification based on HMMs
[ES] En este trabajo presentamos una forma para realizar el análisis y la detección de líneas de
texto en documentos manuscritos basada en los Modelos Ocultos de Markov, una técnica
ampliamente utilizada en otras tareas del reconocimiento del texto manuscrito y del
habla. Mostamos que el análisis y la detección de líneas de texto puede realizarse
utilizando metodologías más formales en contraposición a los métodos heurístics que se
pueden encontrar en la literatura. Nuestro método no solo proporciona las mejores
coordenas de posición para cada una de las regiones verticales de la página sino que
también las etiqueta, de esta manera superando los métodos heurísticos tradicionales. En
nuestros experimentos demonstramos el rendimiento de nuestro método ( tanto en
detección como en classificación de líneas) y estudiamos el impacto de incrementalmente
restringidos "lenguajes de estructuración vertical de páginas" y modelos morfológicos
sobre la precisión de detección y clasificación. Mediante esta experimentación también
demostramos la mejora en calidad de las líneas base generadas por nuestro método en
comparación con un método heurístico estado del arte basado en perfiles de proyección
vertical.[EN] In this paper we present an approach for text line analysis and detection in handwritten
documents based on Hidden Markov Models, a technique widely used in other handwritten
and speech recognition tasks. It is shown that text line analysis and detection can be
solved using a more formal methodology in contraposition to most of the proposed
heuristic approaches found in the literature. Our approach not only provides the best
position coordinates for each of the vertical page regions but also labels them, in this
manner surpassing the traditional heuristic methods. In our experiments we demonstrate
the performance of the approach (both in line analysis and detection) and study the
impact of increasingly constrained ¿vertical layout language models¿ and morphologic
models on text line detection and classification accuracy. Through this experimentation
we also show the improvement in quality of the baselines yielded by our approach in
comparisonwith a state-of-the-art heuristic method based on vertical projection profiles.Bosch Campos, V. (2012). Handwritten Text Line Detection and Classification based on HMMs. http://hdl.handle.net/10251/17964Archivo delegad
Parametric classification in domains of characters, numerals, punctuation, typefaces and image qualities
This thesis contributes to the Optical Font Recognition problem (OFR), by developing a classifier system to differentiate ten typefaces using a single English character ‘e’. First, features which need to be used in the classifier system are carefully selected after a thorough typographical study of global font features and previous related experiments. These features have been modeled by multivariate normal laws in order to use parameter estimation in learning. Then, the classifier system is built up on six independent schemes, each performing typeface classification using a different method. The results have shown a remarkable performance in the field of font recognition. Finally, the classifiers have been implemented on Lowercase characters, Uppercase characters, Digits, Punctuation and also on Degraded Images
Data-driven Job Search Engine Using Skills and Company Attribute Filters
According to a report online, more than 200 million unique users search for
jobs online every month. This incredibly large and fast growing demand has
enticed software giants such as Google and Facebook to enter this space, which
was previously dominated by companies such as LinkedIn, Indeed and
CareerBuilder. Recently, Google released their "AI-powered Jobs Search Engine",
"Google For Jobs" while Facebook released "Facebook Jobs" within their
platform. These current job search engines and platforms allow users to search
for jobs based on general narrow filters such as job title, date posted,
experience level, company and salary. However, they have severely limited
filters relating to skill sets such as C++, Python, and Java and company
related attributes such as employee size, revenue, technographics and
micro-industries. These specialized filters can help applicants and companies
connect at a very personalized, relevant and deeper level. In this paper we
present a framework that provides an end-to-end "Data-driven Jobs Search
Engine". In addition, users can also receive potential contacts of recruiters
and senior positions for connection and networking opportunities. The high
level implementation of the framework is described as follows: 1) Collect job
postings data in the United States, 2) Extract meaningful tokens from the
postings data using ETL pipelines, 3) Normalize the data set to link company
names to their specific company websites, 4) Extract and ranking the skill
sets, 5) Link the company names and websites to their respective company level
attributes with the EVERSTRING Company API, 6) Run user-specific search queries
on the database to identify relevant job postings and 7) Rank the job search
results. This framework offers a highly customizable and highly targeted search
experience for end users.Comment: 8 pages, 10 figures, ICDM 201
- …