6 research outputs found

    Research and Development of Feature Extraction from Myanmar Palm Leaf Manuscripts for the Myanmar Character Recognition System

    Get PDF
    This paper proposed Myanmar palm leaf manuscript handwriting OCR system. Each text area in the Myanmar palm-leaf manuscript is segmented. This segmented character text image is needed to be recognized to transform to Myanmar handwritten characters which express Myanmar’s precious historical and invaluable information. This paper involves two essential steps: preprocessing and feature extraction. The preprocessing is carried out to extract the attractive palm-leaf manuscript region from the Images automatically are taken by the camera and to support the enhanced images for subsequence processes of Myanmar character recognition from Myanmar palm leaves. The one-dimensional segmentation approach is used to crop leaf area in the image which is taken with high resolution. Line count analysis is also done to extract the region for using enough line count. After that, line segmentation is carried out using Object Frequency Histogram along the horizontal lines which can find the best optimal points between the lines. Similarly, the same technique but vertically is used to get each character or smallest group of characters. Totally 18 features are extracted to recognize the Myanmar palm-leaf manuscript characters. Although the experimental results are good enough but some difficulties are still needed to take account related to the connected components.

    Text Line Segmentation of Historical Documents: a Survey

    Full text link
    There is a huge amount of historical documents in libraries and in various National Archives that have not been exploited electronically. Although automatic reading of complete pages remains, in most cases, a long-term objective, tasks such as word spotting, text/image alignment, authentication and extraction of specific fields are in use today. For all these tasks, a major step is document segmentation into text lines. Because of the low quality and the complexity of these documents (background noise, artifacts due to aging, interfering lines),automatic text line segmentation remains an open research field. The objective of this paper is to present a survey of existing methods, developed during the last decade, and dedicated to documents of historical interest.Comment: 25 pages, submitted version, To appear in International Journal on Document Analysis and Recognition, On line version available at http://www.springerlink.com/content/k2813176280456k3

    Chinese calligraphy: character style recognition based on full-page document

    Full text link
    Calligraphy plays a very important role in the history of China. From ancient times to modern times, the beauty of calligraphy has been passed down to the present. Different calligraphy styles and structures have made calligraphy a beauty and embodiment in the field of writing. However, the recognition of calligraphy style and fonts has always been a blank in the computer field. The structural complexity of different calligraphy also brings a lot of challenges to the recognition technology of computers. In my research, I mainly discussed some of the main recognition techniques and some popular machine learning algorithms in this field for more than 20 years, trying to find a new method of Chinese calligraphy styles recognition and exploring its feasibility. In our research, we searched for research papers 20 years ago. Most of the results are about the content recognition of modern Chinese characters. At first, we analyze the development of Chinese characters and the basic Chinese character theory. In the analysis of the current recognition of Chinese characters (including handwriting online and offline) in the computer field, it is more important to analyze various algorithms and results, and to analyze how to use the experimental data, besides how they construct the data set used for their test. The research on the method of image processing based on Chinese calligraphy works is very limited, and the data collection for calligraphy test is very limited also. The test of dataset that used between different recognition technologies is also very different. However, it has far-reaching significance for inheriting and carrying forward the traditional Chinese culture. It is very necessary to develop and promote the recognition of Chinese characters by means of computer tecnchque. In the current application field, the font recognition of Chinese calligraphy can effectively help the library administrators to identify the problem of the classification of the copybook, thus avoiding the recognition of the calligraphy font which is difficult to perform manually only through subjective experience. In the past 10 years of technology, some techniques for the recognition of single Chinese calligraphy fonts have been given. Most of them are the pre-processing of calligraphy characters, the extraction of stroke primitives, the extraction of style features, and the final classification of machine learning. The probability of the classification of the calligraphy works. Such technical requirements are very large for complex Chinese characters, the result of splitting and recognition is very large, and it is difficult to accurately divide many complex font results. As a result, the recognition rate is low, or the accuracy of recognition of a specific word is high, but the overall font recognition accuracy is low. We understand that Chinese calligraphy is a certain research value. In the field of recognition, many research papers on the analysis of Chinese calligraphy are based on the study of calligraphy and stroke. However, we have proposed a new method for dealing with font recognition. The recognition technology is based on the whole page of the document. It is studied in three steps: the first step is to use Fourier transform and some Chinese calligraphy images and analyze the results. The second is that CNN is based on different data sets to get some results. Finally, we made some improvements to the CNN structure. The experimental results of the thesis show that the full-page documents recognition method proposed can achieve high accuracy with the support of CNN technology, and can effectively identify the different styles of Chinese calligraphy in 5 styles. Compared with the traditional analysis methods, our experimental results show that the method based on the full-page document is feasible, avoiding the cumbersome font segmentation problem. This is more efficient and more accurate

    Redes bayesianas para análise de comportamento aplicadas a telefonia celular

    Get PDF
    Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico. Programa de Pós-Graduação em Ciência da Computação.O aprendizado automático em redes Bayesianas faz uso do Teorema de Bayes que é de grande importância para o cálculo de probabilidades. A teoria de probabilidade envolve métodos de propagação de crença e métodos para aprendizado destas redes, focando principalmente, o aprendizado através da inferência lógica, visto que o mesmo pode ser entendido como a base para analisar um conjunto de informações disponíveis e chegar a uma conclusão objetiva, expressa numericamente

    Segmentación de líneas de texto en documentos manuscritos antiguos independiente del lenguaje

    Get PDF
    Hasta el momento no se ha utilizado todo el conocimiento que hay en los manuscritos antiguos debido a que reconocimiento de texto manuscrito aún no cuenta con métodos robustos para esta tarea. El primer problema de los métodos para el reconocimiento de texto manuscrito es que requieren que el texto se encuentre dividido en líneas. Los métodos actuales para la segmentación de líneas de texto manuscrito no han sido optimizados para trabajar con manuscritos antiguos. La primera etapa de la Segmentación de Líneas de Texto (SLT) manuscrito consiste en la Localización de Líneas de Texto (LLT). Para la SLT se han propuesto métodos que buscan los valores máximos locales en un histograma. El problema para estos métodos es que existen demasiados máximos locales, lo cual no permite localizar las líneas que hay. La segunda etapa de la SLT en manuscritos antiguos consiste en la búsqueda de una ruta que permita separar las líneas de texto, el problema de los métodos actuales es que algunos realizan una búsqueda local de la ruta y los otros métodos buscan la ruta evitando pasar por la mayor cantidad de caracteres. En este trabajo se presenta un sistema compuesto por dos nuevos métodos para la LLT manuscrito y otro método para la Búsqueda de una Ruta que permita Segmentar Líneas de Texto en documentos manuscritos (BRSLT) que supera a los métodos analizados en el estado del arte en las dos etapas. En el primer método propuesto se presenta la extracción de un mapa de energía que incrementa las diferencias entre los máximos y mínimos locales en un histograma. El segundo método propuesto consiste en buscar la mejor ruta para segmentar líneas de texto manuscrito antiguo usando un algoritmo genético. Para evaluar la exactitud de los métodos propuestos se han realizado experimentos con dos colecciones de documentos. Se ha realizado una evaluación independiente de los dos métodos propuesto. Las colecciones de documentos incluyen los idiomas: español, chino, árabe, inglés, árabe-español con escritura moderna y escritura antigua. Con los resultados de la experimentación se ha demostrado que es posible mejorar la LLT implementando un mapa de energía que incremente las diferencias entre máximos y mínimos locales. Los experimentos de la segunda sección demuestran que es necesario realizar una optimización global de la ruta para segmentar líneas de texto

    An intelligent framework for pre-processing ancient Thai manuscripts on palm leaves

    Get PDF
    In Thailand’s early history, prior to the availability of paper and printing technologies, palm leaves were used to record information written by hand. These ancient documents contain invaluable knowledge. By digitising the manuscripts, the content can be preserved and made widely available to the interested community via electronic media. However, the content is difficult to access or retrieve. In order to extract relevant information from the document images efficiently, each step of the process requires reduction of irrelevant data such as noise or interference on the images. The pre-processing techniques serve the purpose of extracting regions of interest, reducing noise from the image and degrading the irrelevant background. The image can then be directly and efficiently processed for feature selection and extraction prior to the subsequent phase of character recognition. It is therefore the main objective of this study to develop an efficient and intelligent image preprocessing system that could be used to extract components from ancient manuscripts for information extraction and retrieval purposes. The main contributions of this thesis are the provision and enhancement of the region of interest by using an intelligent approach for the pre-processing of ancient Thai manuscripts on palm leaves and a detailed examination of the preprocessing techniques for palm leaf manuscripts. As noise reduction and binarisation are involved in the first step of pre-processing to eliminate noise and background from image documents, it is necessary for this step to provide a good quality output; otherwise, the accuracy of the subsequent stages will be affected. In this work, an intelligent approach to eliminate background was proposed and carried out by a selection of appropriate binarisation techniques using SVM. As there could be multiple binarisation techniques of choice, another approach was proposed to eliminate the background in this study in order to generate an optimal binarised image. The proposal is an ensemble architecture based on the majority vote scheme utilising local neighbouring information around a pixel of interest. To extract text from that binarised image, line segmentation was then applied based on the partial projection method as this method provides good results with slant texts and connected components. To improve the quality of the partial projection method, an Adaptive Partial Projection (APP) method was proposed. This technique adjusts the size of a character strip automatically by adapting the width of the strip to separate the connected component of consecutive lines through divide and conquer, and analysing the upper vowels and lower vowels of the text line. Finally, character segmentation was proposed using a hierarchical segmentation technique based on a contour-tracing algorithm. Touching components identified from the previous step were then separated by a trace of the background skeletons, and a combined method of segmentation. The key datasets used in this study are images provided by the Project for Palm Leaf Preservation, Northeastern Thailand Division, and benchmark datasets from the Document Image Binarisation Contest (DIBCO) series are used to compare the results of this work against other binarisation techniques. The experimental results have shown that the proposed methods in this study provide superior performance and will be used to support subsequent processing of the Thai ancient palm leaf documents. It is expected that the contributions from this study will also benefit research work on ancient manuscripts in other languages