Search CORE

14 research outputs found

research on coded character set standards and classification

Author: 吴健
芮建武
谢谦
Publication venue
Publication date: 01/01/2006
Field of study

编码字符集标准是计算机处理文字信息的基础,本文提出了编码字符集三元组抽象,对现有编码字符集标准进行了简单回顾和总结,深入剖析了影响巨大的ISO 2022标准及其派生标准,对ISO 2022编码机制应用于多语言环境的局限性进行了探讨,阐明了使用通用编码字符集UCS的必要性,并对其进行了分析.探讨了现有编码分类方法存在的问题,引入了一种对编码字符集以及实现方法进行分类的新方法,使用该方法对现有标准进行了归类;最后对汉字字符集相关的国家标准进行了分析评介

Institute Of Software, Chinese Academy Of Sciences

finite state machine description of iso 2022

Author: 吴健
芮建武
谢谦
Publication venue
Publication date: 01/01/2006
Field of study

ISO 2022编码体系对字符集国家标准的制订有很大影响,然而标准条款存在不确定性,有时难于理解。本文引入有限状态机(FSM)模型来形式化地刻画ISO 2022的特征。针对FSM五元组,详细说明了其状态空间的构成,提出了输入字母表的等效分类方法,给出了初始状态以及终结状态集合,分析了状态转移函数的规模,并采用FSM描述方法分析了ISO-2022-CN、EUC-CN、复合文本等标准,揭示了这些标准与ISO 2022的内在联系。这些工作有助于ISO 2022标准符合性检测、扩展标准的制订与系统实现复杂度评估。鉴于形式化描述方法在编码字符集标准领域未得到广泛应用,本文工作为该类研究引入了新的思路和方法。中国中文信息学

Institute Of Software, Chinese Academy Of Sciences

design of mongolian operating system within the framework of internationalization

Author: 吴健
孙玉芳
芮建武
Publication venue
Publication date: 01/01/2006
Field of study

蒙文操作系统实现较为复杂的原因在于两个方面：①传统蒙文采用自顶向下竖写、每列从左向右排列的书写方式；②蒙文字符在不同文本上下文中采用变化相当复杂的显现字形．基于操作系统国际化体系结构，从蒙文字符集、蒙文字符的变形显现、蒙文文本的垂直显示、蒙文独特的图形用户界面等多个方面阐述了传统蒙文操作系统实现中面临的难点和技术方案；简要介绍了基于Qt／KDE桌面系统的实现；最后提出了蒙文操作系统实现仍需要解决的问题。中国中文信息学

Institute Of Software, Chinese Academy Of Sciences

study on implementing tibetan operating system based on iso/iec 10646

Author: 吴健
孙玉芳
芮建武
Publication venue
Publication date: 01/01/2005
Field of study

长期以来尚未有完整的藏文操作系统,原因是藏文文字的特性要求特定的文字处理。本文基于ISO/IEC10646的藏文字符集标准,结合藏文正字法要求,详细分析了藏文操作系统实现中的关键问题:(1)藏文字符集方案比较与藏文存储;(2)藏文输入;(3)藏文显现。藏文显现是公认的“瓶颈”问题。对此,本文提出基于音节划分、使用OpenType字体及相应的文本引擎来解决藏文“叠加”字符的显现。此方案应用于Qt库的实验及相关测试证明基于ISO/IEC10646标准的藏文操作系统实现是较合理的方案。中国中文信息学

Institute Of Software, Chinese Academy Of Sciences

survey on international text processing

Author: 吴健
孙玉芳
芮建武
Publication venue
Publication date: 01/01/2006
Field of study

计算机与不同用户的交互通常必须实现通过多种文字信息的输入/输出以实现,因此操作系统对多种文字的支持程度是其功能性的一个衡量标准。各种文字特征的巨大差异导致现代操作系统的文字处理实现非常复杂。本文总结了操作系统文字处理的范围与内容,包括文本输入与存储,文本处理以及用户交互处理;归纳了通用的文字处理模型和可能采取的技术途径及其优缺点;分析了常用操作系统的文字处理实现;最后展望了文字处理仍面临的挑战。中国中文信息学

Institute Of Software, Chinese Academy Of Sciences

Accelerating ASIFT Based on CPU/GPU Synergetic Parallel Computing

Author: 何婷婷
温腊
芮建武
Publication venue
Publication date: 01/01/2014
Field of study

ASIFT(Affine-SIFT)是一种具有仿射不变性、尺度不变性的特征提取算法,其被用于图像匹配中,具有较好的匹配效果,但因计算复杂度高而难以运用到实时处理中.在分析ASIFT算法运行耗时分布的基础上,先对SIFT算法进行了GPU优化,通过使用共享内存、合并访存,提高了数据访问效率.之后对ASIFT计算中的其它部分进行GPU优化,形成GASIFT.整个GASⅡT计算过程中使用显存池来减少对显存的申请和释放.最后分别在CPU/GPU协同工作的两种方式上进行了尝试.实验表明,CPU负责逻辑计算、GPU负责并行计算的模式最适合于GAS-IFT计算,在该模式下GASIFT有很好的加速效果,尤其针对大、中图片.对于2048* 1536的大图片,GASIFT与标准ASIFT相比加速比可达16倍,与OpenMP优化过的ASIFT相比加速比可达7倍,极大地提高了ASIFT在实时计算中应用的可能性.ASIFT(affine-SIFT)is a fully affine invariant,and scale invariant image local feature extraction algorithm.It has a good result in image matching.But because of its high computational complexity,it cannot be applied to real-time processing.Thus GPU is used to accelerate ASIFT.Based on the analysis of running time of ASIFT,firstly SIFT was adapted to GPU,and then the other parts of ASIFT.Memory pool was used in GASIFT to avoid frequently allocating and deleting memory during the runtime.Different ways of CPU/GPU synergetic parallel computing were studied to make GASIFT more efficient.Experiments show that the model in which CPU takes the logical calculation work and GPU makes parallel computing is the most suitable way.Based on this model,GASIFT has a good speed-up ratio over other methods.That's 16times compared with traditional ASIFT,and 7times compared with OpenMP optimized ASIFT

Institute Of Software, Chinese Academy Of Sciences

n approach for storing and accessing small files on hadoop

Author: 何婷婷
张春明
芮建武
Publication venue
Publication date: 01/01/2012
Field of study

HDFS(Hadoop Distributed File System)凭借其高容错、可伸缩和廉价存储的优点,在当前面向云计算的应用场景中得到了广泛应用。然而,HDFS设计的初衷是存储超大文件,对于海量小文件,由于NameNode内存开销等问题,其存储和读取性能并不理想。提出一种基于小文件合并的方法 HIFM(Hierarchy Index File Merging),综合考虑小文件之间的相关性和数据的目录结构,来辅助将小文件合并成大文件,并生成分层索引。采用集中存储和分布式存储相结合的方式管理索引文件,并实现索引文件预加载。此外,HIFM采用数据预取的机制,提高顺序访问小文件的效率。实验结果表明,HIFM方法能够有效提高小文件存储和读取效率,显著降低NameNode和DataNode的内存开销,适合应用在有一定目录结构的海量小文件存储的应用场合。新闻出版重大科技工程项目(0610-1041BJNF2328/23)|国家科技支撑计划课题(2011BAH14B02)|中国科学院知识创新工程方向性项目课题(KGCX2-YW-174)Benefiting from its advantages of high fault-tolerance, scalability and low-cost storage capability, HDFS (Hadoop distributed file system) has been gaining widely application in current cloud computing-based applied scenes. However, HDFS is primarily designed for streaming access of ultra-large files and suffers the performance penalty in both storage and accessing while managing massive small files due to the memory overhead problem of NameNode. In this paper, an approach based on combining small files, called HIFM (hierarchy index file merging), is proposed. In it, the correlations between small files and the directory structure of data are comprehensively considered to assist the small files to be merged into large ones and to generate hierarchical index. Centralised storage and distributed storage methods are jointly used in index files management, and the preload of index files is implemented. Besides, in order to improve the efficiency of sequentially ?accessing? the small files, HIFM adopts data prefetching mechanism. Experimental results show that HIFM can improve the efficiency of ?storing? and accessing small files effectively, and mitigate the memory overhead of NameNode and DataNode obviously. It is suitable for the applications which have massive structured small files storage

Institute Of Software, Chinese Academy Of Sciences

design and implementation of linux file system supporting multilingualism

Author: 吴健
孙玉芳
芮建武
谢谦
Publication venue
Publication date: 01/01/2005
Field of study

操作系统的多语言支持是网络环境下计算机软件发展的必然结果。由于POSIX标准的国际化体系结构对多语言和分布式应用需求的支持有其局限性，导致遵循POSIX标准的Linux文件予系统在支持多语言文本时可能造成数据丢失。本文从多语言角度考察了Linux文件予系统，重新构造了一个能够支持Urdcode编码的逻辑文件系统EXT2U，改进了文件予系统，同时提供了基于Urdcode编码的系统调用接口。通过新文件系统与系统调用接口，为操作系统多语言处理提供了更好的基础。中国中文信息学

Institute Of Software, Chinese Academy Of Sciences

Accelerating hierarchical distributed latent Dirichlet allocation algorithm by parallel GPU

Author: 何婷婷
温腊
芮建武
郭亮
Publication venue
Publication date: 01/01/2013
Field of study

　　分层分布式狄利克雷分布(HD-LDA)算法是一个对潜在狄利克雷分布(LDA)进行改进的基于概率增长模型的文本分类算法，与只能在单机上运行的LDA算法相比，可以运行在分布式框架下，进行分布式并行处理.Mahout在Hadoop框架下实现了HD-LDA算法，但是因为单节点算法的计算量大，仍然存在对大数据分类运行时间太长的问题.而大规模文本集合分散到多个节点上迭代推导，单个节点上文档集合的推导仍是顺序进行的，所以处理大规模文本集合时仍然需要很长时间才能完成全部文本的分类.为此，提出将Hadoop与图形处理器(GPU)相结合，将单节点文本集合的推导过程转移到GPU上运行，实现单节点多个文档并行推导，利用多台并行的GPU对HD-LDA算法进行加速.应用结果表明，使用该方法能使分布式框架下的HD-LDA算法对大规模文本集合处理达到7倍的加速比.Hierarchical Distributed Latent Dirichlet Allocation (HD-LDA), a popular topic modeling technique for exploring collections, is an improved Latent Dirichlet Allocation (LDA) algorithm running in distributed environment. Mahout has realized HD-LDA algorithm in the framework of Hadoop. However the algorithm processed the whole documents of a single node in sequence, and the execution time of the HD-LDA program was very long when processing a large amount of documents. A new method was proposed to combine Hadoop with Graphic Processing Unit (GPU) to solve the above problem when transferring the computation from CPU to GPU. The application results show that combining the Hadoop with GPU which processes many documents in parallel can decrease the execution time of HD-LDA program greatly and achieve seven times speedup

Institute Of Software, Chinese Academy Of Sciences

de-noising approach for online handwriting character recognition based on mathematical morphology

Author: 刘瀚猛
吴健
孙嫣
芮建武
Publication venue
Publication date: 01/01/2009
Field of study

手写输入时由于笔尖抖动等原因产生了大量噪声,有效地去除噪声是手写识别的前提和关键。根据联机手写识别中手写体字符形态的特性,分析了手写时由于各种原因而产生的噪声,运用数学形态学中膨胀、腐蚀、细化等基本运算,提出了一种将数学形态学应用于联机手写识别预处理的方法,该方法可以有效地消除大量的冗余信息。测试结果表明,提出的方法可行,具有很好的鲁棒性,可以配合其他方案应用于各种联机手写字符识别中

Institute Of Software, Chinese Academy Of Sciences