3 research outputs found
Low Cost, Cross-language and Cross-platform Information Retrieval and Documentation Tools
In this paper we focus on the design and implementation of low cost, cross language and cross platform Information Retrieval and Documentation tools capable for the collection, organization and administration of unstructured and semi-structured information imported from various sources. A modular Computer Assisted Information Resources Navigation (CAIRN) software architecture is proposed and the requirements of each module are presented. A discussion of the implementation is based on the experimentation with a prototype of such a software tool. The technologies that are incorporated into the modern operating systems and the opportunities that they offer for implementing the modules of the CAIRN architecture are also examined and evaluated. Some of these technologies are common / independent from the operating systems, while some others are distinctive. In this latter case we face barriers (restrictions) for a straightforward implementation of the CAIRN software systems to the whole range of desktop operating systems (e.g. Windows, Mac OS, Linux, Solaris). Some alternative technologies are presented to avoid this serious constraint. The evaluation of the implementation effort is also discussed and eventually some conclusions and future plans for further improvement of the CAIRN architecture are given
Text Extraction and Web Searching in a Non-Latin Language
Recent studies of queries submitted to Internet Search Engines have shown that
non-English queries and unclassifiable queries have nearly tripled during the
last decade. Most search engines were originally engineered for English. They
do not take full account of inflectional semantics nor, for example, diacritics or
the use of capitals which is a common feature in languages other than English.
The literature concludes that searching using non-English and non-Latin based
queries results in lower success and requires additional user effort to achieve
acceptable precision.
The primary aim of this research study is to develop an evaluation methodology
for identifying the shortcomings and measuring the effectiveness of
search engines with non-English queries. It also proposes a number of solutions
for the existing situation. A Greek query log is analyzed considering the morphological
features of the Greek language. Also a text extraction experiment
revealed some problems related to the encoding and the morphological and
grammatical differences among semantically equivalent Greek terms. A first
stopword list for Greek based on a domain independent collection has been
produced and its application in Web searching has been studied. The effect of
lemmatization of query terms and the factors influencing text based image retrieval
in Greek are also studied. Finally, an instructional strategy is presented
for teaching non-English students how to effectively utilize search engines.
The evaluation of the capabilities of the search engines showed that international
and nationwide search engines ignore most of the linguistic idiosyncrasies
of Greek and other complex European languages. There is a lack of
freely available non-English resources to work with (test corpus, linguistic resources,
etc). The research showed that the application of standard IR techniques,
such as stopword removal, stemming, lemmatization and query expansion,
in Greek Web searching increases precision.
i