Automatic Cross-Linguistic Information Retrieval using Latent Semantic Indexing

Abstract

this document as a bag of freely intermingled French and English words. A set of training documents like this is analyzed using LSI, and the result is a reduced dimension semantic space in which related terms are near each other. Because the documents contained both French and English terms, the LSI space will contain terms from both languages; this is what makes it possible for the CL-LSI method to avoid query translation. Words that are consistently paired in translation (e.g., Libya and Libye) will be given identical representations in the LSI space, whereas words that are frequently associated with one another (e.g., not and pas) will be given similar representations. The next step in the CL-LSI method is to add (or "fold in") documents in just French or English. As described above, this is done by locating a new document at the weighted vector sum of its constituent terms. The result of this process is that each document in the database has a language-independent representation in terms of numerical vectors. Users can now pose queries in either French or English and get back the most similar documents regardless of language. 3.2 Experimental Test

    Similar works

    Full text

    thumbnail-image

    Available Versions