We present work carried out within the framework of the Scriptorium project, developed at the Research & Development division of Electricité de France (EDF), the French electricity company. We are exploring issues related to knowledge acquisition from very large, heterogeneous corpora, and to the semantic annotation of these corpora, with the aim of facilitating browsing and navigation. Semantic access to heterogeneous, evolving text collections has become a crucial issue today in the world of online information: the increasing availability of electronic text enables the construction (and dispersion) of heterogeneous text collections. Current navigation tools such as thesauri, glossaries, indexes, etc., based on pre-defined semantic categories or taxonomies are inadequate for describing or browsing this kind of dynamic, loosely structured text collections. We therefore have adopted an inductive, data-driven approach aimed at extracting semantic classes from a corpus through the statistical analysis of textual data. We create different views or 'slices' of the document collection by extracting sub-corpora of manageable size, which we submit to the statistical software. We then build a navigable topic map of our document collection using the Topic Map Standard (ISO/IEC 13250) which provides a semantic interface to the document collection and enables navigation through the viewpoints and classes inductively acquired. Navigation is aided by a 3D geometric representation of the semantic space of the corpus... The aim of this project is to identify prominent and emerging topics from the automatic analysis of the discourse of the company's (EDF's) different social agents (managers, trade-unions, employees, etc.) by way of textual data analysis methods. The corpus under study in this project has eight million words and is very heterogeneous (it contains book extracts, corporate press, union press, summaries of corporate meetings, transcriptions of taped trade union messages, etc.). This diversity makes this corpus prototypical of the electronic documents available nowadays in a given domain. All documents are SGML tagged following the TEI (Text Encoding Initiative) recommendations... We are exploring issues related to semantic acquisition from large, heterogeneous corpora and content-based access to these corpora on the basis of inductively-acquired categories. We feel that data-driven, inductive approaches for building semantic interfaces to text collections will become more and more necessary, to efficiently manage the unrestricted, dynamic online information available today
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.