8 research outputs found

    Automatic Extraction of Semantically-Meaningful Information from the Web.

    Get PDF
    The semantic Web will bring meaning to the Internet,making it possible for web agents to understand the information it contains. However,curren t trends seem to suggest that the semantic web is not likely to be adopted in the forthcoming years. In this sense,meaningful information extraction from the web becomes a handicap for web agents. In this article,w e present a framework for automatic extraction of semantically-meaningful information from the current web. Separating the extraction process from the business logic of an agent enhances modularity,adaptabilit y,and maintainability. Our approach is novel in that it combines different technologies to extract information,surf the web and automatically adapt to web changes.Comisión Interministerial de Ciencia y Tecnología TIC2000-1106-C02-0

    Flexible and scalable digital library search

    Get PDF
    In this report the development of a specialised search engine for a digital library is described. The proposed system architecture consists of three levels: the conceptual, the logical and the physical level. The conceptual level schema enables by its exposure of a domain specific schema semantically rich conceptual search. The logical level provides a description language to achieve a high degree of flexibility for multimedia retrieval. The physical level takes care of scalable and efficient persistent data storage. The role, played by each level, changes during the various stages of a search engine's lifecycle: (1) modeling the index, (2) populating and maintaining the index and (3) querying the index. The integration of all this functionality allows the combination of both conceptual and content-based querying in the query stage. A search engine for the Australian Open tennis tournament website is used as a running example, which shows the power of the complete architecture and its various component

    Web Information Systems: Usage, Content, and Functionally Modelling

    Get PDF
    The design of large-scale data-intensive web information systems (WIS) requires a clear picture of the intended users and their behaviour in using the system, a support of various access channels and the technology used with them, and an integration of traditional methods for the design of data-intensive information systems with new methods that address the challenges arising from the web-presentation and the open access. This paper presents the conceptual modelling parts of a methodology for the design of WISs that is based on an abstract abstraction layer model (ALM). It concentrates on the two most important layers in this model: a business layer and a conceptual layer. The major activities on the business layer deal with user profiling and storyboarding, which addresses the design of an underlying application story. The core of such a story can be expressed by a directed multi-graph, in which the vertices represent scenes and the edges actions by the users including navigation. This leads to story algebras which can then be used to personalise the WIS to the needs of a user with a particular profile. The major activities on the conceptual layer address the support of scenes by modelling media types, which combine links to databases via extended views with the generation of navigation structures, operations supporting the activities in the storyboard, hierarchical presentations, and adaptivity to users, end-devices and channels. Adding presentation style options this can be used to generate the web-pages that will be presented to the WIS users

    Metadata-based and personalized web querying

    Get PDF
    Cataloged from PDF version of article.The advent of the Web has raised new searching and querying problems. Keyword matching based querying techniques that have been widely used by search engines, return thousands of Web documents for a single query, and most of these documents are generally unrelated to the users’ information needs. Towards the goal of improving the information search needs of Web users, a recent promising approach is to index the Web by using metadata and annotations. In this thesis, we model and query Web-based information resources using metadata for improved Web searching capabilities. Employing metadata for querying the Web increases the precision of the query outputs by returning semantically more meaningful results. Our Web data model, named “Web information space model”, consists of Web-based information resources (HTML/XML documents on the Web), expert advice repositories (domain-expert-specified metadata for information resources), and personalized information about users (captured as user profiles that indicate users’ preferences about experts as well as users’ knowledge about topics). Expert advice is specified using topics and relationships among topics (i.e., metalinks), along the lines of recently proposed topic maps standard. Topics and metalinks constitute metadata that describe the contents of the underlying Web information resources. Experts assign scores to topics, metalinks, and information resources to represent the “importance” of them. User profiles store users’ preferences and navigational history information about the information resources that the user visits. User preferences, knowledge level on topics, and history information are used for personalizing the Web search, and improving the precision of the results returned to the user. We store expert advices and user profiles in an object relational database iv v management system, and extend the SQL for efficient querying of Web-based information resources through the Web information space model. SQL extensions include the clauses for propagating input importance scores to output tuples, the clause that specifies query stopping condition, and new operators (i.e., text similarity based selection, text similarity based join, and topic closure). Importance score propagation and query stopping condition allow ranking of query outputs, and limiting the output size. Text similarity based operators and topic closure operator support sophisticated querying facilities. We develop a new algebra called Sideway Value generating Algebra (SVA) to process these SQL extensions. We also propose evaluation algorithms for the text similarity based SVA directional join operator, and report experimental results on the performance of the operator. We demonstrate experimentally the effectiveness of metadata-based personalized Web search through SQL extensions over the Web information space model against keyword matching based Web search techniques.Özel, Selma AyşePh.D

    A scientific-research activities information system

    No full text
    Cilj - Cilj istraživanja je razvoj modela, implementacija prototipa i verifikacija sistema za ekstrakciju metodologija iz naučnih članaka iz oblasti Informatike. Da bi se, pomoću tog sistema, naučnicima mogao obezbediti bolji uvid u metodologije u svojim oblastima potrebno je ekstrahovane metodolgije povezati sa metapodacima vezanim za publikaciju iz koje su ekstrahovani. Iz tih razloga istraživanje takoñe za cilj ima i razvoj modela sistema za automatsku ekstrakciju metapodataka iz naučnih članaka. Metodologija - Ekstrahovane metodologije se kategorizuju u četiri kategorije: kategorizuju se u četiri semantičke kategorije: zadatak (Task), metoda (Method), resurs/osobina (Resource/Feature) i implementacija (Implementation). Sistem se sastoji od dva nivoa: prvi je automatska identifikacija metodoloških rečenica; drugi nivo vrši prepoznavanje metodoloških fraza (segmenata). Zadatak ekstrakcije i kategorizacije formalizovan je kao problem označavanja sekvenci i upotrebljena su četiri zasebna Conditional Random Fields modela koji su zasnovani na sintaktičkim frazama. Sistem je evaluiran na ručno anotiranom korpusu iz oblasti Automatske Ekstrakcije Termina koji se sastoji od 45 naučnih članaka. Sistem za automatsku ekstrakciju metapodataka zasnovan je na klasifikaciji. Klasifikacija metapodataka vrši se u osam unapred definisanih sematičkih kategorija: Naslov, Autori, Pripadnost, Adresa, Email, Apstrakt, Ključne reči i Mesto publikacije. Izvršeni su eksperimenti sa svim standardnim modelima za klasifikaciju: naivni bayes, stablo odlučivanja, k-najbližih suseda i mašine potpornih vektora. Rezultati - Sistem za ekstrakciju metodologija postigao je sledeće rezultate: F-mera od 53% za identifikaciju Task i Method kategorija (sa preciznošću od 70%) dok su vrednosti za F-mere za Resource/Feature i Implementation kategorije bile 60% (sa preciznošću od 67%) i 75% (sa preciznošću od 85%) respektivno. Nakon izvršenih klasifikacionih eksperimenata, za sistem za ekstrakciju metapodataka, utvrñeno je da mašine potpornih vektora (SVM) pružaju najbolje performanse. Dobijeni rezultati SVM modela su generalno dobri, F-mera preko 85% kod skoro svih kategorija, a preko 90% kod većine. Ograničenja istraživanja/implikacije - Sistem za ekstrakciju metodologija, kao i sistem za esktrakciju metapodataka primenljivi su samo na naučne članke na engleskom jeziku. Praktične implikacije - Predloženi modeli mogu se, pre svega, koristiti za analizu i pregled razvoja naučnih oblasti kao i za kreiranje sematički bogatijih informacionih sistema naučno-istraživačke delatnosti. Originalnost/vrednost - Originalni doprinosi su sledeći: razvijen je model za ekstrakciju i semantičku kategorijzaciju metodologija iz naučnih članaka iz oblasti Informatike, koji nije opisan u postojećoj literaturi. Izvršena je analiza uticaja različitih vrsta osobina na ekstrakciju metodoloških fraza. Razvijen je u potpunosti automatizovan sistem za ekstrakciju metapodataka u informacionim sistemima naučno-istraživačke delatnosti.Purpose - The purpose of this research is model development, software prototype implementation and verification of the system for the identification of methodology mentions in scientific publications in a subdomain of automatic terminology extraction. In order to provide a better insight for scientists into the methodologies in their fields extracted methodologies should be connected with the metadata associated with the publication from which they are extracted. For this reason the purpose of this research was also a development of a system for the automatic extraction of metadata from scientific publications. Design/methodology/approach - Methodology mentions are categorized in four semantic categories: Task, Method, Resource/Feature and Implementation. The system comprises two major layers: the first layer is an automatic identification of methodological sentences; the second layer highlights methodological phrases (segments). Extraction and classification of the segments was 171 formalized as a sequence tagging problem and four separate phrase-based Conditional Random Fields were used to accomplish the task. The system has been evaluated on a manually annotated corpus comprising 45 full text articles. The system for the automatic extraction of metadata from scientific publications is based on classification. The metadata are classified eight pre-defined categories: Title, Authors, Affiliation, Address, Email, Abstract, Keywords and Publication Note. Experiments were performed with standard classification models: Decision Tree, Naive Bayes, K-nearest Neighbours and Support Vector Machines. Findings - The results of the system for methodology extraction show an Fmeasure of 53% for identification of both Task and Method mentions (with 70% precision), whereas the Fmeasures for Resource/Feature and Implementation identification was 60% (with 67% precision) and 75% (with 85% precision) respectively. As for the system for the automatic extraction of metadata Support Vector Machines provided the best performance. The Fmeasure was over 85% for almost all of the categories and over 90% for the most of them. Research limitations/implications - Both the system for the extractions of methodologies and the system for the extraction of metadata are only applicable to the scientific papers in English language. 172 Practical implications - The proposed models can be used in order to gain insight into a development of a scientific discipline and also to create semantically rich research activity information systems. Originality/Value - The main original contributions are: a novel model for the extraction of methodology mentions from scientific publications. The impact of the various types of features on the performance of the system was determined and presented. A fully automated system for the extraction of metadata for the rich research activity information systems was developed

    Araneus in the Era of XML

    No full text
    A large body of research has been recently motivated by the attempt to extend database manipulation techniques to data on the Web. Most of these research efforts -- which range from the definition of Web query languages and the related optimizations, to systems for Web site development and management, and to integration techniques -- started before XML was introduced, and therefore have strived for a long time to handle the highly heterogeneous nature of HTML pages. In the meanwhile, Web data sources have evolved from small, home-made collections of HTML pages into complex platforms for distributed data access and application development, and XML promises to impose itself as a more appropriate format for this new breed of Web sites. XML brings data on the Web closer to databases, since, differently from HTML, it is based on a clean distinction between the way the data, its logical structure (the DTD), and the chosen presentation (the stylesheet) are specified. By virtue of this, most of the early research proposals for data management on the Web are now being reconsidered in this new perspective. In this paper, we discuss the impact of XML on the research work conducted in the last few years by our group in the framework of the Araneus project. Araneus started as an attempt to investigate the chances of re-applying traditional database concepts and abstractions, such as the ones of data-model and query language, to data on the Web. In this spirit, we have developed several tools and techniques to handle both structured and semistructured data, in the Web style, as follows: (i) a data model called ADM for modeling Web documents and hypertexts; (ii) languages for wrapping and querying Web sites; (iii) tools and techniques for Web site design and implementation

    Araneus in the Era of XML

    No full text
    A large body of research has been recently motivated by the attempt to extend database manipulation techniques to data on the Web. Most of these research efforts -- which range from the definition of Web query languages and the related optimizations, to systems for Web site development and management, and to integration techniques -- started before XML was introduced, and therefore have strived for a long time to handle the highly heterogeneous nature of HTML pages. In the meanwhile, Web data sources have evolved from small, home-made collections of HTML pages into complex platforms for distributed data access and application development, and XML promises to impose itself as a more appropriate format for this new breed of Web sites. XML brings data on the Web closer to databases, since, differently from HTML, it is based on a clean distinction between the way the data, its logical structure (the DTD), and the chosen presentation (the stylesheet) are specified. By virtue of this, most of the early research proposals for data management on the Web are now being reconsidered in this new perspective. In this paper, we discuss the impact of XML on the research work conducted in the last few years by our group in the framework of the Araneus project. Araneus started as an attempt to investigate the chances of re-applying traditional database concepts and abstractions, such as the ones of data-model and query language, to data on the Web. In this spirit, we have developed several tools and techniques to handle both structured and semistructured data, in the Web style, as follows: (i) a data model called ADM for modeling Web documents and hypertexts; (ii) languages for wrapping and querying Web sites; (iii) tools and techniques for Web site design and implementation
    corecore