10 research outputs found

    Integrating information to bootstrap information extraction from web sites

    Get PDF
    In this paper we propose a methodology to learn to extract domain-specific information from large repositories (e.g. the Web) with minimum user intervention. Learning is seeded by integrating information from structured sources (e.g. databases and digital libraries). Retrieved information is then used to bootstrap learning for simple Information Extraction (IE) methodologies, which in turn will produce more annotation to train more complex IE engines. All the corpora for training the IE en- gines are produced automatically by integrating in- formation from different sources such as available corpora and services (e.g. databases or digital libraries, etc.). User intervention is limited to providing an initial URL and adding information missed by the different modules when the computation has finished. The information added or delete by the user can then be reused providing further training and therefore getting more information (recall) and/or more precision. We are currently applying this methodology to mining web sites of Computer Science departments.peer-reviewe

    Web Search using Improved Concept Based Query Refinement

    Get PDF
    The information extracted from Web pages can be used for effective query expansion. The aspect needed to improve accuracy of web search engines is the inclusion of metadata, not only to analyze Web content, but also to interpret. With the Web of today being unstructured and semantically heterogeneous, keyword-based queries are likely to miss important results. . Using data mining methods, our system derives dependency rules and applies them to concept-based queries. This paper presents a novel approach for query expansion that applies dependence rules mined from a large Web World, combining several existing techniques for data extraction and mining, to integrate the system into COMPACT, our prototype implementation of a concept-based search engine

    Multi-strategy definition of annotation services in Melita

    Get PDF
    The definition of methodologies for automatic ontology-based document annotation is a fundamental step in the Semantic Web vision. In the near future, semantic annotation services could become as important as search engines are today. Tools for the easy and effective development of such services are therefore needed. In this paper, we present Melita, a tool for the definition and development of ontology-based annotation services. Melita goes beyond the dichotomy rule learning Vs rule writing of classic annotation systems, as it allows adopting different strategies, from annotating examples in a corpus for training a learner to rule writing and even a mixture of them. It also supports users in defining and maintaining an ontology for annotation and in delivering the annotation service. The result is a tool easy to use and flexible to different user needs.peer-reviewe

    Features for Killer Apps from a Semantic Web Perspective

    Get PDF
    There are certain features that that distinguish killer apps from other ordinary applications. This chapter examines those features in the context of the semantic web, in the hope that a better understanding of the characteristics of killer apps might encourage their consideration when developing semantic web applications. Killer apps are highly tranformative technologies that create new e-commerce venues and widespread patterns of behaviour. Information technology, generally, and the Web, in particular, have benefited from killer apps to create new networks of users and increase its value. The semantic web community on the other hand is still awaiting a killer app that proves the superiority of its technologies. The authors hope that this chapter will help to highlight some of the common ingredients of killer apps in e-commerce, and discuss how such applications might emerge in the semantic web

    Gimme The Context: Context-driven automatic semantic annotation with C-PANKOW

    Get PDF
    Cimiano P, Ladwig G, Staab S. Gimme The Context: Context-driven automatic semantic annotation with C-PANKOW. In: Ellis A, Hagino T, eds. Proceedings of the 14th international conference on World Wide Web, WWW 2005. ACM Press; 2005: 332-341

    Modelling Web Service Composition for Deductive Web Mining

    Get PDF
    Composition of simpler web services into custom applications is understood as promising technique for information requests in a heterogeneous and changing environment. This is also relevant for applications characterised as deductive web mining (DWM). We suggest to use problem-solving methods (PSMs) as templates for composed services. We developed a multi-dimensional, ontology-based framework, and a collection of PSMs, which enable to characterise DWM applications at an abstract level; we describe several existing applications in this framework. We show that the heterogeneity and unboundedness of the web demands for some modifications of the PSM paradigm used in the context of traditional artificial intelligence. Finally, as simple proof of concept, we simulate automated DWM service composition on a small collection of services, PSM-based templates, data objects and ontological knowledge, all implemented in Prolog

    Advanced Knowledge Technologies at the Midterm: Tools and Methods for the Semantic Web

    Get PDF
    The University of Edinburgh and research sponsors are authorised to reproduce and distribute reprints and on-line copies for their purposes notwithstanding any copyright annotation hereon. The views and conclusions contained herein are the author’s and shouldn’t be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of other parties.In a celebrated essay on the new electronic media, Marshall McLuhan wrote in 1962:Our private senses are not closed systems but are endlessly translated into each other in that experience which we call consciousness. Our extended senses, tools, technologies, through the ages, have been closed systems incapable of interplay or collective awareness. Now, in the electric age, the very instantaneous nature of co-existence among our technological instruments has created a crisis quite new in human history. Our extended faculties and senses now constitute a single field of experience which demands that they become collectively conscious. Our technologies, like our private senses, now demand an interplay and ratio that makes rational co-existence possible. As long as our technologies were as slow as the wheel or the alphabet or money, the fact that they were separate, closed systems was socially and psychically supportable. This is not true now when sight and sound and movement are simultaneous and global in extent. (McLuhan 1962, p.5, emphasis in original)Over forty years later, the seamless interplay that McLuhan demanded between our technologies is still barely visible. McLuhan’s predictions of the spread, and increased importance, of electronic media have of course been borne out, and the worlds of business, science and knowledge storage and transfer have been revolutionised. Yet the integration of electronic systems as open systems remains in its infancy.Advanced Knowledge Technologies (AKT) aims to address this problem, to create a view of knowledge and its management across its lifecycle, to research and create the services and technologies that such unification will require. Half way through its sixyear span, the results are beginning to come through, and this paper will explore some of the services, technologies and methodologies that have been developed. We hope to give a sense in this paper of the potential for the next three years, to discuss the insights and lessons learnt in the first phase of the project, to articulate the challenges and issues that remain.The WWW provided the original context that made the AKT approach to knowledge management (KM) possible. AKT was initially proposed in 1999, it brought together an interdisciplinary consortium with the technological breadth and complementarity to create the conditions for a unified approach to knowledge across its lifecycle. The combination of this expertise, and the time and space afforded the consortium by the IRC structure, suggested the opportunity for a concerted effort to develop an approach to advanced knowledge technologies, based on the WWW as a basic infrastructure.The technological context of AKT altered for the better in the short period between the development of the proposal and the beginning of the project itself with the development of the semantic web (SW), which foresaw much more intelligent manipulation and querying of knowledge. The opportunities that the SW provided for e.g., more intelligent retrieval, put AKT in the centre of information technology innovation and knowledge management services; the AKT skill set would clearly be central for the exploitation of those opportunities.The SW, as an extension of the WWW, provides an interesting set of constraints to the knowledge management services AKT tries to provide. As a medium for the semantically-informed coordination of information, it has suggested a number of ways in which the objectives of AKT can be achieved, most obviously through the provision of knowledge management services delivered over the web as opposed to the creation and provision of technologies to manage knowledge.AKT is working on the assumption that many web services will be developed and provided for users. The KM problem in the near future will be one of deciding which services are needed and of coordinating them. Many of these services will be largely or entirely legacies of the WWW, and so the capabilities of the services will vary. As well as providing useful KM services in their own right, AKT will be aiming to exploit this opportunity, by reasoning over services, brokering between them, and providing essential meta-services for SW knowledge service management.Ontologies will be a crucial tool for the SW. The AKT consortium brings a lot of expertise on ontologies together, and ontologies were always going to be a key part of the strategy. All kinds of knowledge sharing and transfer activities will be mediated by ontologies, and ontology management will be an important enabling task. Different applications will need to cope with inconsistent ontologies, or with the problems that will follow the automatic creation of ontologies (e.g. merging of pre-existing ontologies to create a third). Ontology mapping, and the elimination of conflicts of reference, will be important tasks. All of these issues are discussed along with our proposed technologies.Similarly, specifications of tasks will be used for the deployment of knowledge services over the SW, but in general it cannot be expected that in the medium term there will be standards for task (or service) specifications. The brokering metaservices that are envisaged will have to deal with this heterogeneity.The emerging picture of the SW is one of great opportunity but it will not be a wellordered, certain or consistent environment. It will comprise many repositories of legacy data, outdated and inconsistent stores, and requirements for common understandings across divergent formalisms. There is clearly a role for standards to play to bring much of this context together; AKT is playing a significant role in these efforts. But standards take time to emerge, they take political power to enforce, and they have been known to stifle innovation (in the short term). AKT is keen to understand the balance between principled inference and statistical processing of web content. Logical inference on the Web is tough. Complex queries using traditional AI inference methods bring most distributed computer systems to their knees. Do we set up semantically well-behaved areas of the Web? Is any part of the Web in which semantic hygiene prevails interesting enough to reason in? These and many other questions need to be addressed if we are to provide effective knowledge technologies for our content on the web

    Integrating Information to Bootstrap Information Extraction from Web Sites

    Get PDF
    In this paper we propose a methodology to learn to extract domain-specific information from large repositories (e.g. the Web) with minimum user intervention

    Recherche d'information sémantique et extraction automatique d'ontologie du domaine

    Get PDF
    Il peut s'avĂ©rer ardu, mĂȘme pour une organisation de petite taille, de se retrouver parmi des centaines, voir des milliers de documents Ă©lectroniques. Souvent, les techniques employĂ©es par les moteurs de recherche dans Internet sont utilisĂ©es par les entreprises voulant faciliter la recherche d'information dans leur intranet. Ces techniques reposent sur des mĂ©thodes statistiques et ne permettent pas de traiter la sĂ©mantique contenue dans la requĂȘte de l'usager ainsi que dans les documents. Certaines approches ont Ă©tĂ© dĂ©veloppĂ©es pour extraire cette sĂ©mantique et ainsi, mieux rĂ©pondre Ă  des requĂȘtes faites par les usagers. Par contre, la plupart de ces techniques ont Ă©tĂ© conçues pour s'appliquer au Web en entier et non pas sur un domaine en particulier. Il pourrait ĂȘtre intĂ©ressant d'utiliser une ontologie pour reprĂ©senter un domaine spĂ©cifique et ainsi, ĂȘtre capable de mieux rĂ©pondre aux questions posĂ©es par un usager. Ce mĂ©moire prĂ©sente notre approche proposant l'utilisation du logiciel Text- To-Onto pour crĂ©er automatiquement une ontologie dĂ©crivant un domaine. Cette mĂȘme ontologie est par la suite utilisĂ©e par le logiciel Sesei, qui est un filtre sĂ©mantique pour les moteurs de recherche conventionnels. Cette mĂ©thode permet ainsi d'amĂ©liorer la pertinence des documents envoyĂ©s Ă  l'usager.It can prove to be diffcult, even for a small size organization, to find information among hundreds, even thousands of electronic documents. Most often, the methods employed by search engines on the Internet are used by companies wanting to improve information retrieval on their intranet. These techniques rest on statistical methods and do not make it possible neither to evaluate the semantics contained in the user requests, nor in the documents. Certain methods were developed to extract this semantics and thus, to improve the answer given to requests. On the other hand, the majority of these techniques were conceived to be applied on the entire World Wide Web and not on a particular field of knowledge, like corporative data. It could be interesting to use domain specific ontologies in trying to link a specific query to related documents and thus, to be able to better answer these queries. This thesis presents our approach which proposes the use of the Text-To-Onto software to automatically create an ontology describing a particular field. Thereafter, this ontology is used by the Sesei software, which is a semantic filter for conventional search engines. This method makes it possible to improve the relevance of documents returned to the user

    Moving towards the semantic web: enabling new technologies through the semantic annotation of social contents.

    Get PDF
    La Web Social ha causat un creixement exponencial dels continguts disponibles deixant enormes quantitats de recursos textuals electrĂČnics que sovint aclaparen els usuaris. Aquest volum d’informaciĂł Ă©s d’interĂšs per a la comunitat de mineria de dades. Els algorismes de mineria de dades exploten caracterĂ­stiques de les entitats per tal de categoritzar-les, agrupar-les o classificar-les segons la seva semblança. Les dades per si mateixes no aporten cap mena de significat: han de ser interpretades per esdevenir informaciĂł. Els mĂštodes tradicionals de mineria de dades no tenen com a objectiu “entendre” el contingut d’un recurs, sinĂł que extreuen valors numĂšrics els quals esdevenen models en aplicar-hi cĂ lculs estadĂ­stics, que nomĂ©s cobren sentit sota l’anĂ lisi manual d’un expert. Els darrers anys, motivat per la Web SemĂ ntica, molts investigadors han proposat mĂštodes semĂ ntics de classificaciĂł de dades capaços d’explotar recursos textuals a nivell conceptual. Malgrat aixĂČ, normalment aquests mĂštodes depenen de recursos anotats prĂšviament per poder interpretar semĂ nticament el contingut d’un document. L’Ășs d’aquests mĂštodes estĂ  estretament relacionat amb l’associaciĂł de dades i el seu significat. Aquest treball es centra en el desenvolupament d’una metodologia genĂšrica capaç de detectar els trets mĂ©s rellevants d’un recurs textual descobrint la seva associaciĂł semĂ ntica, es a dir, enllaçant-los amb conceptes modelats a una ontologia, i detectant els principals temes de discussiĂł. Els mĂštodes proposats sĂłn no supervisats per evitar el coll d’ampolla generat per l’anotaciĂł manual, independents del domini (aplicables a qualsevol Ă rea de coneixement) i flexibles (capaços d’analitzar recursos heterogenis: documents textuals o documents semi-estructurats com els articles de la ViquipĂšdia o les publicacions de Twitter). El treball ha estat avaluat en els Ă mbits turĂ­stic i mĂšdic. Per tant, aquesta dissertaciĂł Ă©s un primer pas cap a l'anotaciĂł semĂ ntica automĂ tica de documents necessĂ ria per possibilitar el camĂ­ cap a la visiĂł de la Web SemĂ ntica.La Web Social ha provocado un crecimiento exponencial de los contenidos disponibles, dejando enormes cantidades de recursos electrĂłnicos que a menudo abruman a los usuarios. Tal volumen de informaciĂłn es de interĂ©s para la comunidad de minerĂ­a de datos. Los algoritmos de minerĂ­a de datos explotan caracterĂ­sticas de las entidades para categorizarlas, agruparlas o clasificarlas segĂșn su semejanza. Los datos por sĂ­ mismos no aportan ningĂșn significado: deben ser interpretados para convertirse en informaciĂłn. Los mĂ©todos tradicionales no tienen como objetivo "entender" el contenido de un recurso, sino que extraen valores numĂ©ricos que se convierten en modelos tras aplicar cĂĄlculos estadĂ­sticos, los cuales cobran sentido bajo el anĂĄlisis manual de un experto. Actualmente, motivados por la Web SemĂĄntica, muchos investigadores han propuesto mĂ©todos semĂĄnticos de clasificaciĂłn de datos capaces de explotar recursos textuales a nivel conceptual. Sin embargo, generalmente estos mĂ©todos dependen de recursos anotados previamente para poder interpretar semĂĄnticamente el contenido de un documento. El uso de estos mĂ©todos estĂĄ estrechamente relacionado con la asociaciĂłn de datos y su significado. Este trabajo se centra en el desarrollo de una metodologĂ­a genĂ©rica capaz de detectar los rasgos mĂĄs relevantes de un recurso textual descubriendo su asociaciĂłn semĂĄntica, es decir, enlazĂĄndolos con conceptos modelados en una ontologĂ­a, y detectando los principales temas de discusiĂłn. Los mĂ©todos propuestos son no supervisados para evitar el cuello de botella generado por la anotaciĂłn manual, independientes del dominio (aplicables a cualquier ĂĄrea de conocimiento) y flexibles (capaces de analizar recursos heterogĂ©neos: documentos textuales o documentos semi-estructurados, como artĂ­culos de la Wikipedia o publicaciones de Twitter). El trabajo ha sido evaluado en los ĂĄmbitos turĂ­stico y mĂ©dico. Esta disertaciĂłn es un primer paso hacia la anotaciĂłn semĂĄntica automĂĄtica de documentos necesaria para posibilitar el camino hacia la visiĂłn de la Web SemĂĄntica.Social Web technologies have caused an exponential growth of the documents available through the Web, making enormous amounts of textual electronic resources available. Users may be overwhelmed by such amount of contents and, therefore, the automatic analysis and exploitation of all this information is of interest to the data mining community. Data mining algorithms exploit features of the entities in order to characterise, group or classify them according to their resemblance. Data by itself does not carry any meaning; it needs to be interpreted to convey information. Classical data analysis methods did not aim to “understand” the content and the data were treated as meaningless numbers and statistics were calculated on them to build models that were interpreted manually by human domain experts. Nowadays, motivated by the Semantic Web, many researchers have proposed semantic-grounded data classification and clustering methods that are able to exploit textual data at a conceptual level. However, they usually rely on pre-annotated inputs to be able to semantically interpret textual data such as the content of Web pages. The usability of all these methods is related to the linkage between data and its meaning. This work focuses on the development of a general methodology able to detect the most relevant features of a particular textual resource finding out their semantics (associating them to concepts modelled in ontologies) and detecting its main topics. The proposed methods are unsupervised (avoiding the manual annotation bottleneck), domain-independent (applicable to any area of knowledge) and flexible (being able to deal with heterogeneous resources: raw text documents, semi-structured user-generated documents such Wikipedia articles or short and noisy tweets). The methods have been evaluated in different fields (Tourism, Oncology). This work is a first step towards the automatic semantic annotation of documents, needed to pave the way towards the Semantic Web vision
    corecore