32 research outputs found

    The iCrawl Wizard -- Supporting Interactive Focused Crawl Specification

    Full text link
    Collections of Web documents about specific topics are needed for many areas of current research. Focused crawling enables the creation of such collections on demand. Current focused crawlers require the user to manually specify starting points for the crawl (seed URLs). These are also used to describe the expected topic of the collection. The choice of seed URLs influences the quality of the resulting collection and requires a lot of expertise. In this demonstration we present the iCrawl Wizard, a tool that assists users in defining focused crawls efficiently and semi-automatically. Our tool uses major search engines and Social Media APIs as well as information extraction techniques to find seed URLs and a semantic description of the crawl intent. Using the iCrawl Wizard even non-expert users can create semantic specifications for focused crawlers interactively and efficiently.Comment: Published in the Proceedings of the European Conference on Information Retrieval (ECIR) 201

    Analysing entity context in multilingual Wikipedia to support entity-centric retrieval applications

    Get PDF
    Representation of influential entities, such as famous people and multinational corporations, on the Web can vary across languages, reflecting language-specific entity aspects as well as divergent views on these entities in different communities. A systematic analysis of language specific entity contexts can provide a better overview of the existing aspects and support entity-centric retrieval applications over multilingual Web data. An important source of cross-lingual information about influential entities is Wikipedia — an online community-created encyclopaedia — containing more than 280 language editions. In this paper we focus on the extraction and analysis of the language-specific entity contexts from different Wikipedia language editions over multilingual data. We discuss alternative ways such contexts can be built, including graph-based and article-based contexts. Furthermore, we analyse the similarities and the differences in these contexts in a case study including 80 entities and five Wikipedia language editions

    A First Step Towards Keyword-Based Searching for Recommendation Systems

    Get PDF
    Due to the high availability of data, users are frequently overloaded with a huge amount of alternatives when they need to choose a particular item. This has motivated an increased interest in research on recommendation systems, which lter the options and provide users with suggestions about specic elements (e.g., movies, restaurants, hotels, news, etc.) that are estimated to be potentially relevant for the user. Recommendation systems are still an active area of research, and particularly in the last years the concept of context-aware recommendation systems has started to be popular, due to the interest of considering the context of the user in the recommendation process. In this paper, we describe our work-in-progress concerning pull-based recommendations (i.e., recommendations about certain types of items that are explicitly requested by the user). In particular, we focus on the problem of detecting the type of item the user is interested in. Due to its popularity, we consider a keyword-based user interface: the user types a few keywords and the system must determine what the user is searching for. Whereas there is extensive work in the field of keyword-based search, which is still a very active research area, keyword searching has not been applied so far in most recommendation contexts

    Topic detection in multichannel Italian newspapers

    Get PDF
    Nowadays, any person, company or public institution uses and exploits different channels to share private or public information with other people (friends, customers, relatives, etc.) or institutions. This context has changed the journalism, thus, the major newspapers report news not just on its own web site, but also on several social media such as Twitter or YouTube. The use of multiple communication media stimulates the need for integration and analysis of the content published globally and not just at the level of a single medium. An analysis to achieve a comprehensive overview of the information that reaches the end users and how they consume the information is needed. This analysis should identify the main topics in the news flow and reveal the mechanisms of publication of news on different media (e.g. news timeline). Currently, most of the work on this area is still focused on a single medium. So, an analysis across different media (channels) should improve the result of topic detection. This paper shows the application of a graph analytical approach, called Keygraph, to a set of very heterogeneous documents such as the news published on various media. A preliminary evaluation on the news published in a 5 days period was able to identify the main topics within the publications of a single newspaper, and also within the publications of 20 newspapers on several on-line channels

    No users no dataspaces! Query-driven dataspace orchestration

    Get PDF
    Data analysis in rich spaces of heterogeneous data sources is an increasingly common activity. Examples include querying the web of linked data and personal information management. Such analytics on dataspaces is often iterative and dynamic, in an open-ended interaction between discovery and data orchestration. The current state of the art in integration and orchestration in dataspaces is primarily geared towards close-ended analysis, targeting the discovery of stable data mappings or one-time, pay-as-you-go ad hoc data mappings. The perspective here is dataspace-centric. In this paper, we propose a shift to a user-centric perspective on dataspace orchestration. We outline basic conceptual and technical challenges in supporting data analytics which is open-ended and always evolving, as users respond to new discoveries and connections

    TRAFAIR: Understanding Traffic Flow to Improve Air Quality

    Get PDF
    Environmental impacts of traffic are of major concern throughout many European metropolitan areas. Air pollution causes 400 000 deaths per year, making it first environmental cause of premature death in Europe. Among the main sources of air pollution in Europe, there are road traffic, domestic heating, and industrial combustion. The TRAFAIR project brings together 9 partners from two European countries (Italy and Spain) to develop innovative and sustainable services combining air quality, weather conditions, and traffic flows data to produce new information for the benefit of citizens and government decision-makers. The project is started in November 2018 and lasts two years. It is motivated by the huge amount of deaths caused by the air pollution. Nowadays, the situation is particularly critical in some member states of Europe. In February 2017, the European Commission warned five countries, among which Spain and Italy, of continued air pollution breaches. In this context, public administrations and citizens suffer from the lack of comprehensive and fast tools to estimate the level of pollution on an urban scale resulting from varying traffic flow conditions that would allow optimizing control strategies and increase air quality awareness. The goals of the project are twofold: monitoring urban air quality by using sensors in 6 European cities and making urban air quality predictions thanks to simulation models. The project is co-financed by the European Commission under the CEF TELECOM call on Open Data

    A software processing chain for evaluating thesaurus quality

    Get PDF
    Thesauri are knowledge models commonly used for information classification and retrieval whose structure is defined by standards that describe the main features the concepts and relations must have. However, following these standards requires a deep knowledge of the field the thesaurus is going to cover and experience in their creation. To help in this task, this paper describes a software processing chain that provides different validation components that evaluates the quality of the main thesaurus features

    Semantic URL Analytics to Support Efficient Annotation of Large Scale Web Archives

    Get PDF
    Long-term Web archives comprise Web documents gathered over longer time periods and can easily reach hundreds of terabytes in size. Semantic annotations such as named entities can facilitate intelligent access to the Web archive data. However, the annotation of the entire archive content on this scale is often infeasible. The most efficient way to access the documents within Web archives is provided through their URLs, which are typically stored in dedicated index files. The URLs of the archived Web documents can contain semantic information and can offer an efficient way to obtain initial semantic annotations for the archived documents. In this paper, we analyse the applicability of semantic analysis techniques such as named entity extraction to the URLs in a Web archive. We evaluate the precision of the named entity extraction from the URLs in the Popular German Web dataset and analyse the proportion of the archived URLs from 1,444 popular domains in the time interval from 2000 to 2012 to which these techniques are applicable. Our results demonstrate that named entity recognition can be successfully applied to a large number of URLs in our Web archive and provide a good starting point to efficiently annotate large scale collections of Web documents

    From a research project to an Information System course: a professional approach

    Full text link
    [EN] Nowadays, new business models are arising thanks to the development of ICT. In this context, the law is constantlybeing adaptedto guarantee the rights of individuals. Studyingtopics related to legislation without considering its relation with a particular project is unattractive and generally it does not motivate computer science students. However, according to reports by the Instituto Nacional de Tecnologías de la Comunicación (INTECO), a high percentage of Small and Medium Enterprises (SMEs) does not consider current legislation on issues related to ICT. For these reasons, we develop a series of guides definingbehaviour protocols, based on an active computer researchproject oriented to SMEs; and, at the same time, we decided to try to engage computer science students of the need to respect the regulations for the development of any software project (part of their next career future) making clear the relation between their tasks in any project of this kind and the laws and norms that should be respected during this process by the practical use and respect of these laws, in an Information Systems course. This last part is the work we present hereLozano Albalate, MT.; Trillo Lado, R. (2015). From a research project to an Information System course: a professional approach. En 1ST INTERNATIONAL CONFERENCE ON HIGHER EDUCATION ADVANCES (HEAD' 15). Editorial Universitat Politècnica de València. 83-89. https://doi.org/10.4995/HEAd15.2015.420OCS838

    Consolidation of a professional approach experience on motivating Computer Engineering students to the application of legal issues

    Full text link
    [EN] In previous courses, professors of the degree of Computer Science and Software Engineering of the University of Zaragoza realised that students did not like studying materias related to Legislation and Information Systems. However, these topics are key when Computers Science and Software Engineers have to analyse, design, implement, and mantain Information Systems in different environments such as an enterprise, a public entity, etc. because the rights of the users/clients of these systems must be guaranteed. So, a more appeling way to teach those topics to motivate the students to take them into account was designed. This paper describes the methodology and the main activities designed in the 2014/2015 and 2015/2016 courses in order to get the attention of the students on topics related to the current Spanish legislation and Information Systems. Moreover, some indicators about the performance of the students and their opinions about this new methodology are also described and analysed.Lozano, M.; Trillo-Lado, R. (2016). Consolidation of a professional approach experience on motivating Computer Engineering students to the application of legal issues. En 2nd. International conference on higher education advances (HEAD'16). Editorial Universitat Politècnica de València. 295-301. https://doi.org/10.4995/HEAD16.2015.2713OCS29530
    corecore