102 research outputs found

    Menetelmiä mielenkiintoisten solmujen löytämiseen verkostoista

    With the increasing amount of graph-structured data available, finding interesting objects, i.e., nodes in graphs, becomes more and more important. In this thesis we focus on finding interesting nodes and sets of nodes in graphs or networks. We propose several definitions of node interestingness as well as different methods to find such nodes. Specifically, we propose to consider nodes as interesting based on their relevance and non-redundancy or representativeness w.r.t. the graph topology, as well as based on their characterisation for a class, such as a given node attribute value. Identifying nodes that are relevant, but non-redundant to each other is motivated by the need to get an overview of different pieces of information related to a set of given nodes. Finding representative nodes is of interest, e.g. when the user needs or wants to select a few nodes that abstract the large set of nodes. Discovering nodes characteristic for a class helps to understand the causes behind that class. Next, four methods are proposed to find a representative set of interesting nodes. The first one incrementally picks one interesting node after another. The second iteratively changes the set of nodes to improve its overall interestingness. The third method clusters nodes and picks a medoid node as a representative for each cluster. Finally, the fourth method contrasts diverse sets of nodes in order to select nodes characteristic for their class, even if the classes are not identical across the selected nodes. The first three methods are relatively simple and are based on the graph topology and a similarity or distance function for nodes. For the second and third, the user needs to specify one parameter, either an initial set of k nodes or k, the size of the set. The fourth method assumes attributes and class attributes for each node, a class-related interesting measure, and possible sets of nodes which the user wants to contrast, such as sets of nodes that represent different time points. All four methods are flexible and generic. They can, in principle, be applied on any weighted graph or network regardless of what nodes, edges, weights, or attributes represent. Application areas for the methods developed in this thesis include word co-occurrence networks, biological networks, social networks, data traffic networks, and the World Wide Web. As an illustrating example, consider a word co-occurrence network. There, finding terms (nodes in the graph) that are relevant to some given nodes, e.g. branch and root, may help to identify different, shared contexts such as botanics, mathematics, and linguistics. A real life application lies in biology where finding nodes (biological entities, e.g. biological processes or pathways) that are relevant to other, given nodes (e.g. some genes or proteins) may help in identifying biological mechanisms that are possibly shared by both the genes and proteins.Väitöskirja käsittelee verkostojen louhinnan menetelmiä. Sen tavoitteena on löytää mielenkiintoisia tietoja painotetuista verkoista. Painotettuna verkkona voi tarkastella esim. tekstiainestoja, biologisia ainestoja, ihmisten välisiä yhteyksiä tai internettiä. Tällaisissa verkoissa solmut edustavat käsitteitä (esim. sanoja, geenejä, ihmisiä tai internetsivuja) ja kaaret niiden välisiä suhteita (esim. kaksi sanaa esiintyy samassa lauseessa, geeni koodaa proteiinia, ihmisten ystävyyksiä tai internetsivu viittaa toiseen internetsivuun). Kaarten painot voivat vastata esimerkiksi yhteyden voimakuutta tai luotettavuutta. Väitöskirjassa esitetään erilaisia verkon rakenteeseen tai solmujen attribuutteihin perustuvia määritelmiä solmujen mielenkiintoisuudelle sekä useita menetelmiä mielenkiintoisten solmujen löytämiseksi. Mielenkiintoisuuden voi määritellä esim. merkityksellisyytenä suhteessa joihinkin annettuihin solmuihin ja toisaalta mielenkiintoisten solmujen keskinäisenä erilaisuutena. Esimerkiksi ns. ahneella menetelmällä voidaan löytää keskenään erilaisia solmuja yksi kerrallaan. Väitöskirjan tuloksia voidaan soveltaa esimerkiksi tekstiaineistoa käsittelemällä saatuun sanojen väliseen verkostoon, jossa kahden sanan välillä on sitä voimakkaampi yhteys mitä useammin ne tapaavat esiintyä keskenään samoissa lauseissa. Sanojen erilaisia käyttöyhteyksiä ja jopa merkityksiä voidaan nyt löytää automaattisesti. Jos kohdesanaksi otetaan vaikkapa "juuri", niin siihen liittyviä mutta keskenään toisiinsa liittymättömiä sanoja ovat "puu" (biologinen merkitys: kasvin juuri), "yhtälö" (matemaattinen merkitys: yhtälön ratkaisu eli juuri) sekä "indoeurooppalainen" (kielitieteellinen merkitys: sanan vartalo eli juuri). Tällaisia menetelmiä voidaan soveltaa esimerkiksi hakukoneessa: sanalla "juuri" tehtyihin hakutuloksiin sisällytetään tuloksia mahdollisimman erilaisista käyttöyhteyksistä, jotta käyttäjän tarkoittama merkitys tulisi todennäköisemmin katetuksi hakutuloksissa. Merkittävä sovelluskohde väitöskirjan menetelmille ovat biologiset verkot, joissa solmut edustavat biologisia käsitteitä (esim. geenejä, proteiineja tai sairauksia) ja kaaret niiden välisiä suhteita (esim. geeni koodaa proteiinia tai proteiini on aktiivinen tietyssä sairauksessa). Menetelmillä voidaan etsiä esimerkiksi sairauksiin vaikuttavia biologisia mekanismeja paikantamalla edustava joukko sairauteen ja siihen mahdollisesti liittyviin geeneihin verkostossa kytkeytyviä muita solmuja. Nämä voivat auttaa biologeja ymmärtämään geenien ja sairauden mahdollisia kytköksiä ja siten kohdentamaan jatkotutkimustaan lupaavimpiin geeneihin, proteiineihin tms. Väitöskirjassa esitetyt solmujen mielenkiintoisuuden määritelmät sekä niiden löytämiseen ehdotetut menetelmät ovat yleispäteviä ja niitä voi soveltaa periaatteessa mihin tahansa verkkoon riippumatta siitä, mitä solmut, kaaret tai painot edustavat. Kokeet erilaisilla verkoilla osoittavat että ne löytävät mielenkiintoisia solmuja

    Content Based Image Retrieval (CBIR) in Remote Clinical Diagnosis and Healthcare

    Content-Based Image Retrieval (CBIR) locates, retrieves and displays images alike to one given as a query, using a set of features. It demands accessible data in medical archives and from medical equipment, to infer meaning after some processing. A problem similar in some sense to the target image can aid clinicians. CBIR complements text-based retrieval and improves evidence-based diagnosis, administration, teaching, and research in healthcare. It facilitates visual/automatic diagnosis and decision-making in real-time remote consultation/screening, store-and-forward tests, home care assistance and overall patient surveillance. Metrics help comparing visual data and improve diagnostic. Specially designed architectures can benefit from the application scenario. CBIR use calls for file storage standardization, querying procedures, efficient image transmission, realistic databases, global availability, access simplicity, and Internet-based structures. This chapter recommends important and complex aspects required to handle visual content in healthcare.Comment: 28 pages, 6 figures, Book Chapter from "Encyclopedia of E-Health and Telemedicine

    Novelty and Diversity in Retrieval Evaluation

    Queries submitted to search engines rarely provide a complete and precise description of a user's information need. Most queries are ambiguous to some extent, having multiple interpretations. For example, the seemingly unambiguous query ``tennis lessons'' might be submitted by a user interested in attending classes in her neighborhood, seeking lessons for her child, looking for online videos lessons, or planning to start a business teaching tennis. Search engines face the challenging task of satisfying different groups of users having diverse information needs associated with a given query. One solution is to optimize ranking functions to satisfy diverse sets of information needs. Unfortunately, existing evaluation frameworks do not support such optimization. Instead, ranking functions are rewarded for satisfying the most likely intent associated with a given query. In this thesis, we propose a framework and associated evaluation metrics that are capable of optimizing ranking functions to satisfy diverse information needs. Our proposed measures explicitly reward those ranking functions capable of presenting the user with information that is novel with respect to previously viewed documents. Our measures reflects quality of a ranking function by taking into account its ability to satisfy diverse users submitting a query. Moreover, the task of identifying and establishing test frameworks to compare ranking functions on a web-scale can be tedious. One reason for this problem is the dynamic nature of the web, where documents are constantly added and updated, making it necessary for search engine developers to seek additional human assessments. Along with issues of novelty and diversity, we explore one approximate approach to compare different ranking functions by overcoming the problem of lacking complete human assessments. We demonstrate that our approach is capable of accurately sorting ranking functions based on their capability of satisfying diverse users, even in the face of incomplete human assessments

    An Investigation of Digital Reference Interviews: A Dialogue Act Approach

    The rapid increase of computer-mediated communications (CMCs) in various forms such as micro-blogging (e.g. Twitter), online chatting (e.g. digital reference) and community- based question-answering services (e.g. Yahoo! Answers) characterizes a recent trend in web technologies, often referred to as the social web. This trend highlights the importance of supporting linguistic interactions in people\u27s online information-seeking activities in daily life - something that the web search engines still lack because of the complexity of this hu- man behavior. The presented research consists of an investigation of the information-seeking behavior of digital reference services through analysis of discourse semantics, called dialogue acts, and experimentation of automatic identification of dialogue acts using machine-learning techniques. The data was an online chat reference transaction archive, provided by the Online Computing Library Center (OCLC). Findings of the discourse analysis include supporting evidence of some of the existing theories of the information-seeking behavior. They also suggest a new way of analyzing the progress of information-seeking interactions using dia- logue act analysis. The machine learning experimentation produced promising results and demonstrated the possibility of practical applications of the DA analysis for further research across disciplines

    NASA RECON: Course Development, Administration, and Evaluation

    The R and D activities addressing the development, administration, and evaluation of a set of transportable, college-level courses to educate science and engineering students in the effective use of automated scientific and technical information storage and retrieval systems, and, in particular, in the use of the NASA RECON system, are discussed. The long-range scope and objectives of these contracted activities are overviewed and the progress which has been made toward these objectives during FY 1983-1984 is highlighted. In addition, the results of a survey of 237 colleges and universities addressing course needs are presented

    Explicit web search result diversification

    Queries submitted to a web search engine are typically short and often ambiguous. With the enormous size of the Web, a misunderstanding of the information need underlying an ambiguous query can misguide the search engine, ultimately leading the user to abandon the originally submitted query. In order to overcome this problem, a sensible approach is to diversify the documents retrieved for the user's query. As a result, the likelihood that at least one of these documents will satisfy the user's actual information need is increased. In this thesis, we argue that an ambiguous query should be seen as representing not one, but multiple information needs. Based upon this premise, we propose xQuAD---Explicit Query Aspect Diversification, a novel probabilistic framework for search result diversification. In particular, the xQuAD framework naturally models several dimensions of the search result diversification problem in a principled yet practical manner. To this end, the framework represents the possible information needs underlying a query as a set of keyword-based sub-queries. Moreover, xQuAD accounts for the overall coverage of each retrieved document with respect to the identified sub-queries, so as to rank highly diverse documents first. In addition, it accounts for how well each sub-query is covered by the other retrieved documents, so as to promote novelty---and hence penalise redundancy---in the ranking. The framework also models the importance of each of the identified sub-queries, so as to appropriately cater for the interests of the user population when diversifying the retrieved documents. Finally, since not all queries are equally ambiguous, the xQuAD framework caters for the ambiguity level of different queries, so as to appropriately trade-off relevance for diversity on a per-query basis. The xQuAD framework is general and can be used to instantiate several diversification models, including the most prominent models described in the literature. In particular, within xQuAD, each of the aforementioned dimensions of the search result diversification problem can be tackled in a variety of ways. In this thesis, as additional contributions besides the xQuAD framework, we introduce novel machine learning approaches for addressing each of these dimensions. These include a learning to rank approach for identifying effective sub-queries as query suggestions mined from a query log, an intent-aware approach for choosing the ranking models most likely to be effective for estimating the coverage and novelty of multiple documents with respect to a sub-query, and a selective approach for automatically predicting how much to diversify the documents retrieved for each individual query. In addition, we perform the first empirical analysis of the role of novelty as a diversification strategy for web search. As demonstrated throughout this thesis, the principles underlying the xQuAD framework are general, sound, and effective. In particular, to validate the contributions of this thesis, we thoroughly assess the effectiveness of xQuAD under the standard experimentation paradigm provided by the diversity task of the TREC 2009, 2010, and 2011 Web tracks. The results of this investigation demonstrate the effectiveness of our proposed framework. Indeed, xQuAD attains consistent and significant improvements in comparison to the most effective diversification approaches in the literature, and across a range of experimental conditions, comprising multiple input rankings, multiple sub-query generation and coverage estimation mechanisms, as well as queries with multiple levels of ambiguity. Altogether, these results corroborate the state-of-the-art diversification performance of xQuAD

    Query engine of novelty in video streams

    Prior research on novelty detection has primarily focused on algorithms to detect novelty for a given application domain. Effective storage, indexing and retrieval of novel events (beyond detection) are largely ignored as a problem in itself. In light of the recent advances in counter-terrorism efforts and link discovery initiatives, the need for effective data management of novel events assumes apparent importance. Automatically detecting novel events in video data streams is an extremely challenging task. The aim of this thesis is to provide evidence to the fact that the notion of novelty in video as perceived by a human is extremely subjective and therefore algorithmically illdefined. Though it comes as no surprise that current machine-based parametric learning systems to accurately mimic human novelty perception are far from perfect such systems have recently been very successful in exhaustively capturing novelty in video once the novelty function is well-defined by a human expert. So, how truly effective are these machine based novelty detection systems as compared to human novelty detection? In this paper we outline an experimental evaluation of the human vs machine based novelty systems in terms of qualitative performance. We then quantify this evaluation using a variety of metrics based on location of novel events, number of novel events found in the video, etc. We begin by describing a machine-based system for detecting novel events in video data streams. We then discuss the issues of designing an indexing-strategy or Manga (comic-book representation is termed as manga in Japanese) to effectively determine the most-representative novel frames for a video sequence. We then evaluate the performance of machine-based novelty detection system against human novelty detection and present the results. The distance metrics we suggest for novelty comparison may eventually aide a variety of end-users to effectively drive the indexing, retrieval and analysis of large video databases. It should also be noted that the techniques we describe in this paper are based on low-level features extracted from video such as color, intensity and focus of attention. The video processing component does not include any semantic processing such as object detection in video for this framework. We conjecture that such advances, though beyond the scope of this particular paper, would undoubtedly benefit the machine-based novelty detection systems and experimentally validate this. We believe that developing a novelty detection system that works in conjunction with the human expert will lead to a more user-centered data mining approach for such domains. JPEG 2000 is a new method of compressing images better than other image formats such as JPEG, GIF, PNG, etc. The main reason this format is in need for investigation is it allows metadata to be embedded within the image itself. The types of data can essentially be anything such as text, audio, video, images, etc. Currently image annotations are stored and collected side by side. Even though this method is very common, it brings up a lot of risks and flaws. Imagine if medical images were annotated by doctors to describe a tumor within the brain, then suddenly some of the annotations are lost. Without these annotations, the images itself would be useless. By embedding these annotations within the image will guarentee that the description and the image will never be seperated. The metadata embedded within the image has no influence to the image iteself. In this thesis we initially develop a metric to index novelty by comparing it to traditional indexing techniques and to human perception. In the second phase of this thesis, we will investigate the new emerging technology of JPEG 2000 and show that novelty stored in this format will outperform traditional image structures. One of the contributions this thesis is making is to develop metrics to measure the performance and quality between the query results of JPEG 2000 and traditional image formats. Since JPEG 2000 is a new technology, there are no existing metrics to measure this type of performance with traditional images

    Geographic information extraction from texts

    A large volume of unstructured texts, containing valuable geographic information, is available online. This information – provided implicitly or explicitly – is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction