26,006 research outputs found

    TERM WEIGHTING BASED ON INDEX OF GENRE FOR WEB PAGE GENRE CLASSIFICATION

    Get PDF
    Automating the identification of the genre of web pages becomes an important area in web pages classification, as it can be used to improve the quality of the web search result and to reduce search time. To index the terms used in classification, generally the selected type of weighting is the document-based TF-IDF. However, this method does not consider genre, whereas web page documents have a type of categorization called genre. With the existence of genre, the term appearing often in a genre should be more significant in document indexing compared to the term appearing frequently in many genres despites its high TF-IDF value. We proposed a new weighting method for web page documents indexing called inverse genre frequency (IGF). This method is based on genre, a manual categorization done semantically from previous research. Experimental results show that the term weighting based on index of genre (TF-IGF) performed better compared to term weighting based on index of document (TF-IDF), with the highest value of accuracy, precision, recall, and F-measure in case of excluding the genre-specific keywords were 78%, 80.2%, 78%, and 77.4% respectively, and in case of including the genre-specific keywords were 78.9%, 78.7%, 78.9%, and 78.1% respectively

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    SLIS Student Research Journal, Vol.7, Iss.1

    Get PDF

    Betwixt and Between Past and Present: Cultural and Generic Hybridity in the Fiction of Mary Yukari Waters

    Get PDF
    The cosmopolitan make-up of the American society has yielded cultural hybrid offspring and this cultural hybridity features strongly in contemporary American fiction. Amy Tan and Mary Yukari Waters are both Asian-Americans who portray such hybridity in their short stories which depict the shifting identities of the self. But do the internal categories of gender, race and ethnicity help in the coherence or do they add to the fragmentation of diverse identities? It is the dynamics of this critique of multiple identification and hybrid cultures that is being traced here in this study and how all this is reflected in narrative responses to such conditions of the examination of the self and, on a broader scale, community. The fiction of both writers usually ends up with a metaphysical human aspiration that retains the past, holds on to the present and looks forward to the hidden joys of the future. Betwixt and between was a term coined by Victor Turner to describe those who are culturally both and neither, that is, they stand at a liminal (border) stage that should be a temporary state, but, in certain cases has become a permanent one. Waters and Tan fulfill Will Kymlicka\u27s exemplary mode of multiculturalism. Kymlicka does not want ethnic/Americans to separate their conflicting identities in order to fit in. They should not bring their lifestyles to conform to various codes; instead, they should have the freedom of multiple identification in whichever place and with whichever group, be it a minority or mainstream. Multiple identification has been a blessing for both writers as both consider it like two alternate worlds that they can resort to the one when they are fed-up with the other. Two short stories were chosen for each writer: Rationing and Aftermath for Mary Yukari Waters; The Moon Lady and A Pair of Tickets for Amy Tan. Cultural hybridity is clear in their appreciation of their ancestors\u27 stoicism, wisdom and guidance on the one hand, and in their willingness to take in American cultural traits on the other. Generic hybridity is exemplified in the interpenetration of the historic, the mythic and the symbolic. The history of China and Japan during World War II is constantly conjured up and the present and the past are intermingled through the workings of memory. The mythic has a powerful presence in the texts of both writers given the influence of the myth in their Eastern spiritual cultures. Names and actions acquire a symbolic significance which adds richness in meaning to the texts. The two writers, moreover, add a touch of folklore to stress their Asian origin and to prove the fact that (multicultural) society and (hybrid) culture must have their influence apparent in all literary texts

    What is the influence of genre during the perception of structured text for retrieval and search?

    Get PDF
    This thesis presents an investigation into the high value of structured text (or form) in the context of genre within Information Retrieval. In particular, how are these structured texts perceived and why are they not more heavily used within Information Retrieval & Search communities? The main motivation is to show the features in which people can exploit genre within Information Search & Retrieval, in particular, categorisation and search tasks. To do this, it was vital to record and analyse how and why this was done during typical tasks. The literature review highlighted two previous studies (Toms & Campbell 1999a; Watt 2009) which have reported pilot studies consisting of genre categorisation and information searching. Both studies and other findings within the literature review inspired the work contained within this thesis. Genre is notoriously hard to define, but a very useful framework of Purpose and Form, developed by Yates & Orlikowski (1992), was utilised to design two user studies for the research reported within the thesis. The two studies consisted of, first, a categorisation task (e-mails), and second, a set of six simulated situations in Wikipedia, both of which collected quantitative data from eye tracking experiments as well as qualitative user data. The results of both studies showed the extent to which the participants utilised the form features of the stimuli presented, in particular, how these were used, which ocular behaviours (skimming or scanning) and actual features were used, and which were the most important. The main contributions to research made by this thesis were, first of all, that the task-based user evaluations employing simulated search scenarios revealed how and why users make decisions while interacting with the textual features of structure and layout within a discourse community, and, secondly, an extensive evaluation of the quantitative data revealed the features that were used by the participants in the user studies and the effects of the interpretation of genre in the search and categorisation process as well as the perceptual processes used in the various communities. This will be of benefit for the re-development of information systems. As far as is known, this is the first detailed and systematic investigation into the types of features, value of form, perception of features, and layout of genre using eye tracking in online communities, such as Wikipedia

    CC-interop : COPAC/Clumps Continuing Technical Cooperation. Final Project Report

    Get PDF
    As far as is known, CC-interop was the first project of its kind anywhere in the world and still is. Its basic aim was to test the feasibility of cross-searching between physical and virtual union catalogues, using COPAC and the three functioning "clumps" or virtual union catalogues (CAIRNS, InforM25, and RIDING), all funded or part-funded by JISC in recent years. The key issues investigated were technical interoperability of catalogues, use of collection level descriptions to search union catalogues dynamically, quality of standards in cataloguing and indexing practices, and usability of union catalogues for real users. The conclusions of the project were expected to, and indeed do, contribute to the development of the JISC Information Environment and to the ongoing debate as to the feasibility and desirability of creating a national UK catalogue. They also inhabit the territory of collection level descriptions (CLDs) and the wider services of JISC's Information Environment Services Registry (IESR). The results of this project will also have applicability for the common information environment, particularly through the landscaping work done via SCONE/CAIRNS. This work is relevant not just to HE and not just to digital materials, but encompasses other sectors and domains and caters for print resources as well. Key findings are thematically grouped as follows: System performance when inter-linking COPAC and the Z39.50 clumps. The various individual Z39.50 configurations permit technical interoperability relatively easily but only limited semantic interoperability is possible. Disparate cataloguing and indexing practices are an impairment to semantic interoperability, not just for catalogues but also for CLDs and descriptions of services (like those constituting JISC's IESR). Creating dynamic landscaping through CLDs: routines can be written to allow collection description databases to be output in formats that other UK users of CLDs, including developers of the JISC information environment. Searching a distributed (virtual) catalogue or clump via Z39.50: use of Z39.50 to Z39.50 middleware permits a distributed catalogue to be searched via Z39.50 from such disparate user services as another virtual union catalogue or clump, a physical union catalogue like COPAC, an individual Z client and other IE services. The breakthrough in this Z39.50 to Z39.50 conundrum came with the discovery that the JISC-funded JAFER software (a result of the 5/99 programme) meets many of the requirements and can be used by the current clumps services. It is technically possible for the user to select all or a sub-set of available end destination Z39.50 servers (we call this "landscaping") within this middleware. Comparing results processing between COPAC and clumps. Most distributed services (clumps) do not bring back complete results sets from associated Z servers (in order to save time for users). COPAC on-the-fly routines could feasibly be applied to the clumps services. An automated search set up to repeat its query of 17 catalogues in a clump (InforM25) hourly over nearly 3 months returned surprisingly good results; for example, over 90% of responses were received in less than one second, and no servers showed slower response times in periods of traditionally heavy OPAC use (mid-morning to early evening). User behaviour when cross-searching catalogues: the importance to users of a number of on-screen features, including the ability to refine a search and clear indication that a search is processing. The importance to users of information about the availability of an item as well as the holdings data. The impact of search tools such as Google and Amazon on user behaviour and the expectations of more information than is normally available from a library catalogue. The distrust of some librarians interviewed of the data sources in virtual union catalogues, thinking that there was not true interoperability

    All the World\u27s Computer Games

    Get PDF
    Video games are a well-established and expanding medium. There is a notable lack of a classification system that is accurate and quantified; not much research has been done towards the creation of a classification system. This work has addressed this problem, and is progressing towards creating video game classification systems that are precise, quantified, and data-driven
    corecore