6,057 research outputs found

    Web Page Segmentation for Non Visual Skimming

    Get PDF

    Web Page Segmentation for Non Visual Skimming

    Get PDF
    International audienceWeb page segmentation aims to break a page into smaller blocks, in which contents with coherent semantics are kept together. Examples of tasks targeted by such a technique are advertisement detection or main content extraction. In this paper, we study different seg-mentation strategies for the task of non visual skimming. For that purpose, we consider web page segmentation as a clustering problem of visual elements, where (1) all visual elements must be clustered, (2) a fixed number of clusters must be discovered, and (3) the elements of a cluster should be visually connected. Therefore, we study three different algorithms that comply to these constraints: K-means, F-K-means, and Guided Expansion. Evaluation shows that Guided Expansion evidences statistically-relevant results in terms of compactness and separateness, and satisfies more logical constraints when compared to the other strategies

    Verkkosivuelementtien luokittelu koneoppimisen avulla

    Get PDF
    Basic image segmentation is a fairly simple task for human beings and even young children can accomplish it naturally, but for machines, this can be a burdensome and difficult task. Segmenting large amounts of documents manually can be rather labour-intensive exercise and many gains in productivity could be had if machines could be automated to do the routine segmentation and classification tasks. The website hosting company Suomen Hostingpalvelu Oy is transitioning from their old web site builder software to a new in-house developed site builder and they were faced with the problem of how to effortlessly allow users to move their old websites from the old site builder to the new one. This thesis explores the solution for this problem based on the fact, that the new site builder software uses semantic building blocks to build a website. By identifying the given semantic parts present on a given website through machine learning, we can provide the corresponding building blocks for site transitioning in the new site builder. In this thesis, a novel way of segmenting web pages to their semantic parts is presented. This is accomplished by building a prototype which parses a given web site, gathers all the relevant features of the site's web elements and captures images of each web element. The gathered data is employed to create a training and testing data set which is used to train a machine learning model to classify web site segments. Three different machine learning algorithms, random forests, gradient boosting machines and a neural networks are examined and tested. After cross-validation, the highest achieved classification accuracy score of the trained machine learning model was a competent 81% allowing the prototype to be used in production at Hostingpalvelu. Finally, we will explore ideas for future research and for the improvement of the prototype.Kuvien jakaminen sen merkitseviin osiin on verrattain helppo tehtÀvÀ ihmisille ja jopa pienet lapset osaavat sen luonnostaan, mutta koneille tÀmÀ voi olla hyvinkin haastava tehtÀvÀ suoritettavaksi. Suurten tiedostomÀÀrien kÀsin luokitteleminen ja osiin jakaminen on aikaa vievÀÀ ja työteliÀstÀ ja jos tietokoneet voitaisiin automatisoida tekemÀÀn nÀmÀ rutiinityöt, ihmiset voisivat ohjata työpanoksensa merkitsevÀmpiin asioihin. IT-alan yritys Suomen Hostingpalvelu Oy on siirtymÀssÀ pois vanhasta kotisivukoneestaan uuteen talon sisÀllÀ kehitettyyn kotisivukoneeseen ja heillÀ oli ongelmana vanhojen kotisivujen siirto vanhasta kotisivukoneesta uuteen. TÀmÀ diplomityö kÀsittelee tÀmÀn ongelman ratkaisemista perustuen siihen, ettÀ uusi kotisivukone kÀyttÀÀ semanttisia lohkoja sivujen rakentamiseen. Tunnistamalla vanhalla kotisivukoneella tehdyistÀ sivuista niiden semanttiset osat, voidaan sivuston siirto uuteen kotisivukoneeseen automatisoida. TÀssÀ diplomityössÀ esitellÀÀn uudenlainen lÀhestymistapa verkkosivun semanttiseen jakamiseen osiksi. TÀmÀ tehdÀÀn rakentamalla prototyyppiohjelma, joka ensin jÀsentÀÀ sille annetun verkkosivun, kerÀÀÀ jokaisen sivulla esiintyvÀn elementin ominaispiirteet ja ottaa niistÀ kuvat. TÀstÀ datasta muodostetaan opetus- ja testidata, jolla opetetaan koneoppimismallia luokittelemaan verkkosivun semanttiset osat. TyössÀ esitellÀÀn kolme koneoppimisalgoritmia, random forests, gradient boosting machine ja neuroverkot, joita testataan prototyypissÀ. Ristiinvalidoinnin jÀlkeen korkein saatu luokittelutarkkuus oli 81%, joka on tarpeeksi korkea mahdollistaakseen prototyypin ottamisen tuotantokÀyttöön Hosting-palvelulla. Lopuksi tutkimme vielÀ ideoita tulevaisuuden tutkimukseen ja mahdollisia tapoja, jolla prototyyppiÀ voitaisiin parantaa

    Hoodsquare: Modeling and Recommending Neighborhoods in Location-based Social Networks

    Full text link
    Information garnered from activity on location-based social networks can be harnessed to characterize urban spaces and organize them into neighborhoods. In this work, we adopt a data-driven approach to the identification and modeling of urban neighborhoods using location-based social networks. We represent geographic points in the city using spatio-temporal information about Foursquare user check-ins and semantic information about places, with the goal of developing features to input into a novel neighborhood detection algorithm. The algorithm first employs a similarity metric that assesses the homogeneity of a geographic area, and then with a simple mechanism of geographic navigation, it detects the boundaries of a city's neighborhoods. The models and algorithms devised are subsequently integrated into a publicly available, map-based tool named Hoodsquare that allows users to explore activities and neighborhoods in cities around the world. Finally, we evaluate Hoodsquare in the context of a recommendation application where user profiles are matched to urban neighborhoods. By comparing with a number of baselines, we demonstrate how Hoodsquare can be used to accurately predict the home neighborhood of Twitter users. We also show that we are able to suggest neighborhoods geographically constrained in size, a desirable property in mobile recommendation scenarios for which geographical precision is key.Comment: ASE/IEEE SocialCom 201

    Concurrent Speech Synthesis to Improve Document First Glance for the Blind

    Get PDF
    International audienceSkimming and scanning are two well-known reading processes, which are combined to access the document content as quickly and efficiently as possible. While both are available in visual reading mode, it is rather difficult to use them in non visual environments because they mainly rely on typographical and layout properties. In this article, we introduce the concept of tag thunder as a way (1) to achieve the oral transposition of the web 2.0 concept of tag cloud and (2) to produce an innovative interactive stimulus to observe the emergence of self-adapted strategies for non-visual skimming of written texts. We first present our general and theoretical approach to the problem of both fast, global and non-visual access to web browsing; then we detail the progress of development and evaluation of the various components that make up our software architecture. We start from the hypothesis that the semantics of the visual architecture of web pages can be transposed into new sensory modalities thanks to three main steps (web page segmentation, keywords extraction and sound spatialization). We note the difficulty of simultaneously (1) evaluating a modular system as a whole at the end of the processing chain and (2) identifying at the level of each software brick the exact origin of its limits; despite this issue, the results of the first evaluation campaign seem promising

    Learning Object Categories From Internet Image Searches

    Get PDF
    In this paper, we describe a simple approach to learning models of visual object categories from images gathered from Internet image search engines. The images for a given keyword are typically highly variable, with a large fraction being unrelated to the query term, and thus pose a challenging environment from which to learn. By training our models directly from Internet images, we remove the need to laboriously compile training data sets, required by most other recognition approaches-this opens up the possibility of learning object category models “on-the-fly.” We describe two simple approaches, derived from the probabilistic latent semantic analysis (pLSA) technique for text document analysis, that can be used to automatically learn object models from these data. We show two applications of the learned model: first, to rerank the images returned by the search engine, thus improving the quality of the search engine; and second, to recognize objects in other image data sets

    Preprocessing for Images Captured by Cameras

    Get PDF
    • 

    corecore