229 research outputs found

    Fine Grained Approach for Domain Specific Seed URL Extraction

    Get PDF
    Domain Specific Search Engines are expected to provide relevant search results. Availability of enormous number of URLs across subdomains improves relevance of domain specific search engines. The current methods for seed URLs can be systematic ensuring representation of subdomains. We propose a fine grained approach for automatic extraction of seed URLs at subdomain level using Wikipedia and Twitter as repositories. A SeedRel metric and a Diversity Index for seed URL relevance are proposed to measure subdomain coverage. We implemented our approach for \u27Security - Information and Cyber\u27 domain and identified 34,007 Seed URLs and 400,726 URLs across subdomains. The measured Diversity index value of 2.10 conforms that all subdomains are represented, hence, a relevant \u27Security Search Engine\u27 can be built. Our approach also extracted more URLs (seed and child) as compared to existing approaches for URL extraction

    How people find videos

    Get PDF
    At present very little is known about how people locate and view videos 'in the wild'. This study draws a rich picture of everyday video seeking strategies and video information needs, based on an ethnographic study of New Zealand university students. These insights into the participants' activities and motivations suggest potentially useful facilities for a video digital library

    Finding video on the web

    Get PDF
    At present very little is known about how people locate and view videos. This study draws a rich picture of everyday video seeking strategies and video information needs, based on an ethnographic study of New Zealand university students. These insights into the participants’ activities and motivations suggest potentially useful facilities for a video digital library

    MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing

    Get PDF
    With the proliferation of public web archives, it is becoming more important to better profile their contents, both to understand their immense holdings as well as to support routing of requests in Memento aggregators. A memento is a past version of a web page and a Memento aggregator is a tool or service that aggregates mementos from many different web archives. To save resources, the Memento aggregator should only poll the archives that are likely to have a copy of the requested Uniform Resource Identifier (URI). Using the Crawler Index (CDX), we generate profiles of the archives that summarize their holdings and use them to inform routing of the Memento aggregator’s URI requests. Additionally, we use full text search (when available) or sample URI lookups to build an understanding of an archive’s holdings. Previous work in profiling ranged from using full URIs (no false positives, but with large profiles) to using only top-level domains (TLDs) (smaller profiles, but with many false positives). This work explores strategies in between these two extremes. For evaluation we used CDX files from Archive-It, UK Web Archive, Stanford Web Archive Portal, and Arquivo.pt. Moreover, we used web server access log files from the Internet Archive’s Wayback Machine, UK Web Archive, Arquivo.pt, LANL’s Memento Proxy, and ODU’s MemGator Server. In addition, we utilized historical dataset of URIs from DMOZ. In early experiments with various URI-based static profiling policies we successfully identified about 78% of the URIs that were not present in the archive with less than 1% relative cost as compared to the complete knowledge profile and 94% URIs with less than 10% relative cost without any false negatives. In another experiment we found that we can correctly route 80% of the requests while maintaining about 0.9 recall by discovering only 10% of the archive holdings and generating a profile that costs less than 1% of the complete knowledge profile. We created MementoMap, a framework that allows web archives and third parties to express holdings and/or voids of an archive of any size with varying levels of details to fulfil various application needs. Our archive profiling framework enables tools and services to predict and rank archives where mementos of a requested URI are likely to be present. In static profiling policies we predefined the maximum depth of host and path segments of URIs for each policy that are used as URI keys. This gave us a good baseline for evaluation, but was not suitable for merging profiles with different policies. Later, we introduced a more flexible means to represent URI keys that uses wildcard characters to indicate whether a URI key was truncated. Moreover, we developed an algorithm to rollup URI keys dynamically at arbitrary depths when sufficient archiving activity is detected under certain URI prefixes. In an experiment with dynamic profiling of archival holdings we found that a MementoMap of less than 1.5% relative cost can correctly identify the presence or absence of 60% of the lookup URIs in the corresponding archive without any false negatives (i.e., 100% recall). In addition, we separately evaluated archival voids based on the most frequently accessed resources in the access log and found that we could have avoided more than 8% of the false positives without introducing any false negatives. We defined a routing score that can be used for Memento routing. Using a cut-off threshold technique on our routing score we achieved over 96% accuracy if we accept about 89% recall and for a recall of 99% we managed to get about 68% accuracy, which translates to about 72% saving in wasted lookup requests in our Memento aggregator. Moreover, when using top-k archives based on our routing score for routing and choosing only the topmost archive, we missed only about 8% of the sample URIs that are present in at least one archive, but when we selected top-2 archives, we missed less than 2% of these URIs. We also evaluated a machine learning-based routing approach, which resulted in an overall better accuracy, but poorer recall due to low prevalence of the sample lookup URI dataset in different web archives. We contributed various algorithms, such as a space and time efficient approach to ingest large lists of URIs to generate MementoMaps and a Random Searcher Model to discover samples of holdings of web archives. We contributed numerous tools to support various aspects of web archiving and replay, such as MemGator (a Memento aggregator), Inter- Planetary Wayback (a novel archival replay system), Reconstructive (a client-side request rerouting ServiceWorker), and AccessLog Parser. Moreover, this work yielded a file format specification draft called Unified Key Value Store (UKVS) that we use for serialization and dissemination of MementoMaps. It is a flexible and extensible file format that allows easy interactions with Unix text processing tools. UKVS can be used in many applications beyond MementoMaps

    Topic modelling of Finnish Internet discussion forums as a tool for trend identification and marketing applications

    Get PDF
    The increasing availability of public discussion text data on the Internet motivates to study methods to identify current themes and trends. Being able to extract and summarize relevant information from public data in real time gives rise to competitive advantage and applications in the marketing actions of a company. This thesis presents a method of topic modelling and trend identification to extract information from Finnish Internet discussion forums. The development of text analytics, and especially topic modelling techniques, is reviewed and suitable methods are identified from the literature. The Latent Dirichlet Allocation topic model and the Dynamic Topic Model are applied in finding underlying topics from the Internet discussion forum data. The discussion data collection with web scarping and text data preprocessing methods are presented. Trends are identified with a method derived from outlier detection. Real world events, such as the news about Finnish army vegetarian meal day and the Helsinki summit of presidents Trump and Putin, were identified in an unsupervised manner. Applications for marketing are considered, e.g. automatic search engine advert keyword generation and website content recommendation. Future prospects for further improving the developed topical trend identification method are proposed. This includes the use of more complex topic models, extensive framework for tuning trend identification parameters and studying the use of more domain specific text data sources such as blogs, social media feeds or customer feedback

    Study of the long tail formation within an ewom community. The case of ciao UK

    Get PDF
    Continuous communication among people and ubiquitous online access are fundamental characteristics of online eWOM communities that are facilitating the distribution of a broad range of products and services. eWOM communities have emerged to influence customers directly and create interest with efficacy and flexibility in spite of geographic boundaries (Duan, Gu and Whinston 2008). They provide rich and objective product information that is influencing customers’ decision making (Gu, Tang and Whinston 2013, Kim and Gupta 2009, Zacharia, Moukas and Maes 2000), due to the credibility, empathy and relevance they offer to customers as opposed to the information provided by marketer-designed websites (Bickart and Schindler 2001). Through eWOM, users can freely post their reviews about any product or service, and share those reviews with other users in order to better understand a product (Hennig-Thurau, et al. 2004). Thus, through eWOM communities, a great audience of users is able to acquire knowledge from reviews concerning products and services that are less popular to the majority. In that respect, the distribution of product sales is changing due to the increment of product information available to consumers (Brynjolfsson, Hu, & Smith, 2010) facilitating the long tail phenomenon (Anderson 2004). Many authors have given a good understanding of the main idea behind long tail within sales distributions in product markets such as Amazon (Brynjolfsson, Hu, & Smith, 2003; Brynjolfsson, Hu, & Smith, 2010). However, this Thesis goes beyond and applies new methodologies –elbow criterion– and extends others –power-law distribution– by Clauset, Shalizi and Newman (2009) to mathematically measure the long tail in other environments, such as the eWOM community Ciao. Whereas most eWOM studies focus just on the potential of eWOM facilitating the long tail effect to find rare or niche products (Hennig-Thurau, Gwinner, Walsh, & Gremler, 2004; Khammash & Griffiths, 2011) and how eWOM is enabling zero-cost dissemination of information about products (Odić, Tkalčič, Tasič, & Koơir, 2013) and so forth, not many noticed that for each product type enclosed in the tail of the sales distribution there might be different impacts. In this regard, the results within this Thesis might indicate that vendors could adopt alternative product strategies depending on with which niche product type (search or experience good) the tails of sales distribution would be formed. More specifically, this Thesis proposes an approach for detecting whether there is a long tail for each product type and thus, cases should be differentiated when niche products represent a significant portion of overall product sales. Likewise, given the volume of the user-generated content in the web and its speed of change this Thesis also presents two important highlights in this regard. First, the implementation of an effective web crawler that can gather and identify big amounts of user-generated content. Second, the stages followed on this crawling process, which are the identification and collection of important data, and the maintenance of the gathered data. Consequently, social science needs to develop adequate methodologies to deal with huge amounts of data, such as the one outlined within this Thesis and overcome the distance between technology and social sciences. The chosen methodology within this Thesis has been to triangulate the method of power-law distribution of data gathered with other method, the elbow criterion in order to identify the long tail. That is, to compare the all the type of products among the eWOM Ciao UK, the probability power-law distribution function was represented as a tool to measure the long tail. Besides, to extra validate such method the elbow criterion was also used to identify where was located the optimal cut-off point that distinguishes the products characterized by the long tail. Furthermore, this Thesis outlines an architectural framework and methodology to gather user-generated data the eWOM community Ciao UK. To that end, a new methodology describes the implementation of a web crawler from other disciplinary perspective: the computing science discipline. Interestingly, the present thesis aims to contribute to the study of the long tail phenomenon in an eWOM community and what product types are enclosed there. To this end the three following hypotheses where contrasted: H1: The experience products from the distribution of product categories within an eWOM are more likely to exhibit a long tail. H2: The search products from the distribution of product categories within an eWOM are less likely to exhibit a long tail. H3: The distribution of product categories within an eWOM that have high frequency events or super-hits in the short head are not particularly associated with search or experience products. The results supported all the three proposed hypotheses. In this sense, this Thesis presents important new findings. Firstly, it is evidenced that products having a long tail are those with subjective evaluation standards, which are classified as experience products. Secondly, it is also corroborated that search products, which have a high level of objective attributes in the total product assessment do not encourage the long tail phenomenon. Thirdly, there is a combination of products when there are super-hits in the short head of the distribution. Thus, those are not particularly associated with search or experience products since they contain either objective or subjective evaluation standards. Finally, it is also remarkable to highlight that not all the categories fitting a power-law distribution are characterized by a long tail and on the contrary, some of those having a long tail do not fit a power-law. In general, the findings also suggest the potentials of eWOM, which, in general, might generate a long tail effect, where a large number of small-volume vendors coexist with a few high-volume ones. Furthermore, this Thesis has contributed to both theory and practice, essentially, in three different ways: (1) with a methodology of collection of online user-generated data in the context of social sciences; (2) with the development of two more accurate methods to identify niche products within an eWOM community, providing a deeper understanding of the long tail phenomena and the type of products; and (3) with publications of refereed journals papers (indexed in JCR/JSCR) as well as conference papers related to the main topic of this Thesis.Premio Extraordinario de Doctorado U

    A Biased Topic Modeling Approach for Case Control Study from Health Related Social Media Postings

    Get PDF
    abstract: Online social networks are the hubs of social activity in cyberspace, and using them to exchange knowledge, experiences, and opinions is common. In this work, an advanced topic modeling framework is designed to analyse complex longitudinal health information from social media with minimal human annotation, and Adverse Drug Events and Reaction (ADR) information is extracted and automatically processed by using a biased topic modeling method. This framework improves and extends existing topic modelling algorithms that incorporate background knowledge. Using this approach, background knowledge such as ADR terms and other biomedical knowledge can be incorporated during the text mining process, with scores which indicate the presence of ADR being generated. A case control study has been performed on a data set of twitter timelines of women that announced their pregnancy, the goals of the study is to compare the ADR risk of medication usage from each medication category during the pregnancy. In addition, to evaluate the prediction power of this approach, another important aspect of personalized medicine was addressed: the prediction of medication usage through the identification of risk groups. During the prediction process, the health information from Twitter timeline, such as diseases, symptoms, treatments, effects, and etc., is summarized by the topic modelling processes and the summarization results is used for prediction. Dimension reduction and topic similarity measurement are integrated into this framework for timeline classification and prediction. This work could be applied to provide guidelines for FDA drug risk categories. Currently, this process is done based on laboratory results and reported cases. Finally, a multi-dimensional text data warehouse (MTD) to manage the output from the topic modelling is proposed. Some attempts have been also made to incorporate topic structure (ontology) and the MTD hierarchy. Results demonstrate that proposed methods show promise and this system represents a low-cost approach for drug safety early warning.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Letters from the War of Ecosystems – An Analysis of Independent Software Vendors in Mobile Application Marketplaces

    Get PDF
    The recent emergence of a new generation of mobile application marketplaces has changed the business in the mobile ecosystems. The marketplaces have gathered over a million applications by hundreds of thousands of application developers and publishers. Thus, software ecosystems—consisting of developers, consumers and the orchestrator—have emerged as a part of the mobile ecosystem. This dissertation addresses the new challenges faced by mobile application developers in the new ecosystems through empirical methods. By using the theories of two-sided markets and business ecosystems as the basis, the thesis assesses monetization and value creation in the market as well as the impact of electronic Word-of-Mouth (eWOM) and developer multihoming— i. e. contributing for more than one platform—in the ecosystems. The data for the study was collected with web crawling from the three biggest marketplaces: Apple App Store, Google Play and Windows Phone Store. The dissertation consists of six individual articles. The results of the studies show a gap in monetization among the studied applications, while a majority of applications are produced by small or micro-enterprises. The study finds only weak support for the impact of eWOM on the sales of an application in the studied ecosystem. Finally, the study reveals a clear difference in the multi-homing rates between the top application developers and the rest. This has, as discussed in the thesis, an impact on the future market analyses—it seems that the smart device market can sustain several parallel application marketplaces.Muutama vuosi sitten julkistetut uuden sukupolven mobiilisovellusten kauppapaikat ovat muuttaneet mobiiliekosysteemien liiketoimintadynamiikkaa. NĂ€mĂ€ uudet markkinapaikat ovat jo onnistuneet houkuttelemaan yli miljoona sovellusta sadoilta tuhansilta ohjelmistokehittĂ€jiltĂ€. NĂ€mĂ€ kehittĂ€jĂ€t yhdessĂ€ markkinapaikan organisoijan sekĂ€ loppukĂ€yttĂ€jien kanssa ovat muodostaneet ohjelmistoekosysteemin osaksi laajempaa mobiiliekosysteemiĂ€. TĂ€ssĂ€ vĂ€itöskirjassa tarkastellaan mobiilisovellusten kehittĂ€jien uudenlaisilla kauppapaikoilla kohtaamia haasteita empiiristen tutkimusmenetelmien kautta. VĂ€itöskirjassa arvioidaan sovellusten monetisaatiota ja arvonluontia sekĂ€ verkon asiakasarviointien (engl. electronicWord-of-Mouth, eWOM) ja kehittĂ€jien moniliittymisen (engl. multi-homing) — kehittĂ€jĂ€ on sitoutunut useammalle kuin yhdelle ekosysteemille — vaikutuksia ekosysteemissĂ€. Työn teoreettinen tausta rakentuu kaksipuolisten markkinapaikkojen ja liiketoimintaekosysteemien pÀÀlle. Tutkimuksen aineisto on kerĂ€tty kolmelta suurimmalta mobiilisovellusmarkkinapaikalta: Apple App Storesta, Google PlaystĂ€ ja Windows Phone Storesta. TĂ€mĂ€ artikkelivĂ€itöskirja koostuu kuudesta itsenĂ€isestĂ€ tutkimuskĂ€sikirjoituksesta. Artikkelien tulokset osoittavat puutteita monetisaatiossa tutkittujen sovellusten joukossa. MerkittĂ€vĂ€ osa tarkastelluista sovelluksista on pienten yritysten tai yksittĂ€isten kehittĂ€jien julkaisemia. Tutkimuksessa löydettiin vain heikkoa tukea eWOM:in positiiviselle vaikutukselle sovellusten myyntimÀÀrissĂ€. TyössĂ€ myös osoitetaan merkittĂ€vĂ€ ero menestyneimpien sovelluskehittĂ€jien sekĂ€ muiden kehittĂ€jien moniliittymiskĂ€yttĂ€ytymisen vĂ€lillĂ€. TĂ€llĂ€ havainnolla on merkitystĂ€ tuleville markkina-analyyseille ja sen vaikutuksia on kĂ€sitelty työssĂ€. Tulokset esimerkiksi viittaavat siihen, ettĂ€ markkinat pystyisivĂ€t yllĂ€pitĂ€mÀÀn useita kilpailevia kauppapaikkoja.Siirretty Doriast
    • 

    corecore