14 research outputs found
Crawler for Estonian Social Media RSS Feeds
Bakalaureusetöö raames valmis kaks roomajat eesti keelse sotsiaalmeedia roomamiseks. Töös on kirjeldatud roomajate algoritmiline ülesehitus ning samuti antud hinnang roomajate efektiivsusele läbitud eksperimendi põhjal.
Rakendused leiavad kasutust Tartu Ülikooli eesti keele uurimisgrupi töös.The aim of the thesis was to develop two crawlers for Estonian social media. The thesis includes the description of algorithms used in the crawlers. Besides that there is an overview of the experiment done with the crawlers and an evaluation based on this. The crawlers will be used by the Tartu University Estonian language work group
Methoden für Trendanalysen im Web zur Unterstützung des Customer Relationship Management
Mit dem Einzug des Web 2.0 ins tägliche Leben haben Individuen die Möglichkeit ihre Meinungen und Gefühle in Form von Blogs zu veröffentlichen. Die Analyse der Trends in dieser Blogosphäre kann maßgeblich zur Unterstützung der Kundenrückgewinnung in einem CRM-System eingesetzt werden. In dieser Forschungsarbeit werden bestehende Ansätze zur Trenderkennung im Allgemeinen untersucht und anschließend die Eignung ihrer Applikation auf Weblogs geprüft. Dazu wird ausgehend von bestehenden wissenschaftlichen Arbeiten ein System zur Trendanalyse prototypisch implementiert und die Analyseergebnisse im Anschluss evaluiert
Recommended from our members
Transforming user data into user value by novel mining techniques for extraction of web content, structure and usage patterns. The Development and Evaluation of New Web Mining Methods that enhance Information Retrieval and improve the Understanding of User¿s Web Behavior in Websites and Social Blogs.
The rapid growth of the World Wide Web in the last decade makes it the largest publicly accessible data source in the world, which has become one of the most significant and influential information revolution of modern times. The influence of the Web has impacted almost every aspect of humans' life, activities and fields, causing paradigm shifts and transformational changes in business, governance, and education. Moreover, the rapid evolution of Web 2.0 and the Social Web in the past few years, such as social blogs and friendship networking sites, has dramatically transformed the Web from a raw environment for information consumption to a dynamic and rich platform for information production and sharing worldwide. However, this growth and transformation of the Web has resulted in an uncontrollable explosion and abundance of the textual contents, creating a serious challenge for any user to find and retrieve the relevant information that he truly seeks to find on the Web. The process of finding a relevant Web page in a website easily and efficiently has become very difficult to achieve. This has created many challenges for researchers to develop new mining techniques in order to improve the user experience on the Web, as well as for organizations to understand the true informational interests and needs of their customers in order to improve their targeted services accordingly by providing the products, services and information that truly match the requirements of every online customer.
With these challenges in mind, Web mining aims to extract hidden patterns and discover useful knowledge from Web page contents, Web hyperlinks, and Web usage logs. Based on the primary kinds of Web data used in the mining process, Web mining tasks can be categorized into three main types: Web content mining, which extracts knowledge from Web page contents using text mining techniques, Web structure mining, which extracts patterns from the hyperlinks that represent the structure of the website, and Web usage mining, which mines user's Web navigational patterns from Web server logs that record the Web page access made by every user, representing the interactional activities between the users and the Web pages in a website. The main goal of this thesis is to contribute toward addressing the challenges that have been resulted from the information explosion and overload on the Web, by proposing and developing novel Web mining-based approaches. Toward achieving this goal, the thesis presents, analyzes, and evaluates three major contributions. First, the development of an integrated Web structure and usage mining approach that recommends a collection of hyperlinks for the surfers of a website to be placed at the homepage of that website. Second, the development of an integrated Web content and usage mining approach to improve the understanding of the user's Web behavior and discover the user group interests in a website. Third, the development of a supervised classification model based on recent Social Web concepts, such as Tag Clouds, in order to improve the retrieval of relevant articles and posts from Web social blogs
It's all a bit upmessing - non-standard verb-particle combinations in blogs
This article will explore how verb-particle combinations, for a long time one of the most productive segments of English word-formation, have changed with the advent of online real-time short communication forms such as blogs or their more sophisticated social networking or microblogging varieties like Twitter and Facebook. Following up on earlier research (Diemer 2008), evidence will be presented that that the long and seemingly unstoppable trend towards verb-adverb combinations and the decline of the prefixes has been partly reversed by these new forms of communication. Selected examples with the prefixes in and on will be discussed. It will be argued that the main reasons for this change are facilitation of syntax, need for innovation in specialized and peer group communication, analogy formation and the influence of other languages on English
CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines
Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective.
The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines.
From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
Importance of social media in the information sourcing phase during the decision-making process of the South African traveller
Includes bibliographical references.The Internet and the emergence of social media have a significant effect on the tourism industry world-wide. Tourists can search for advice online from strangers and friends who have visited the destination in the past. Research indicates that this information source is perceived as more credible than traditional marketing material such as Web sites, brochures or other forms of advertisements. More specifically, information sources on social media assist the tourist in evaluating alternatives in order to make an informed purchasing- decision. Destination marketing organisations and tourism enterprises need to understand the role that social media plays in the decision-making process in order to create effective marketing strategies online. This research paper places the focus on the South African traveller and which online sources s/he uses to search for travel information before going on holiday. Social media sources in particularly will be under investigation. There has been a dearth of research conducted in this area on emerging markets such as South Africa and this paper will fill an important gap in the academic literature. The database for this research was acquired from Travelstart; a leading digital travel agency in South Africa
The voting model for people search
The thesis investigates how persons in an enterprise organisation can be ranked in response to a query, so that those persons with relevant expertise to the query topic are ranked first. The expertise areas of the persons are represented by documentary evidence of expertise, known as candidate profiles. The statement of this research work is that the expert search task in an enterprise setting can be successfully and effectively modelled using a voting paradigm. In the so-called Voting Model, when a document is retrieved for a query, this document represents a vote for every expert associated with the document to have relevant expertise to the query topic. This voting paradigm is manifested by the proposition of various voting techniques that aggregate the votes from documents to candidate experts. Moreover, the research work demonstrates that these voting techniques can be modelled in terms of a Bayesian belief network, providing probabilistic semantics for the proposed voting paradigm.
The proposed voting techniques are thoroughly evaluated on three standard expert search test collections, deriving conclusions concerning each component of the Voting Model, namely the method used to identify the documents that represent each candidate's expertise areas, the weighting models that are used to rank the documents, and the voting techniques which are used to convert the ranking of documents into the ranking of experts. Effective settings are identified and insights about the behaviour of each voting technique are derived. Moreover, the practical aspects of deploying an expert search engine such as its efficiency and how it should be trained are also discussed.
This thesis includes an investigation of the relationship between the quality of the underlying ranking of documents and the resulting effectiveness of the voting techniques. The thesis shows that various effective document retrieval approaches have a positive impact on the performance of the voting techniques. Interestingly, it also shows that a `perfect' ranking of documents does not necessarily translate into an equally perfect ranking of candidates. Insights are provided into the reasons for this, which relate to the complexity of evaluating tasks based on ranking aggregates of documents.
Furthermore, it is shown how query expansion can be adapted and integrated into the expert search process, such that the query expansion successfully acts on a pseudo-relevant set containing only a list of names of persons. Five ways of performing query expansion in the expert search task are proposed, which vary in the extent to which they tackle expert search-specific problems, in particular, the occurrence of topic drift within the expertise evidence for each candidate.
Not all documentary evidence of expertise for a given person are equally useful, nor may there be sufficient expertise evidence for a relevant person within an enterprise. This thesis investigates various approaches to identify the high quality evidence for each person, and shows how the World Wide Web can be mined as a resource to find additional expertise evidence.
This thesis also demonstrates how the proposed model can be applied to other people search tasks such as ranking blog(ger)s in the blogosphere setting, and suggesting reviewers for the submitted papers to an academic conference.
The central contributions of this thesis are the introduction of the Voting Model, and the definition of a number of voting techniques within the model. The thesis draws insights from an extremely large and exhaustive set of experiments, involving many experimental parameters, and using different test collections for several people search tasks. This illustrates the effectiveness and the generality of the Voting Model at tackling various people search tasks and, indeed, the retrieval of aggregates of documents in general
An Online Analytical System for Multi-Tagged Document Collections
The New York Times Annotated Corpus and the ACM Digital Library are two prototypical examples of document collections in which each document is tagged with keywords and significant phrases. Such collections can be viewed as high-dimensional document cubes against which browsers and search systems can be applied in a manner similar to online analytical processing against data cubes. The tagging patterns in these collections are examined and a generative tagging model is developed that can mimic the tag assignments observed in those collections. When a user browses the collection by means of a Boolean query over tags, the result is a subset of documents that can be summarized by a centroid derived from their document term vectors. A partial materialization strategy is developed to provide efficient storage and access to centroids for such document subsets. A customized local term vocabulary storage approach is incorporated into the partial materialization to ensure that rich and relevant term vocabulary is available for representing centroids while maintaining a low storage footprint. By adopting this strategy, summary measures dependent on centroids (including bursty terms, or larger sets of indicative documents) can be efficiently and accurately computed for important subsets of documents. The proposed design is evaluated on the two collections along with PubMed (a held-back document collection) and several synthetic collections to validate that it outperforms alternative storage strategies.
Finally, an enhanced faceted browsing system is developed to support users' exploration of large multi-tagged document collections. It provides summary measures of document result sets at each step of navigation through a set of indicative terms and diverse set of documents, as well as information scent that helps to guide users' exploration. These summaries are derived from pre-materialized views that allow for quick calculation of centroids for various result sets. The utility and efficiency of the system is demonstrated on the New York Times Annotated Corpus