36 research outputs found

    What Users Ask a Search Engine: Analyzing One Billion Russian Question Queries

    Full text link
    We analyze the question queries submitted to a large commercial web search engine to get insights about what people ask, and to better tailor the search results to the users’ needs. Based on a dataset of about one billion question queries submitted during the year 2012, we investigate askers’ querying behavior with the support of automatic query categorization. While the importance of question queries is likely to increase, at present they only make up 3–4% of the total search traffic. Since questions are such a small part of the query stream and are more likely to be unique than shorter queries, clickthrough information is typically rather sparse. Thus, query categorization methods based on the categories of clicked web documents do not work well for questions. As an alternative, we propose a robust question query classification method that uses the labeled questions from a large community question answering platform (CQA) as a training set. The resulting classifier is then transferred to the web search questions. Even though questions on CQA platforms tend to be different to web search questions, our categorization method proves competitive with strong baselines with respect to classification accuracy. To show the scalability of our proposed method we apply the classifiers to about one billion question queries and discuss the trade-offs between performance and accuracy that different classification models offer. Our findings reveal what people ask a search engine and also how this contrasts behavior on a CQA platform

    Experimental Evaluation of Representation Models for Content Recommendation in Microblogging Services

    Get PDF
    Οι microblogging υπηρεσίες αποτελούν έναν ευρέως διαδεδομένο τρόπο ανταλλαγής πληροφοριών και επικοινωνίας σε πραγματικό χρόνο. Το Twitter είναι η πιο δημοφιλής microblogging υπηρεσία, αφού επί του παρόντος συγκεντρώνει 300 εκατομμύρια ενεργούς χρήστες μηνιαίως και καταγράφει 500 εκατομμύρια tweets ημερησίως. Για να αντιμετωπιστεί ο καταιγισμός πληροφοριών των χρηστών του Twitter, έχουν προταθεί ποικίλες μέθοδοι συστάσεων για την ανακατάταξη των tweets στο χρονολόγιο ενός χρήστη, σύμφωνα με τα ενδιαφέροντά του. Στη παρούσα διπλωματική εργασία εστιάζουμε σε τεχνικές που αρχικά κατασκευάζουν ένα μοντέλο για κάθε χρήστη ξεχωριστά, με στόχο να απεικονίσουν τις προτιμήσεις του και στη συνέχεια κατατάσσουν τα tweets του χρήστη με βάση την ομοιότητά τους με το μοντέλο αυτό. Στη βιβλιογραφία, μέχρι στιγμής, δεν υπάρχει περιεκτική αποτίμηση των στρατηγικών μοντελοποίησης χρηστών. Για να καλύψουμε το κενό αυτό, εξετάζουμε διεξοδικά σε ένα πραγματικό σύνολο δεδομένων του Twitter, σύγχρονες μεθόδους για τη μοντελοποίηση των προτιμήσεων ενός χρήστη, χρησιμοποιώντας αποκλειστικά πληροφορία σε μορφή κειμένου. Ο στόχος μας είναι να προσδιορίσουμε το πιο αποδοτικό μοντέλο χρήστη σε σχέση με τα ακόλουθα κριτήρια: (1) την πηγή της πληροφορίας σχετική με tweets που χρησιμοποιείται για την μοντελοποίηση, (2) το είδος του χρήστη, όπως προσδιορίζεται από τη σχέση μεταξύ της συχνότητας των tweets που ανεβάζει ο ίδιος και της συχνότητας αυτών που λαμβάνει, (3) τα χαρακτηριστικά της λειτουργικότητάς του, όπως προκύπτουν από μια πρωτότυπη ταξινόμηση, (4) την ευρωστία του σε σχέση με τις εσωτερικές του παραμέτρους. Τα αποτελέσματά μας μπορούν να αξιοποιηθούν για την ρύθμιση και ερμηνεία μοντέλων χρηστών βασισμένων σε κείμενο, με στόχο συστάσεις σε microblogging υπηρεσίες και λειτουργούν σαν σημείο εκκίνησης για την ενίσχυση του καλύτερου μοντέλου με επιπλέον συναφή εξωτερική πληροφορία.Micro-blogging services constitute a popular means of real time communication and information sharing. Twitter is the most popular of these services with 300 million monthly active user accounts and 500 million tweets posted in a daily basis at the moment. Consequently, Twitter users suffer from an information deluge and a large number of recommendation methods have been proposed to re-rank the tweets in a user's timeline according to her interests. We focus on techniques that build a textual model for every individual user to capture her tastes and then rank the tweets she receives according to their similarity with that model. In the literature, there is no comprehensive evaluation of these user modeling strategies as yet. To cover this gap, in this thesis we systematically examine on a real Twitter dataset, 9 state-of-the-art methods for modeling a user's preferences using exclusively textual information. Our goal is to identify the best performing user model with respect to several criteria: (i) the source of tweet information available for modeling (ii) the user type, as determined by the relation between the tweeting frequency of a user and the frequency of her received tweets, (iii) the characteristics of its functionality, as derived from a novel taxonomy, and (iv) its robustness with respect to its internal configurations, as deduced by assessing a wide range of plausible values for internal parameters. Our results can be used for fine-tuning and interpreting text user models in a recommendation scenario in microblogging services and could serve as a starting point for further enhancing the most effective user model with additional contextual information

    Retrieval Enhancements for Task-Based Web Search

    Get PDF
    The task-based view of web search implies that retrieval should take the user perspective into account. Going beyond merely retrieving the most relevant result set for the current query, the retrieval system should aim to surface results that are actually useful to the task that motivated the query. This dissertation explores how retrieval systems can better understand and support their users’ tasks from three main angles: First, we study and quantify search engine user behavior during complex writing tasks, and how task success and behavior are associated in such settings. Second, we investigate search engine queries formulated as questions, and explore patterns in a large query log that may help search engines to better support this increasingly prevalent interaction pattern. Third, we propose a novel approach to reranking the search result lists produced by web search engines, taking into account retrieval axioms that formally specify properties of a good ranking.Die Task-basierte Sicht auf Websuche impliziert, dass die Benutzerperspektive berücksichtigt werden sollte. Über das bloße Abrufen der relevantesten Ergebnismenge für die aktuelle Anfrage hinaus, sollten Suchmaschinen Ergebnisse liefern, die tatsächlich für die Aufgabe (Task) nützlich sind, die diese Anfrage motiviert hat. Diese Dissertation untersucht, wie Retrieval-Systeme die Aufgaben ihrer Benutzer besser verstehen und unterstützen können, und leistet Forschungsbeiträge unter drei Hauptaspekten: Erstens untersuchen und quantifizieren wir das Verhalten von Suchmaschinenbenutzern während komplexer Schreibaufgaben, und wie Aufgabenerfolg und Verhalten in solchen Situationen zusammenhängen. Zweitens untersuchen wir Suchmaschinenanfragen, die als Fragen formuliert sind, und untersuchen ein Suchmaschinenlog mit fast einer Milliarde solcher Anfragen auf Muster, die Suchmaschinen dabei helfen können, diesen zunehmend verbreiteten Anfragentyp besser zu unterstützen. Drittens schlagen wir einen neuen Ansatz vor, um die von Web-Suchmaschinen erstellten Suchergebnislisten neu zu sortieren, wobei Retrieval-Axiome berücksichtigt werden, die die Eigenschaften eines guten Rankings formal beschreiben

    Identifying wildlife observations on twitter

    Get PDF
    Despite the potential of social media for environmental monitoring, concerns remain about the quality and reliability of the information automatically extracted. Notably there are many observations of wildlife on Twitter, but their automated detection is a challenge due to the frequent use of wildlife related words in messages that have no connection with wildlife observation. We investigate whether and what type of supervised machine learning methods can be used to create a fully automated text classification model to identify genuine wildlife observations on Twitter, irrespective of species type or whether Tweets are geo-tagged. We perform experiments with various techniques for building feature vectors that serve as input to the classifiers, and consider how they affect classification performance. We compare three classification approaches and perform an analysis of the types of features that are indicative for genuine wildlife observations on Twitter. In particular, we compare some classical machine learning algorithms, widely used in ecology studies, with state-of-the-art neural network models. Results showed that the neural network-based model Bidirectional Encoder Representations from Transformers (BERT) outperformed the classical methods. Notably this was the case for a relatively small training corpus, consisting of less than 3000 instances. This reflects that fact that the BERT classifier uses a transfer learning approach that benefits from prior learning on a very much larger collection of generic text. BERT performed particularly well even for Tweets that employed specialised language relating to wildlife observations. The analysis of possible indicative features for wildlife Tweets revealed interesting trends in the usage of hashtags that are unrelated to official citizen science campaigns. The findings from this study facilitate more accurate identification of wildlife-related data on social media which can in turn be used for enriching citizen science data collections

    Understanding people through the aggregation of their digital footprints

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 160-172).Every day, millions of people encounter strangers online. We read their medical advice, buy their products, and ask them out on dates. Yet our views of them are very limited; we see individual communication acts rather than the person(s) as a whole. This thesis contends that socially-focused machine learning and visualization of archived digital footprints can improve the capacity of social media to help form impressions of online strangers. Four original designs are presented that each examine the social fabric of a different existing online world. The designs address unique perspectives on the problem of and opportunities offered by online impression formation. The first work, Is Britney Spears Span?, examines a way of prototyping strangers on first contact by modeling their past behaviors across a social network. Landscape of Words identifies cultural and topical trends in large online publics. Personas is a data portrait that characterizes individuals by collating heterogenous textual artifacts. The final design, Defuse, navigates and visualizes virtual crowds using metrics grounded in sociology. A reflection on these experimental endeavors is also presented, including a formalization of the problem and considerations for future research. A meta-critique by a panel of domain experts completes the discussion.by Aaron Robert Zinman.Ph.D

    Hybrid intelligence for data mining

    Full text link
    Today, enormous amount of data are being recorded in all kinds of activities. This sheer size provides an excellent opportunity for data scientists to retrieve valuable information using data mining techniques. Due to the complexity of data in many neoteric problems, one-size-fits-all solutions are seldom able to provide satisfactory answers. Although the studies of data mining have been active, hybrid techniques are rarely scrutinized in detail. Currently, not many techniques can handle time-varying properties while performing their core functions, neither do they retrieve and combine information from heterogeneous dimensions, e.g., textual and numerical horizons. This thesis summarizes our investigations on hybrid methods to provide data mining solutions to problems involving non-trivial datasets, such as trajectories, microblogs, and financial data. First, time-varying dynamic Bayesian networks are extended to consider both causal and dynamic regularization requirements. Combining with density-based clustering, the enhancements overcome the difficulties in modeling spatial-temporal data where heterogeneous patterns, data sparseness and distribution skewness are common. Secondly, topic-based methods are proposed for emerging outbreak and virality predictions on microblogs. Complicated models that consider structural details are popular while others might have taken overly simplified assumptions to sacrifice accuracy for efficiency. Our proposed virality prediction solution delivers the benefits of both worlds. It considers the important characteristics of a structure yet without the burden of fine details to reduce complexity. Thirdly, the proposed topic-based approach for microblog mining is extended for sentiment prediction problems in finance. Sentiment-of-topic models are learned from both commentaries and prices for better risk management. Moreover, previously proposed, supervised topic model provides an avenue to associate market volatility with financial news yet it displays poor resolutions at extreme regions. To overcome this problem, extreme topic model is proposed to predict volatility in financial markets by using supervised learning. By mapping extreme events into Poisson point processes, volatile regions are magnified to reveal their hidden volatility-topic relationships. Lastly, some of the proposed hybrid methods are applied to service computing to verify that they are sufficiently generic for wider applications

    An ebd-enabled design knowledge acquisition framework

    Get PDF
    Having enough knowledge and keeping it up to date enables designers to execute the design assignment effectively and gives them a competitive advantage in the design profession. Knowledge elicitation or acquisition is a crucial component of system design, particularly for tasks requiring transdisciplinary or multidisciplinary cooperation. In system design, extracting domain-specific information is exceedingly tricky for designers. This thesis presents three works that attempt to bridge the gap between designers and domain expertise. First, a systematic literature review on data-driven demand elicitation is given using the Environment-based Design (EBD) approach. This review address two research objectives: (i) to investigate the present state of computer-aided requirement knowledge elicitation in the domains of engineering; (ii) to integrate EBD methodology into the conventional literature review framework by providing a well-structured research question generation methodology. The second study describes a data-driven interview transcript analysis strategy that employs EBD environment analysis, unsupervised machine learning, and a range of natural language processing (NLP) approaches to assist designers and qualitative researchers in extracting needs when domain expertise is lacking. The second research proposes a transfer-learning method-based qualitative text analysis framework that aids researchers in extracting valuable knowledge from interview data for healthcare promotion decision-making. The third work is an EBD-enabled design lexical knowledge acquisition framework that automatically constructs a semantic network -- RomNet from an extensive collection of abstracts from engineering publications. Applying RomNet can improve the design information retrieval quality and communication between each party involved in a design project. To conclude, this thesis integrates artificial intelligence techniques, such as Natural Language Processing (NLP) methods, Machine Learning techniques, and rule-based systems to build a knowledge acquisition framework that supports manual, semi-automatic, and automatic extraction of design knowledge from different types of the textual data source
    corecore