387 research outputs found

    Event detection in interaction network

    Get PDF
    We study the problem of detecting top-k events from digital interaction records (e.g, emails, tweets). We first introduce interaction meta-graph, which connects associated interactions. Then, we define an event to be a subset of interactions that (i) are topically and temporally close and (ii) correspond to a tree capturing information flow. Finding the best event leads to one variant of prize-collecting Steiner-tree problem, for which three methods are proposed. Finding the top-k events maps to maximum k-coverage problem. Evaluation on real datasets shows our methods detect meaningful events

    Detecting New, Informative Propositions in Social Media

    Get PDF
    The ever growing quantity of online text produced makes it increasingly challenging to find new important or useful information. This is especially so when topics of potential interest are not known a-priori, such as in “breaking news stories”. This thesis examines techniques for detecting the emergence of new, interesting information in Social Media. It sets the investigation in the context of a hypothetical knowledge discovery and acquisition system, and addresses two objectives. The first objective addressed is the detection of new topics. The second is filtering of non-informative text from Social Media. A rolling time-slicing approach is proposed for discovery, in which daily frequencies of nouns, named entities, and multiword expressions are compared to their expected daily frequencies, as estimated from previous days using a Poisson model. Trending features, those showing a significant surge in use, in Social Media are potentially interesting. Features that have not shown a similar recent surge in News are selected as indicative of new information. It is demonstrated that surges in nouns and news entities can be detected that predict corresponding surges in mainstream news. Co-occurring trending features are used to create clusters of potentially topic-related documents. Those formed from co-occurrences of named entities are shown to be the most topically coherent. Machine learning based filtering models are proposed for finding informative text in Social Media. News/Non-News and Dialogue Act models are explored using the News annotated Redites corpus of Twitter messages. A simple 5-act Dialogue scheme, used to annotate a small sample thereof, is presented. For both News/Non-News and Informative/Non-Informative classification tasks, using non-lexical message features produces more discriminative and robust classification models than using message terms alone. The combination of all investigated features yield the most accurate models

    Local News And Event Detection In Twitter

    Get PDF
    Twitter, one of the most popular micro-blogging services, allows users to publish short messages on a wide variety of subjects such as news, events, stories, ideas, and opinions, called tweets. The popularity of Twitter, to some extent, arises from its capability of letting users promptly and conveniently contribute tweets to convey diverse information. Specifically, with people discussing what is happening outside in the real world by posting tweets, Twitter captures invaluable information about real-world news and events, spanning a wide scale from large national or international stories like a presidential election to small local stories such as a local farmers market. Detecting and extracting small news and events for a local place is a challenging problem and is the focus of this thesis. In particular, we explore several directions to extract and detect local news and events using tweets in Twitter: a) how to identify local influential people on Twitter for potential news seeders; b) how to recognize unusualness in tweet volume as signals of potential local events; c) how to overcome the data sparsity of local tweets to detect more and smaller undergoing local news and events. Additionally, we also try to uncover implicit correlations between location, time, and text in tweets by learning embeddings for them using a universal representation under the same semantic space. In the first part, we investigate how to measure the spatial influence of Twitter users by their interactions and thereby identify the locally influential users, which we found are usually good news and event seeders in practice. In order to do this, we built a large-scale directed interaction graph of Twitter users. Such a graph allows us to exploit PageRank based ranking procedures to select top local influential people after innovatively incorporating in geographical distance to the transition matrix used for the random walking. In the second part, we study how to recognize the unusualness in tweet volume at a local place as signals of potential ongoing local events. The intuition is that if there is suddenly an abnormal change in the number of tweets at a location (e.g., a significant increase), it may imply a potential local event. We, therefore, present DeLLe, a methodology for automatically Detecting Latest Local Events from geotagged tweet streams (i.e., tweets that contain GPS points). With the help of novel spatiotemporal tweet count prediction models, DeLLe first finds unusual locations which have aggregated an unexpected number of tweets in the latest time period and then calculates, for each such unusual location, a ranking score to identify the ones most likely to have ongoing local events by addressing the temporal burstiness, spatial business, and topical coherence. In the third part, we explore how to overcome the data sparsity of local tweets when trying to discover more and smaller local news or events. Local tweets are those whose locations fall inside a local place. They are very sparse in Twitter, which hinders the detection of small local news or events that have only a handful of tweets. A system, called Firefly, is proposed to enhance the local live tweet stream by tracking the tweets of a large body of local people, and further perform a locality-aware keyword based clustering for event detection. The intuition is that local tweets are published by local people, and tracking their tweets naturally yields a source of local tweets. However, in practice, only 20% Twitter users provide information about where they come from. Thus, a social network-based geotagging procedure is subsequently proposed to estimate locations for Twitter users whose locations are missing. Finally, in order to discover correlations between location, time and text in geotagged tweets, e.g., “find which locations are mostly related to the given topics“ and “find which locations are similar to a given location“, we present LeGo, a methodology for Learning embeddings of Geotagged tweets with respect to entities such as locations, time units (hour-of-day and day-of-week) and textual words in tweets. The resulting compact vector representations of these entities hence make it easy to measure the relatedness between locations, time and words in tweets. LeGo comprises two working modes: crossmodal search (LeGo-CM) and location-similarity search (LeGo-LS), to answer these two types of queries accordingly. In LeGo-CM, we first build a graph of entities extracted from tweets in which each edge carries the weight of co-occurrences between two entities. The embeddings of graph nodes are then learned in the same latent space under the guidance of approximating stationary residing probabilities between nodes which are computed using personalized random walk procedures. In comparison, we supplement edges between locations in LeGo-LS to address their underlying spatial proximity and topic likeliness to support location-similarity search queries

    Timeout Reached, Session Ends?

    Get PDF
    Die Identifikation von Sessions zum Verständnis des Benutzerverhaltens ist ein Forschungsgebiet des Web Usage Mining. Definitionen und Konzepte werden seit über 20 Jahren diskutiert. Die Forschung zeigt, dass Session-Identifizierung kein willkürlicher Prozess sein sollte. Es gibt eine fragwürdige Tendenz zu vereinfachten mechanischen Sessions anstelle logischer Segmentierungen. Ziel der Dissertation ist es zu beweisen, wie unterschiedliche Session-Ansätze zu abweichenden Ergebnissen und Interpretationen führen. Die übergreifende Forschungsfrage lautet: Werden sich verschiedene Ansätze zur Session-Identifizierung auf Analyseergebnisse und Machine-Learning-Probleme auswirken? Ein methodischer Rahmen für die Durchführung, den Vergleich und die Evaluation von Sessions wird gegeben. Die Dissertation implementiert 135 Session-Ansätze in einem Jahr (2018) Daten einer deutschen Preisvergleichs-E-Commerce-Plattform. Die Umsetzung umfasst mechanische Konzepte, logische Konstrukte und die Kombination mehrerer Mechaniken. Es wird gezeigt, wie logische Sessions durch Embedding-Algorithmen aus Benutzersequenzen konstruiert werden: mit einem neuartigen Ansatz zur Identifizierung logischer Sessions, bei dem die thematische Nähe von Interaktionen anstelle von Suchanfragen allein verwendet wird. Alle Ansätze werden verglichen und quantitativ beschrieben sowie in drei Machine-Learning-Problemen (wie Recommendation) angewendet. Der Hauptbeitrag dieser Dissertation besteht darin, einen umfassenden Vergleich von Session-Identifikationsalgorithmen bereitzustellen. Die Arbeit bietet eine Methodik zum Implementieren, Analysieren und Evaluieren einer Auswahl von Mechaniken, die es ermöglichen, das Benutzerverhalten und die Auswirkungen von Session-Modellierung besser zu verstehen. Die Ergebnisse zeigen, dass unterschiedlich strukturierte Eingabedaten die Ergebnisse von Algorithmen oder Analysen drastisch verändern können.The identification of sessions as a means of understanding user behaviour is a common research area of web usage mining. Different definitions and concepts have been discussed for over 20 years: Research shows that session identification is not an arbitrary task. There is a tendency towards simplistic mechanical sessions instead of more complex logical segmentations, which is questionable. This dissertation aims to prove how the nature of differing session-identification approaches leads to diverging results and interpretations. The overarching research question asks: will different session-identification approaches impact analysis and machine learning tasks? A comprehensive methodological framework for implementing, comparing and evaluating sessions is given. The dissertation provides implementation guidelines for 135 session-identification approaches utilizing a complete year (2018) of traffic data from a German price-comparison e-commerce platform. The implementation includes mechanical concepts, logical constructs and the combination of multiple methods. It shows how logical sessions were constructed from user sequences by employing embedding algorithms on interaction logs; taking a novel approach to logical session identification by utilizing topical proximity of interactions instead of search queries alone. All approaches are compared and quantitatively described. The application in three machine-learning tasks (such as recommendation) is intended to show that using different sessions as input data has a marked impact on the outcome. The main contribution of this dissertation is to provide a comprehensive comparison of session-identification algorithms. The research provides a methodology to implement, analyse and compare a wide variety of mechanics, allowing to better understand user behaviour and the effects of session modelling. The main results show that differently structured input data may drastically change the results of algorithms or analysis

    Rhetorical memory, synaptic mapping, and ethical grounding

    Get PDF
    This research applies neuroscience to classical accounts of rhetorical memory, and argues that the physical operations of memory via synaptic activity support causal theories of language, and account for individual agency in systematically considering, creating, and revising our stances toward rhetorical situations. The dissertation explores ways that rhetorical memory grounds the work of the other canons of rhetoric in specific contexts, thereby expanding memory's classical function as "custodian" to the canons. In this approach, rhetorical memory actively orients the canons as interdependent phases of discursive communicative acts, and grounds them in an ethical baseline from which we enter discourse. Finally, the work applies its re-conception of rhetorical memory to various aspects of composition and Living Learning Community educational models via practical and deliberate interpretation and arrangement of our synaptic "maps.

    Extracting Insights from Differences: Analyzing Node-aligned Social Graphs

    Full text link
    Social media and network research often focus on the agreement between different entities to infer connections, recommend actions and subscriptions and even improve algorithms via ensemble methods. However, studying differences instead of similarities can yield useful insights in all these cases. We can infer and understand inter-community interactions (including ideological and user-based community conflicts, hierarchical community relations) and improve community detection algorithms via insights gained from differences among entities such as communities, users and algorithms. When the entities are communities or user groups, we often study the difference via node-aligned networks, which are networks with the same set of nodes but different sets of edges. The edges define implicit connections which we can infer via similarities or differences between two nodes. We perform a set of studies to identify and understand differences among user groups using Reddit, where the subreddit structure provides us with pre-defined user groups. Studying the difference between author overlap and textual similarity among different subreddits, we find misaligned edges and networks which expose subreddits at ideological 'war', community fragmentation, asymmetry of interactions involving subreddits based on marginalized social groups and more. Differences in perceived user behavior across different subreddits allow us to identify subreddit conflicts and features which can implicate communal misbehavior. We show that these features can be used to identify some subreddits banned by Reddit. Applying the idea of differences in community detection algorithms helps us identify problematic community assignments where we can ask for human help in categorizing a node in a specific community. It also gives us an idea of the overall performance of a particular community detection algorithm on a particular network input. In general, these improve ensemble community detection techniques. We demonstrate this via CommunityDiff (a community detection and visualization tool), which compares and contrasts different algorithms and incorporates user knowledge in community detection output. We believe the idea of gaining insights from differences can be applied to several other problems and help us understand and improve social media interactions and research.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/149801/1/srayand_1.pd

    Visual object category discovery in images and videos

    Get PDF
    textThe current trend in visual recognition research is to place a strict division between the supervised and unsupervised learning paradigms, which is problematic for two main reasons. On the one hand, supervised methods require training data for each and every category that the system learns; training data may not always be available and is expensive to obtain. On the other hand, unsupervised methods must determine the optimal visual cues and distance metrics that distinguish one category from another to group images into semantically meaningful categories; however, for unlabeled data, these are unknown a priori. I propose a visual category discovery framework that transcends the two paradigms and learns accurate models with few labeled exemplars. The main insight is to automatically focus on the prevalent objects in images and videos, and learn models from them for category grouping, segmentation, and summarization. To implement this idea, I first present a context-aware category discovery framework that discovers novel categories by leveraging context from previously learned categories. I devise a novel object-graph descriptor to model the interaction between a set of known categories and the unknown to-be-discovered categories, and group regions that have similar appearance and similar object-graphs. I then present a collective segmentation framework that simultaneously discovers the segmentations and groupings of objects by leveraging the shared patterns in the unlabeled image collection. It discovers an ensemble of representative instances for each unknown category, and builds top-down models from them to refine the segmentation of the remaining instances. Finally, building on these techniques, I show how to produce compact visual summaries for first-person egocentric videos that focus on the important people and objects. The system leverages novel egocentric and high-level saliency features to predict important regions in the video, and produces a concise visual summary that is driven by those regions. I compare against existing state-of-the-art methods for category discovery and segmentation on several challenging benchmark datasets. I demonstrate that we can discover visual concepts more accurately by focusing on the prevalent objects in images and videos, and show clear advantages of departing from the status quo division between the supervised and unsupervised learning paradigms. The main impact of my thesis is that it lays the groundwork for building large-scale visual discovery systems that can automatically discover visual concepts with minimal human supervision.Electrical and Computer Engineerin

    Temporal Information Models for Real-Time Microblog Search

    Get PDF
    Real-time search in Twitter and other social media services is often biased towards the most recent results due to the “in the moment” nature of topic trends and their ephemeral relevance to users and media in general. However, “in the moment”, it is often difficult to look at all emerging topics and single-out the important ones from the rest of the social media chatter. This thesis proposes to leverage on external sources to estimate the duration and burstiness of live Twitter topics. It extends preliminary research where itwas shown that temporal re-ranking using external sources could indeed improve the accuracy of results. To further explore this topic we pursued three significant novel approaches: (1) multi-source information analysis that explores behavioral dynamics of users, such as Wikipedia live edits and page view streams, to detect topic trends and estimate the topic interest over time; (2) efficient methods for federated query expansion towards the improvement of query meaning; and (3) exploiting multiple sources towards the detection of temporal query intent. It differs from past approaches in the sense that it will work over real-time queries, leveraging on live user-generated content. This approach contrasts with previous methods that require an offline preprocessing step

    Mining Time-aware Actor-level Evolution Similarity for Link Prediction in Dynamic Network

    Get PDF
    Topological evolution over time in a dynamic network triggers both the addition and deletion of actors and the links among them. A dynamic network can be represented as a time series of network snapshots where each snapshot represents the state of the network over an interval of time (for example, a minute, hour or day). The duration of each snapshot denotes the temporal scale/sliding window of the dynamic network and all the links within the duration of the window are aggregated together irrespective of their order in time. The inherent trade-off in selecting the timescale in analysing dynamic networks is that choosing a short temporal window may lead to chaotic changes in network topology and measures (for example, the actors’ centrality measures and the average path length); however, choosing a long window may compromise the study and the investigation of network dynamics. Therefore, to facilitate the analysis and understand different patterns of actor-oriented evolutionary aspects, it is necessary to define an optimal window length (temporal duration) with which to sample a dynamic network. In addition to determining the optical temporal duration, another key task for understanding the dynamics of evolving networks is being able to predict the likelihood of future links among pairs of actors given the existing states of link structure at present time. This phenomenon is known as the link prediction problem in network science. Instead of considering a static state of a network where the associated topology does not change, dynamic link prediction attempts to predict emerging links by considering different types of historical/temporal information, for example the different types of temporal evolutions experienced by the actors in a dynamic network due to the topological evolution over time, known as actor dynamicities. Although there has been some success in developing various methodologies and metrics for the purpose of dynamic link prediction, mining actor-oriented evolutions to address this problem has received little attention from the research community. In addition to this, the existing methodologies were developed without considering the sampling window size of the dynamic network, even though the sampling duration has a large impact on mining the network dynamics of an evolutionary network. Therefore, although the principal focus of this thesis is link prediction in dynamic networks, the optimal sampling window determination was also considered
    • …
    corecore