Search CORE

212 research outputs found

Recommended from our members

Email Thread Reassembly Using Similarity Matching

Author: Harnly Aaron
Yeh Jen-Yuan
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2006
Field of study

Email thread reassembly is the task of linking messages by parent-child relationships. In this paper, we present two approaches to address this problem. One exploits previously undocumented header information from the Microsoft Exchange Protocol. The other uses string similarity metrics and a heuristic algorithm to reassemble threads in the absence of header information. The pros and cons of both methods are discussed. The similarity matching method is evaluated using the Enron email corpus and found to perform well

Columbia University Academic Commons

A Method to Discover Digital Collaborative Conversations in Business Collaborations

Author: Bourge Fabrice
Dugdale Julie
Flepp Antoine
Marie-Cardot Tiphaine
Publication venue
Publication date: 02/05/2018
Field of study

Many companies have a suite of digital tools, such as Enterprise Social Networks, conferencing and document sharing software, and email, to facilitate collaboration among employees. During, or at the end of a collaboration, documents are often produced. People who were not involved in the initial collaboration often have difficulties understanding parts of its content because they are lacking the overall context. We argue there is valuable contextual and collaborative knowledge contained in these tools (content and use) that can be used to understand the document. Our goal is to rebuild the conversations that took place over a messaging service and their links with a digital conferencing tool during document production. The novelty in our approach is to combine several conversation-threading methods to identify interesting links between distinct conversations. Specifically we combine header-field information with social, temporal and semantic proximities. Our findings suggest the messaging service and conferencing tool are used in a complementary way. The primary results confirm that combining different conversation threading approaches is efficient to detect and construct conversation threads from distinct digital conversations concerning the same document

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

Why Forwarded Email Threads are Hard to Read: The Email Format as an Antecedent of Email Overload

Author: Sobotta Nikolai
Publication venue: AIS Electronic Library (AISeL)
Publication date: 01/07/2016
Field of study

Research has shown that excessive email use leads to feelings of being overwhelmed and stressed. Existing coping solutions, which mitigate email overload, address the number of emails and, in consequence, the time spent on emails. These approaches are congruent with existing research on antecedents of email overload. Further coping solutions include addressing email threads. However, we lack a theoretical grounding for perceiving email threads as an antecedent of email overload. I suggest cognitive load theory as a means of investigating the format of forwarded email threads in an experiment. I found support for the effects on reading time and performance in terms of correct answers per second, findings that confirm that forwarded email threads are an antecedent of email overload and that we need a new conceptualization of email overload

AIS Electronic Library (AISeL)

A New Email Retrieval Ranking Approach

Author: AbdelRahman Samir
Bahgat Reem
Hassan Basma
Publication venue: 'Academy and Industry Research Collaboration Center (AIRCC)'
Publication date: 01/11/2010
Field of study

Email Retrieval task has recently taken much attention to help the user retrieve the email(s) related to the submitted query. Up to our knowledge, existing email retrieval ranking approaches sort the retrieved emails based on some heuristic rules, which are either search clues or some predefined user criteria rooted in email fields. Unfortunately, the user usually does not know the effective rule that acquires best ranking related to his query. This paper presents a new email retrieval ranking approach to tackle this problem. It ranks the retrieved emails based on a scoring function that depends on crucial email fields, namely subject, content, and sender. The paper also proposes an architecture to allow every user in a network/group of users to be able, if permissible, to know the most important network senders who are interested in his submitted query words. The experimental evaluation on Enron corpus prove that our approach outperforms known email retrieval ranking approachesComment: 20 page

arXiv.org e-Print Archive

CiteSeerX

Crossref

IDENTITY RESOLUTION IN EMAIL COLLECTIONS

Author: Elsayed Tamer Mohamed
Publication venue
Publication date: 01/01/2009
Field of study

Access to historically significant email collections poses challenges that arise less often in personal collections. Most notably, people exploring a large collection of emails, in which they were not sending or receiving, may not be very familiar with the discussions that exist in this collection. They would not only need to focus on understanding the topical content of those discussions, but would also find it useful to understand who the people sending, receiving, or mentioned in these discussions were. In this dissertation, the problem of resolving personal identity in the context of large email collections is tackled. In such collections, a common name (e.g., John) might easily refer to any one of several hundred people; when one of these people was mentioned in an email, the question then arises: "who is that John?'' To "resolve identity'' of people in an email collection, two problems need to be solved: (1) modeling the identity of the participants in that collection, and (2) resolving name-mentions (that appeared in the body of the messages) to these identities. To tackle the first problem, a simple computational model of identity, that is built on extracting unambiguous references (e.g., full names from headers, or nicknames from free-text signatures) to people from the whole collection, is presented. To tackle the second problem, a generative probabilistic approach that leverages the model of identity to resolve mentions is presented. The approach is motivated by intuitions about the way people might refer to others in an email; it expands the context surrounding a mention in four directions: the message where the mention was observed, the thread that includes that message, topically-related messages, and messages sent or received by the original communicating parties. It relies on less ambiguous references (e.g., email addresses or full names) that are observed in some context of a given mention to rank potential referents of that mention. In order to jointly resolve all mentions in the collection, a parallel implementation is presented using the MapReduce distributed-programming framework. The implementation decomposes the structure of the resolution process into subcomponents that fit the MapReduce task model well. At the heart of that implementation, a parallel algorithm for efficient computation of pairwise document similarity in large collections is proposed as a general solution that can be used for scalable context expansion of all mentions and other applications as well. The resolution approach compares favorably with previously-reported techniques on small test collections (sets of mention-queries that were manually resolved beforehand) that were used to evaluate the task in the literature. However, the mention-queries in those collections, besides being relatively few in number, are limited in that all refer to people for whom a substantial amount of evidence would be expected to be available in the collection thus omitting the "long tail'' of the identity distribution for which less evidence is available. This motivated the development of a new test collection that now is the largest and best-balanced test collection available for the task. To build this collection, a user study was conducted that also provided some insight into the difficulty of the task and how time-consuming it is when humans perform it, and the reliability of their task performance. The study revealed that at least 80% of the 584 annotated mentions were resolvable to people who had sent or received email within the same collection. The new test collection was used to experimentally evaluate the resolution system. The results highlight the importance of the social context (that includes messages sent or received by the original communicating parties) when resolving mentions in email. Moreover, the results show that combining evidence from multiple types of contexts yields better resolution than what can be achieved using any individual context. The one-best selection is correct 74% of the time when tested on the full set of the mention-queries, and 51% of the time when tested on the mention-queries labeled as "hard'' by the annotators. Experiments run with iterative reformulation of the resolution algorithm resulted in modest gains only for the second iteration in the social context expansion

Digital Repository at the University of Maryland

Crowdsource Annotation and Automatic Reconstruction of Online Discussion Threads

Author: Jamison Emily K.
Publication venue
Publication date: 01/01/2016
Field of study

Modern communication relies on electronic messages organized in the form of discussion threads. Emails, IMs, SMS, website comments, and forums are all composed of threads, which consist of individual user messages connected by metadata and discourse coherence to messages from other users. Threads are used to display user messages effectively in a GUI such as an email client, providing a background context for understanding a single message. Many messages are meaningless without the context provided by their thread. However, a number of factors may result in missing thread structure, ranging from user mistake (replying to the wrong message), to missing metadata (some email clients do not produce/save headers that fully encapsulate thread structure; and, conversion of archived threads from over repository to another may also result in lost metadata), to covert use (users may avoid metadata to render discussions difficult for third parties to understand). In the field of security, law enforcement agencies may obtain vast collections of discussion turns that require automatic thread reconstruction to understand. For example, the Enron Email Corpus, obtained by the Federal Energy Regulatory Commission during its investigation of the Enron Corporation, has no inherent thread structure. In this thesis, we will use natural language processing approaches to reconstruct threads from message content. Reconstruction based on message content sidesteps the problem of missing metadata, permitting post hoc reorganization and discussion understanding. We will investigate corpora of email threads and Wikipedia discussions. However, there is a scarcity of annotated corpora for this task. For example, the Enron Emails Corpus contains no inherent thread structure. Therefore, we also investigate issues faced when creating crowdsourced datasets and learning statistical models of them. Several of our findings are applicable for other natural language machine classification tasks, beyond thread reconstruction. We will divide our investigation of discussion thread reconstruction into two parts. First, we explore techniques needed to create a corpus for our thread reconstruction research. Like other NLP pairwise classification tasks such as Wikipedia discussion turn/edit alignment and sentence pair text similarity rating, email thread disentanglement is a heavily class-imbalanced problem, and although the advent of crowdsourcing has reduced annotation costs, the common practice of crowdsourcing redundancy is too expensive for class-imbalanced tasks. As the first contribution of this thesis, we evaluate alternative strategies for reducing crowdsourcing annotation redundancy for class-imbalanced NLP tasks. We also examine techniques to learn the best machine classifier from our crowdsourced labels. In order to reduce noise in training data, most natural language crowdsourcing annotation tasks gather redundant labels and aggregate them into an integrated label, which is provided to the classifier. However, aggregation discards potentially useful information from linguistically ambiguous instances. For the second contribution of this thesis, we show that, for four of five natural language tasks, filtering of the training dataset based on crowdsource annotation item agreement improves task performance, while soft labeling based on crowdsource annotations does not improve task performance. Second, we investigate thread reconstruction as divided into the tasks of thread disentanglement and adjacency recognition. We present the Enron Threads Corpus, a newly-extracted corpus of 70,178 multi-email threads with emails from the Enron Email Corpus. In the original Enron Emails Corpus, emails are not sorted by thread. To disentangle these threads, and as the third contribution of this thesis, we perform pairwise classification, using text similarity measures on non-quoted texts in emails. We show that i) content text similarity metrics outperform style and structure text similarity metrics in both a class-balanced and class-imbalanced setting, and ii) although feature performance is dependent on the semantic similarity of the corpus, content features are still effective even when controlling for semantic similarity. To reconstruct threads, it is also necessary to identify adjacency relations among pairs. For the forum of Wikipedia discussions, metadata is not available, and dialogue act typologies, helpful for other domains, are inapplicable. As our fourth contribution, via our experiments, we show that adjacency pair recognition can be performed using lexical pair features, without a dialogue act typology or metadata, and that this is robust to controlling for topic bias of the discussions. Yet, lexical pair features do not effectively model the lexical semantic relations between adjacency pairs. To model lexical semantic relations, and as our fifth contribution, we perform adjacency recognition using extracted keyphrases enhanced with semantically related terms. While this technique outperforms a most frequent class baseline, it fails to outperform lexical pair features or tf-idf weighted cosine similarity. Our investigation shows that this is the result of poor word sense disambiguation and poor keyphrase extraction causing spurious false positive semantic connections. In concluding this thesis, we also reflect on open issues and unanswered questions remaining after our research contributions, discuss applications for thread reconstruction, and suggest some directions for future work

TUbiblio

tuprints

Crimean Rhetorical Sovereignty: Resisting A Deportation Of Identity

Author: Berry Christian
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/01/2013
Field of study

On a small contested part of the world, the peninsula of Crimea, once a part of the former Soviet Union, lives a people who have endured genocide and who have struggled to etch out an identity in a land once their own. They are the Crimean Tatar. Even their name, an exonym promoting the Crimeans’ “peripheral status” (Powell) and their ensuing “cultural schizophrenia” (Vizenor), bears witness to the otherization they have withstood throughout centuries. However, despite attempts to relegate them to the history books, Crimeans are alive and well in the “motherland,” but not without some difficulty. Having been forced to reframe their identities because of numerous imperialistic, colonialist, and soviet behavior and policies, there have been many who have resisted, first and foremost through rhetorical sovereignty, the ability to reframe Crimean Tatar identity through Crimean Tatar rhetoric. This negotiation of identity through rhetoric has included a fierce defense of their language and culture in what Malea Powell calls a “war with homogeneity,” a struggle for identification based on resistance. This thesis seeks to understand the rhetorical function of naming practices as acts that inscribe material meaning and perform marginalization or resistance within the context of Crimea-L, a Yahoo! Group listserv as well as immediate and remote Crimean history. To analyze the rhetoric of marginalization and resistance in naming practices, I use the Discourse Historical Approach (DHA) to Critical Discourse Analysis (CDA) within recently archived discourses. Ruth Wodak’s DHA strategies will be reappropriated as Naming Practice Strategies, depicting efforts in otherization or rhetorical sovereignty

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Detecting worm mutations using machine learning

Author: Sharma Oliver
Publication venue
Publication date: 01/01/2008
Field of study

Worms are malicious programs that spread over the Internet without human intervention. Since worms generally spread faster than humans can respond, the only viable defence is to automate their detection. Network intrusion detection systems typically detect worms by examining packet or flow logs for known signatures. Not only does this approach mean that new worms cannot be detected until the corresponding signatures are created, but that mutations of known worms will remain undetected because each mutation will usually have a different signature. The intuitive and seemingly most effective solution is to write more generic signatures, but this has been found to increase false alarm rates and is thus impractical. This dissertation investigates the feasibility of using machine learning to automatically detect mutations of known worms. First, it investigates whether Support Vector Machines can detect mutations of known worms. Support Vector Machines have been shown to be well suited to pattern recognition tasks such as text categorisation and hand-written digit recognition. Since detecting worms is effectively a pattern recognition problem, this work investigates how well Support Vector Machines perform at this task. The second part of this dissertation compares Support Vector Machines to other machine learning techniques in detecting worm mutations. Gaussian Processes, unlike Support Vector Machines, automatically return confidence values as part of their result. Since confidence values can be used to reduce false alarm rates, this dissertation determines how Gaussian Process compare to Support Vector Machines in terms of detection accuracy. For further comparison, this work also compares Support Vector Machines to K-nearest neighbours, known for its simplicity and solid results in other domains. The third part of this dissertation investigates the automatic generation of training data. Classifier accuracy depends on good quality training data -- the wider the training data spectrum, the higher the classifier's accuracy. This dissertation describes the design and implementation of a worm mutation generator whose output is fed to the machine learning techniques as training data. This dissertation then evaluates whether the training data can be used to train classifiers of sufficiently high quality to detect worm mutations. The findings of this work demonstrate that Support Vector Machines can be used to detect worm mutations, and that the optimal configuration for detection of worm mutations is to use a linear kernel with unnormalised bi-gram frequency counts. Moreover, the results show that Gaussian Processes and Support Vector Machines exhibit similar accuracy on average in detecting worm mutations, while K-nearest neighbours consistently produces lower quality predictions. The generated worm mutations are shown to be of sufficiently high quality to serve as training data. Combined, the results demonstrate that machine learning is capable of accurately detecting mutations of known worms

Glasgow Theses Service

CiteSeerX

OpenGrey Repository

Towards Real-time Remote Processing of Laparoscopic Video

Author: Ronaghi Zahra
Publication venue: Clemson University Libraries
Publication date: 01/05/2017
Field of study

Laparoscopic surgery is a minimally invasive technique where surgeons insert a small video camera into the patient\u27s body to visualize internal organs and use small tools to perform these procedures. However, the benefit of small incisions has a disadvantage of limited visualization of subsurface tissues. Image-guided surgery (IGS) uses pre-operative and intra-operative images to map subsurface structures and can reduce the limitations of laparoscopic surgery. One particular laparoscopic system is the daVinci-si robotic surgical vision system. The video streams generate approximately 360 megabytes of data per second, demonstrating a trend toward increased data sizes in medicine, primarily due to higher-resolution video cameras and imaging equipment. Real-time processing this large stream of data on a bedside PC, single or dual node setup, may be challenging and a high-performance computing (HPC) environment is not typically available at the point of care. To process this data on remote HPC clusters at the typical 30 frames per second rate (fps), it is required that each 11.9 MB (1080p) video frame be processed by a server and returned within the time this frame is displayed or 1/30th of a second. The ability to acquire, process, and visualize data in real time is essential for the performance of complex tasks as well as minimizing risk to the patient. We have implemented and compared performance of compression, segmentation and registration algorithms on Clemson\u27s Palmetto supercomputer using dual Nvidia graphics processing units (GPUs) per node and compute unified device architecture (CUDA) programming model. We developed three separate applications that run simultaneously: video acquisition, image processing, and video display. The image processing application allows several algorithms to run simultaneously on different cluster nodes and transfer images through message passing interface (MPI). Our segmentation and registration algorithms resulted in an acceleration factor of around 2 and 8 times respectively. To achieve a higher frame rate, we also resized images and reduced the overall processing time. As a result, using high-speed network to access computing clusters with GPUs to implement these algorithms in parallel will improve surgical procedures by providing real-time medical image processing and laparoscopic data

Clemson University: TigerPrints