97 research outputs found

    People-search : searching for people sharing similar interests from the web

    Get PDF
    On the Web, there are limited ways of finding people sharing similar interests or background with a given person. The current methods, such as using regular search engines, are either ineffective or time consuming. In this work, a new approach for searching people sharing similar interests from the Web, called People-Search, is presented. Given a person, to find similar people from the Web, there are two major research issues: person representation and matching persons. In this study, a person representation method which uses a person\u27s website to represent this person\u27s interest and background is proposed. The design of matching process takes person representation into consideration to allow the same representation to be used when composing the query, which is also a personal website. Based on this person representation method, the main proposed algorithm integrates textual content and hyperlink information of all the pages belonging to a personal website to represent a person and match persons. Other algorithms, based on different combinations of content, inlink, and outlink information of an entire personal website or only the main page, are also explored and compared to the main proposed algorithm. Two kinds of evaluations were conducted. In the automatic evaluation, precision, recall, F and Kruskal-Goodman F measures were used to compare these algorithms. In the human evaluation, the effectiveness of the main proposed algorithm and two other important ones were evaluated by human subjects. Results from both evaluations show that the People-Search algorithm integrating content and link information of all pages belonging to a personal website outperformed all other algorithms in finding similar people from the Web

    Data Sets: Word Embeddings Learned from Tweets and General Data

    Full text link
    A word embedding is a low-dimensional, dense and real- valued vector representation of a word. Word embeddings have been used in many NLP tasks. They are usually gener- ated from a large text corpus. The embedding of a word cap- tures both its syntactic and semantic aspects. Tweets are short, noisy and have unique lexical and semantic features that are different from other types of text. Therefore, it is necessary to have word embeddings learned specifically from tweets. In this paper, we present ten word embedding data sets. In addition to the data sets learned from just tweet data, we also built embedding sets from the general data and the combination of tweets with the general data. The general data consist of news articles, Wikipedia data and other web data. These ten embedding models were learned from about 400 million tweets and 7 billion words from the general text. In this paper, we also present two experiments demonstrating how to use the data sets in some NLP tasks, such as tweet sentiment analysis and tweet topic classification tasks

    Rumor Detection on Social Media: Datasets, Methods and Opportunities

    Full text link
    Social media platforms have been used for information and news gathering, and they are very valuable in many applications. However, they also lead to the spreading of rumors and fake news. Many efforts have been taken to detect and debunk rumors on social media by analyzing their content and social context using machine learning techniques. This paper gives an overview of the recent studies in the rumor detection field. It provides a comprehensive list of datasets used for rumor detection, and reviews the important studies based on what types of information they exploit and the approaches they take. And more importantly, we also present several new directions for future research.Comment: 10 page

    Event Detection from Social Media Stream: Methods, Datasets and Opportunities

    Full text link
    Social media streams contain large and diverse amount of information, ranging from daily-life stories to the latest global and local events and news. Twitter, especially, allows a fast spread of events happening real time, and enables individuals and organizations to stay informed of the events happening now. Event detection from social media data poses different challenges from traditional text and is a research area that has attracted much attention in recent years. In this paper, we survey a wide range of event detection methods for Twitter data stream, helping readers understand the recent development in this area. We present the datasets available to the public. Furthermore, a few research opportunitiesComment: 8 page

    Generating Better Concept Hierarchies Using Automatic Document Classification

    Get PDF
    ABSTRACT This paper presents a hybrid concept hierarchy development technique for web returned documents retrieved by a meta-search engine. The aim of the technique is to separate the initial retrieved documents into topical oriented categories, prior to the actual concept hierarchy generation. The topical categories correspond to different semantic aspects of the query. This is done using a 1-of-n automatic document classification, on the initial set of returned documents. Then, an individual topical concept hierarchy is automatically generated inside each of the resulted categories. Both steps are executed on the fly at retrieval time. Due to the efficiency constraints imposed by the web retrieval context, the algorithm only uses document snippets (rather than full web pages) for both document classification and concept hierarchy generation. Experimental results show that the algorithm is able to improve the quality of the concept hierarchy presented to the searcher; at the same time, the efficiency parameters are kept within reasonable intervals
    • …
    corecore