6,548 research outputs found

    Analyzing and Comparing On-Line News Sources via (Two-Layer) Incremental Clustering

    Get PDF
    In this paper, we analyse the contents of the web site of two Italian press agencies and of four of the most popular Italian newspapers, in order to answer questions such as what are the most relevant news, what is the average life of news, and how much different are different sites. To this aim, we have developed a web-based application which hourly collects the articles in the main column of the six web sites, implements an incremental clustering algorithm for grouping the articles into news, and finally allows the user to see the answer to the above questions. We have also designed and implemented a two-layer modification of the incremental clustering algorithm and executed some preliminary experimental evaluation of this modification: it turns out that the two-layer clustering is extremely efficient in terms of time performances, and it has quite good performances in terms of precision and recall

    Cluster-Based News Representative Generation with Automatic Incremental Clustering

    Get PDF
    Nowadays, a large volume of news circulates around the Internet in one day, amounting to more than two thousand news. However, some of these news have the same topic and content, trapping readers among different sources of news that say similar things. This research proposes a new approach to provide a representative news automatically through the Automatic Incremental Clustering method. This method began with the Data Acquisition process, Keyword Extraction, and Metadata Aggregation to produce a news metadata matrix. The news metadata matrix consisted of types of word in the column and news section of each line. Furthermore, the news on the matrix were grouped by the Automatic Incremental Clustering method based on the number of word similarities that arised, calculated using the Euclidean Distance approach, and was done automatically and real-time. Each cluster (topic) determined one representing news as a Representative News based on the location of the news closest to the midpoint/centroid on the cluster. This study used 101 news as experimental data and produced 87 news clusters with 85.14% precision ratio

    Text Mining Infrastructure in R

    Get PDF
    During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis methods, text clustering, text classification and string kernels.

    Web-scale provenance reconstruction of implicit information diffusion on social media

    Get PDF
    Fast, massive, and viral data diffused on social media affects a large share of the online population, and thus, the (prospective) information diffusion mechanisms behind it are of great interest to researchers. The (retrospective) provenance of such data is equally important because it contributes to the understanding of the relevance and trustworthiness of the information. Furthermore, computing provenance in a timely way is crucial for particular use cases and practitioners, such as online journalists that promptly need to assess specific pieces of information. Social media currently provide insufficient mechanisms for provenance tracking, publication and generation, while state-of-the-art on social media research focuses mainly on explicit diffusion mechanisms (like retweets in Twitter or reshares in Facebook).The implicit diffusion mechanisms remain understudied due to the difficulties of being captured and properly understood. From a technical side, the state of the art for provenance reconstruction evaluates small datasets after the fact, sidestepping requirements for scale and speed of current social media data. In this paper, we investigate the mechanisms of implicit information diffusion by computing its fine-grained provenance. We prove that explicit mechanisms are insufficient to capture influence and our analysis unravels a significant part of implicit interactions and influence in social media. Our approach works incrementally and can be scaled up to cover a truly Web-scale scenario like major events. We can process datasets consisting of up to several millions of messages on a single machine at rates that cover bursty behaviour, without compromising result quality. By doing that, we provide to online journalists and social media users in general, fine grained provenance reconstruction which sheds lights on implicit interactions not captured by social media providers. These results are provided in an online fashion which also allows for fast relevance and trustworthiness assessment

    Document Based Clustering For Detecting Events in Microblogging Websites

    Get PDF
    Social media has a great in?uence in our daily lives. People share their opinions, stories, news, and broadcast events using social media. This results in great amounts of information in social media. It is cumbersome to identify and organize the interesting events with this massive volumes of data, typically browsing, searching, monitoring events becomes more and more challenging. A lot of work has been done in the area of topic detection and tracking (TDT). Most of these methods are based on single-modality (e.g., text, images) information or multi-modality information. In the single-modality analysis, many existing methods adopt visual information (e.g., images and videos) or textual information (e.g., names, time references, locations, title, tags, and description) in isolation to model event data for event detection and tracking. This problem can be resolved by a novel multi-model social event tracking and an evolutionary framework not only effectively capturing the events, but also generates the summary of these events over time. We proposed a novel method works with mmETM, which can effectively model the social documents, which includes the long text along with the images. It learns the similarities between the textual and visual modalities to separate the visual and non-visual representative topics. To incorporate our method to social tracking, we adopted an incremental learning technique represented as mmETM, which gives informative textual and visual topics of event in social media with respect to the time. To validate our work, we used a sample data set and conducted various experiments on it. Both subjective and quantitative assessments show that the proposed mmETM technique performs positively against a few best state-of-the art techniques

    Machine Learning for Financial Prediction Under Regime Change Using Technical Analysis: A Systematic Review

    Get PDF
    Recent crises, recessions and bubbles have stressed the non-stationary nature and the presence of drastic structural changes in the financial domain. The most recent literature suggests the use of conventional machine learning and statistical approaches in this context. Unfortunately, several of these techniques are unable or slow to adapt to changes in the price-generation process. This study aims to survey the relevant literature on Machine Learning for financial prediction under regime change employing a systematic approach. It reviews key papers with a special emphasis on technical analysis. The study discusses the growing number of contributions that are bridging the gap between two separate communities, one focused on data stream learning and the other on economic research. However, it also makes apparent that we are still in an early stage. The range of machine learning algorithms that have been tested in this domain is very wide, but the results of the study do not suggest that currently there is a specific technique that is clearly dominant

    Web information search and sharing :

    Get PDF
    制度:新 ; 報告番号:甲2735号 ; 学位の種類:博士(人間科学) ; 授与年月日:2009/3/15 ; 早大学位記番号:新493

    ANALYZING TEMPORAL PATTERNS IN PHISHING EMAIL TOPICS

    Get PDF
    In 2020, the Federal Bureau of Investigation (FBI) found phishing to be the most common cybercrime, with a record number of complaints from Americans reporting losses exceeding $4.1 billion. Various phishing prevention methods exist; however, these methods are usually reactionary in nature as they activate only after a phishing campaign has been launched. Priming people ahead of time with the knowledge of which phishing topic is more likely to occur could be an effective proactive phishing prevention strategy. It has been noted that the volume of phishing emails tended to increase around key calendar dates and during times of uncertainty. This thesis aimed to create a classifier to predict which phishing topics have an increased likelihood of occurring in reference to an external event. After distilling around 1.2 million phishes until only meaningful words remained, a Latent Dirichlet allocation (LDA) topic model uncovered 90 latent phishing topics. On average, human evaluators agreed with the composition of a topic 74% of the time in one of the phishing topic evaluation tasks, showing an accordance of human judgment to the topics produced by the LDA model. Each topic was turned into a timeseries by creating a frequency count over the dataset’s two-year timespan. This time-series was changed into an intensity count to highlight the days of increased phishing activity. All phishing topics were analyzed and reviewed for influencing events. After the review, ten topics were identified to have external events that could have possibly influenced their respective intensities. After performing the intervention analysis, none of the selected topics were found to correlate with the identified external event. The analysis stopped here, and no predictive classifiers were pursued. With this dataset, temporal patterns coupled with external events were not able to predict the likelihood of a phishing attack
    corecore