8 research outputs found

    Research on Application of Distributed Clustering Algorithms Based on MapReduce in Social Networking Services

    Get PDF
    在信息爆炸的大数据时代,人们的生活、工作和思维方式逐渐在改变。对于数据分析而言,传统的抽样方法有悖于数据量的增长态势,使用全体数据取代随机抽样成为时代的发展需求。为了实现这个目标,仅仅依赖摩尔定律来提升计算性能是远远不够的,云计算等弹性计算体系架构逐渐受到关注。社交网络作为互联网发展史上的一个重要和成功的应用领域,也是大数据时代的重要数据来源之一。这不论对于社交网络服务提供商自身还是对其商业伙伴,乃至对于社会科学研究领域而言,都是巨大的财富。 本文针对目前国内主流微博网站在自动话题识别和归类上的欠缺,研究基于分布式聚类算法和信息检索技术,并结合语义相似度计算模型,实现一个能够根据内容对微博按话题聚类,并以此为基础向用户推荐相似话题微博的应用。论文的主要工作包括以下几个方面: 首先,研究探讨MapReduce编程模型的基本原理,就开源框架Hadoop对其工作流程、容错和任务调度机制的实现进行分析,并探讨MapReduce用于大数据处理的核心思想和基本流程。 其次,阐述k-Means、Canopy两个经典聚类算法的原理以及它们在实际应用中的结合方式,同时研究算法并行实现的可行性及策略并基于MapReduce模型给予实现。 最后,总结向量空间模型和语义方法在文本相似度计算上各自的优缺点,提出了一种综合TF-IDF和语义的文本相似度计算方法,详细论述该方法的思想及计算过程,将此作为微博文本聚类的距离度量依据。 实验结果表明,论文中的技术和方法是切实可行的,能够较为有效地识别出微博中的话题并给予用户特定的推荐和反馈,从而改变用户浏览微博的习惯,具有一定的实用性。In the age of information explosion, so called “big data revolution”, how our live, work, and think has been transformed. For data analysis, the traditional sampling method seems irreconcilable with the increment of data volume. “From some to all” has been the requirement of nowadays. In order to achieve this goal, with depending only on the Moore’s Law is not enough. Elastic computing architectures, such as cloud computing, have received increasingly large amounts of attention. Meanwhile, social networking service, as the millstone in internet history, which is the most important data source of the age of big data. It’s huge wealth for not only the SNS provider, but his commercial partners, even the field of social science research. This dissertation focuses on the microblog topics recommendation based on dis-tributed clustering algorithm and information retrieval, combined with semantic simi-larity model. It fills the gap of auto topics detection and classification in the domestic mainstream microblogging sites. The main works of the dissertation as follow: First of all, do research on core ideas of big data processing with the MapReduce programming model and the Hadoop framework. Secondly, analysis the principles of two classical clustering algorithms, so called k-Means and Canopy, design and implement their parallel computing strategies. Finally, summarize the drawback which using in text similarity computing be-tween vector space model and semantic method. And then propose a combined text similarity algorithm integrating TF-IDF and word semantics. Meanwhile, it was used as the distance metric of microblog text clustering. Experimental results show that method proposed by the dissertation is practica-ble.学位:工程硕士院系专业:软件学院_软件工程学号:2432011115228

    A Research and Implementation with Improved K-Means Clustering algorithm To Image Retrieval System Based On Hadoop Platform

    Get PDF
    现代人们的生活已经进入了移动互联网时代,各种移动互联网设备的普及和广泛应用极大的方便了人们的生活学习等各个方面。与此同时,来自各行各业的大量信息正以多媒体信息的方式数字化并不断累积。其中图像作为最为基本的多媒体信息之一易于理解和使用,人们对图像检索的需求也从开始的根据文本描述来检索图像发展到根据图像内容来检索相似图像。 图像检索早已成为计算机领域的一个研究热点,它可以按照检索内容划分为基于文本的图像检索和基于内容的图像检索。本文主要的内容是如何应用大数据技术进行基于内容的海量图像检索技术的研究和实现。 从数据层面分析,一个基于内容的图像检索系统要解决大量图像数据的存储和快速处理两个最主要的...Contemporary people’s life has entered The Mobile Internet era. People’s life and study and some other aspects have benefited a lot from the popularization and widespread application of various mobile internet devices. At the same time, lots of information from every walk of life are digitized and accumulating in the form of mutimedia information. As one of the most basic multimedia information, T...学位:工程硕士院系专业:信息科学与技术学院_计算机技术学号:2302011115305

    一种基于Gradient Boosting的公交车运行时长预测方法

    Get PDF
    目前,我国公交公司主要依靠经验丰富的工作人员估计车辆回场时间,进而进行车辆调度,此方式缺乏辅助的预测方法,常常造成较大的误差与错误的调度决策。从公交公司的实际需求出发,提出了一种基于动态特征选择的预测方法R-GBDT。R-GBDT利用特征选择组件和模型调参组件为预测组件提供符合线路特征的特征组合与参数,由融合组件对其他组件的结果进行融合,形成一个用于预测最终时间间隔的框架。结果表明,相对于其他算法,所提方法能大大提高公交运行时长预测的准确度。国家自然科学基金资助项目(No.61672441,No.61872154)深圳市基础研究计划基金资助项目(No.JCYJ20170818141325209)福建省自然科学基金资助项目(No.2018J01097)~

    Error-Tolerant Big Data Processing

    Get PDF
    Real-world data contains various kinds of errors. Before analyzing data, one usually needs to process the raw data. However, traditional data processing based on exactly match often misses lots of valid information. To get high-quality analysis results and fit in the big data era, this thesis studies the error-tolerant big data processing. As most of the data in real world can be represented as a sequence or a set, this thesis utilizes the widely-used sequence-based and set-based similar functions to tolerate errors in data processing and studies the approximate entity extraction, similarity join and similarity search problems. The main contributions of this thesis include: 1. This thesis proposes a unified framework to support approximate entity extraction with both sequence-based and set-based similarity functions simultaneously. The experiments show that the unified framework can improve the state-of-the-art methods by 1 to 2 orders of magnitude. 2. This thesis designs two methods respectively for the sequence and the set similarity joins. For the sequence similarity join, this thesis proposes to evenly partition the sequences to segments. It is guaranteed that two sequences are similar only if one sequence has a subsequence identical to a segment of another sequence. For the set similarity join, this thesis proposes to partition all the sets into segments based on the universe. This thesis further extends the two partition-based methods to support the large-scale data processing framework, Map-Reduce and Spark. The partition-based method won the string similarity join competition held by EDBT and beat the second place by 10 times. 3. This thesis proposes a pivotal prefix filter technique to solve the sequence similarity search problem. This thesis shows that the pivotal prefix filter has stronger pruning power and less filtering cost compared to the state-of-the-art filters.Comment: PhD thesis, Tsinghua University, 201

    Survey of Automatic Labeling Methods for Topic Models

    Get PDF
    Topic models are often used in modeling unstructured corpora and discrete data to extract the latent topic. As topics are generally expressed in the form of word lists, it is usually difficult for users to understand the meanings of topics, especially when users lack knowledge in the subject area. Although manually labeling topics can generate more explanatory and easily understandable topic labels, the cost is too high for the method to be feasible. Therefore, research on automatic labeling of topic discovered provides solutions to the problem. Firstly, the currently most popular technique, latent Dirichlet allocation (LDA), is elaborated and analyzed. According to the three different representations of topic labels, based on phrases, abstracts, and pictures, the topic labeling methods are classified into three types. Then, centered on improving the interpretability of topics, with different types of generated topic labels utilized, the relevant research in recent years is sorted out, analyzed, and summarized. The applicable scenarios and usability of different labels are also discussed. Meanwhile, methods are further categorized according to their different characteristics. The focus is placed on the quantitative and qualitative analysis of the abstract topic labels generated through lexical-based, submodular optimization, and graph-based methods. The differences between separate methods with respect to the learning types, technologies used, and data sources are then compared. Finally, the existing problems and trend of development of research on automatic topic labeling are discussed. Based on deep learning, integrating with sentiment analysis, and continuously expanding the applicable scenarios of topic labeling, will be the directions of future development

    Explainable Recommendation: Theory and Applications

    Full text link
    Although personalized recommendation has been investigated for decades, the wide adoption of Latent Factor Models (LFM) has made the explainability of recommendations a critical issue to both the research community and practical application of recommender systems. For example, in many practical systems the algorithm just provides a personalized item recommendation list to the users, without persuasive personalized explanation about why such an item is recommended while another is not. Unexplainable recommendations introduce negative effects to the trustworthiness of recommender systems, and thus affect the effectiveness of recommendation engines. In this work, we investigate explainable recommendation in aspects of data explainability, model explainability, and result explainability, and the main contributions are as follows: 1. Data Explainability: We propose Localized Matrix Factorization (LMF) framework based Bordered Block Diagonal Form (BBDF) matrices, and further applied this technique for parallelized matrix factorization. 2. Model Explainability: We propose Explicit Factor Models (EFM) based on phrase-level sentiment analysis, as well as dynamic user preference modeling based on time series analysis. In this work, we extract product features and user opinions towards different features from large-scale user textual reviews based on phrase-level sentiment analysis techniques, and introduce the EFM approach for explainable model learning and recommendation. 3. Economic Explainability: We propose the Total Surplus Maximization (TSM) framework for personalized recommendation, as well as the model specification in different types of online applications. Based on basic economic concepts, we provide the definitions of utility, cost, and surplus in the application scenario of Web services, and propose the general framework of web total surplus calculation and maximization.Comment: 169 pages, in Chinese, 3 main research chapter
    corecore