Search CORE

7 research outputs found

A Comprehensive Review on Blog Mining under a Cross-Disciplinary Framework

Author: Conglei Yao
Shengliang Gao
Publication venue
Publication date
Field of study

Nowadays, blog has been playing a more and more important role in human life as the internet evolves dramatically. Academia has seen an research burst on blog since 2002, not only in the area of computer science like IR and web mining, but also in the area of social sciences. We present a blog-centered framework here to abstract and generalize current researches on blog mining, with the intention of providing a clear image on blog-focused research area

CiteSeerX

Towards a Global Schema for Web Entities

Author: Conglei Yao
Sicong Shou
Xiaoming Li
Yongjian Yu
Publication venue
Publication date
Field of study

Popular entities often have thousands of instances on the Web. In this paper, we focus on the case where they are presented in table-like format, namely appearing with their attribute names. It is observed that, on one hand, for the same entity, different web pages often incorporate different attributes; on the other, for the same attribute, different web pages often use different attribute names (labels). Therefore, it is imaginably difficult to produce a global attribute schema for all the web entities of a given entity type based on their web instances, although the global attribute schema is usually highly desired in web entity instances integration and web object extraction. To this end, we propose a novel framework of automatically learning a global attribute schema for all web entities of one specific entity type. Under this framework, an iterative instances extraction procedure is first employed to extract sufficient web entity instances to discover enough attribute labels. Next, based on the labels, entity instances, and related web pages, a maximum entropy-based schema discovery approach is adopted to learn the global attribute schema for the target entity type. Experimental results on the Chinese Web achieve weighted average Fscores of 0.7122 and 0.7123 on two global attribute schemas for person-type and movie-type web entities, respectively. These results show that our framework is general, efficient and effective

CiteSeerX

Peking University at the TREC-2005 Question and Answering Track

Author: Cheng Chen
Conglei Yao
Jing He
Ping Yin
Yongjun Bao
Publication venue
Publication date
Field of study

This paper describes the architecture and implementation of Tianwang QA system, which can work for the Main task and the Document Ranking task. The system is designed to extract candidate answer snippets from different pipelines, e.g. the high quality search engines' results, the frequently asked question (FAQ) set, and the wellstructured web facts, etc..So the system need to process the Web documents, the FAQ corpus and the knowledge base (KB) from the structural web pages, besides analyzing the query, the TREC document retrieval and the answer merging. The external knowledge we made use of, i.e. FAQ and KB, are proved to be effective for our final results. We classify questions with SVM approaches, construct queries in Boolean way, retrieve and rank the passage with span model and extract answers using named entity technologies. 1

CiteSeerX

Efficient Entity Relation Discovery on Web

Author: Conglei Yao
Jing He
Nan Di
Qichen Tu
Yuan Liu
Publication venue
Publication date
Field of study

With popularization of Web, there are billions of pages on Web, which contain affluent information of real world entities and their relations. Therefore, much research focuses on named entity extraction and entity relation discovery for constructing social networks which can reflect the real society. However, some former entity relation discovery approaches, extracting a small group of entities in a limited community or intranet, is not so scalable. So when it is applied to a large group of entities on Web, it may fail. In this paper, we employ co-occurrence to identify the relations between entities. The contribution of the paper is: 1. empirically evaluating various frequently used measures for co-occurrence and find Cosine outperforms the others; 2. presenting two novel efficient algorithms for discovering relations between entities and comparing them

CiteSeerX

CNDS Expert Finding System for TREC2005 CongleiYao BoPeng JingHe ZhifengYang

Author: Bo Peng
Conglei Yao
Jing He
School Of Electronics
Zhifeng Yang
Publication venue
Publication date
Field of study

This paper describes our system developed for Expert Finding task of Enterprise Track for TREC2005. This system employs 3 methods, traditional IR method, email clustering method and entry page finding method, to find experts related to a specific topic in W3C corpus. Experiment indicates that traditional IR method is useful to expert finding if the query is well generated, email clustering method is helpful when the mail list is relevant to a unique work group or committee, and entry page finding method is valuable while the topic is the theme of a special group. We use result aggregation methods of linear synthesis to combine the results generated by the three methods,. Of our 5 runs submitted for Expert Finding task, the best run is the one generated by linear synthesis, providing a MAP score of 0.2174(Bpref of 0.4299 and p@10 of 0.3460) Keywords Expert Finding, IR, Email Clustering, Entry Page Finding, Result Aggregation 1

CiteSeerX