31 research outputs found

    Baichuan 2: Open Large-scale Language Models

    Full text link
    Large language models (LLMs) have demonstrated remarkable performance on a variety of natural language tasks based on just a few examples of natural language instructions, reducing the need for extensive feature engineering. However, most powerful LLMs are closed-source or limited in their capability for languages other than English. In this technical report, we present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens. Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval. Furthermore, Baichuan 2 excels in vertical domains such as medicine and law. We will release all pre-training model checkpoints to benefit the research community in better understanding the training dynamics of Baichuan 2.Comment: Baichuan 2 technical report. Github: https://github.com/baichuan-inc/Baichuan

    R-SpamRank: A Spam Detection Algorithm Based on Link Analysis

    No full text
    Spam web pages intend to achieve higher-than-deserved ranking by various techniques. While human experts could easily identify spam web pages, the manual evaluating process of a large number of pages is still time consuming and cost consuming. To assist manual evaluation, we propose an algorithm to assign spam values to web pages and semi-automatically select potential spam web pages. We first manually select a small set of spam pages as seeds. Then, based on the link structure of the web, the initial R-SpamRank values assigned to the seed pages propagate through links and distribute among the whole web page set. After sorting the pages according to their R-SpamRank values, the pages with high values are selected. Our experiments and analyses show that the algorithm is highly successful in identifying spam pages, which gains a precision of 99.1 % in the top 10,000 web pages with the highest R-SpamRank values

    User behavior oriented web spam detection

    No full text
    Combating Web spam has become one of the top challenges for Web search engines. State-of-the-art spam detection techniques are usually designed for specific known types of Web spam and are incapable and inefficient for recently-appeared spam. With user behavior analyses into Web access logs, we propose a spam page detection algorithm based on Bayes learning. Preliminary experiments on Web access data collected by a commercial Web site (containing over 2.74 billion user clicks in 2 months) show the effectiveness of the proposed detection framework and algorithm

    Automatic online news issue construction in web environment

    No full text
    In many cases, rather than a keyword search, people intend to see what is going on through the Internet. Then the integrated comprehensive information on news topics is necessary, which we called news issues, including the background, history, current progress, different opinions and discussions, etc. Traditionally, news issues are manually generated by website editors. It is quite a time-consuming hard work, and hence real-time update is difficult to perform. In this paper, a three-step automatic online algorithm for news issue construction is proposed. The first step is a topic detection process, in which newly appearing stories are clustered into new topic candidates. The second step is a topic tracking process, where those candidates are compared with previous topics, either merged into old ones or generating a new one. In the final step, news issues are constructed by the combination of related topics and updated by the insertion of new topics. An automatic online news issue construction process under practical Web circumstances is simulated to perform news issue construction experiments. F-measure of the best results is either above (topic detection) or close to (topic detection and tracking) 90%. Four news issue construction results are successfully generated in different time granularities: one meets the needs like “what’s new”, and the other three will answer questions like “what’s hot ” or “what’s going on”. Through the proposed algorithm, news issues can be effectively and automatically constructed with real-time update, and lots of human efforts will be released from tedious manual work

    Data Cleansing for Web Information Retrieval using Query Independent Features

    No full text
    We report on a study that was undertaken to better understand what kinds of Web pages are the most useful for web search engine users by exploiting queryindependent features of retrieval target pages. To our knowledge, there has been little research towards query-independent web page cleansing for web information retrieval. Based on more than 30 million web pages obtained both from TREC and from a widely-used Chinese search engine SOGOU (www.sogou.com), we provide analysis on the differences between retrieval target pages and ordinary ones. We also propose a learning-based data cleansing algorithm for reducing Web pages which are not likely to be useful for user request. The results obtained show that retrieval target pages can be separated from low quality pages using queryindependent features and cleansing algorithms. Our algorithm succeeds in reducing 95 % web pages with less than 8 % loss in retrieval target pages. It makes it possible for web IR tools to meet over 92 % users’ needs with only 5 % pages on the Web. 1
    corecore