Search CORE

1,063 research outputs found

Re-evaluating Retrosynthesis Algorithms with Syntheseus

Author: Gaiński Piotr
Liu Guoqing
Maziarz Krzysztof
Segler Marwin
Seidl Philipp
Stanley Megan
Tripp Austin
Xie Shufang
Publication venue
Publication date: 30/10/2023
Field of study

The planning of how to synthesize molecules, also known as retrosynthesis, has been a growing focus of the machine learning and chemistry communities in recent years. Despite the appearance of steady progress, we argue that imperfect benchmarks and inconsistent comparisons mask systematic shortcomings of existing techniques. To remedy this, we present a benchmarking library called syntheseus which promotes best practice by default, enabling consistent meaningful evaluation of single-step and multi-step retrosynthesis algorithms. We use syntheseus to re-evaluate a number of previous retrosynthesis algorithms, and find that the ranking of state-of-the-art models changes when evaluated carefully. We end with guidance for future works in this area

arXiv.org e-Print Archive

Automatic Threshold Selections by exploration and exploitation of optimization algorithm in Record Deduplication

Author: Deepa K.
Rajan S.Palanivel
Vivek C.
Publication venue: 'CIRWOLRD'
Publication date: 16/06/2016
Field of study

A deduplication process uses similarity function to identify the two entries are duplicate or not by setting the threshold.Â This threshold setting is an important issue to achieve more accuracy and it relies more on human intervention. Swarm Intelligence algorithm such as PSO and ABC have been used for automatic detection of threshold to find the duplicate records. Though the algorithms performed well there is still an insufficiency regarding the solution search equation, which is used to generate new candidate solutions based on the information of previous solutions. Â The proposed work addressed two problems: first to find the optimal equation using Genetic Algorithm(GA) and next it adopts an modified Â Artificial Bee Colony (ABC) to get the optimal threshold to detect the duplicate records more accurately and also it reduces human intervention. CORA dataset is considered to analyze the proposed algorithm

KHALSA PUBLICATIONS

Towards Data Optimization in Storages and Networks

Author: Kim Daehee
Publication venue
Publication date
Field of study

Title from PDF of title page, viewed on August 7, 2015Dissertation advisors: Sejun Song and Baek-Young ChoiVitaIncludes bibliographic references (pages 132-140)Thesis (Ph.D.)--School of Computing and Engineering. University of Missouri--Kansas City, 2015We are encountering an explosion of data volume, as a study estimates that data will amount to 40 zeta bytes by the end of 2020. This data explosion poses significant burden not only on data storage space but also access latency, manageability, and processing and network bandwidth. However, large portions of the huge data volume contain massive redundancies that are created by users, applications, systems, and communication models. Deduplication is a technique to reduce data volume by removing redundancies. Reliability will be even improved when data is replicated after deduplication. Many deduplication studies such as storage data deduplication and network redundancy elimination have been proposed to reduce storage consumption and network bandwidth consumption. However, existing solutions are not efficient enough to optimize data delivery path from clients to servers through network. Hence we propose a holistic deduplication framework to optimize data in their path. Our deduplication framework consists of three components including data sources or clients, networks, and servers. The client component removes local redundancies in clients, the network component removes redundant transfers coming from different clients, and the server component removes redundancies coming from different networks. We designed and developed components for the proposed deduplication framework. For the server component, we developed the Hybrid Email Deduplication System that achieves a trade-off of space savings and overhead for email systems. For the client component, we developed the Structure Aware File and Email Deduplication for Cloudbased Storage Systems that is very fast as well as having good space savings by using structure-based granularity. For the network component, we developed a system called Software-defined Deduplication as a Network and Storage service that is in-network deduplication, and that chains storage data deduplication and network redundancy elimination functions by using Software Defined Network to achieve both storage space and network bandwidth savings with low processing time and memory size. We also discuss mobile deduplication for image and video files in mobile devices. Through system implementations and experiments, we show that the proposed framework effectively and efficiently optimizes data volume in a holistic manner encompassing the entire data path of clients, networks and storage servers.Introduction -- Deduplication technology -- Existing deduplication approaches -- HEDS: Hybrid Email Deduplication System -- SAFE: Structure-aware File and Email Deduplication for cloud-based storage systems -- SoftDance: Software-defined Deduplication as a Network and Storage Service -- Moblie de-duplication -- Conclusion

University of Missouri: MOspace

Generative Deduplication For Socia Media Data Selection

Author: Li Jing
Li Xianming
Publication venue
Publication date: 12/01/2024
Field of study

Social media data is plagued by the redundancy problem caused by its noisy nature, leading to increased training time and model bias. To address this issue, we propose a novel approach called generative deduplication. It aims to remove duplicate text from noisy social media data and mitigate model bias. By doing so, it can improve social media language understanding performance and save training time. Extensive experiments demonstrate that the proposed generative deduplication can effectively reduce training samples while improving performance. This evidence suggests the effectiveness of generative deduplication and its importance in social media language understanding.Comment: Work In Progres

arXiv.org e-Print Archive

Entity Matching for Digital World: A Modern Approach using Artificial Intelligence and Machine Learning

Author: Edward Lambert
K. Victor Rajan
Publication venue: 'Global Journals'
Publication date: 10/04/2023
Field of study

Entity matching is the field of research solving the problem of identifying similar records which refer to the same real-world entity In today s digital world business organizations deal with large amount of data like customers vendors manufacturers etc Entities are spread across various data sources and failure to correlate two records as one entity can lead to confusion Relationships and patterns would be missed Aggregations and calculations won t make any sense It is a significant data integration effort that often arises when data originate from different sources In such scenarios we understand the situation by linking records and then track entities from a person to a product etc There is appreciable value in integrating the data silos across various industrie

Global Journal of Computer Science and Technology (GJCST)

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Author: Bi Xiao
Chen Guanting
Dong Kai
Guo Daya
Li Y. K.
Liang Wenfeng
Luo Fuli
Wu Y.
Xie Zhenda
Xiong Yingfei
Yang Dejian
Zhang Wentao
Zhu Qihao
Publication venue
Publication date: 26/01/2024
Field of study

The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window to enhance code generation and infilling. Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use

arXiv.org e-Print Archive