1,063 research outputs found
Re-evaluating Retrosynthesis Algorithms with Syntheseus
The planning of how to synthesize molecules, also known as retrosynthesis,
has been a growing focus of the machine learning and chemistry communities in
recent years. Despite the appearance of steady progress, we argue that
imperfect benchmarks and inconsistent comparisons mask systematic shortcomings
of existing techniques. To remedy this, we present a benchmarking library
called syntheseus which promotes best practice by default, enabling consistent
meaningful evaluation of single-step and multi-step retrosynthesis algorithms.
We use syntheseus to re-evaluate a number of previous retrosynthesis
algorithms, and find that the ranking of state-of-the-art models changes when
evaluated carefully. We end with guidance for future works in this area
Automatic Threshold Selections by exploration and exploitation of optimization algorithm in Record Deduplication
A deduplication process uses similarity function to identify the two entries are duplicate or not by setting the threshold. This threshold setting is an important issue to achieve more accuracy and it relies more on human intervention. Swarm Intelligence algorithm such as PSO and ABC have been used for automatic detection of threshold to find the duplicate records. Though the algorithms performed well there is still an insufficiency regarding the solution search equation, which is used to generate new candidate solutions based on the information of previous solutions.  The proposed work addressed two problems: first to find the optimal equation using Genetic Algorithm(GA) and next it adopts an modified  Artificial Bee Colony (ABC) to get the optimal threshold to detect the duplicate records more accurately and also it reduces human intervention. CORA dataset is considered to analyze the proposed algorithm
Towards Data Optimization in Storages and Networks
Title from PDF of title page, viewed on August 7, 2015Dissertation advisors: Sejun Song and Baek-Young ChoiVitaIncludes bibliographic references (pages 132-140)Thesis (Ph.D.)--School of Computing and Engineering. University of Missouri--Kansas City, 2015We are encountering an explosion of data volume, as a study estimates that data
will amount to 40 zeta bytes by the end of 2020. This data explosion poses significant
burden not only on data storage space but also access latency, manageability, and processing
and network bandwidth. However, large portions of the huge data volume contain
massive redundancies that are created by users, applications, systems, and communication
models. Deduplication is a technique to reduce data volume by removing redundancies.
Reliability will be even improved when data is replicated after deduplication.
Many deduplication studies such as storage data deduplication and network redundancy
elimination have been proposed to reduce storage consumption and network
bandwidth consumption. However, existing solutions are not efficient enough to optimize
data delivery path from clients to servers through network. Hence we propose a holistic
deduplication framework to optimize data in their path. Our deduplication framework
consists of three components including data sources or clients, networks, and servers. The
client component removes local redundancies in clients, the network component removes
redundant transfers coming from different clients, and the server component removes redundancies
coming from different networks.
We designed and developed components for the proposed deduplication framework.
For the server component, we developed the Hybrid Email Deduplication System
that achieves a trade-off of space savings and overhead for email systems. For the client
component, we developed the Structure Aware File and Email Deduplication for Cloudbased
Storage Systems that is very fast as well as having good space savings by using
structure-based granularity. For the network component, we developed a system called
Software-defined Deduplication as a Network and Storage service that is in-network deduplication,
and that chains storage data deduplication and network redundancy elimination
functions by using Software Defined Network to achieve both storage space and network
bandwidth savings with low processing time and memory size. We also discuss mobile
deduplication for image and video files in mobile devices. Through system implementations
and experiments, we show that the proposed framework effectively and efficiently
optimizes data volume in a holistic manner encompassing the entire data path of clients,
networks and storage servers.Introduction -- Deduplication technology -- Existing deduplication approaches -- HEDS: Hybrid Email Deduplication System -- SAFE: Structure-aware File and Email Deduplication for cloud-based storage systems -- SoftDance: Software-defined Deduplication as a Network and Storage Service -- Moblie de-duplication -- Conclusion
Generative Deduplication For Socia Media Data Selection
Social media data is plagued by the redundancy problem caused by its noisy
nature, leading to increased training time and model bias. To address this
issue, we propose a novel approach called generative deduplication. It aims to
remove duplicate text from noisy social media data and mitigate model bias. By
doing so, it can improve social media language understanding performance and
save training time. Extensive experiments demonstrate that the proposed
generative deduplication can effectively reduce training samples while
improving performance. This evidence suggests the effectiveness of generative
deduplication and its importance in social media language understanding.Comment: Work In Progres
Entity Matching for Digital World: A Modern Approach using Artificial Intelligence and Machine Learning
Entity matching is the field of research solving the problem of identifying similar records which refer to the same real-world entity In today s digital world business organizations deal with large amount of data like customers vendors manufacturers etc Entities are spread across various data sources and failure to correlate two records as one entity can lead to confusion Relationships and patterns would be missed Aggregations and calculations won t make any sense It is a significant data integration effort that often arises when data originate from different sources In such scenarios we understand the situation by linking records and then track entities from a person to a product etc There is appreciable value in integrating the data silos across various industrie
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
The rapid development of large language models has revolutionized code
intelligence in software development. However, the predominance of
closed-source models has restricted extensive research and development. To
address this, we introduce the DeepSeek-Coder series, a range of open-source
code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion
tokens. These models are pre-trained on a high-quality project-level code
corpus and employ a fill-in-the-blank task with a 16K window to enhance code
generation and infilling. Our extensive evaluations demonstrate that
DeepSeek-Coder not only achieves state-of-the-art performance among open-source
code models across multiple benchmarks but also surpasses existing
closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models
are under a permissive license that allows for both research and unrestricted
commercial use
- …