isclouder.com - 香港服务器

最新动态:When DuckDB Embraces MySQL in the AI Era

Written by

in

行业动态更新:When DuckDB Embraces MySQL in the AI Era

By Simian and Yuanqi Following the previous article, when MySQL meets the DuckDB columnar storage engine in the AI era RDS for MySQL embeds DuckDB into the MySQL kernel. This uses the performance advantages of DuckDB in analytical query scenarios and resolves the shortcomings of MySQL in analytical processing (AP) capabilities. For product positioning, we defined a policy of “MySQL first, DuckDB second” from the start. We use DuckDB to enhance the analytical capabilities of MySQL. We focus on the user ecosystem of MySQL. We remain as compatible as possible with existing customer habits. These include syntax, protocols, and third-party tools. For technical implementation, we integrate DuckDB into MySQL as a built-in storage engine. This gives it the capabilities of the MySQL Server layer, such as data dictionary management and binary log replication. It can also independently handle all operation requests from the MySQL Server layer, similar to native storage engines such as InnoDB. We import the DuckDB computing engine into the query SQL execution path. This ensures the query efficiency of DuckDB nodes and high compatibility with MySQL syntax. Based on the basic capabilities of the DuckDB engine, we launched DuckDB read-only instances. A DuckDB node acts as a data replica. It replays binary logs from the primary database. It can independently provide read request services. Combined with the automatic request distribution among row store and column store nodes at the agent layer, the MySQL primary node and the DuckDB node provide a complete hybrid transactional and analytical processing (HTAP) solution. In addition to DuckDB-based analytical read-only instances, many customers provided feedback that they want to deploy DuckDB nodes as independent database instances. This lets users directly ingest data using SQL. It can also serve as an aggregation instance for multiple source instances. Therefore, we launched the DuckDB-based analytical primary instance. Two product forms of MySQL DuckDB The primary instance form greatly expands the scenarios of DuckDB. In theory, you can replace MySQL with DuckDB primary instances in all scenarios. The primary instance form also has higher requirements for the MySQL DuckDB engine. These include data security guarantees, data ingestion performance, and storage management for larger storage spaces. The following sections introduce the core capabilities enhanced by the DuckDB engine in the primary instance form. These focus on four aspects: binary logging adaptation, high availability and data security, data ingestion capabilities, and compatibility. The analytical performance of DuckDB is fully verified in multiple benchmark tests, such as TPC-H and ClickBench. DuckDB primary instances have the same query capabilities as read-only instances. They also have the same compression efficiency. This topic does not describe these basic capabilities. When DuckDB-based read-only instances were released, we turned off the binary logging toggle for read-only instances by default. First, read-only instances typically do not need to generate additional binary logs in user scenarios. Second, the DuckDB engine was not fully adapted to the binary logging enabled scenarios at that time. However, binary logging is an essential part of the DuckDB primary instance form. During the development of primary instances, we fully adapted binary logging. We provide the following technical capabilities: High availability (HA) means that a database can continuously provide services. The DuckDB-based analytical primary instance uses a primary-secondary cluster architecture. The primary database performs one-way data synchronization to secondary databases through Binlog. If the primary database becomes abnormal, the system automatically triggers an HA switch to ensure continuous service availability. DuckDB primary instance deployment diagram Data security specifically refers to the Durability attribute in the atomicity, consistency, isolation, and durability (ACID) attributes of a database. This means that after a transaction commit, the Result requires permanent retention. The Result is not lost even if the system crashes. For a DuckDB-based analytical read-only instance, the persistence requirements can be appropriately lowered because it acts as a data replica of the primary instance. Even if some data lacks persistence when the instance crashes, you can recover the data. Simply pull the Binlog from the primary instance and replay it after a restart. Therefore, the read-only instance uses an idempotent Binlog replay mechanism. This ensures data consistency between the primary database and DuckDB-based read-only nodes. However, persistence is a rigid requirement for the DuckDB-based analytical primary instance. In this state, data is stored only in a columnar format. You cannot rely on the primary node to rebuild data. Therefore, the DuckDB engine must guarantee data persistence in any failure scenario. The data security of the DuckDB primary instance is guaranteed mainly in the following ways: 1. Backup and restore capabilities: The backup and restore capabilities of the DuckDB-based analytical primary instance are identical to those of a standard MySQL instance. The primary instance supports regular full backups and real-time incremental backups based on Binlog. It also provides a point-in-time recovery (PITR) feature accurate to the second. You can use this feature to recover the instance to any historical Time point. 2. Kernel persistence guarantee: The DuckDB engine ensures the persistence of transactional data through the write-ahead logging (WAL) mechanism. We also modified the transaction commit path of the DuckDB engine at the MySQL layer. This ensures the atomicity of Binlog writes and DuckDB local transaction commits. During the instance crash recovery phase, the system can correctly replay Submitted transactions. This ensures the consistency among DuckDB engine data, binlog content, and binlog GTID offsets. This provides data security guarantees that align with MySQL standards. In summary, the DuckDB-based analytical primary instance is fully comparable to MySQL high availability instances in terms of high availability and data security. It effectively guarantees the continuous availability of customer core systems and the security of key data assets. Unlike a read-only instance that only supports primary database Binlog log replay, the DuckDB-based analytical primary instance supports multiple data Ingestion methods. These methods include Binlog log replay, direct SQL writing, and Data Transmission Service (DTS) full and incremental data synchronization. The Binlog replay method is identical to that of a read-only instance. The log replay performance can reach about 300,000 rows per second. The DuckDB-based analytical primary instance also supports more flexible replication topologies, such as multi-source replication and cascading replication. In scenarios involving direct business writes or data synchronization through DTS, the DuckDB-based analytical primary instance maintains high ingestion performance. This relies mainly on the following kernel optimization capabilities: We conducted benchmark testing on the write performance of the DuckDB-based analytical primary instance as a synchronization Target based on the Sysbench dataset. • When full data is imported using LOAD DATA, the peak throughput can reach 2,000,000 rows per second. • The peak throughput of DTS incremental synchronization can reach 300,000 rows per second. Note that the actual synchronization speed depends on multiple factors. These include the query efficiency of the source instance, table schema complexity, and Data Transmission Service (DTS) instance type. The preceding values are the maximum end-to-end replication performance in an ideal staging environment. RDS DuckDB-based analytical instances are deeply integrated with Alibaba Cloud DTS. When you create a DTS job, simply set the target database type to DuckDB. This automatically delivers high ingestion performance. No complex parameter settings or performance tuning are required. We continuously optimize compatibility. Recently, we made the following enhancements. To improve SQL syntax compatibility, we reverse-parse the abstract syntax tree obtained in the MySQL Prepare phase. This achieves SQL regularization. It resolves over 90% of incompatible SQL syntax and significantly improves compatibility. For functions, we use rapid iterations to support nearly 20 frequently used Json Functions and regular functions. These cover almost all common functions for online users. We are also gradually adding support for less frequently used functions. Ultimately, our functions will be fully compatible with MySQL. Recently, we added support for character sets such as ascii and latin1 to the DuckDB engine. We also added support for generated columns. We improved the execution efficiency of COPY DDL operations. For more information about these kernel updates, see our release notes. The customer needs to perform global queries on recent business orders. This requires Real-time Sync of data from 1024 sharded tables across 32 source MySQL instances to a single convergence instance for unified queries. The convergence instance must retain data for one month. The total data volume is about 10 TB. The peak synchronization Transaction Per Second (TPS) reaches 300,000. All traffic consists of order-related transactional DML operations. These include high-frequency INSERT, UPDATE, and DELETE operations. Data convergence scenarios usually require joint analysis of multiple sharded tables. This requires data synchronization to converge data from multiple instances to a single target instance. These jobs are typically handled by AnalyticDB or distributed databases. However, we used a single DuckDB primary instance to successfully meet the customer requirements. This mainly depends on the fol

随着IDC行业的快速发展,可持续发展将成为未来竞争的关键

如果您正在寻找优质的香港GPU服务器,欢迎访问 www.isclouder.com 了解更多