CDC data integration

Change Data Capture (CDC) is used for real-time data ingestion and based on detecting and delivering changes from a source database. In other words, this is heterogeneous replication where a source and a target are different database systems. With CDC, your BI tools and reports get the most current data immediately, instead of displaying "yesterday's data" that is only refreshed on a schedule (often nightly).

Real-time data integration is critically important for embedded BI and customer-facing analytics and customers expect the reports, dashboards, and metrics embedded within a SaaS app to reflect their actions immediately. If a customer completes a transaction, they expect their usage report to update right away, not hours later.

How to choose CDC data ingestion technology

  • Connectors: ensure that data sources and your DW target are supported.
  • Deployment options: cloud, self-hosted or hybrid.
  • Commercial or free/open source: it's a simple choice, all free tools are self-hosted and based on Debezium.
  • Data volume and budget: well-known enterprise-grade data platforms may be too expensive for SMB/mid-size companies and SaaS vendors. This comparison doesn't include enterpise solutions with quite-based pricing like Talend, IBM InfoSphere etc.

Real-time data integration tools comparison

Product Price Sources Destinations Pros Cons
Debezium + Kafka + Airflow + dbt (optional) Free/OSS SQL Server, Mysql, Postgresql, Oracle, MongoDB (single-node) SQL Server, MySql, PostgreSql, Oracle, DB2, MongoDB, ClickHouse and many others. Zero-cost, reliable and time-proven CDC solution. This is not a ready-to-use product: multiple open-source products needs to be configured and connected exactly for the real-time data integration task (complex setup and configuration; infrastructure management overhead). Unefficiency makes this solution limited to low/moderate CDC throughput (< 10-25 mb/s).
Airbyte Self-hosted: Free (Core)
Cloud: volume based, capacity based and custom pricing.
600+ connectors (anything). All popular DBs/DWs. Easy to deploy free self-hosted version. Huge number of connectors. Free (Core) version suitable only for low throughput (CDC based on Debezium), batch-first design (not suitable for sub-minute syncs), known issues with data types mappings (say, decimals in SQL Server → PostgreSql) and connectors can crash frequently. Unpredictable/high cost in cloud version.
Supermetal Free self-hosted trial
No pricing yet.
SQL Server, MySql, PostgreSql, Oracle PostgreSql, ClickHouse, Databricks, Snowflake Simple single-binary tool, high efficiency. Suitable for low-lattency syncs. Limited number of connectors, unknown pricing.
Fivetran Cloud: Free plan 500k MAR
starts from $500/mo 1M+ MAR.
No on-prem version (hybrid for enterprise plan).
500+ sources (anything) 200+ destinations Ease of use: the initial setup is remarkably fast and no-code. High and unpredictable costs, pricing model based on MAR (Monthly Active Rows) which is difficult to forecast. No self-hosted version.
Azure Data Factory CDC Azure: pay-as-you-go Cloud-native CDC for Azure sources SQL Server, Azure SQL, SAP CDC, Azure Cosmos DB, Snowflake Azure-native CDC service Preview/limited features; tables unavailable without native CDC enabled; unable to stop running resources; debug cluster delays; fails with identity columns; complex troubleshooting.
Qlik Replicate Starts at $1000/mo
Enterprise quote-based pricing.
Oracle, SQL Server, MySQL, PostgreSQL, DB2, SAP HANA, MongoDB, Teradata and many others All popular DBs, Teradata, Singlestore, SAP HANA Enterprise data replication and CDC with agentless log-basedapproach. Very expensive forsmall businesses(~$45k+); complex setup and configuration.
Know affordable CDC data replication tool that is not in the list? Feel free to contact us and ask to add it.