CDC data integration
Change Data Capture (CDC) is used for real-time data ingestion and based on detecting and delivering changes from a source database. In other words, this is heterogeneous replication where a source and a target are different database systems. With CDC, your BI tools and reports get the most current data immediately, instead of displaying "yesterday's data" that is only refreshed on a schedule (often nightly).
Real-time data integration is critically important for embedded BI and customer-facing analytics and customers expect the reports, dashboards, and metrics embedded within a SaaS app to reflect their actions immediately. If a customer completes a transaction, they expect their usage report to update right away, not hours later.
How to choose CDC data ingestion technology
- Connectors: ensure that data sources and your DW target are supported.
- Deployment options: cloud, self-hosted or hybrid.
- Commercial or free/open source: it's a simple choice, all free tools are self-hosted and based on Debezium.
- Data volume and budget: well-known enterprise-grade data platforms may be too expensive for SMB/mid-size companies and SaaS vendors. This comparison doesn't include enterpise solutions with quite-based pricing like Talend, IBM InfoSphere etc.
Real-time data integration tools comparison
| Product | Price | Sources | Destinations | Pros | Cons |
|---|---|---|---|---|---|
| Debezium + Kafka + Airflow + dbt (optional) | Free/OSS | SQL Server, Mysql, Postgresql, Oracle, MongoDB (single-node) | SQL Server, MySql, PostgreSql, Oracle, DB2, MongoDB, ClickHouse and many others. | Zero-cost, reliable and time-proven CDC solution. | This is not a ready-to-use product: multiple open-source products needs to be configured and connected exactly for the real-time data integration task (complex setup and configuration; infrastructure management overhead). Unefficiency makes this solution limited to low/moderate CDC throughput (< 10-25 mb/s). |
| Airbyte | Self-hosted: Free (Core)
Cloud: volume based, capacity based and custom pricing. |
600+ connectors (anything). | All popular DBs/DWs. | Easy to deploy free self-hosted version. Huge number of connectors. | Free (Core) version suitable only for low throughput (CDC based on Debezium), batch-first design (not suitable for sub-minute syncs), known issues with data types mappings (say, decimals in SQL Server → PostgreSql) and connectors can crash frequently. Unpredictable/high cost in cloud version. |
| Supermetal | Free self-hosted trial No pricing yet. |
SQL Server, MySql, PostgreSql, Oracle | PostgreSql, ClickHouse, Databricks, Snowflake | Simple single-binary tool, high efficiency. Suitable for low-lattency syncs. | Limited number of connectors, unknown pricing. |
| Fivetran | Cloud: Free plan 500k MAR
starts from $500/mo 1M+ MAR. No on-prem version (hybrid for enterprise plan). |
500+ sources (anything) | 200+ destinations | Ease of use: the initial setup is remarkably fast and no-code. | High and unpredictable costs, pricing model based on MAR (Monthly Active Rows) which is difficult to forecast. No self-hosted version. |
| Azure Data Factory CDC | Azure: pay-as-you-go | Cloud-native CDC for Azure sources | SQL Server, Azure SQL, SAP CDC, Azure Cosmos DB, Snowflake | Azure-native CDC service | Preview/limited features; tables unavailable without native CDC enabled; unable to stop running resources; debug cluster delays; fails with identity columns; complex troubleshooting. |
| Qlik Replicate |
Starts at $1000/mo Enterprise quote-based pricing. |
Oracle, SQL Server, MySQL, PostgreSQL, DB2, SAP HANA, MongoDB, Teradata and many others | All popular DBs, Teradata, Singlestore, SAP HANA | Enterprise data replication and CDC with agentless log-basedapproach. | Very expensive forsmall businesses(~$45k+); complex setup and configuration. |