Monday, September 30, 2019

                               Streaming with Kafka Connect



Apache Kafka is a high-throughput distributed message system that is being adopted by hundreds of companies to manage their real-time data.
Companies use Kafka for many applications (real time stream processing, data synchronisation, messaging, and more), but one of the most popular
applications is ETL pipelines. Kafka is a perfect tool for building data pipelines: it’s reliable, scalable, and efficient.

Until recently, building pipelines with Kafka has required significant effort: each system you wanted to connect to Kafka required either custom code or
a different tool, each new tool used a different set of configurations, might assume different data formats, and used different approaches to management
and monitoring. Data pipelines built from this hodgepodge of tools are brittle and difficult to manage.


Where ?
Four common use cases of Kafka 
Source  → Kafka   = Producer API                              = Kafka Connect Source
Kafka    → Kafka =  Consumer Producer API          = Kafka Streams API
Kafka    → Sink   = Consumer API                           = Kafka Connect Sink
Kafka    → App     -= Consumer App


Why Kafka Connect ?

  • Programmers wanted to import data from same sources  into Kafka
  • Programmers wanted to store data via same sink from Kafka
  • Needed to achieve Exactly Once,  Fault Tolerance, Distribution Ordering
  • Many Connectors available and customise the configuration accordingly 
  • Part of ETL Pipeline
  • Scaling made easy from small pipelines to company wide pipelines
  • Configuration to be submitted via REST API


Everything  as Events in the Streaming  world




How does Kafka Connect help in ETL ?




What does Kafka  Connector do ?

  • A Kafka Connector contains multiple loaded reusable loaded connectors. A connector can be a source or sink .
  • Breaks a job into Tasks
  • Supplies configuration to Tasks
  • Monitors and reconfigures Tasks when required.


What does Tasks do ? 
  • It is responsible for copying data to/ from target systems.It is executed by Kafka Connect Workers which is a single Java process and can be run in Standalone or Cluster Mode.
  • It is executed by Kafka Connect Workers which is a single Java process and can be run in Standalone or Cluster Mode.
  • It is reconfigurable through API’s


Kafka Connect REST API




No comments: