Streaming data from Kafka Topic to Spark using Spark Structured Streaming in a Docker environment - Part 0

Streaming data from Kafka Topic to Spark using Spark Structured Streaming in a Docker environment - Part 0

This project is a simple kafka and spark streaming architecture. A docker-compose file initialize a kafka cluster and a spark cluster with all their dependencies. Producers send text messages to kafka a topic named "test-topic". You can either consume messages with consumers scripts written in NodeJS and Python or stream data with spark streaming which simply print on the console all received data.

Workflow

This version is optimized to be more lightweight. Everything runs from one simple command. There is no dependency on your computer to have apart from docker and docker-compose.

Warning: The data streaming only works with spark in scala for now, python version is in progress

Everything is automatic in this project.

For now, all you have to do is run a simple script that will trigger everything.

You can then dive deeper into code and play around with it to get your hands dirty 😊

Github repository of source code

In my next articles, I'll go through the process of creating each and every service: python and nodejs producers and consumers, kafka broker server, spark streaming code and docker containers. Stay tuned !

Requirements

Please make sure:

  • you can run commands with root privileges on your computer
  • your port 8080 is not in use
  • the subnet 172.18.0.0/24 is not in use in your computer

Project folder structure

.
├── architecture.png........ # Architecture of the project
├── clean-env.sh............ # Cleans the environment
├── docker-compose.yml...... # Create kafka and spark clusters
├── nodejs-consumer......... # Consumes messages from kafka
   ├── consumer.js
   ├── Dockerfile
   ├── package.json
   └── package-lock.json
├── nodejs-producer......... # Produces messages to kafka
   ├── Dockerfile
   ├── package.json
   └── producer.js
├── python-consumer......... # Consumes messages to kafka
   ├── consumer.py
   └── Dockerfile
├── python-producer......... # Produces messages to kafka
   ├── Dockerfile
   └── producer.py
├── README.md
└── spark-streaming......... # Consume streaming data from kafka and sinks to console
    ├── python.............. # Streaming with python (Work In Progress)
    └── scala............... # Streaming with scala

Running services

service nameaddress[:port]
zookeeper172.18.0.8:2181
kafka (from host)172.18.0.9:9093
kafka (inside container)172.18.0.9:9092
spark master172.18.0.10:7077
spark UI172.18.0.10:8080
spark worker 1172.18.0.11
spark worker 2172.18.0.12
spark-streaming-kafka172.18.0.13
nodejs-producer172.18.0.14
nodejs-consumer172.18.0.15
python-producer172.18.0.16
python-consumer172.18.0.17

The project creates a docker network name "kafka-spark" on the address range 172.18.0.0/24

Getting Started

Note: You can go through the docker-compose.yml or the run.sh files to better understand how things work.

1. Clone the repo and cd into the folder

Note: Be sure to be into the directory

    git clone https://github.com/MDiakhate12/kafka-spark-streaming-docker.git
    cd kafka-spark-streaming-docker/

2. Run docker-compose.yml

Important: Don't close the terminal after you have executed docker-compose

docker-compose up

Note: Wait until all services are up (about 1 to 2 minutes, the console will be quite idle)

3. Submit the spark streaming job

Note: Be sure to have root privileges

On a new terminal run the command

sudo chmod 777 jars_dir && \
docker exec -it spark \
spark-submit \
--packages "org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0" \
--master "spark://172.18.0.10:7077" \
--class Streaming \
--conf spark.jars.ivy=/opt/bitnami/spark/ivy \
ivy/spark-streaming-with-kafka_2.12-1.0.jar

After everything is set, your output should look like this:

Screenshot from 2021-11-15 05-15-41

That's it 🎉🎉 Congratulation.

Look at the result

Note: The spark UI is available at http://172.18.0.10:8080

On a new terminal, you can see logs of each service by running:

docker-compose logs -f [SERVICE_NAME]

Available services are:

  1. zookeeper
  2. kafka
  3. spark
  4. spark-worker-1
  5. spark-worker-2
  6. spark-streaming-kafka
  7. nodejs-producer
  8. nodejs-consumer
  9. python-producer
  10. python-consumer