DEV Community

How I reduced $10000 Monthly AWS Glue Bill to $400 using Airflow

Akash Singh on February 15, 2025

During my time as a Devops Engineer at Vance, we were running around 80 ETL pipelines on AWS Glue, but as our workloads scaled, so did our costs—h...

Read full post

Santhosh • Feb 16

While I find this a good read,
The title doesn't seems convincing.
Glue and airflow serves different purposes. Airflow is not a replacement for glue, and its not designed to process data. It's an high end orchestration tool. The details about why you are using glue to get 10k grand bill is missing. May be you doing some ETL (using spark) ?? (That's what it is desinged to use for)
Also MWAA would solve those config issues with one click and its not expensive. And it takes care of scaling(depends on num of dags) and all maintainance activities. .50 cents/hour ? 50 dags ?
Some 350 bucks!! Per month.
Finally it's super easy to keep dags in a S3 bucket and let airflow monitor this place for new dags. And not by keeping them inside a docker image. We can easily create a simple pipeline to deploy to this S3 bucket when a dag changed than handling it via docker.

Akash Singh • Feb 16

Thanks for your feedback! You’ve raised some great points, and I’d love to clarify a few things.

On the surface Glue and Airflow serve different purposes. Glue is primarily an ETL service, whereas Airflow is an orchestration tool. Our use case involved heavy ETL processing with AWS Glue using Spark, which led to unexpectedly high costs. The $10K bill wasn’t just from orchestration but from running Glue jobs at scale. The motivation behind moving to Airflow was to gain more control over execution and cost efficiency using Airflow workers and move away from the cloud vendor lockin.

Regarding MWAA, it's a great managed solution, but for our scale, self-hosted Airflow provided better flexibility and cost savings. MWAA's pricing (around $0.50/hour plus metadata storage, logging, and network costs) can add up quickly, especially when managing hundreds of DAGs. In some cases, a self-managed setup gave us more control over instance types, autoscaling, and optimizations that MWAA abstracts anyways.

For DAG deployment, I did explore the s3 approach but we faced a lot of issues with setting it up. Maybe you can pinpoint the right documentation for this, perhaps we were doing something wrong. Anyways, pushing the DAGs to s3 or just writing a simple CI pipeline to do it for you is a matter of choice only.

Wahid M • Feb 17

You still haven’t provided the details of your glue jobs and you airflow dag replacing the glue jobs. And airflow ain’t a replacement for glue.

Akash Singh • Feb 18

@wahid_m_1da6cebefa1750714 I can't share the exact details of the glue jobs due to security concerns ofc, but they were on the lighter side of transformations.

And yes, we were able to completely replace glue jobs with our airflow setup using airflow workers, kindly refer to the blog for the same \oo/

Maurice Borgmeier • Feb 17

I don't understand, how does gaining more control over how you schedule your Glue Jobs reduce your costs unless you change something about the Glue Jobs?

Are you running them less frequently?
Did you change the logic to be more efficient?
Did you supply fewer DPUs?

Airflow is a great orchestrator and using MWAA seems like a much more painless setup, especially when you take into account future debugging / maintenance / operations expenses.

What are your costs for the self-managed Airflow + the Glue Jobs it's triggering?

Akash Singh • Feb 18

Answering @mauricebrg , the updated cost is literally of just the computation of running the airflow ecs services, approximately $400-$500 per month. Using MWAA is a more painless setup yes, but there is not much maintainance needed to our self-managed airflow once we have set it up.

Again, our airflow is not triggering any glue jobs, rather we have written DAGs for airflow that mimic our glue jobs and run it on airflow workers using celery. Kindly read the blog for further details!

Maurice Borgmeier • Feb 18

That means your savings are coming from you changing how you do ETL. IMO the more interesting story is how you replaced Glue Jobs that run some Spark-stuff with DAGs on Containers.

ImTheDeveloper • Feb 17

Interesting read.

There's a discussion I found here about some people having pro/cons for Aws managed airflow vs. running your own like you have

reddit.com/r/aws/comments/15i3qzt/...

Akash Singh • Feb 18

Haha yes I had read this, MWAA be expensive fr!

Narayan Prajapat • Feb 17

Hii Akash
You have moved your pipeline from AWS glue to Airflow. But if in case any pipeline which takes to much time and huge dataset to pushed into warehouse. If you will run in Airflow that will went down because of you are moving huge dataset to warehouse.

Akash Singh • Feb 18

That is a case we are aware of, we have setup autoscaling of the workers to handle load accordingly!

Narayan Prajapat • Feb 18

Which one is best approach for pipeline, moving data from database to warehouse.

Entire pipeline runs on airflow (Setup auto-scaling of the worker to handle load)
Read and writing we will do on AWS EMR and We will manage the pipelines on Airlfow

Please suggest your thoughts

Rajan Gupta (Raj) • Feb 17

Hi Akash,
It is really a good read.
But still I have some clarity needed. After moving to Apache Airflow from GLUE Workflow management; where have you run the actual spark cluster to process the ETL? Is it still using GLUE or EMR of AWS? If it is still using GLUE then, how the cost has reduced by 96%; also just moving workflow to Apache Airflow?
I also have a setup of multiple ETL jobs which usages the S3, GLUE 4.0, GLUE workflow, CloudWatch for logging & monitoring, to send email for a job notification. We have four stages RAW store on S3 as a flat file, Bronze in Parquet on S3, Silver in Parquet on S3, and Gold in Parquet on S3. Our ETL jobs run daily for 6 hours to ingest the new data from RAW stage to other various stages for the delta data/files. We RUN three GLUE jobs parallelly. Each GLUE JOB usages 12 worker Nodes of Spark Cluster (each node 8 vcpu, 16GB RAM and roughly 96 GB SSD). So for three Glue Jobs which run in parallel; we use 12 * 3 = 36 nodes. But, all these nodes spawn when needed for the spark jobs because it is AWS Elastic Map Reduce (EMR).
I hope we are doing right. Please suggest.
But still my query is:
1.) After moving to Apache Airflow; where the ETLs spark jobs are running in your case?
2.) How many nodes of spark worker nodes you people are using?
3.) Where is your Spark worker cluster?
Can we connect on whatsapp quickly: +63 9628674764. I am in Manila, PH right now.
Regards
Raj Gupta

Akash Singh • Feb 18

Hi Raj, I have answered your queries and follow up questions on linkedin, do let me know if any more clarifications are needed!

Aman • Feb 17

Great article, but key thing to note is your statement that Airflow is alternative to AWS Glue.
I created this exact self managed Airflow deployment 3 years back using ansible: github.com/netxillon/Airflow-gurmukh

Akash Singh • Feb 18

Alternative? Yes but with its tradeoffs and circus haha!

Pratik Singh • Feb 15

Great Read!! 💯

Akash Singh • Feb 15

Thank you bhaiya!

Bibhu Prasad Pala • Feb 15

Did you checkout using AWS MWAA?

Akash Singh • Feb 15

Yes, but then the whole cost optimisation purpose is defeated

Bibhu Prasad Pala • Feb 18

Awesome blog. Just curious if you have tested and documented cost difference between mwaa and your current setup. Surely this number will help others to figure out what should they go for a rough estimate will also make sense.Thanks

Akash Singh • Feb 18

Yes we did actually, I can't share the exact numbers but while we achieved a 96% reduction using self hosted airflow, MMWA would only give us a max of 30% -35% reduction in cost. This estimate is subject to vary tho

assaad salameh • Feb 16

Nice stuff.
Why didn't you use fargate instead of ec2 for ecs cluster ? Just curious

Akash Singh • Feb 16

Another Classic Case of cost optimisation decision, that's all

Venkatesh Soma • Feb 16

Wonderfully written. How often do we forget alternatives to otherwise serverless offerings from cloud providers which reduce the cost drastically.

Akash Singh • Feb 16

Exactly! A classic example of Vendor Lock-In

Alfiya Fatima • Feb 15

Interesting!!!

Akash Singh • Feb 15

glad you found it so

Nandkumar Vanamali • Feb 15

Crazy work bhaiya! 🔥

Akash Singh • Feb 15

Thank you!

Dhruv • Feb 15

Holy moly, that's crazy work ! 🔥

Akash Singh • Feb 15

Thank you!

shreyas karanam • Feb 15

Awesome read!

Akash Singh • Feb 15

Thank you!

Kui Shen • Feb 16

Helpful 🤝

Akash Singh • Feb 16

Thank You!

Sourabh Singh Rana • Feb 16

My question.
Earlier you were using Glue to process data.
Under the hood spark is used.

But now in airflow.
How were you processing the data.
Are you using airflow workers for ETL?!

Akash Singh • Feb 16

Yes! We are using Airflow Workers and I have written the steps to set them up using celery workers and redis as well!

Rishi Arora • Feb 16

Unless the Glue jobs were badly written and were not the right use case for said job it's not possible.

Akash Singh • Feb 16

Haha could you elaborate more on why this is not possible? As I said, we had over 80+ ETL Pipelines running per hour / every 30mins for most pipelines xD