DEV Community

Cover image for How I reduced $10000 Monthly AWS Glue Bill to $400 using Airflow

How I reduced $10000 Monthly AWS Glue Bill to $400 using Airflow

Akash Singh on February 15, 2025

During my time as a Devops Engineer at Vance, we were running around 80 ETL pipelines on AWS Glue, but as our workloads scaled, so did our costs—h...
Collapse
 
devnprody profile image
Santhosh

While I find this a good read,
The title doesn't seems convincing.
Glue and airflow serves different purposes. Airflow is not a replacement for glue, and its not designed to process data. It's an high end orchestration tool. The details about why you are using glue to get 10k grand bill is missing. May be you doing some ETL (using spark) ?? (That's what it is desinged to use for)
Also MWAA would solve those config issues with one click and its not expensive. And it takes care of scaling(depends on num of dags) and all maintainance activities. .50 cents/hour ? 50 dags ?
Some 350 bucks!! Per month.
Finally it's super easy to keep dags in a S3 bucket and let airflow monitor this place for new dags. And not by keeping them inside a docker image. We can easily create a simple pipeline to deploy to this S3 bucket when a dag changed than handling it via docker.

Collapse
 
skysingh04 profile image
Akash Singh

Thanks for your feedback! You’ve raised some great points, and I’d love to clarify a few things.

On the surface Glue and Airflow serve different purposes. Glue is primarily an ETL service, whereas Airflow is an orchestration tool. Our use case involved heavy ETL processing with AWS Glue using Spark, which led to unexpectedly high costs. The $10K bill wasn’t just from orchestration but from running Glue jobs at scale. The motivation behind moving to Airflow was to gain more control over execution and cost efficiency using Airflow workers and move away from the cloud vendor lockin.

Regarding MWAA, it's a great managed solution, but for our scale, self-hosted Airflow provided better flexibility and cost savings. MWAA's pricing (around $0.50/hour plus metadata storage, logging, and network costs) can add up quickly, especially when managing hundreds of DAGs. In some cases, a self-managed setup gave us more control over instance types, autoscaling, and optimizations that MWAA abstracts anyways.

For DAG deployment, I did explore the s3 approach but we faced a lot of issues with setting it up. Maybe you can pinpoint the right documentation for this, perhaps we were doing something wrong. Anyways, pushing the DAGs to s3 or just writing a simple CI pipeline to do it for you is a matter of choice only.

Collapse
 
wahid_m_1da6cebefa1750714 profile image
Wahid M

You still haven’t provided the details of your glue jobs and you airflow dag replacing the glue jobs. And airflow ain’t a replacement for glue.

Thread Thread
 
skysingh04 profile image
Akash Singh

@wahid_m_1da6cebefa1750714 I can't share the exact details of the glue jobs due to security concerns ofc, but they were on the lighter side of transformations.

And yes, we were able to completely replace glue jobs with our airflow setup using airflow workers, kindly refer to the blog for the same \oo/

Collapse
 
mauricebrg profile image
Maurice Borgmeier

I don't understand, how does gaining more control over how you schedule your Glue Jobs reduce your costs unless you change something about the Glue Jobs?

  • Are you running them less frequently?
  • Did you change the logic to be more efficient?
  • Did you supply fewer DPUs?

Airflow is a great orchestrator and using MWAA seems like a much more painless setup, especially when you take into account future debugging / maintenance / operations expenses.

What are your costs for the self-managed Airflow + the Glue Jobs it's triggering?

Thread Thread
 
skysingh04 profile image
Akash Singh

Answering @mauricebrg , the updated cost is literally of just the computation of running the airflow ecs services, approximately $400-$500 per month. Using MWAA is a more painless setup yes, but there is not much maintainance needed to our self-managed airflow once we have set it up.

Again, our airflow is not triggering any glue jobs, rather we have written DAGs for airflow that mimic our glue jobs and run it on airflow workers using celery. Kindly read the blog for further details!

Thread Thread
 
mauricebrg profile image
Maurice Borgmeier

That means your savings are coming from you changing how you do ETL. IMO the more interesting story is how you replaced Glue Jobs that run some Spark-stuff with DAGs on Containers.

Collapse
 
imthedeveloper profile image
ImTheDeveloper

Interesting read.

There's a discussion I found here about some people having pro/cons for Aws managed airflow vs. running your own like you have

reddit.com/r/aws/comments/15i3qzt/...

Collapse
 
skysingh04 profile image
Akash Singh

Haha yes I had read this, MWAA be expensive fr!

Collapse
 
narayan_prajapat profile image
Narayan Prajapat

Hii Akash
You have moved your pipeline from AWS glue to Airflow. But if in case any pipeline which takes to much time and huge dataset to pushed into warehouse. If you will run in Airflow that will went down because of you are moving huge dataset to warehouse.

Collapse
 
skysingh04 profile image
Akash Singh

That is a case we are aware of, we have setup autoscaling of the workers to handle load accordingly!

Collapse
 
narayan_prajapat profile image
Narayan Prajapat

Which one is best approach for pipeline, moving data from database to warehouse.

  1. Entire pipeline runs on airflow (Setup auto-scaling of the worker to handle load)
  2. Read and writing we will do on AWS EMR and We will manage the pipelines on Airlfow

Please suggest your thoughts

Collapse
 
rajan_guptaraj_55b43d2 profile image
Rajan Gupta (Raj)

Hi Akash,
It is really a good read.
But still I have some clarity needed. After moving to Apache Airflow from GLUE Workflow management; where have you run the actual spark cluster to process the ETL? Is it still using GLUE or EMR of AWS? If it is still using GLUE then, how the cost has reduced by 96%; also just moving workflow to Apache Airflow?
I also have a setup of multiple ETL jobs which usages the S3, GLUE 4.0, GLUE workflow, CloudWatch for logging & monitoring, to send email for a job notification. We have four stages RAW store on S3 as a flat file, Bronze in Parquet on S3, Silver in Parquet on S3, and Gold in Parquet on S3. Our ETL jobs run daily for 6 hours to ingest the new data from RAW stage to other various stages for the delta data/files. We RUN three GLUE jobs parallelly. Each GLUE JOB usages 12 worker Nodes of Spark Cluster (each node 8 vcpu, 16GB RAM and roughly 96 GB SSD). So for three Glue Jobs which run in parallel; we use 12 * 3 = 36 nodes. But, all these nodes spawn when needed for the spark jobs because it is AWS Elastic Map Reduce (EMR).
I hope we are doing right. Please suggest.
But still my query is:
1.) After moving to Apache Airflow; where the ETLs spark jobs are running in your case?
2.) How many nodes of spark worker nodes you people are using?
3.) Where is your Spark worker cluster?
Can we connect on whatsapp quickly: +63 9628674764. I am in Manila, PH right now.
Regards
Raj Gupta

Collapse
 
skysingh04 profile image
Akash Singh

Hi Raj, I have answered your queries and follow up questions on linkedin, do let me know if any more clarifications are needed!

Collapse
 
d_aman profile image
Aman

Great article, but key thing to note is your statement that Airflow is alternative to AWS Glue.
I created this exact self managed Airflow deployment 3 years back using ansible: github.com/netxillon/Airflow-gurmukh

Collapse
 
skysingh04 profile image
Akash Singh

Alternative? Yes but with its tradeoffs and circus haha!

Collapse
 
kitarp29 profile image
Pratik Singh

Great Read!! 💯

Collapse
 
skysingh04 profile image
Akash Singh

Thank you bhaiya!

Collapse
 
bibhu_prasadpala_e014143 profile image
Bibhu Prasad Pala

Did you checkout using AWS MWAA?

Collapse
 
skysingh04 profile image
Akash Singh

Yes, but then the whole cost optimisation purpose is defeated

Collapse
 
bibhu_prasadpala_e014143 profile image
Bibhu Prasad Pala

Awesome blog. Just curious if you have tested and documented cost difference between mwaa and your current setup. Surely this number will help others to figure out what should they go for a rough estimate will also make sense.Thanks

Thread Thread
 
skysingh04 profile image
Akash Singh

Yes we did actually, I can't share the exact numbers but while we achieved a 96% reduction using self hosted airflow, MMWA would only give us a max of 30% -35% reduction in cost. This estimate is subject to vary tho

Collapse
 
assaads profile image
assaad salameh

Nice stuff.
Why didn't you use fargate instead of ec2 for ecs cluster ? Just curious

Collapse
 
skysingh04 profile image
Akash Singh

Another Classic Case of cost optimisation decision, that's all

Collapse
 
venky_soma profile image
Venkatesh Soma

Wonderfully written. How often do we forget alternatives to otherwise serverless offerings from cloud providers which reduce the cost drastically.

Collapse
 
skysingh04 profile image
Akash Singh

Exactly! A classic example of Vendor Lock-In

Collapse
 
alfiyafatima09 profile image
Alfiya Fatima

Interesting!!!

Collapse
 
skysingh04 profile image
Akash Singh

glad you found it so

Collapse
 
nandkumar_vanamali_942cbe profile image
Nandkumar Vanamali

Crazy work bhaiya! 🔥

Collapse
 
skysingh04 profile image
Akash Singh

Thank you!

Collapse
 
slashexx profile image
Dhruv

Holy moly, that's crazy work ! 🔥

Collapse
 
skysingh04 profile image
Akash Singh

Thank you!

Collapse
 
k22shreyas profile image
shreyas karanam

Awesome read!

Collapse
 
skysingh04 profile image
Akash Singh

Thank you!

Collapse
 
kuikevinshen profile image
Kui Shen

Helpful 🤝

Collapse
 
skysingh04 profile image
Akash Singh

Thank You!

Collapse
 
sourabh_singh_rana profile image
Sourabh Singh Rana

My question.
Earlier you were using Glue to process data.
Under the hood spark is used.

But now in airflow.
How were you processing the data.
Are you using airflow workers for ETL?!

Collapse
 
skysingh04 profile image
Akash Singh

Yes! We are using Airflow Workers and I have written the steps to set them up using celery workers and redis as well!

Collapse
 
rishi2025009 profile image
Rishi Arora

Unless the Glue jobs were badly written and were not the right use case for said job it's not possible.

Collapse
 
skysingh04 profile image
Akash Singh

Haha could you elaborate more on why this is not possible? As I said, we had over 80+ ETL Pipelines running per hour / every 30mins for most pipelines xD

Collapse
 
tanmay_srivastava_ profile image
Tanmay Srivastava

Crazy work bhaiya!!🔥

Collapse
 
skysingh04 profile image
Akash Singh

Thanks Bro!

Collapse
 
pradeep_kumar_2b0d4d39cf4 profile image
Pradeep Kumar

How much time it took you to achieve this??

Collapse
 
skysingh04 profile image
Akash Singh

This took us around 2 weeks from start to sanity all in all