
During my time as a Devops Engineer at Vance, we were running around 80 ETL pipelines on AWS Glue, but as our workloads scaled, so did our costs—h...
For further actions, you may consider blocking this person and/or reporting abuse
While I find this a good read,
The title doesn't seems convincing.
Glue and airflow serves different purposes. Airflow is not a replacement for glue, and its not designed to process data. It's an high end orchestration tool. The details about why you are using glue to get 10k grand bill is missing. May be you doing some ETL (using spark) ?? (That's what it is desinged to use for)
Also MWAA would solve those config issues with one click and its not expensive. And it takes care of scaling(depends on num of dags) and all maintainance activities. .50 cents/hour ? 50 dags ?
Some 350 bucks!! Per month.
Finally it's super easy to keep dags in a S3 bucket and let airflow monitor this place for new dags. And not by keeping them inside a docker image. We can easily create a simple pipeline to deploy to this S3 bucket when a dag changed than handling it via docker.
Thanks for your feedback! You’ve raised some great points, and I’d love to clarify a few things.
On the surface Glue and Airflow serve different purposes. Glue is primarily an ETL service, whereas Airflow is an orchestration tool. Our use case involved heavy ETL processing with AWS Glue using Spark, which led to unexpectedly high costs. The $10K bill wasn’t just from orchestration but from running Glue jobs at scale. The motivation behind moving to Airflow was to gain more control over execution and cost efficiency using Airflow workers and move away from the cloud vendor lockin.
Regarding MWAA, it's a great managed solution, but for our scale, self-hosted Airflow provided better flexibility and cost savings. MWAA's pricing (around $0.50/hour plus metadata storage, logging, and network costs) can add up quickly, especially when managing hundreds of DAGs. In some cases, a self-managed setup gave us more control over instance types, autoscaling, and optimizations that MWAA abstracts anyways.
For DAG deployment, I did explore the s3 approach but we faced a lot of issues with setting it up. Maybe you can pinpoint the right documentation for this, perhaps we were doing something wrong. Anyways, pushing the DAGs to s3 or just writing a simple CI pipeline to do it for you is a matter of choice only.
You still haven’t provided the details of your glue jobs and you airflow dag replacing the glue jobs. And airflow ain’t a replacement for glue.
@wahid_m_1da6cebefa1750714 I can't share the exact details of the glue jobs due to security concerns ofc, but they were on the lighter side of transformations.
And yes, we were able to completely replace glue jobs with our airflow setup using airflow workers, kindly refer to the blog for the same \oo/
I don't understand, how does gaining more control over how you schedule your Glue Jobs reduce your costs unless you change something about the Glue Jobs?
Airflow is a great orchestrator and using MWAA seems like a much more painless setup, especially when you take into account future debugging / maintenance / operations expenses.
What are your costs for the self-managed Airflow + the Glue Jobs it's triggering?
Answering @mauricebrg , the updated cost is literally of just the computation of running the airflow ecs services, approximately $400-$500 per month. Using MWAA is a more painless setup yes, but there is not much maintainance needed to our self-managed airflow once we have set it up.
Again, our airflow is not triggering any glue jobs, rather we have written DAGs for airflow that mimic our glue jobs and run it on airflow workers using celery. Kindly read the blog for further details!
That means your savings are coming from you changing how you do ETL. IMO the more interesting story is how you replaced Glue Jobs that run some Spark-stuff with DAGs on Containers.
Interesting read.
There's a discussion I found here about some people having pro/cons for Aws managed airflow vs. running your own like you have
reddit.com/r/aws/comments/15i3qzt/...
Haha yes I had read this, MWAA be expensive fr!
Hii Akash
You have moved your pipeline from AWS glue to Airflow. But if in case any pipeline which takes to much time and huge dataset to pushed into warehouse. If you will run in Airflow that will went down because of you are moving huge dataset to warehouse.
That is a case we are aware of, we have setup autoscaling of the workers to handle load accordingly!
Which one is best approach for pipeline, moving data from database to warehouse.
Please suggest your thoughts
Hi Akash,
It is really a good read.
But still I have some clarity needed. After moving to Apache Airflow from GLUE Workflow management; where have you run the actual spark cluster to process the ETL? Is it still using GLUE or EMR of AWS? If it is still using GLUE then, how the cost has reduced by 96%; also just moving workflow to Apache Airflow?
I also have a setup of multiple ETL jobs which usages the S3, GLUE 4.0, GLUE workflow, CloudWatch for logging & monitoring, to send email for a job notification. We have four stages RAW store on S3 as a flat file, Bronze in Parquet on S3, Silver in Parquet on S3, and Gold in Parquet on S3. Our ETL jobs run daily for 6 hours to ingest the new data from RAW stage to other various stages for the delta data/files. We RUN three GLUE jobs parallelly. Each GLUE JOB usages 12 worker Nodes of Spark Cluster (each node 8 vcpu, 16GB RAM and roughly 96 GB SSD). So for three Glue Jobs which run in parallel; we use 12 * 3 = 36 nodes. But, all these nodes spawn when needed for the spark jobs because it is AWS Elastic Map Reduce (EMR).
I hope we are doing right. Please suggest.
But still my query is:
1.) After moving to Apache Airflow; where the ETLs spark jobs are running in your case?
2.) How many nodes of spark worker nodes you people are using?
3.) Where is your Spark worker cluster?
Can we connect on whatsapp quickly: +63 9628674764. I am in Manila, PH right now.
Regards
Raj Gupta
Hi Raj, I have answered your queries and follow up questions on linkedin, do let me know if any more clarifications are needed!
Great article, but key thing to note is your statement that Airflow is alternative to AWS Glue.
I created this exact self managed Airflow deployment 3 years back using ansible: github.com/netxillon/Airflow-gurmukh
Alternative? Yes but with its tradeoffs and circus haha!
Great Read!! 💯
Thank you bhaiya!
Did you checkout using AWS MWAA?
Yes, but then the whole cost optimisation purpose is defeated
Awesome blog. Just curious if you have tested and documented cost difference between mwaa and your current setup. Surely this number will help others to figure out what should they go for a rough estimate will also make sense.Thanks
Yes we did actually, I can't share the exact numbers but while we achieved a 96% reduction using self hosted airflow, MMWA would only give us a max of 30% -35% reduction in cost. This estimate is subject to vary tho
Nice stuff.
Why didn't you use fargate instead of ec2 for ecs cluster ? Just curious
Another Classic Case of cost optimisation decision, that's all
Wonderfully written. How often do we forget alternatives to otherwise serverless offerings from cloud providers which reduce the cost drastically.
Exactly! A classic example of Vendor Lock-In
Interesting!!!
glad you found it so
Crazy work bhaiya! 🔥
Thank you!
Holy moly, that's crazy work ! 🔥
Thank you!
Awesome read!
Thank you!
Helpful 🤝
Thank You!
My question.
Earlier you were using Glue to process data.
Under the hood spark is used.
But now in airflow.
How were you processing the data.
Are you using airflow workers for ETL?!
Yes! We are using Airflow Workers and I have written the steps to set them up using celery workers and redis as well!
Unless the Glue jobs were badly written and were not the right use case for said job it's not possible.
Haha could you elaborate more on why this is not possible? As I said, we had over 80+ ETL Pipelines running per hour / every 30mins for most pipelines xD
Crazy work bhaiya!!🔥
Thanks Bro!
How much time it took you to achieve this??
This took us around 2 weeks from start to sanity all in all