Being a data engineer is fun in own ways. Thinking of a different, easiest and optimised way to solve a problem is 90% of the effort. Few years back, when I and AWS Glue was young, created a whole ETL pipeline where a single trigger (manual or automatic) can fetch the data, curate and ready to use in just few minutes.
Lets start from the Architecture:
- Invoke can be manual or automatic schedule which executes a Glue Job named: Driver. The main responsibility of it to see if any other job is running and fetch the requirements and configurations and pass to another Glue job called Controller.
- Controller is responsible for the whole execution and taking care if the pipeline end successfully or failed and any retry is needed. It also trigger the worker glue jobs when one is finished.
- The Amazon RDS keeps all the records for each steps which in short is our logging database. if retry is needed, the latest data is fetch from RDS so know from where the job will start again.
- The 1st worker job called Fetch CSV will fetch the data in CSV format from source that can be RDS, S3, Data Streams or any other and store in S3.
- The 2nd worker job called 'Convert to Parquet' gets trigger when 1st is completed and fetch the CSV files from S3 and convert to Parquet format so it is lighter and easier to curate and the size is reduced as well.
- The 3rd worker 'Curate Data' is executed after 2nd worker job is finished. It fetches the Data from S3 in Parquet format, curate using the Spark Job and store in final S3 bucket.
- Meanwhile the Glue Crawlers are used on S3 to fetch the metadata for Athena.
- Lastly, Athena is used to view the data from S3.
This is easy to implement and maintain. It can work on structured or unstructured data.
Top comments (0)