DEV Community

How to use Glue crawler to add tables automatically

This document will cover the steps on how to use Glue crawler to extract data from S3 to automatically add tables to the glue DB and run queries on it from Dremio or Athena

Setup Diagram

Setup Diagram

Steps to follow

  • Create an S3 bucket and upload the raw data i.e, csv, json files.

  • Go to AWS Glue Console and Create Glue DB

  • Go to Tables page and Select Add Tables using crawler on the top right corner

Add Tables using Crawler

This should land you to the AWS Glue Crawler setup page

Follow below steps to fill in the details

  • Name - Enter the Crawler name
  • Add data source

    • Data source - Select S3
    • Location of S3 data - Select In this account (if that’s the case)
    • S3 path - Browse for the S3 bucket which contains the data and don’t forget to add forward slash at the end
    • Subsequent crawler runs - Select Crawl all sub-folders
  • Click Add an S3 data source

  • Click Next → Configure security settings

  • Click Create new IAM role and give a name to the role. It will create a new IAM role required by the Glue crawler to extract the data present in the S3 bucket

  • Next, Set output and scheduling

    • Select the Target Database - you can choose default or create a new one
    • Crawler schedule - On Demand
  • Next → Review and Create → Create Crawler

  • Now, the crawler has been successfully created and you can run the crawler

Run the crawler

It will take few minutes to extract the data from S3 bucket and once it is done, you should see the state as Ready

Now, you should be able to see a table added in the glue DB

  1. Go to Dremio → Add the glue catalog as a source
  2. Name - Enter glue catalog name
  3. Region - Select the AWS region
  4. Authentication - AWS Access key

Click Save and run queries on the glue DB from Dremio! or Athena

Top comments (0)