What is AWS Glue
AWS Glue is a fully managed ETL service that plays a crucial role in modern data architectures. You can build simple and cost-effective solutions to clean and process the data flowing through your various systems using AWS Glue. AWS Glue consists of a central metadata repository called AWS Data Catalog, where you can manage all the metadata of the tables.
It has UI capabilities like visual pipeline editor, drag and drop to design and develop data pipelines and ETL code. AWS Glue is serverless – which means you don’t need to maintain separate computer clusters like EMR that add to your operational overhead.
AWS Glue Architecture and Main Components
The main components of AWS Glue architecture are
- AWS Glue Catalog
- Glue Crawlers, Classifiers, and Connections
- Glue job
Let’s look at each architectural component
AWS Glue Data Catalog
The glue catalog is a persistent metadata store for source and target systems.
Glue Catalog is highly scalable. Glue Catalog stores metadata as a table under the database object.
Glue Data catalog has a centralized repository where several systems can store metadata to transfer and move data.
For example, using a data catalog, we can query large files in Athena or Redshift Spectrum.
Glue Crawlers, Classifiers and Connections
What are Glue Crawlers
Glue crawlers scan the data from the source system and store metadata in the form of tables. It recognizes the format of the data and generates schemas using classifiers.
What is a Glue Classifier
A classifier is an object in crawler to identify the format of the data. Glue provides some set of built-in classifiers, and you can also create custom classifiers.
What is a Connection
Connection stores information of data stores like JDBC URI, virtual private cloud info that can be used by glue jobs to connect to that data source.
Each glue job represents a specific instruction set to extract data from data sources, transform this data and load it into a target system (typically a data lake). Glue jobs leverage the data catalog, crawlers, classifiers and connections. Glue jobs are can be written in Python or Scala programming languages.
How Do I Create a Glue Job
As we can see in the AWS console screenshot below, there are four main ways to setup a Glue job
- Visual Canvas
- Spark script
- Python script
- Jupyter notebook
Setting up Glue Job using Visual Editor
Visual editor is a canvas board where we can easily drag and drop source and target systems that we want to pull data from and store data into. It’s a great way to get started quickly with Glue jobs, especially for less complex jobs if Python or Spark programming skillset is not readable available. That said, visual editor has limited capabilities – For example, it is dependent on Glue Data Catalog, cannot connected to on-prem data sources using the visual editor, etc.
Setting up Glue Job using Spark Editor
Selecting the Spark Editor option to setup the job opens up what is essentially a python or scala script editor. Developers can use this editor to write scripts for their extract-transform-load (ETL) process. The editor provides boiler plate code to improve developer efficiency. Spark editor currently supports Spark 3.1, Scala 2 and python3.
Setting up Glue Job using Python Shell
Similar to Spark editor, selecting Python editor to setup Glue job opens up a code editor. Python editor supports only python and developers don’t get access to Spark cluster. This is handy way to setup less memory intensive Glue jobs. Usually, Glue jobs setup using Python shell are naturally cheaper to run than those setup using Spark editor since we don’t have to pay for the Spark cluster.
Setting up Glue Job using Jupyter Notebook
Glue Studio’s Jupyter Notebook is interactive notebook to develop Glue jobs. To setup a Glue job using this method, you’ll need an IAM role sufficient permissions to the data sources, S3 targets and ability to assume Glue service role.
Job configuration attributes such as the programming language, Glue version, worker node capacity can be changed using appropriate %magic in the notebook.
Summary of different ways of setting up Glue Jobs
|Glue Studio Option||Description||Best to use when||Advantages||Trade-offs|
|Visual Editor||Canvas board that allows drag and drop of sources, targets and common transformations to setup Glue jobs.||Less complex Jobs involving AWS services as source and targets with little or no transformations||– Programming knowledge not required|
– Get started quickly
|– Heavily dependent on Glue catalog|
– Supports only certain AWS native services
|Spark Editor||Setup Spark ETL jobs using Python or Scala. Supports all major Glue versions||Memory intensive Spark jobs||– Spark cluster is auto scalable|
– Unlike in EMR there is no infrastructure to manage with Glue Studio Spark Job
|– Requires Spark and Python or Scala programming knowledge|
|Python Shell||Python code editor without any Spark cluster support||– Used typically for small generic tasks that are part of larger ETL workflows such as submission of SQL queries to Athena, loading of data into RDS, etc.||– Prebuilt libraries like scripy, numpy, boto3 allows quick setup of small tasks|
– Cheaper to run since no clusters are provisioned as part of this job
|– Not all Python versions are supported|
– Import of Python libraries not supported by the platform is comparatively tedious
– Not ideal for complex or intensive jobs
|Jupyter Notebook||Allows interactive development and testing of ETL jobs using familiar Jupyter Notebook interface||Comples ETL jobs||– Rapidly build, test, run data sourcing, transformation and load processes|
– Ideal for complex ETL workflows
Typical Use Cases for AWS Glue
1. Power near real-time analytics
AWS Glue is very effective in powering near realtime analytics for streaming source data such as Kinesis, AWS MSK or Kafka. In such use cases, Glue can be used to load and process data from the streaming source into a data lake such as S3. Data Lake can then be leveraged used by analytics platform such as Quicksight or Athena.
2. Setup event-driven ETL pipelines
AWS Glue’s native integration makes Glue the ideal platform for triggering ETL pipelines off of Event Bridge events or events directly from services such as S3.
3. AWS Glue Data Catalog as a metastore
AWS Glue Data Catalog can be configured as the Hive metastore in EMR 5.8 and later. This is especially useful in complex enterprise data architectures where you want a unified metastore (metada repository) across multiple AWS accounts, EMR clusters, Redshift Spectrum or other AWS Services.
Benefits of using AWS Glue
Some of the key benefits of using AWS Glue are –
- Managed serverless infrastructure – AWS Glue takes care of the undifferentiated work of managing the infrastructure required to run your ETL jobs
- Data Catalog – Glue Data Catalog and crawlers allow for no or low code source data extraction and transformation allowing you to focus on the business logic of your ETL processes
- Native integration with other AWS Services – Native integration with other AWS Data and eventing AWS Services such as Kinesis, S3, Redshift, MSK, etc. make it trivial to extract data from or load into these services. This boosts developer productivity and reduces the boiler plate and plumbing code required to integrate with other AWS Services
- Pricing model – Unlike with most of the alternatives to AWS Glue, AWS Glue supports AWS’s pay-per-use pricing model. This allows cost effective way to get started with Glue
This article introduced you to AWS Glue and the four different pathways to setup your first Glue job. We also looked at typical use cases where AWS Glue is best suited as well as benefits of using AWS Glue in your data flow architecture.