25+ AWS Glue Interview Questions and Answers (for 2023)

What is AWS Glue?

AWS Glue is a fully managed data ingestion and transformation service. You can build simple and cost-effective solutions to clean and process the data flowing through your various systems using AWS Glue. You can think of AWS Glue as a modern ETL alternative.

Explain why and when you would use AWS Glue compared to other options to set up data pipelines

AWS Glue makes it easy to move data between data stores and as such, can be used in a variety of data integration scenarios, including:

  1. Data lake build & consolidation: Glue can extract data from multiple sources and load the data into a central data lake powered by something like Amazon S3.
  2. Data migration: For large migration and modernization initiatives, Glue can help move data from a legacy data store to a modern data lake or data warehouse.
  3. Data transformation: Glue provides a visual workflow to transform data using a comprehensive built-in transformation library or custom transformation using PySpark
  4. Data cataloging: Glue can assist data governance initiatives since it supports automatic metadata cataloging across your data sources and targets, making it easy to discover and understand data relationships.

When compared to other options for setting up data pipelines, such as Apache NiFi or Apache Airflow, AWS Glue is typically a good choice if:

  1. You want a fully managed solution: With Glue, you don’t have to worry about setting up, patching, or maintaining any infrastructure.
  2. Your data sources are primarily in AWS: Glue integrates natively with many AWS services, such as S3, Redshift, and RDS.
  3. You are constrained by programming skills availability: Glue’s visual workflow makes it easy to create data pipelines in a no-code or low-code code way.
  4. You need flexibility and scalability: Glue can scale automatically to meet demand and can handle petabyte-scale data.

What is the AWS Glue Architecture?

The main components of AWS Glue architecture are

  • AWS Glue Catalog
  • Glue Crawlers, Classifiers, and Connections
  • Glue job

For an overview of each component, read this introduction to AWS Glue

What are the primary benefits of using AWS Data Brew?

AWS Data Brew is a visual data preparation service that simplifies the process of data cleansing & transformation. The primary benefits of using AWS Data Brew are:

  1. Visual interface: Data Brew provides an intuitive visual interface for configuring data preparation workflows, making it easy for users with limited technical skills to use the service.
  2. Automated data preparation: Data Brew can automatically detect patterns in your source data and suggest actions to cleanse it. This reduces the data preparation effort significantly.
  3. Increased efficiency: The visual interface, detection of patterns and cleansing actions together significantly reduce the time spent on data preparation, improving efficiency.
  4. Integration with other AWS services: Data Brew integrates natively with many other AWS services, including Amazon S3, RDS and Redshift, making it easy to source and prepare data from those data sources for analysis or use in other applications.
  5. Flexible, pay-per-use pricing model: Like with most AWS Services, with Data Brew, you only pay for what you use, making it a cost-effective solution for data preparation that can scale with your needs.

Describe the four ways to create AWS Glue jobs

Four ways to create Glue jobs are:

  1. Visual Canvas: The Visual Canvas is an intuitive, drag-and-drop interface that makes it super easy to create Glue jobs without writing any code, or in a no-code manner.
  2. Spark script: The Spark script option allows you to create Glue jobs using Spark code in Scala or PySpark, providing access to the full Spark ecosystem to create complex data transformations.
  3. Python script: The Python script option lets you create AWS Glue jobs using Python code, useful in scenarios that require the most flexibility and versatility.
  4. Jupyter Notebook: By allowing to create AWS Glue jobs using a Jupyter Notebook, Glue makes it easy to create and run interactive data transformations, and explorations in a collaborative manner and then turn them into Glue jobs.

How does AWS Glue support the creation of no-code ETL jobs?

AWS Glue supports the creation of no-code ETL jobs through its Visual Canvas – a drag-and-drop interface to create AWS Glue jobs without writing any code. Visual Canvas allows users to visually define sources, targets, and data transformations by connecting sources to targets.

Visual Canvas comes with a library of pre-built transformations thereby making it possible to create and deploy Glue jobs quickly and easily, even for users with limited technical skills. Additionally, Visual Canvas integrates natively with other AWS services, such as S3, RDS and Redshift, making it easy to move data between these purpose-built data stores (again, using the visual canvas)

What is the difference between AWS Glue and AWS EMR?

Some of the differences between AWS Glue and EMR are:

  • AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it easy for customers to prepare and load their data for analytics. AWS EMR, on the other hand, is a service that makes it easy to process large amounts of data quickly and efficiently.
  • AWS Glue and EMR are both used for data processing but they differ in how they process and data and their typical use cases
  • AWS Glue can be easily used to process both structured as well as unstructured data while AWS EMR is typically suited for processing structured or semi-structured data.
  • AWS Glue can automatically discover and categorize the data. AWS EMR does not have that capability.
  • AWS Glue can be used to process streaming data or data in near-real-time, while AWS EMR is typically used for scheduled batch processing.
  • Usage of AWS Glue is charged per DPU hour while EMR is charged per underlying EC2 instance hour.
  • AWS Glue is easier to get started than EMR as Glue does not require developers to have prior knowledge of MapReduce or Hadoop.

Here is an article that dives deep into AWS Glue vs EMR

What are some ways to orchestrate glue jobs as part of a larger ETL flow?

Glue Workflows and AWS Step Functions are two ways to orchestrate glue jobs as part of large ETL flows.

What is a connection in AWS Glue?

Connection in AWS Glue is a construct that stores information required to connect to a data source such as Redshift, RDS, DynamoDB, or S3.

Connections, with the help of glue crawlers, help move data from source to target.

In addition to the support for many AWS native data stores glue connections also support external data sources as long as those data sources can be connected to using a JDBC driver.

What is the best practice for managing the credentials required by a Glue connection?

The best practice is for the credentials to be stored & accessed securely by leveraging AWS Systems Manager (SSM), AWS Secrets Manager or Amazon Key Management Service (KMS)

Can Glue crawlers be configured to run on a regular schedule? If yes, how?

Yes, Glue crawlers can be configured to run on a regular schedule. Glue supports cron based scheduling format to be specified during the creation of the crawler. For ETL workflows orchestrated by step functions, event-based triggers in step functions can be used to run crawlers on a specific schedule.

What streaming sources does AWS Glue support?

AWS Glue supports Amazon Kinesis Data Streams, Apache Kafka, and Amazon Managed Streaming for Apache Kafka (Amazon MSK).

See Using a streaming data source on how to configure properties for each of these streaming sources in AWS Glue.

Related Article: Top Kafka Interview Questions

Is AWS Glue suitable for converting log files into structured data?

Yes, AWS Glue is suitable for converting log files into structured data. Using the AWS Glue Visual Canvas or by defining a custom glue job, we can define custom data transformations to structure log file data.

Glue makes is possible to aggregate logs from various sources into a common data lake that makes it easy to access and maintain these logs.

What is an interactive session in AWS Glue and what are its benefits?

Interactive sessions in AWS Glue are essentially on-demand serverless Spark runtime environments that allow rapid build and test of data preparation and analytics applications. Interactive sessions can be used via the visual interface, AWS command line or the API.

Using interactive sessions, you can author and test your scripts as Jupyter notebooks. Glue supports a comprehensive set of Jupyter magics allowing developers to develop rich data preparation or transformation scripts.

What are the two types of workflow views in AWS Glue?

The two types of workflow views are static views and dynamic views. Static view can be considered as the design view of the workflow, whereas the dynamic view is the runtime view of the workflow that includes logs, status and error details for the latest run of the workflow.

Static view is used mainly while defining the workflow, whereas dynamic view is used when operating the workflow.

What are start triggers in AWS Glue?

Start triggers are special Data Catalog objects that can be used to start Glue jobs. Start triggers in AWS Glue can be one of three types: Scheduled, Conditional or On-demand.

How can you start an AWS Glue workflow run using AWS CLI?

Glue workflow can be started using the start-workflow-run command of AWS CLI and passing the workflow name as a parameter. The command accepts various optional parameters which are listed in the AWS CLI documentation.

How can you pull data from an external API in your AWS Glue job?

AWS Glue does not have native support for connecting to external APIs. To allow AWS Glue to access data from an external API, we can build a custom connector in Amazon AppFlow that connects to the external API, retrieves the necessary data, and makes it available to AWS Glue. This solution is illustrated in the architecture diagram below –

AWS Glue leveraging AppFlow to pull data from external API

AWS AppFlow is a perfect fit for this use case since designed to automate data flows at scale between AWS services and external systems such as SaaS and APIs without having to provision or manage resources.

Our company’s spend on AWS Glue is increasing rapidly. How can we optimize our AWS Glue spend?

Cost optimization is a critical aspect of running workloads in cloud and leveraging cloud services, including AWS Glue. On going cost optimization ensures we are making most of our cloud investments while reducing waste. When optimizing AWS Glue spend, the following factors should be considered:

  1. Use Glue Development Endpoints sparingly as these can get costly quickly.
  2. Choose the right DPU allocation based on job complexity and requirements.
  3. Optimize job concurrency
  4. Use Glue job bookmarks to track processed data, allowing Glue to skip previously processed records during incremental runs thus reducing cost for recurring jobs.
  5. Some additional factors such as leveraging Glue Data Catalog, minimizing costly transformations, etc.

Our article on the best practices for AWS Glue Cost Optimization, covers this topic in more detail.

Related Reading