Top AWS Glue Interview Questions
What is AWS Glue?
AWS Glue is a fully managed data ingestion and transformation service. You can build simple and cost-effective solutions to clean and process the data flowing through your various systems using AWS Glue. You can think of AWS Glue as a modern ETL alternative.
Describe the AWS Glue Architecture
The main components of AWS Glue architecture are
- AWS Glue Catalog
- Glue Crawlers, Classifiers, and Connections
- Glue job
For an overview of each component, read this introduction to AWS Glue
What are the primary benefits of using AWS Data Brew?
AWS Data Brew is a visual data preparation service that simplifies the process of data cleansing & transformation. The primary benefits of using AWS Data Brew are:
- Visual interface: Data Brew provides an intuitive visual interface for configuring data preparation workflows, making it easy for users with limited technical skills to use the service.
- Automated data preparation: Data Brew can automatically detect patterns in your source data and suggest actions to cleanse it. This reduces the data preparation effort significantly.
- Increased efficiency: The visual interface, detection of patterns and cleansing actions together significantly reduce the time spent on data preparation, improving efficiency.
- Integration with other AWS services: Data Brew integrates natively with many other AWS services, including Amazon S3, RDS and Redshift, making it easy to source and prepare data from those data sources for analysis or use in other applications.
- Flexible, pay-per-use pricing model: Like with most AWS Services, with Data Brew, you only pay for what you use, making it a cost-effective solution for data preparation that can scale with your needs.
Describe the four ways to create AWS Glue jobs
Four ways to create Glue jobs are:
- Visual Canvas: The Visual Canvas is an intuitive, drag-and-drop interface that makes it super easy to create Glue jobs without writing any code, or in a no-code manner.
- Spark script: The Spark script option allows you to create Glue jobs using Spark code in Scala or PySpark, providing access to the full Spark ecosystem to create complex data transformations.
- Python script: Like Spark, AWS Glue supports Python scripts. This is particularly useful for scenarios that necessitate a high degree of versatility and custom logic.
- Jupyter Notebook: Jupyter notebook provides an interactive environment to create and run data transformations and then convert them into Glue jobs. This is best suited for collaborative work and for scenarios where exploration and iterative development of data transformations is required.
How does AWS Glue support the creation of no-code ETL jobs?
AWS Glue supports the creation of no-code ETL jobs through its Visual Canvas – a drag-and-drop interface to create AWS Glue jobs without writing any code. Visual Canvas allows users to visually define sources, targets, and data transformations by connecting sources to targets.
Visual Canvas comes with a library of pre-built transformations thereby making it possible to create and deploy Glue jobs quickly and easily, even for users with limited technical skills. Additionally, Visual Canvas integrates natively with other AWS services, such as S3, RDS and Redshift, making it easy to move data between these purpose-built data stores (again, using the visual canvas)
Related Reading: Efficient AWS Glue ETL
What is a connection in AWS Glue?
Connection in AWS Glue is a construct that stores information required to connect to a data source such as Redshift, RDS, DynamoDB, or S3.
Connections, with the help of glue crawlers, help move data from source to target.
In addition to the support for many AWS native data stores glue connections also support external data sources as long as those data sources can be connected to using a JDBC driver.
What is the best practice for managing the credentials a Glue connection requires?
The best practice is for the credentials to be stored & accessed securely by leveraging AWS Systems Manager (SSM), AWS Secrets Manager, or Amazon Key Management Service (KMS)
What streaming sources does AWS Glue support?
AWS Glue supports Amazon Kinesis Data Streams, Apache Kafka, and Amazon Managed Streaming for Apache Kafka (Amazon MSK).
See Using a streaming data source to learn how to configure properties for each of these streaming sources in AWS Glue.
Related Article: Top Kafka Interview Questions
What is an interactive session in AWS Glue, and what are its benefits?
Interactive sessions in AWS Glue are essentially on-demand serverless Spark runtime environments that allow rapid build and test of data preparation and analytics applications. Interactive sessions can be used via the visual interface, AWS command line or the API.
You can author and test your scripts using interactive sessions as Jupyter notebooks. Glue supports a comprehensive set of Jupyter magics allowing developers to develop rich data preparation or transformation scripts.
What are the two types of workflow views in AWS Glue?
The two types of workflow views are static views and dynamic views. The static view can be considered as the design view of the workflow, whereas the dynamic view is the runtime view of the workflow that includes logs, status, and error details for the latest run of the workflow.
A static view is used mainly when defining the workflow, whereas a dynamic view is used when operating the workflow.
What are start triggers in AWS Glue?
Start triggers are particular Data Catalog objects that can be used to start Glue jobs. In AWS Glue, start triggers can be one of three types: scheduled, Conditional, or On-demand.
How can you start running an AWS Glue workflow using AWS CLI?
Glue workflow can be started using the start-workflow-run command of AWS CLI and passing the workflow name as a parameter. The command accepts various optional parameters which are listed in the AWS CLI documentation.
What role does Apache Spark play in AWS Glue?
WS Glue and Apache Spark are closely intertwined. At its core, AWS Glue leverages Apache Spark as its underlying distributed data processing engine. Apache Spark plays a pivotal role in AWS Glue, empowering it with robust data processing capabilities. AWS Glue, in turn, simplifies some of the complexities of Spark, making it more accessible to a broader audience. AWS Glue uses Apache Spark as a distributed data processing engine. AWS Glue scripts are compiled into code that runs on Apache Spark. SparkContext, a key component in Spark, is initialized for implicitly in AWS Glue so you don’t have to worry about initializing it yourself.
AWS Glue Interview Questions for Experienced
Explain why and when you would use AWS Glue compared to other options to set up data pipelines
AWS Glue makes it easy to move data between data stores and as such, can be used in a variety of data integration scenarios, including:
- Data lake build & consolidation: Glue can extract data from multiple sources and load the data into a central data lake powered by something like Amazon S3. (Related Reading: Building Data Lakes on AWS using S3 and Glue)
- Data migration: For large migration and modernization initiatives, Glue can help move data from a legacy data store to a modern data lake or data warehouse.
- Data transformation: Glue provides a visual workflow to transform data using a comprehensive built-in transformation library or custom transformation using PySpark
- Data cataloging: Glue can assist data governance initiatives since it supports automatic metadata cataloging across your data sources and targets, making it easy to discover and understand data relationships.
When compared to other options for setting up data pipelines, such as Apache NiFi or Apache Airflow, AWS Glue is typically a good choice if:
- You want a fully managed solution: With Glue, you don’t have to worry about setting up, patching, or maintaining any infrastructure.
- Your data sources are primarily in AWS: Glue integrates natively with many AWS services, such as S3, Redshift, and RDS.
- You are constrained by programming skills availability: Glue’s visual workflow makes it easy to create data pipelines in a no-code or low-code code way.
- You need flexibility and scalability: Glue can scale automatically to meet demand and can handle petabyte-scale data.
Related Reading: AWS Glue vs Lambda: Choosing the Right Tool for Your Data Pipeline
Can you highlight the role of AWS Glue in big data environments?
AWS Glue plays a pivotal role in big data environments as it provides the ability to handle, process and transform large data sets in distributed and parallel environments. AWS Glue is engineered for large-scale data processing. It can scale horizontally, providing the capability to process petabytes of data efficiently and quickly. AWS Glue is highly beneficial in a big data environment due to its serverless architecture and integration capabilities with other AWS services.
What is the difference between AWS Glue and AWS EMR?
Some of the differences between AWS Glue and EMR are:
- AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it easy for customers to prepare and load their data for analytics. AWS EMR, on the other hand, is a service that makes it easy to process large amounts of data quickly and efficiently.
- AWS Glue and EMR are both used for data processing but they differ in how they process and data and their typical use cases
- AWS Glue can be easily used to process both structured as well as unstructured data while AWS EMR is typically suited for processing structured or semi-structured data.
- AWS Glue can automatically discover and categorize the data. AWS EMR does not have that capability.
- AWS Glue can be used to process streaming data or data in near-real-time, while AWS EMR is typically used for scheduled batch processing.
- Usage of AWS Glue is charged per DPU hour while EMR is charged per underlying EC2 instance hour.
- AWS Glue is easier to get started than EMR as Glue does not require developers to have prior knowledge of MapReduce or Hadoop.
Here is an article that dives deep into AWS Glue vs EMR
What are some ways to orchestrate glue jobs as part of a larger ETL flow?
Glue Workflows and AWS Step Functions are two ways to orchestrate glue jobs as part of large ETL flows.
Can Glue crawlers be configured to run on a regular schedule? If yes, how?
Yes, Glue crawlers can be configured to run on a regular schedule. Glue supports cron based scheduling format to be specified during the creation of the crawler. For ETL workflows orchestrated by step functions, event-based triggers in step functions can be used to run crawlers on a specific schedule.
Is AWS Glue suitable for converting log files into structured data?
Yes, AWS Glue is suitable for converting log files into structured data. Using the AWS Glue Visual Canvas or by defining a custom glue job, we can define custom data transformations to structure log file data.
Glue makes is possible to aggregate logs from various sources into a common data lake that makes it easy to access and maintain these logs.
How can you pull data from an external API in your AWS Glue job?
AWS Glue does not have native support for connecting to external APIs. To allow AWS Glue to access data from an external API, we can build a custom connector in Amazon AppFlow that connects to the external API, retrieves the necessary data, and makes it available to AWS Glue. This solution is illustrated in the architecture diagram below –
AWS AppFlow is a perfect fit for this use case since designed to automate data flows at scale between AWS services and external systems such as SaaS and APIs without having to provision or manage resources.
Our company’s spend on AWS Glue is increasing rapidly. How can we optimize our AWS Glue spend?
Cost optimization is a critical aspect of running workloads in cloud and leveraging cloud services, including AWS Glue. On going cost optimization ensures we are making most of our cloud investments while reducing waste. When optimizing AWS Glue spend, the following factors should be considered:
- Use Glue Development Endpoints sparingly as these can get costly quickly.
- Choose the right DPU allocation based on job complexity and requirements.
- Optimize job concurrency
- Use Glue job bookmarks to track processed data, allowing Glue to skip previously processed records during incremental runs, thus reducing cost for recurring jobs.
- Some additional factors such as leveraging Glue Data Catalog, minimizing costly transformations, etc.
Our article on the best practices for AWS Glue Cost Optimization, covers this topic in more detail.
What is the difference between Glue Data Catalog and Collibra Data Catalog?
AWS Glue Data Catalog is a centralized metadata repository primarily focused on seamless integration with AWS services, while Collibra Data Catalog emphasizes comprehensive data governance, collaboration, and data quality management.
AWS Glue Data Catalog suits organizations heavily invested in the AWS ecosystem, whereas Collibra Data Catalog is ideal for those prioritizing advanced governance features and flexibility in connecting with various data sources. Our article AWS Glue Data Catalog versus Collibra Data Catalog covers this topic in-depth.
How does AWS Glue Schema Registry work?
AWS Glue Schema Registry is a serverless feature that makes it easy to discover, control, and evolve data stream schemas. It allows you to validate and control the evolution of streaming data using registered Apache Avro schemas, thus ensuring data produced and consumed by different applications is compatible and can be parsed reliably.
How does AWS Glue integrate with AWS Sagemaker?
AWS Glue can prepare and load data into data stores for analytics. One such destination can be AWS Sagemaker, where we can train machine learning models on the prepared data. AWS Glue can pull data from various sources, clean and transform it, and then AWS Sagemaker can use this data for machine learning purposes.
Related Reading
AWS Glue Scenario Based Interview Questions
Scenario: You are working on a project where you need to clean and prepare large amounts of raw data for analysis. The data is stored in various formats and in different AWS services like Amazon S3, Amazon RDS, and Amazon Redshift. How would you use AWS Glue in this scenario to automate the process of data preparation?
Answer: AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analysis. I would use AWS Glue to discover the data and store the associated metadata (e.g., table definition and schema) in the AWS Glue Data Catalog. Once cataloged in Glue Catalog, the data is immediately searchable, queryable, and available for ETL. AWS Glue generates Python or Scala code for the transformations, which I can further customize if needed.
Scenario: Your company has a large amount of data stored in a non-relational database on AWS, and you need to move this data to a relational database for a specific analysis. The data needs to be transformed during this process. How would you use AWS Glue for this data migration and transformation?
Answer: AWS Glue can connect to on-premises and cloud-based data sources, including non-relational databases. I would use AWS Glue to extract the data from the non-relational database, transform the data to match the schema of the relational database, and then load the transformed data into the relational database. The transformation could include actions like converting data formats, mapping one data set to another, and cleaning data.
Scenario: You are tasked with setting up a data catalog for your organization. The data is stored in various AWS services and in different formats. How would you use AWS Glue to create a centralized metadata repository?
Answer: In this scenario, I would use AWS Glue’s data crawlers to automatically discover and catalog metadata from various data sources in AWS. The cataloged metadata, stored in the AWS Glue Data Catalog, includes data format, data type, and other characteristics. This makes the data easily searchable and queryable across the organization.
The Data Catalog integrates with other AWS services like Amazon Athena and Amazon Redshift Spectrum, allowing direct querying of the data without moving it. Additionally, it stores metadata related to ETL jobs, aiding in automating data preparation for analysis. This approach creates a unified view of all data, irrespective of its location or format.