The need for storing and accessing large amounts of data is growing exponentially. According to a study by the International Data Corporation (IDC), global data creation and storage is expected to reach 175 zettabytes by 2025, a staggering tenfold increase from 2018. Businesses are responding to this need by adopting Data Lakes. Data lakes allow organizations to store, manage, and analyze vast amounts of data in a cost-efficient manner.
That said, the data in data lakes is only as good as the value it can generate through downstream applications. It is, therefore, necessary to understand the access patterns that allow applications to access the required data from data lakes.
Some of the considerations when designing access patterns for data lakes are – Access modality (synchronous, asynchronous), access method (API, event streaming, data transfer), security, and performance.
With these considerations in mind, below are the top data lake access patterns.
1. Interactive Queries
Querying is probably one of the most obvious patterns to access data from a data lake. Queries allow users to quickly retrieve the exact information they need from a large dataset.
In this pattern, the user submits a query request to the data lake using a query language such as SQL, HiveQL, or Pig. The query is then sent to a distributed query processor, which evaluates it and returns the results, usually in seconds or minutes. If the data lake is implemented on AWS using S3, Amazon Athena can be used for interactive queries.
Advantages of this pattern:
- Allows for interactive access to data in a familiar language such as SQL
- Provides direct access to the required data
- Allows for use cases such as ad-hoc analysis, machine learning model testing, etc.
Disadvantages:
- Getting to the right data may require complex query operations that may not be suitable for all users
- May result in performance issues if queries are too large or overly complicated.
- Requires advanced data security & access control implementations that include row and cell level access control to ensure queries are secure and they don’t compromise data privacy by leaking PII or other sensitive data
2. Change Data Capture (CDC) Subscriptions
Change Data Capture or CDC subscriptions allow applications to subscribe to a specific data set from the data lake and get notified when the data records are updated. This pattern is useful for applications or systems that need to trigger processing based on updates to specific fields or need their local data stores updated with the latest data from the lake.
In AWS, this pattern can be implemented using the Amazon Kinesis Data Streams, which allows applications to create subscriptions to specific data sets in an S3. Amazon EventBridge service, which allows applications to subscribe to events generated by S3.
Advantages:
- Downstream applications receive near real-time record level updates
- Can be designed to scale automatically as the datasets grow using serverless technologies
Disadvantages:
- CDC updates are triggered only once for each data update. Consuming applications must be able to keep up with the data velocity and have retry logic & DLQs to handle processing errors, etc.
- Can be complex to implement and scale, especially since only the updates are streamed to the consumer applications
3. Streaming
Unlike CDC subscriptions that require applications to subscribe to specific datasets, streaming patterns allow applications to continuously stream domain events from the lake. This allows downstream applications to subscribe to events they ‘care’ about and process them. Streaming access pattern is ideal for use cases such as fraud detection within financial account opening, card processing & payments, etc.
Advantages:
- Applications can subscribe to events they are interested in
- Allows applications to scale with the rate of incoming data
- Provides near real-time access to data
- New consumers can be added without impacting the existing solution and flows
Disadvantages:
- Data streams are typically unbounded, so the consumer needs to build mechanisms for back-pressure caused by spikes in events
- Consumer logic can quickly get complex, error-prone, as events may arrive out-of-order or logic requires data from multiple event types.
On AWS Cloud, this pattern can be implemented using services such asĀ Event Bridge, Amazon Managed Streaming for Kafka (MSK), Kinesis Data Streams, and Kinesis Data Firehose combined with Lambda functions, etc.
4. Batch Processing
Batch processing pattern is used to process data from the lake on a pre determined schedule. Batch processing typically deals with validating and transforming large amounts of data. Depending on the functional nature of the job, the batch processing schedule may vary.
A data lake may have multiple batch processes – dozens or hundreds, running at varying schedules such as hourly, daily, weekly, monthly, etc.
Data lake batch processing jobs process data from the lake and store it back either in its original location or move it to other locations for further processing, archival or consumption. This pattern is ideal for applications such as ETL or BI workloads, where large amounts of data need to be processed on a periodic basis. Downstream applications can then access this data from is predetermined location or through BI interfaces.
Advantages:
- Allows for large-scale processing of data in the lake
- Can be triggered on a periodic basis to ensure timely updates of downstream applications
Disadvantages:
- The jobs must be tuned and optimized regularly to ensure performance as well as cost efficiency
On AWS, this pattern can be implemented using services such as AWS Glue, Glue Data Brew, Step Functions Workflow, AWS Data Pipelines, Amazon EMR combined with Lambda functions, etc.
AWS partners such as Snowflake, DataBricks and others offer cloud-based ETL alternatives. These partner services provide a fully managed platform for the large-scale processing of data in the lake.
5. Synchronous API Access
One of the most straight forward and “natural” patterns for developers, synchronous API pattern is used to enable access to data lake resources through, well, synchronous APIs. Implementing this pattern involves creating RESTful or GraphQL APIs around resources in the data lake that can be accessed by downstream applications.
Advantages:
- Enables developers to create rich experiences such as search, filters, sorting etc.
- Applications can access data in the lake on demand
- API requests and responses are typically small, so it is an efficient way to transfer data between applications
Disadvantages:
- High availability needs to be built into the API layer or else downstream applications may face performance issues.
On AWS, this pattern can be implemented using services such as API Gateway, AppSync and Lambda functions.