close
close

first Drop

Com TW NOw News 2024

Customizing the Azure Landing Zone for a Cloud Data Platform
news

Customizing the Azure Landing Zone for a Cloud Data Platform

The Azure Landing Zone for a Cloud Data Platform

Working with sensitive data or in a highly regulated environment requires a safe and secure cloud infrastructure for data processing. The cloud may seem like an open environment on the internet and raises security concerns. When you start your journey with Azure and are not experienced with resource configuration, it is easy to make design and implementation mistakes that can impact the security and flexibility of your new data platform. In this post, I will describe the key aspects of designing a cloud adaptation framework for a data platform in Azure.

Customizing the Azure Landing Zone for a Cloud Data PlatformImage by the author

What is an Azure landing zone?

An Azure landing zone is the foundation for deploying resources to the public cloud. It contains essential elements for a robust platform. These elements include networking, identity and access management, security, governance, and compliance. By implementing a landing zone, organizations can streamline the process of configuring their infrastructure and ensure the use of best practices and guidelines.

An Azure landing zone is an environment that follows key design principles to enable migration, modernization, and development of applications. In Azure, subscriptions are used to isolate and develop application and platform resources. They are categorized as follows:

  • Application landing zones: Subscriptions specifically designed for hosting application-specific resources.
  • Landing zone of the platform: Subscriptions that include shared services such as identity, connectivity, and management resources available to application landing zones.

These design principles help organizations operate successfully in a cloud environment and scale a platform.

Image by the author

Deploying a Data Platform in Azure

A data platform implementation in Azure involves a high-level architectural design that selects resources for data ingestion, transformation, operation, and exploration. The first step may require a landing zone design. If you need a secure platform that follows best practices, it is crucial to start with a landing zone. It helps you organize resources within subscriptions and resource groups, define network topology, and ensure connectivity to on-premises environments via VPN, while also adhering to naming conventions and standards.

Architectural design

Tailoring a data platform architecture requires careful resource selection. Azure provides native resources for data platforms such as Azure Synapse Analytics, Azure Databricks, Azure Data Factory, and Microsoft Fabric. The available services provide a variety of ways to achieve similar goals, allowing flexibility in your architecture selection.

For example:

  • Data recording: Azure Data Factory or Synapse pipelines.
  • Data processing: Azure Databricks or Apache Spark in Synapse.
  • Data analysis: Power BI or Databricks dashboards.

We can use Apache Spark and Python or low-code drag-and-drop tools. Different combinations of these tools can help us create the most suitable architecture, depending on our skills, use cases and capabilities.

High-level architecture (Image by the author)

With Azure, you can also use other components, such as Snowflake, or create your composition using open-source software, Virtual Machines (VM) or Kubernetes Service (AKS). We can use VMs or AKS to configure services for data processing, exploration, orchestration, AI or ML.

Typical data platform structure

A typical data platform in Azure consists of several main components:

1. Tools to ingest data from sources into an Azure Storage Account. Azure provides services such as Azure Data Factory, Azure Synapse Pipelines or Microsoft Fabric. We can use these tools to ingest data from sources.

2. Data Warehouse, Data Lake or Data Lakehouse: Depending on your architectural preferences, we can select different services for storing data and a business model.

  • For Data Lake or Data Lakehouse we can use Databricks or Fabric.
  • For Data Warehouse we can select Azure Synapse, Snowflake or MS Fabric Warehouse.

3. To orchestrate the data processing in Azure, we have Azure Data Factory, Azure Synapse Pipelines, Airflow or Databricks Workflows.

4. Data transformation in Azure can be handled by various services.

  • For Apache Spark: Databricks, Azure Synapse Spark Pool, and MS Fabric Notebooks,
  • For SQL-based transformation, we can use Spark SQL in Databricks, Azure Synapse or MS Fabric, T-SQL in SQL Server, MS Fabric or Synapse Dedicated Pool. Snowflake also provides all SQL capabilities.

Subscriptions

An important aspect of platform design is to plan the segmentation of subscriptions and resource groups based on business units and the software development lifecycle. It is possible to use separate subscriptions for production and non-production environments. This distinction allows us to achieve a more flexible security model, separate policies for production and test environments, and avoid quota restrictions.

Subscription organization (image by the author)

Networks

A virtual network is similar to a traditional network that operates in your datacenter. Azure Virtual Networks (VNet) provides a fundamental layer of security for your platform. Disabling public endpoints for resources significantly reduces the risk of data leakage in the event of lost keys or passwords. Without public endpoints, data stored in Azure Storage Accounts is only accessible when connected to your VNet.

On-premises network connectivity supports direct connectivity between Azure resources and on-premises data sources. Depending on the connection type, communication traffic can be through an encrypted tunnel over the internet or a private connection.

To enhance security within a virtual network, you can use Network Security Groups (NSGs) and firewalls to manage inbound and outbound traffic rules. These rules let you filter traffic based on IP addresses, ports, and protocols. Additionally, Azure enables routing of traffic between subnets, virtual and on-premises networks, and the Internet. Custom route tables let you control where traffic is routed.

Network configuration (image by the author)

Naming convention

A naming convention establishes a standardization for the names of platform resources, making them more self-descriptive and easier to manage. This standardization helps when navigating and filtering different resources in the Azure portal. A well-defined naming convention allows you to quickly identify a resource’s type, purpose, environment, and Azure region. This consistency can be useful in your CI/CD processes, because predictable names are easier to parameterize.

When considering the naming convention, you should consider the information you want to capture. The standard should be easy to follow, consistent, and practical. It’s worth including elements such as the organization, business unit or project, resource type, environment, region, and instance number. You should also consider the scope of resources to ensure that names are unique within their context. For certain resources, such as storage accounts, names must be globally unique.

For example, a Databricks workspace might be named as follows:

Naming convention (image by the author)

Examples of abbreviations:

Image by the author

A comprehensive naming convention typically includes the following format:

  • Source type: An abbreviation indicating the type of source.
  • Project name: A unique identification for your project.
  • Environment: The environment that supports the resource (e.g., development, quality control, production).
  • Region: The geographic region or cloud provider where the resource is deployed.
  • Construction: A number to distinguish between multiple instances of the same resource.

Infrastructure implementation

Deploying infrastructure through the Azure portal may seem simple, but it often involves numerous detailed steps for each resource. Highly secured infrastructure requires resource configuration, networking, private endpoints, DNS zones, etc. Resources like Azure Synapse or Databricks require additional internal configuration, such as setting up Unity Catalog, managing secret scopes, and configuring security settings (users, groups, etc.).

Once you are done with the test environment, you need to replicate the same configuration in QA and production environments. This is where it is easy to make mistakes. To minimize potential mistakes that can impact development quality, it is recommended to use an Infrastructure as a Code (IasC) approach for infrastructure development. IasC allows you to create cloud infrastructure as code in Terraform or Biceps, allowing you to deploy multiple environments with consistent configurations.

In my cloud projects I use accelerators to quickly initiate new infrastructure setups. Microsoft also provides accelerators that can be used. Storing infrastructure as code in a repository provides additional benefits such as version control, change tracking, performing code reviews, and integrating with DevOps pipelines to manage and promote changes to environments.

Summary

If your data platform is not handling sensitive information and you do not need a highly secured data platform, you can create a simpler setup with public internet access without virtual networks (VNet), VPNs, etc. However, in a highly regulated area, a completely different implementation plan is required. This plan involves collaboration with different teams within your organization, such as DevOps, platform and network teams, or even external resources.

You need to set up a secure network infrastructure, resources and security. Only when the infrastructure is ready can you start activities related to the development of data processing.

If you found this article enlightening, I invite you to express your appreciation by clicking the clap button or liking it on LinkedIn. Your support is greatly appreciated. For questions or advice, please feel free to contact me at LinkedIn.


Adapting the Azure Landing Zone for a Cloud Data Platform was originally published in Towards Data Science on Medium. People continued the conversation by bookmarking and commenting on this story.