Azure Storage Account Data Lake

Data Lake Storage Gen2 makes Azure Storage the foundation for building enterprise data lakes on Azure. Designed from the start to service multiple petabytes of information while sustaining hundreds of gigabits of throughput, Data Lake Storage Gen2 allows you to easily manage massive amounts of data.

View Source Code
Deployments

41

Made by

Massdriver

Official

Yes

No

Compliance
Tags

azure-storage-account-data-lake

Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on Azure Blob Storage.

Use Cases

Hadoop-compatible access

Azure Data Lake Storage is primarily designed to work with Hadoop and all frameworks that use the Apache Hadoop Distributed File System (HDFS) as their data access layer. Hadoop distributions include the Azure Blob File System (ABFS) driver, which enables many applications and frameworks to access Azure Blob Storage data directly. The ABFS driver is optimized specifically for big data analytics. The corresponding REST APIs are surfaced through the endpoint dfs.core.windows.net.

Hierarchical directory structure

The hierarchical namespace is a key feature that enables Azure Data Lake Storage Gen2 to provide high-performance data access at object storage scale and price. You can use this feature to organize all the objects and files within your storage account into a hierarchy of directories and nested subdirectories. In other words, your Azure Data Lake Storage data is organized in much the same way that files are organized on your computer.

Optimized cost and performance

Azure Data Lake Storage is priced at Azure Blob Storage levels. It builds on Azure Blob Storage capabilities such as automated lifecycle policy management and object level tiering to manage big data storage costs.

Performance is optimized because you don’t need to copy or transform data as a prerequisite for analysis. The hierarchical namespace capability of Azure Data Lake Storage allows for efficient access and navigation. This architecture means that data processing requires fewer computational resources, reducing both the speed and cost of accessing data.

Massive scalability

Azure Data Lake Storage offers massive storage and accepts numerous data types for analytics. It doesn’t impose any limits on account sizes, file sizes, or the amount of data that can be stored in the data lake. Individual files can have sizes that range from a few kilobytes (KBs) to a few petabytes (PBs). Processing is executed at near-constant per-request latencies that are measured at the service, account, and file levels.

Design

Our bundle includes the following design choices to help simplify your deployment:

Redundancy

Azure Storage always stores multiple copies of your data so that it’s protected from planned and unplanned events, including transient hardware failures, network or power outages, and massive natural disasters. Redundancy ensures that your storage account meets its availability and durability targets even in the face of failures.

  • Locally redundant storage (LRS) copies your data synchronously three times within a single physical location in the primary region. LRS is the least expensive replication option, but isn’t recommended for applications requiring high availability or durability.
  • Zone-redundant storage (ZRS) copies your data synchronously across three Azure availability zones in the primary region. For applications requiring high availability, Microsoft recommends using ZRS in the primary region, and also replicating to a secondary region.

Best practices

TLS 1.2

Enforcement of TLS 1.2 on public HTTPS endpoints is standard best practice.

Data retention policy

A time-based retention policy stores blob data in a Write-Once, Read-Many (WORM) format for a specified interval. When a time-based retention policy is set, clients can create and read blobs, but can’t modify or delete them. After the retention interval has expired, blobs can be deleted but not overwritten.

Security

In order to improve security, we implement a few key safeguards.

Data encrypted in transit

By default, all data in transit will be encrypted with Secure Sockets Layer and Transport Layer Security (SSL/TLS).

Data encrypted at rest

Azure Storage uses service-side encryption (SSE) to automatically encrypt your data when it is persisted to the cloud. Azure Storage encryption protects your data and to help you to meet your organizational security and compliance commitments.

Trade-offs

  • CMKs are not supported in this bundle
VariableTypeDescription
account.access_tierstringHow frequently will the data be accessed? Hot data is accessed frequently, while cool data is accessed less frequently. Hot data is cheaper to write to, but costs more to store. Cool data is more expensive to write to, but costs less to store.
account.regionstringThe region where the storage account will be created. Cannot be changed after deployment.
account.tierstringThe performance tier of the storage account. Premium storage accounts do not support geo-replication. Learn more. Cannot be changed after deployment.
monitoring.modestringEnable and customize Function App metric alarms.
redundancy.data_protectionintegerSet the number of days to allow data recovery if data is deleted (minimum 1, maximum 365).
redundancy.replication_typestringNo description
redundancy.zone_redundancybooleanEnable zone redundancy for the storage account. Cannot be changed after deployment.
No items found.