Snowflake — History, Architecture & Features
In this guide about the snowflake, we will explore the background, how it come to existence, its architecture, and the features of the snowflake. I’ll posting more in this series continuation for further dig down into snowflake (SF) features.
History!!!
Before diving deep into the snowflake, I thought to bring some history about the snowflake.
Snowflake was founded back in 2012 in San Mateo, California. Its name was decided by the founder, because of their love for snow sports. For keeping it two years in stealth mode, it was publicly launched in October,2014.
Snowflake
Snowflake is a cloud-based data warehousing solution. It’s built from scratch(cloud-native), optimized for cloud (AWS/GCP/Azure).
Snowflake is SaaS (Software as a Service) product or DWaaS (Data Warehouse as a Service) with ANSI SQL support, where there’s no software, infrastructure or upgrades are managed, instead Snowflake will take care for this and will be available to you.
With Pay-as-you-go model, storage & compute decoupled you will be charged for compute and storage separately, can scale as needed independent of each other.
If this doesn’t make sense now don’t worry and stick together, we will discuss each of these concepts ahead.
Architecture
Before discussing the architecture let’s first explore the layers
- Storage Layer: Underneath utilizes the cloud storage for storing.
- Query Processing Layer: Also called virtual warehouse, provides the compute.
- Cloud Service Layer: Responsible for multiple tasks like controlling Account Access, generating query plan and others.
Let's deep dive into each of the layer.
- Storage Layer
As the name says itself is responsible for storing the data. The snowflake is responsible for reorganizing the data into its internal optimized columnar format, which is immutable. Under the hood the snowflake utilizes the cloud storage (AWS S3, Azure Blob, Google Cloud Storage), and the user cannot access the data except SQL. Each table is divided into micro-partitions which contains the portion of the data. - Compute / Query Processing Layer
The processing of the query happens at this layer by using the “virtual warehouse”, which a compute resource and provides the required resources like memory, CPU to perform the operations.
Multi-cluster architecture can also be created that means you can activate multiple virtual warehouses at the same time which are independent of each other and cannot affect the performance of the other warehouse. - Cloud Service Layer
This layer is responsible for multiple activities which are carried out by the snowflake under the hood like authentication of the user & access control, management of the infrastructure, metadata management, query processing & optimization, and Security.
Snowflake Features
1- Near-Zero Mangement:
Snowflake is a cloud-native warehousing platform, which offer near-zero management by eliminating the need of the administrative overhead. Plenty of the things related to the management is handled by snowflake internally which allows the organization to focus more on the data instead of the management.
2- Time-travel
Time travel is a native Change Data Capture (CDC) feature to ensuring the continued availability of the data that has been changed or deleted. Historical data in permanent databases and database objects can be queried, cloned or restored for a maximum of 90 days. Whereas data in transient and temporary objects can be accessed maximum up to 24 hours.
The time-travel retention period is by default enabled for each of the snowflake accounts and the default is 24 hours; however it can altered as per the requirements.
Alter database DEMODB SET DATA_RETENTION_TIME_IN_DAYS = 90;
The three ways to go back in the time-travel:
1- Specific time in past
2- Going back certain amount of time from now
3- Query ID (SF assign unique query id each time you execute query).
3- Fail-safe:
Once the maximum time-travel period ends the data is moved to fail-safe state which can be recovered by snowflake employees up to seven days later.
Note: The time-travel feature for temporary and transient objects is 24 hours and there’s no fail-safe state for them. The difference is that transient data isn’t once the session end and default session-timeout in snowflake is 4 hours.
4- Replication and Failover:
Snowflake supports replication across region and across cloud platforms, which enables an organization to replicate databases between snowflake accounts. The objects are replicated physically which incurs the additional storage cost for the any cloned object along with data transfer and compute charges.
5- Zero-copy cloning:
This feature allows to duplicate the data without actually creating the physical copy, which avoids storage cost. Changes made to the cloned object are local to the cloned object and will not affect the source. Similarly, the changes made to the source after cloning will not affect the clone.
6- Snow task:
The built-in schedular to schedule your queries or stored procedures with snowflake by using the time in minutes alternatively you can provide the time in CRON format.
7- Data masking:
Snowflake provides a feature to mask the data either dynamically, statically or conditionally. Data masking is column-based security, which allows to hide the sensitive information from the user that are unauthorized to view the sensitive information inside the table.
8- Row-level security:
Row-level security allows user to only show the limited number of rows to the user based on the region or role thresholds. Suppose in sales table the employees can only see the sales from their region and not the other regions in that case we can apply the row-level security.
There’re plenty of other features that snowflake offers like performance, scalability, availability, sharing and collaboration along with the other mentioned above.
If you want to learn more about the snowflake there’re great resources out there. I’ll mention few of them below:
1. Snowflake Documentation
2. (4) Data Engineering Simplified — YouTube
3. Snowflake: The Definitive Guide from O’Reilly