Snowflake has a native integration with Github, allowing you to stage code and data. Using Github for source control adds another important governance function to Snowflake. We have create a step by step Snowflake GitHub integration setup tutorial. In this article, we will wexplore loading data, metadata, and executing code directly from Github.
Loading data from Snowflake GitHub Integration
When you setup your Github repository in Snowflake, you are creating a that you can SELECT from using the syntax I have highlighted below. The example I provided is a CSV file that holds meta data for 40 S3 buckets that we want to recursively load into Snowflake. Because a CSV file is semi-structured data, we have 2 steps.
1. Create a file format – This is a standard snowflake command to define the file format.
CREATE OR REPLACE FILE FORMAT basic_csv
TYPE = 'CSV'
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
SKIP_HEADER = 1;
2. Select the contents of the CSV .
The results from this query is metadata that I can load into a recursive function, that will move data from Amazon S3 into Snowflake.
SELECT
t.$1 AS folder,
t.$2 AS folder_name,
t.$3 AS filename,
t.$4 AS alias,
t.$5 AS stage_file_format,
t.$6 AS stage_url,
t.$7 AS materialized,
t.$8 AS table_name,
t.$9 AS status
FROM @<your_repository_name>/branches/main/<your_file>.csv
(FILE_FORMAT => 'basic_csv') t WHERE STATUS = 'active';
Why load data from Github into Snowflake
There are multiple scenarios where version controlled data in Github would make sense:
Seeding Data- Seeding data is the process of pre-loading tables with essential, foundational data that serves as a starting point for applications or analysis.
- Configuration Data: Data that dictates business logic or application behavior, such as pricing tiers, user roles, or workflow settings.
- Reference Tables: Tables containing lookup values, such as country codes, product categories, or status mappings.
Metadata provides context and structure for data pipelines, ensuring consistent management and governance. Examples include:
- Pipeline Configuration: Files defining extraction, transformation, and load (ETL) jobs, including schedule and dependency management. In our example, we use each record in our CSV to recursively process and
- Data Model Definitions: Schema files specifying tables, columns, and relationships, ensuring data consistency across environments.
As the scale and dependencies on your Github repository grows, there is an inflection point where a more sophisticated process and technology like DBT makes sense to not only facilitates your source control and promotion management process, but also enhance your meta data management, documentation, semantics and other vitals related to building your data infrastructure.
Execute SQL from Snowflake GitHub Integration
From a Github stage, you can execute code directly from Snowflake using the following command:
EXECUTE IMMEDIATE FROM @MYGITHUBSTAGE.PUBLIC.PIPELINE/branches/main/MYSQLFILE.sql;
We use Github to manage all project code, ensuring all code used for data pipelines and business consumable data has a centralized version control.
Scheduling Source Controlled SQL
While you can use Snowflake tasks or implement popular open source tools like Airflow, our team loves integrating Snowflake with Azure DataFactory and Fabric to give us meta-data driven pipelines.
Real-world use case – Meta-data driven pipleines
We love things that are fast, cheap, and easy, so Azure DataFactory has long checked those boxes. In the example below, we refresh 50 tables originating from AWS S3 every day and then materialize tables by simply executing SQL that is validated and then promoted to production within GitHub. This highlights a very simple and straight forward use case, but does feature the scheduling, logging, and alerting that is needed to manage this production pipeline.