Skip to main content

Using Snowflake Cortex LLM to Blend Data

While at Snowflake annual conference in June, data experts congregated to share ideas, techniques and successes. One of the new innovations from Snowflake is Cortex.

Creating an Embedding with Snowflake Cortex

In this code example, I am taking our Salesforce leads table and chunking the data into a vector structure using SNOWFLAKE.CORTEX.EMBED_TEXT_768. More details on Snowflake Cortex are avaialble on the Snowflake website.


  CREATE OR REPLACE TABLE lead_location_embeddings AS
SELECT 
    "Id",
    "ConvertedAccountId",
    "Name",
 CONCAT("FirstName",' ',"LastName") "FullName",
 "Company",
    "City",
    "State",
    "PostalCode",
    SOUNDEX("FullName") "SoundexName",
   CONCAT("Street",', ',"City",' , ',"State",' , ',"PostalCode") "FullAddress",
    'fullname: ' || COALESCE("FullName", '') || ', ' || 
        'company: ' || COALESCE("Company", '') || ', ' || 
        'address: ' || COALESCE("FullAddress", '') AS address_concat,
    SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', 
        'fullname: ' || COALESCE("FullName", '') || ', ' || 
        'company: ' || COALESCE("Company", '') || ', ' || 
        'address: ' || COALESCE("FullAddress", '')) AS address_embedding
FROM 
    SFDC.STAGE.LEAD
WHERE "Do_Not_Mail__c" = TRUE;

When we acquire new data and need to match 3rd party data to our Salesforce CRM data, you can use the plethora of text matching functions like DISTANCE(). However, text matching functions while useful, lack context.

An embedding involves transforming high-dimensional data, like unstructured text, into a lower-dimensional representation, such as a vector. Using Snowflake Cortex LLM, I convert text, including full addresses, names, and emails, into vectors. These embeddings maintain semantic relationships, capturing similarities and differences in the geometric arrangement of the vectors they produce.

Blending Vector Data with Snowflake

By using vector embeddings and cosine similarity, this process effectively matches addresses from different datasets, identifying the closest and most similar pairs. This method ensures accurate matching based on the semantic meaning of the addresses, not just exact text matches.

//OUTPUT FINAL RESULTS 
SELECT 
    inner_query.RETURNID,
    inner_query.LEADID,
    inner_query.RETURN_ADDRESS,
    inner_query.LEAD_ADDRESS,
    inner_query.embed_similarity


//RUN VECTOR COSIGN SIMILARITY COMPARISON
FROM (
    SELECT 
        mail_returns."ROW" AS RETURNID,
        mail_returns.address_concat AS RETURN_ADDRESS,
        mail_returns.address_embedding,
        sfdc_leads."Id" AS LEADID,
        sfdc_leads.address_concat AS LEAD_ADDRESS,
        sfdc_leads.address_embedding,
        VECTOR_COSINE_SIMILARITY(mail_returns.address_embedding, sfdc_leads.address_embedding) AS embed_similarity,
        ROW_NUMBER() OVER (PARTITION BY mail_returns."ROW" ORDER BY embed_similarity DESC) AS similarity_rank
    FROM 
        mail_returns
    CROSS JOIN 
        sfdc_leads
) AS inner_query
WHERE inner_query.similarity_rank = 1 ;

Tips for Performance

When performing a cross join we typically use functions like SOUNDEX() on first/ last name to improve performance.