While at Snowflake annual conference in June, data experts congregated to share ideas, techniques and successes. One of the new innovations from Snowflake is Cortex.
Creating an Embedding with Snowflake Cortex
In this code example, I am taking our Salesforce leads table and chunking the data into a vector structure using SNOWFLAKE.CORTEX.EMBED_TEXT_768. More details on Snowflake Cortex are avaialble on the Snowflake website.
CREATE OR REPLACE TABLE lead_location_embeddings AS
SELECT
"Id",
"ConvertedAccountId",
"Name",
CONCAT("FirstName",' ',"LastName") "FullName",
"Company",
"City",
"State",
"PostalCode",
SOUNDEX("FullName") "SoundexName",
CONCAT("Street",', ',"City",' , ',"State",' , ',"PostalCode") "FullAddress",
'fullname: ' || COALESCE("FullName", '') || ', ' ||
'company: ' || COALESCE("Company", '') || ', ' ||
'address: ' || COALESCE("FullAddress", '') AS address_concat,
SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2',
'fullname: ' || COALESCE("FullName", '') || ', ' ||
'company: ' || COALESCE("Company", '') || ', ' ||
'address: ' || COALESCE("FullAddress", '')) AS address_embedding
FROM
SFDC.STAGE.LEAD
WHERE "Do_Not_Mail__c" = TRUE;
When we acquire new data and need to match 3rd party data to our Salesforce CRM data, you can use the plethora of text matching functions like DISTANCE(). However, text matching functions while useful, lack context.
An embedding involves transforming high-dimensional data, like unstructured text, into a lower-dimensional representation, such as a vector. Using Snowflake Cortex LLM, I convert text, including full addresses, names, and emails, into vectors. These embeddings maintain semantic relationships, capturing similarities and differences in the geometric arrangement of the vectors they produce.
Blending Vector Data with Snowflake
By using vector embeddings and cosine similarity, this process effectively matches addresses from different datasets, identifying the closest and most similar pairs. This method ensures accurate matching based on the semantic meaning of the addresses, not just exact text matches.
//OUTPUT FINAL RESULTS
SELECT
inner_query.RETURNID,
inner_query.LEADID,
inner_query.RETURN_ADDRESS,
inner_query.LEAD_ADDRESS,
inner_query.embed_similarity
//RUN VECTOR COSIGN SIMILARITY COMPARISON
FROM (
SELECT
mail_returns."ROW" AS RETURNID,
mail_returns.address_concat AS RETURN_ADDRESS,
mail_returns.address_embedding,
sfdc_leads."Id" AS LEADID,
sfdc_leads.address_concat AS LEAD_ADDRESS,
sfdc_leads.address_embedding,
VECTOR_COSINE_SIMILARITY(mail_returns.address_embedding, sfdc_leads.address_embedding) AS embed_similarity,
ROW_NUMBER() OVER (PARTITION BY mail_returns."ROW" ORDER BY embed_similarity DESC) AS similarity_rank
FROM
mail_returns
CROSS JOIN
sfdc_leads
) AS inner_query
WHERE inner_query.similarity_rank = 1 ;
Tips for Performance
When performing a cross join we typically use functions like SOUNDEX() on first/ last name to improve performance.