Skip to main content

5.3 Profiling & Analyzing Joined Data

Now that you have joined two tables together, there is some profiling, analysis, and critical thinking required to ensure your result will accurately reflect your desired output. There are numerous issues that can arise where your results will mis-represent your desired results. In theory, you wanted to understand Leads that are converted. An inner join will ensure only records with L.”ConvertedAccountId” and A.”Id” that match are returned.

What about leads without ConvertedAccountId record?

Leads that are missing a field ConvertedAccountId are not returned.

What about joining 2 tables containing fields with NULL values?
Snowflake (or any SQL database) will not create a Cartesian product due to NULL values in the join conditions. Instead, it will ignore rows in LEAD where ConvertedAccountId is NULL, because those NULL values won’t find any matching Id in the ACCOUNT table (since NULL does not equal anything, not even another NULL). The result set will only include leads that have a non-null ConvertedAccountId that matches an Id in the ACCOUNT table.

How can you determine how many Leads are missing from Accounts?

The simple way to make this determination is to check for the existence of a foreign key:

SELECT COUNT(*) AS " Leads Count" FROM LEAD WHERE "ConvertedAccountId" IS NULL;

This works for simple examples. Over time as you learn and find better ways to solve problems you will likely accumulate your own library of SQL scripts to help you quickly solve problems. When I profile and determine record counts, I use SQL like this:

How can you determine how many Accounts are missing from the Leads Table?

5.5 Combining two data sources with UNION

In SQL, a UNION is used to combine and stack the result sets of two or more SELECT queries. When you create a union, each SELECT query must have the same number of columns in the result sets with similar data types. This operation allows you to stack and consolidate data from different queries into a single result, which is particularly useful when dealing with disparate data sources or different activities that need to be analyzed together.

In Salesforce there isn’t an equivalent of a union that you would normally work with. However, in the world of reporting and analytics, Unions are extremely powerful for connecting and understanding activity data. For example, if you wanted to understand and report total leads created, total opportunities created, and deals closed this week you would need to run 3 different reports. Each report would require a different date/time field and filter criteria. To view these statistics together you could display them in a dashboard together, but they are effectively disconnected. A union can easily combine these 3 data sets into a single structure that can be loaded, pivoted and analyzed.

SQL Union Syntax

With this conceptual understanding, the union SQL syntax follows the same pattern where each query is created and simply combined using the UNION keyword. Here is a basic union syntax to combine two SELECT statements is as follows:

SELECT column1, column2, ...
FROM table1
UNION
SELECT column1, column2, ...
FROM table2;

5.6 Enhancing UNION Results

Labeling each query results with a Constant

When you use a union to combine two query results, you lose context where each query originated. Additionally if you want to use this data result to pivot and analyze data this context is very important. Let’s enhance each query with a “SEGMENT” constant. We learned about adding constants in our first lab so this should be a refresher. 

Add a constant to query 1 called “Segment” with the value ‘New Leads’

Add a constant to query 2 called “Segment” with the value ‘Closed Won Deals’

Your new SQL query should look like this:

SELECT ID,CREATEDDATE,STATUS,'New Leads' AS SEGMENT FROM LEAD
WHERE CREATEDDATE >= '2021-04-01' AND CREATEDDATE <= '2021-04-30'

UNION ALL

SELECT ID,CLOSEDATE,STAGENAME,'Closed Won Deals'as SEGMENT FROM OPPORTUNITY
WHERE CLOSEDATE >= '2021-04-01' AND CLOSEDATE <= '2021-04-30' AND STAGENAME = 'Closed Won';

Assigning Consistent Column Names
Snowflake will utilize the column names from your first query. The ordering of the columns and NOT the column names dictates how data is combined within your union. 

It’s a best practice to use aliases to rename your columns for consistency. It ensures you are thoughtfully merging data while retaining the original table column name as context.

SELECT ID,CREATEDDATE as "ACTIVITY DATE",STATUS,'New Leads' AS SEGMENT FROM LEAD
WHERE CREATEDDATE >= '2021-04-01' AND CREATEDDATE <= '2021-04-30'

UNION ALL

SELECT ID,CLOSEDATE as "ACTIVITY DATE",STAGENAME,'Closed Won Deals'as SEGMENT FROM OPPORTUNITY
WHERE CLOSEDATE >= '2021-04-01' AND CLOSEDATE <= '2021-04-30' AND STAGENAME = 'Closed Won';

Column Data Types

It is important that you align the column data types between tables. For example a VARCHAR (text) column shouldn’t be unioned to an INTEGER (number).

Column Count Mismatches

If one query has a mismatch in columns you could get an error like the following. 

UNION VS UNION ALL

A UNION will automatically remove exact duplicates from your result. A UNION ALL will leave all records in your result. Our lab pulled data from two different tables with distinct stage and status fields so a duplicate would have been impossible.


To further understand UNION vs UNIONALL, let’s say we had these two tables that we wanted to union together.

TableA

IDName
1Alice
2Bob

TableB

IDName
2Bob
3Charlie

Our data has a duplicate record for 2 | Bob. A UNION will remove the duplicate record. A UNION ALL will return both instances for 2 | Bob.  If you start using Union to create data from the same table, the UNION vs UNION ALL distinction will be more important for duplicate removal.

Combined Result Set (with UNION)

IDName
1Alice
2Bob
3Charlie

5.7 Extra Credit – Putting Union Data to Work

To this point, we have been writing code and viewing results. To explain the value of the data set you just created, I have exported the data and pulled into Excel to pivot and aggregate the data. As you can see, in 1 data set we can view two independent activities together in a funnel. When you look at Salesforce pipeline data, this activity based view is extremely powerful to understand in-month activities.

From what you have learned so far, we can quickly get to this same summarized count by month and segment by wrapping our UNION query in a SELECT statement with GROUP BY. Here is an example that you can test and experiment with:

-- Adding the to_char function changes the detailed date format into a month/year text field.

SELECT  to_char("ACTIVITY DATE", 'yyyy-mm') "ACTIVITY DATE" ,SEGMENT,COUNT(ID) "COUNT" FROM

--Wrap the previous lab Union Query in opening and closing parentheses which creates a derived table for you to select from.

(
SELECT ID,CREATEDDATE as "ACTIVITY DATE",STATUS,'New Leads' AS SEGMENT FROM LEAD
WHERE CREATEDDATE >= '2021-04-01' AND CREATEDDATE <= '2021-04-30'

UNION ALL

SELECT ID,CLOSEDATE as "ACTIVITY DATE",STAGENAME,'Closed Won Deals'as SEGMENT FROM OPPORTUNITY
WHERE CLOSEDATE >= '2021-04-01' AND CLOSEDATE <= '2021-04-30' AND STAGENAME = 'Closed Won'
)

-- Adding the to_char changes the detailed date format into a month/year grouped value.

GROUP BY to_char("ACTIVITY DATE", 'yyyy-mm'),SEGMENT

5.4 Normalized vs Denormalized Data

Understanding Normalized vs Denormalized Data

While working in Salesforce, you are working within a relational data model that is “normalized”. When you join these tables together you could be denormalizing your data which helps for data analysis but without care can change the meaning of your data by introducing what would be perceived as “duplicate” records.

Additional Profiling Tips

When you join two data sources together in Salesforce, you have to consider one to many relationships between objects and the resulting information / data that you are delivering.

A quick way to review and check the presence of duplicate data as a result of a join is to count and compare Record counts by ID from distinct record counts.

For example, you can use the previous RIGHT join and perform an aggregate calculation of COUNT and COUNT DISTINCT to give you visibility to any introduce duplicate records in your data, thus denormalizing your data.

SELECT

  COUNT(*) AS "Total Joined Records",

  COUNT(DISTINCT A."Id") AS "Distinct Account IDs"

FROM LEAD AS L

RIGHT JOIN ACCOUNT AS A ON L."ConvertedAccountId" = A."Id";

Guidance on working with JOINs

Understanding SQL joins is crucial for combining data from multiple tables in a database is a critical building block for working with Salesforce data because the analysis and analytics output often relies on multiple table of data and lookups in Salesforce. Whether you’re working with Salesforce reports or querying a database directly, knowing how to join tables enables you to retrieve comprehensive datasets that provide valuable insights.

By grasping the concepts of inner joins, left joins, right joins, and full outer joins, you gain control over how data is merged, allowing you to tailor your queries to specific requirements. Moreover, mastering join conditions ensures that the combined data is relevant and accurately reflects the relationships between the tables.

As you continue your SQL journey, remember to consider the implications of different join types on your query results and how they align with your data analysis goals. With practice and proficiency in SQL joins, you’ll be equipped to navigate complex datasets and extract meaningful information to drive informed decision-making.

There are intermediate to advanced concepts for dealing with junction tables, one to many relationships, and deduping records that we don’t cover in our introductory training but you should be aware of. I recommend that you spend some time practicing joining data extensively because it will become a staple for your SQL skills advancement.

5.2 JOIN Labs

Lab Joining 2 Data Sources with a Left Join

What if we wanted to return all leads with and without matched accounts? The following query with a LEFT JOIN will return all Leads that have and have not been converted to an Account.

SELECT L."Name" as "LeadName", A."Name" as "AccountName" 

FROM LEAD AS L 

LEFT JOIN ACCOUNT AS A 

ON L."ConvertedAccountId" = A."Id"

Lab Joining Two Data Sources with Full Outer Join

The following query with a LEFT JOIN will return all Leads that have and have not been converted to an Account and all Accounts that may not be associated with a Lead.

SELECT L."Name" as "LeadName", A."Name" as "AccountName" 

FROM LEAD AS L 

LEFT JOIN ACCOUNT AS A 

ON L."ConvertedAccountId" = A."Id"

After you perform a full outer join between leads and accounts, you may ask yourself three important questions:

  1. How many matches exist?
  2. How many leads exist without a converted account?
  3. How many accounts exist that were not converted from a lead?

We are not going to dig into these questions quite yet as we continue to learn and build up our working knowledge of expressions. We will refer back to this in our final lesson when we dive into a practical application of a full outer join.

Lab Joining Two Data Sources with Right Join

The following query with a RIGHT JOIN will return all Accounts that were converted from leads or exist without a lead, and only Leads that were converted to an Account. 

SELECT L."Name" as "LeadName", A."Name" as "AccountName" 

FROM LEAD AS L 

RIGHT JOIN ACCOUNT AS A 

ON L."ConvertedAccountId" = A."Id"

5.1 JOIN ON Condition

To understand a join, let’s first consider 2 queries:

Query 1 – Lead Query

SELECT "Name" as "LeadName"

FROM LEAD

Query 2 – Account Query

SELECT "Name" as "AccountName" 

FROM ACCOUNT

Table Alias

When working with tables like Leads and Accounts, where both have columns named “Id” and “Name”, aliasing becomes particularly useful to avoid ambiguity. For instance, in a SQL query involving a join between these two tables, you might encounter confusion if you just refer to “Id” or “Name”, since it’s unclear whether you’re referring to the “Id” and “Name” from Leads or Accounts.

To resolve this, you can alias the tables, like L for Leads and A for Accounts. With the table alias defined, you can now call “Id” or “Name” fields, prefixed with the alias to specify the table. If you review the revised queries below, it’s now clear which table’s “Name” you are referencing.

NOTE: Technically you do not need the keyword “AS” when you create an alias for a SQL table or column. It does however help for readability while starting writing SQL. 

Query 1 – Lead Query

SELECT L."Name" as "LeadName"

FROM LEAD AS L

Query 2 – Account Query

SELECT A."Name" as "AccountName" 

FROM ACCOUNT AS A

JON ON Condition

The ON condition in a SQL join specifies the criteria that determine how rows from two tables are matched together. It’s essentially a rule that says, “only link rows where this condition is true.” For example, in the condition ON Lead.ConvertedAccountId = Account.Id, it means that a row from the Lead table will only be joined with a row from the Account table if the ConvertedAccountId field in the Lead table matches the Id field in the Account table. This ensures that the data combined in the join result is relevant and correctly associated, linking each lead to its corresponding account based on these matching IDs.

SELECT L."Name" as "LeadName", A."Name" as "AccountName" 

FROM LEAD AS L 

JOIN ACCOUNT AS A 

ON L."ConvertedAccountId" = A."Id"