Tuesday, May 12, 2026

how to join customer (scd2 ) with transaction fact table to get location of the customer on date of customer

 To join a Transaction (Fact) table with an SCD Type 2 Customer (Dimension) table, you must match the transaction date to the specific period when a customer's record was "active."

In your specific table structure, since you don't have a valid_to column, you have to "close" the records yourself using a window function before joining.

Step 1: Prepare the Dimension (The "Timeline" View)

First, use a Common Table Expression (CTE) to define the start and end dates for every version of the customer record.

SQL
WITH Customer_Timeline AS (
    SELECT 
        custo_id,
        addr,
        updated_date AS start_date,
        -- Create an end_date by looking at the next record's update date
        LEAD(updated_date) OVER (PARTITION BY custo_id ORDER BY surrogate) AS end_date
    FROM Customers
)

Step 2: The Point-in-Time Join

Now, join the transaction table to this timeline. The join condition must ensure the transaction date falls on or after the start_date and strictly before the end_date.

SQL
SELECT 
    t.trx_id,
    t.date AS trx_date,
    t.customer,
    c.addr AS location_at_trx_time
FROM Transactions t
JOIN Customer_Timeline c 
  ON t.customer = c.custo_id
  AND t.date >= c.start_date 
  AND (t.date < c.end_date OR c.end_date IS NULL);

Why this logic is essential

If you simply join on cust_id, SQL will return every version of that customer for every transaction, causing a "fan-out" (duplicate data).

Using the date range ensures a 1:1 match:

  • Transaction on 01-01-2000: Joins with the record where addr = 'usa'.

  • Transaction on 01-03-2000: Joins with the record where addr = 'india'.

Key Considerations for Interviews

  1. The "NULL" Problem: Always mention the OR end_date IS NULL part. This handles the current "active" record, which hasn't been superseded yet.

  2. SCD Type 2 Identification: Mention that this approach is specifically for Slowly Changing Dimensions (Type 2) where we preserve history.

  3. Performance: Range joins (>= and <) can be computationally expensive on large datasets. In a production environment, it is better to have the ETL process pre-calculate a valid_to_date column rather than calculating it on the fly with LEAD().

No comments:

Post a Comment

12may26 - challenges in data modeling

 Transitioning from technical design to the actual execution of a project brings a different set of headaches. In a manufacturing environmen...