In your specific table structure, since you don't have a valid_to column, you have to "close" the records yourself using a window function before joining.
Step 1: Prepare the Dimension (The "Timeline" View)
First, use a Common Table Expression (CTE) to define the start and end dates for every version of the customer record.
WITH Customer_Timeline AS (
SELECT
custo_id,
addr,
updated_date AS start_date,
-- Create an end_date by looking at the next record's update date
LEAD(updated_date) OVER (PARTITION BY custo_id ORDER BY surrogate) AS end_date
FROM Customers
)
Step 2: The Point-in-Time Join
Now, join the transaction table to this timeline. The join condition must ensure the transaction date falls on or after the start_date and strictly before the end_date.
SELECT
t.trx_id,
t.date AS trx_date,
t.customer,
c.addr AS location_at_trx_time
FROM Transactions t
JOIN Customer_Timeline c
ON t.customer = c.custo_id
AND t.date >= c.start_date
AND (t.date < c.end_date OR c.end_date IS NULL);
Why this logic is essential
If you simply join on cust_id, SQL will return every version of that customer for every transaction, causing a "fan-out" (duplicate data).
Using the date range ensures a 1:1 match:
Transaction on 01-01-2000: Joins with the record where addr = 'usa'.
Transaction on 01-03-2000: Joins with the record where addr = 'india'.
Key Considerations for Interviews
The "NULL" Problem: Always mention the
OR end_date IS NULLpart. This handles the current "active" record, which hasn't been superseded yet.SCD Type 2 Identification: Mention that this approach is specifically for Slowly Changing Dimensions (Type 2) where we preserve history.
Performance: Range joins (
>=and<) can be computationally expensive on large datasets. In a production environment, it is better to have the ETL process pre-calculate avalid_to_datecolumn rather than calculating it on the fly withLEAD().
No comments:
Post a Comment