Thursday, January 29, 2026

29jan26 - Physical model without a Logical one?

Can you create a Physical model without a Logical one?

The short answer: Yes. Technically, you can jump straight into a Physical Data Model (PDM) and start writing SQL code to create tables.

The long answer: You probably shouldn't.

While "skipping to the end" might seem faster, it often leads to significant issues:

Feature	Physical Model Only (Skipping LDM)	Using Both LDM and PDM
Flexibility	Hard to change; tied to a specific database (e.g., PostgreSQL).	Easy to migrate or adapt to new tech. > in future if you want to change the database software , since the LDM is in place its easy to create a PDM and tables
Communication	Difficult for non-tech stakeholders to read.	Clear and understandable for business users.
Data Integrity	Prone to "missing" business rules or logic.	Ensures business rules are baked into the design.
Scalability	Often results in "patchwork" fixes later.	Built on a solid, future-proof foundation.

Sunday, January 25, 2026

25jan26 - how to do profilng in which tool and how

Here’s a clean, interview-ready rewrite focused on one 3rd-party tool (Informatica Data Quality), with step-by-step profiling and reporting. You can say this almost verbatim.

Tool used for data profiling

I use Informatica Data Quality (IDQ) to profile data files and generate data quality reports before loading them into the database.

Step-by-step: How I do data profiling in Informatica Data Quality

Step 1: Ingest the data file

Connect the source file (CSV / Excel / flat file / S3 / database)
Define the source metadata in IDQ
Validate column names and data types

📌 Purpose: Ensure the file is readable and structurally correct.

Step 2: Run Column Profiling

Use Column Profile in Informatica Analyst
Analyze each column for:
- Data type distribution
- Null and blank percentage
- Min / Max values
- Distinct count
- Value frequency

📌 Purpose: Understand the content and detect obvious issues.

Step 3: Run Data Domain & Pattern Analysis

Apply data domains (date, email, phone, numeric)
Use pattern analysis to detect invalid formats

Examples:

Invalid email formats
Date columns stored as strings
Mixed data types in one column

📌 Purpose: Validate format and consistency.

Step 4: Identify Duplicates

Use duplicate analysis on key fields
Identify exact and fuzzy duplicates (if needed)

Example:

Same order_id appearing multiple times
Same customer with slight name variations

📌 Purpose: Prevent double counting and incorrect metrics.

Step 5: Apply Business Rules

Create business rule transformations
Examples:
- Order amount > 0
- Delivery time between 0 and 180 minutes
- Order status ∈ allowed values

📌 Purpose: Ensure data follows business logic, not just technical rules.

Step 6: Generate Data Quality Score

Assign weights to rules (critical vs non-critical)
Calculate overall data quality score
Categorize issues:
- Critical
- Warning
- Informational

📌 Purpose: Measure readiness of the file.

Step 7: Create Data Quality Report

Generate profiling reports from Informatica Analyst
Report includes:
- Column statistics
- Failed rules
- Duplicate counts
- Data quality score

📌 Purpose: Provide transparency to stakeholders.

Step 8: Go / No-Go decision

If critical rules fail → Reject the file
Notify source system / upstream team
Reload only after correction

📌 Purpose: Prevent bad data entering the warehouse.

How I ensure the file is in good shape

A file is considered ready when schema validation passes, nulls and duplicates are within thresholds, business rules are satisfied, and the overall data quality score meets acceptance criteria.

Strong closing line (interview gold)

“Using Informatica Data Quality, I profile the data structurally, statistically, and against business rules, generate a data quality report, and enforce Go/No-Go criteria before loading the data.”

Just say 👍

Friday, January 23, 2026

23 jan 26 - Data Modeling - ALL STEPS - With KPIS and req gathering

23 jan 26 -

Data Model (Step-by-Step)

Supermarket Sales Analytics – Requirement to Data Model (Step-by-Step)

STEP 1: Understand the Business Context

Business Overview

A supermarket chain operates:

Multiple stores across regions
Thousands of products
Sales staff earning commissions
Loyalty customers

Business Goals

Increase sales & profitability
Identify high-demand products
Improve regional performance
Track employee commission
Understand customer behavior

STEP 2: Requirement Gathering – How to Ask Questions

🔹 Core Requirement Questions (Very Important)

A. Business Objective

Ask:

What decisions will this report support?
Who will use this report (CEO, Manager, Ops)?
Is this for monitoring or deep analysis?

B. Sales-Related Questions

Ask:

Do you want daily, monthly, or yearly sales?
Should discounts be included in revenue?
Do you want gross margin or only sales?

C. Time & Comparison

Ask:

Do you need YoY / MoM comparison?
How many years of history?

D. Hierarchy & Drill-down

Ask:

Region → Store → City?
Category → Subcategory → Product?

E. Data Availability

Ask:

Is cost available for margin?
Is employee commission stored or calculated?

STEP 3: Identify Required Reports (5 Reports)

✅ Final Agreed Reports

Sales Performance Report
Customer Analysis Report
Regional Performance Report
Employee Commission Report
Product in Demand Report

STEP 4: Define KPIs for Each Report

📊 Report 1: Sales Performance

Business Questions

How much are we selling?
Are sales increasing?

KPIs

Total Sales Amount
Quantity Sold
Discount Amount
Gross Margin
Sales Growth %

👥 Report 2: Customer Analysis

Business Questions

Who are our top customers?
Are customers returning?

KPIs

Total Spend per Customer
Purchase Frequency
Average Basket Value
Customer Lifetime Value (basic)

🌍 Report 3: Regional Performance

Business Questions

Which regions perform best?
Where are we underperforming?

KPIs

Sales by Region
Sales Growth %
Contribution %
Store Count

🧑‍💼 Report 4: Employee Commission

Business Questions

How much commission is earned?
Who are top performers?

KPIs

Total Sales by Employee
Commission Amount
Commission %
Rank by Sales

📦 Report 5: Product in Demand

Business Questions

What products sell the most?
Which products should be stocked more?

KPIs

Quantity Sold
Sales Amount
Sales Frequency
Stock Turnover (if inventory exists)

STEP 5: Define Grain, Granularity & Granules

🔹 Fact Table Grain (Very Important)

One row per Product per Store per Day per Employee

This supports all 5 reports.

🔹 Granularity (Analysis Levels)

Dimension	Granularity Levels
Time	Day → Month → Year
Product	Product → SubCategory → Category
Geography	Store → City → Region
Customer	Customer
Employee	Employee

🔹 Granules (Lowest Identifiers)

Granules are the keys that define the grain:


Date_Key
Product_Key
Store_Key
Employee_Key
Customer_Key

Each unique combination = 1 row

STEP 6: Choose the Data Model

✅ Recommended Model: Star Schema (with selective Snowflake)

Why?

BI friendly
Easy reporting
Snowflake only where hierarchy is complex (Product, Region)

STEP 7: Identify Fact Tables

🔹 Primary Fact Table

FACT_SALES

Measures:

Sales_Amount
Quantity
Discount
Cost
Commission_Amount

STEP 8: Identify Dimension Tables

🔹 Dimensions Used Across Reports

Dimension	Purpose
DIM_DATE	Time analysis
DIM_PRODUCT	Product details
DIM_CATEGORY (Snowflake)	Product hierarchy
DIM_STORE	Store info
DIM_REGION (Snowflake)	Geography hierarchy
DIM_CUSTOMER	Customer analysis
DIM_EMPLOYEE	Commission tracking

STEP 9: Data Model per Report (Mapping)

📊 Sales Performance


FACT_SALES
→ DIM_DATE
→ DIM_PRODUCT
→ DIM_STORE

👥 Customer Analysis


FACT_SALES
→ DIM_CUSTOMER
→ DIM_DATE

🌍 Regional Performance


FACT_SALES
→ DIM_STORE → DIM_REGION
→ DIM_DATE

🧑‍💼 Employee Commission


FACT_SALES
→ DIM_EMPLOYEE
→ DIM_DATE

📦 Product in Demand


FACT_SALES
→ DIM_PRODUCT → DIM_CATEGORY
→ DIM_DATE

STEP 10: Data Modeling Steps (Execution)

Finalize grain
List KPIs & measures
Identify dimensions
Define hierarchies
Design star/snowflake schema
Assign surrogate keys
Validate against reports
Review with business
Freeze model
Build ETL

🧠 Final Reality Check (Very Important)

Reports drive KPIs → KPIs drive grain → Grain drives data model

Never design tables first.

🎤 Interview-Ready Closing Line

“I start with business questions, derive KPIs, define the grain, and then design a star or snowflake schema that supports all required reports with minimal redundancy.”

Wednesday, January 21, 2026

22jan26 - when to use snowflake model

Scenario for Snowflake Model

🔍 Comparison Snapshot (TOP)

⭐ Star Schema

FACT → DIM_PRODUCT (flat)

❄️ Snowflake Schema

FACT → Product → SubCategory → Category → Department

🎯 What the Company Wants to View in Reports

Business Context

A large retail enterprise wants enterprise-level analytical reporting, not just simple dashboards.

They want reports that:

Work across millions of products
Follow official category hierarchies
Support frequent hierarchy changes
Match master data definitions

📊 Report Requirement 1

“Sales by Department → Category → SubCategory”

Business Expectation

View sales at Department level
Drill down to Category
Drill further to SubCategory
Restructure categories without reloading all product data

🔹 Grain, Granularity & KPIs (Explained)

Grain (Fact Table Grain)

One row per Product per Store per Day

This defines the lowest level at which data is stored in the fact table.

Granularity (Reporting Levels)

Using the snowflaked hierarchy, the business can analyze data at multiple levels:

Department level – Executive overview
Category level – Business unit performance
SubCategory level – Merchandising analysis
Product level – SKU-level analysis

This is possible without changing the fact table, only by navigating dimension hierarchies.

Key KPIs Required

Total Sales Amount
Total Quantity Sold
Gross Margin
Average Selling Price (ASP)
Discount Amount
Sales Growth %

These KPIs aggregate cleanly across all hierarchy levels.

❌ Why This Is Weak in Star Schema

Flat Product Dimension

DIM_PRODUCT
Product_Key | Product_Name | SubCategory | Category | Department

Problems

Category and Department values are repeated for every product
If a Category moves to a new Department →
❗ millions of product rows must be updated
High risk of inconsistent category names
Hard to audit hierarchy changes over time

Result

Maintenance-heavy
Error-prone
Poor governance

✅ Why Snowflake Works

FACT → Product → SubCategory → Category → Department

Benefits

Category and Department stored once
Reclassification requires updating only hierarchy tables
Clean drill-down and roll-up paths
Enterprise-controlled hierarchy

📊 Report Requirement 3

“Impact Analysis When Categories Change”

Business Expectation

When a category is moved or renamed, reports should:
- Update instantly
- Not require massive data reloads

❌ Star Schema Issue

Category change = product dimension reload
Long ETL windows
High operational risk

✅ Snowflake Solution

Update one hierarchy table
Fact data remains untouched
Faster, safer changes

🧠 Core Reason Star Is Not Enough

Star schema collapses hierarchy into flat dimensions, which breaks scalability, governance, and change management when hierarchies are large and dynamic.

🧾 One-Line Business Justification

“Snowflake schema enables governed, hierarchical reporting that a flat star model cannot support efficiently at enterprise scale.”

If you want, I can next:

Add Date / Geography grain examples
Convert this into Northwind scenario
Make it interview-ready Q&A
Create a diagram with keys

Just tell me 👍

22jan26 - Surrogate keys - when to use and not use

What is a Surrogate Key?

A surrogate key is an artificial, system-generated unique identifier for a row in a table.
Usually integer or bigint, auto-incremented or generated by the system.
Not derived from business data (unlike natural keys).

Example:


customer_key INT IDENTITY(1,1)  -- surrogate key

2️⃣ When to Use Surrogate Keys

Scenario	Reason / Benefit	Example
Data Warehouse / Fact-Dimension Modeling	Allows consistent foreign key joins, supports slowly changing dimensions (SCD), avoids issues when natural keys change.	`Dim_Customer` with `customer_key` used in `Fact_Order_Sales`
Slowly Changing Dimensions (Type 2)	A surrogate key lets you track history without altering historical fact data.	Customer changes address → new surrogate key row in `Dim_Customer`
Composite / Complex Natural Keys	Avoid long, multi-column foreign keys.	Order detail identified by `order_id + line_number` → simpler to use `order_line_key`
Integration from Multiple Systems	Different source systems may use different IDs → surrogate key standardizes references.	Merging `customer_id` from two CRMs
Performance / Indexing	Integer surrogate keys are more efficient for joins than long string keys.	Joining `Dim_Product.product_key` to fact table instead of `product_sku`

3️⃣ When NOT to Use Surrogate Keys

Scenario	Reason / Alternative
Transactional OLTP Tables where natural key is stable and simple	Surrogate key adds no value; natural key can uniquely identify the row.
When key has real-world meaning and is immutable	Using a surrogate key may cause unnecessary joins and duplication.
Small reference tables with static values	Surrogate key adds complexity without benefit.
Temporary / staging tables	Surrogate keys are unnecessary; use natural key or row hash.
Where natural key is already numeric, short, and stable	Surrogate key is redundant

4️⃣ Key Guidelines

Always use surrogate keys in data warehouse dimensions and fact table foreign keys.
Avoid surrogate keys in OLTP tables with stable, meaningful natural keys.
Never mix natural keys and surrogate keys for joins in the same context—it confuses lineage.
Surrogate keys are particularly helpful for history tracking (SCD2) and integration of multiple sources

22jan26 - Food Delivery Business – Data Modeling from OLAP - V1

Below is a complete, end-to-end explanation for a Food Delivery Business, written exactly the way it is expected in projects, design documents, interviews, and architecture reviews.

🍔 Food Delivery Business – Data Modeling from OLTP to Star Schema

1️⃣ Transaction Model (Enterprise / OLTP Model)

🎯 Purpose

Support day-to-day operations
Optimized for fast inserts & updates
Highly normalized (3NF)

📋 Core Transaction Tables

Customers

customer_id (PK)
name
phone
email
address_id
created_date

Restaurants

restaurant_id (PK)
restaurant_name
cuisine_type
city
rating

Menu_Items

menu_item_id (PK)
restaurant_id (FK)
item_name
category
price
is_available

Orders

order_id (PK)
customer_id (FK)
restaurant_id (FK)
order_datetime
order_status
total_amount

Order_Items

order_item_id (PK)
order_id (FK)
menu_item_id (FK)
quantity
item_price

Delivery_Partners

delivery_partner_id (PK)
name
vehicle_type
join_date

Payments

payment_id (PK)
order_id (FK)
payment_mode
payment_status
paid_amount

2️⃣ Need for Reporting & Analytics

📊 Business Questions

Daily / monthly revenue?
Top restaurants by sales?
Most ordered food items?
Average delivery time?
Customer repeat rate?
City-wise performance?
Partner productivity?

3️⃣ Problems Reporting Directly from Transaction Model ❌

Problem	Explanation
Too many joins	8–10 joins per report
Poor performance	Tables optimized for writes
Complex queries	Hard for BI tools
Mixed grain	Orders vs order items
No historical tracking	Overwritten data
Business logic scattered	Revenue logic repeated

📌 Conclusion:

Transaction models are NOT designed for analytics.

4️⃣ Why We Need a Data Warehouse ✅

Benefits

Optimized for read & aggregation
Simple query structure
Historical data support
Business-friendly schema
Faster dashboards

5️⃣ Define Grain & Granularity (MOST IMPORTANT)

🎯 Reporting Focus

Analyze food orders and revenue

✅ Fact Table Grain

ONE ROW per menu item per order per day

Formally:

Order × Menu Item × Date

📌 Granularity

Lowest level: Order Item
Time level: Day
Allows roll-up to:
- Order
- Restaurant
- City
- Month / Year

6️⃣ Map Business Requirements to Fact & Dimension Tables

Business Requirement	Table Type	Table Name
Revenue analysis	Fact	Fact_Order_Sales
Time-based analysis	Dimension	Dim_Date
Customer analysis	Dimension	Dim_Customer
Restaurant performance	Dimension	Dim_Restaurant
Menu analysis	Dimension	Dim_Menu_Item
Location analysis	Dimension	Dim_Location
Delivery partner analysis	Dimension	Dim_Delivery_Partner
Payment analysis	Dimension	Dim_Payment

7️⃣ Fact & Dimension Tables List

⭐ Fact Table

Fact_Order_Sales

Grain: One row per menu item per order per day

Measures

quantity_ordered
item_sales_amount
discount_amount
delivery_fee
total_sales_amount
delivery_time_minutes

Foreign Keys

date_key
customer_key
restaurant_key
menu_item_key
delivery_partner_key
location_key
payment_key

⭐ Dimension Tables

Dim_Date

date_key (PK)
full_date
day
month
quarter
year
is_weekend

Dim_Customer

customer_key (PK)
customer_id
name
gender
age_group
city
signup_date
is_current (SCD2)

Dim_Restaurant

restaurant_key (PK)
restaurant_id
restaurant_name
cuisine_type
rating
is_active

Dim_Menu_Item

menu_item_key (PK)
item_name
category
veg_flag
price

Dim_Location

location_key (PK)
city
state
country
zone

Dim_Delivery_Partner

delivery_partner_key (PK)
partner_name
vehicle_type
experience_level

Dim_Payment

payment_key (PK)
payment_mode
payment_provider

8️⃣ Star Schema Structure ⭐

                 Dim_Date
                    |
Dim_Customer — Fact_Order_Sales — Dim_Restaurant
                    |
              Dim_Menu_Item
                    |
              Dim_Location
                    |
          Dim_Delivery_Partner
                    |
                Dim_Payment

9️⃣ Why STAR Schema is Preferred over Snowflake ⭐⭐⭐

Star Schema	Snowflake Schema
Fewer joins	More joins
Faster queries	Slower
Simple for BI	Complex
Business friendly	Technical
Easy maintenance	Harder

📌 Rule:

Use Snowflake only when dimensions are extremely large or reused.

🔟 Example Business Queries

1️⃣ Daily Revenue

SELECT d.full_date, SUM(f.total_sales_amount) revenue
FROM Fact_Order_Sales f
JOIN Dim_Date d ON f.date_key = d.date_key
GROUP BY d.full_date;

2️⃣ Top 5 Restaurants by Revenue

SELECT r.restaurant_name, SUM(f.total_sales_amount) revenue
FROM Fact_Order_Sales f
JOIN Dim_Restaurant r ON f.restaurant_key = r.restaurant_key
GROUP BY r.restaurant_name
ORDER BY revenue DESC
LIMIT 5;

3️⃣ Most Ordered Menu Items

SELECT m.item_name, SUM(f.quantity_ordered) qty
FROM Fact_Order_Sales f
JOIN Dim_Menu_Item m ON f.menu_item_key = m.menu_item_key
GROUP BY m.item_name
ORDER BY qty DESC;

1️⃣1️⃣ Common Mistakes ❌

Mixed grain in fact table
Storing customer name in fact
No date dimension
Snowflaking without reason
Missing surrogate keys
Overloading fact tables
No documentation of grain

🎯 Final Summary

OLTP → operations
Data Warehouse → analytics
Grain defines correctness
Star schema = performance + simplicity

Here’s a comprehensive SQL DDL for all six fact tables in the Food Delivery Data Warehouse, following best practices (surrogate keys, proper data types, foreign keys, and measures only in fact tables).

1️⃣ Fact_Order_Sales

CREATE TABLE Fact_Order_Sales (
    sales_key BIGINT IDENTITY(1,1) PRIMARY KEY,
    date_key INT NOT NULL,
    customer_key INT NOT NULL,
    restaurant_key INT NOT NULL,
    menu_item_key INT NOT NULL,
    delivery_partner_key INT,
    location_key INT,
    payment_key INT,
    order_id VARCHAR(50) NOT NULL,
    order_line_no INT NOT NULL,
    quantity_sold INT NOT NULL,
    item_price DECIMAL(10,2) NOT NULL,
    discount_amount DECIMAL(10,2),
    delivery_fee DECIMAL(10,2),
    total_sales_amount DECIMAL(12,2) NOT NULL,
    cost_amount DECIMAL(10,2),
    profit_amount AS (total_sales_amount - cost_amount) PERSISTED,
    CONSTRAINT fk_order_date FOREIGN KEY (date_key) REFERENCES Dim_Date(date_key),
    CONSTRAINT fk_customer FOREIGN KEY (customer_key) REFERENCES Dim_Customer(customer_key),
    CONSTRAINT fk_restaurant FOREIGN KEY (restaurant_key) REFERENCES Dim_Restaurant(restaurant_key),
    CONSTRAINT fk_menu_item FOREIGN KEY (menu_item_key) REFERENCES Dim_Menu_Item(menu_item_key),
    CONSTRAINT fk_delivery_partner FOREIGN KEY (delivery_partner_key) REFERENCES Dim_Delivery_Partner(delivery_partner_key),
    CONSTRAINT fk_location FOREIGN KEY (location_key) REFERENCES Dim_Location(location_key),
    CONSTRAINT fk_payment FOREIGN KEY (payment_key) REFERENCES Dim_Payment(payment_key)
);

2️⃣ Fact_Customer_Activity

CREATE TABLE Fact_Customer_Activity (
    activity_key BIGINT IDENTITY(1,1) PRIMARY KEY,
    date_key INT NOT NULL,
    customer_key INT NOT NULL,
    orders_count INT NOT NULL,
    total_spend DECIMAL(12,2),
    avg_order_value DECIMAL(10,2),
    last_order_days INT,
    is_repeat_customer BIT,
    CONSTRAINT fk_cust_date FOREIGN KEY (date_key) REFERENCES Dim_Date(date_key),
    CONSTRAINT fk_customer FOREIGN KEY (customer_key) REFERENCES Dim_Customer(customer_key)
);

3️⃣ Fact_Order_Forecast

CREATE TABLE Fact_Order_Forecast (
    forecast_key BIGINT IDENTITY(1,1) PRIMARY KEY,
    forecast_date_key INT NOT NULL,
    restaurant_key INT NOT NULL,
    location_key INT NOT NULL,
    forecast_orders INT NOT NULL,
    forecast_revenue DECIMAL(12,2),
    model_version VARCHAR(50),
    CONSTRAINT fk_forecast_date FOREIGN KEY (forecast_date_key) REFERENCES Dim_Date(date_key),
    CONSTRAINT fk_forecast_restaurant FOREIGN KEY (restaurant_key) REFERENCES Dim_Restaurant(restaurant_key),
    CONSTRAINT fk_forecast_location FOREIGN KEY (location_key) REFERENCES Dim_Location(location_key)
);

4️⃣ Fact_Menu_Item_Performance

CREATE TABLE Fact_Menu_Item_Performance (
    menu_item_perf_key BIGINT IDENTITY(1,1) PRIMARY KEY,
    date_key INT NOT NULL,
    menu_item_key INT NOT NULL,
    restaurant_key INT NOT NULL,
    quantity_sold INT NOT NULL,
    item_revenue DECIMAL(12,2),
    item_cost DECIMAL(12,2),
    item_profit AS (item_revenue - item_cost) PERSISTED,
    popularity_rank INT,
    CONSTRAINT fk_mip_date FOREIGN KEY (date_key) REFERENCES Dim_Date(date_key),
    CONSTRAINT fk_mip_menu_item FOREIGN KEY (menu_item_key) REFERENCES Dim_Menu_Item(menu_item_key),
    CONSTRAINT fk_mip_restaurant FOREIGN KEY (restaurant_key) REFERENCES Dim_Restaurant(restaurant_key)
);

5️⃣ Fact_Season_Sales

CREATE TABLE Fact_Season_Sales (
    season_sales_key BIGINT IDENTITY(1,1) PRIMARY KEY,
    date_key INT NOT NULL,
    season_key INT NOT NULL,
    location_key INT NOT NULL,
    total_orders INT NOT NULL,
    total_revenue DECIMAL(12,2),
    avg_delivery_time DECIMAL(6,2),
    CONSTRAINT fk_ss_date FOREIGN KEY (date_key) REFERENCES Dim_Date(date_key),
    CONSTRAINT fk_ss_season FOREIGN KEY (season_key) REFERENCES Dim_Season(season_key),
    CONSTRAINT fk_ss_location FOREIGN KEY (location_key) REFERENCES Dim_Location(location_key)
);

6️⃣ Fact_Delivery_Performance

CREATE TABLE Fact_Delivery_Performance (
    delivery_perf_key BIGINT IDENTITY(1,1) PRIMARY KEY,
    date_key INT NOT NULL,
    order_id VARCHAR(50) NOT NULL,
    delivery_partner_key INT NOT NULL,
    restaurant_key INT NOT NULL,
    location_key INT,
    promised_time_minutes INT,
    actual_time_minutes INT,
    delay_minutes AS (actual_time_minutes - promised_time_minutes) PERSISTED,
    sla_met_flag BIT,
    CONSTRAINT fk_dp_date FOREIGN KEY (date_key) REFERENCES Dim_Date(date_key),
    CONSTRAINT fk_dp_partner FOREIGN KEY (delivery_partner_key) REFERENCES Dim_Delivery_Partner(delivery_partner_key),
    CONSTRAINT fk_dp_restaurant FOREIGN KEY (restaurant_key) REFERENCES Dim_Restaurant(restaurant_key),
    CONSTRAINT fk_dp_location FOREIGN KEY (location_key) REFERENCES Dim_Location(location_key)
);

7️⃣ Fact_Store_Performance

CREATE TABLE Fact_Store_Performance (
    store_perf_key BIGINT IDENTITY(1,1) PRIMARY KEY,
    date_key INT NOT NULL,
    restaurant_key INT NOT NULL,
    total_orders INT NOT NULL,
    total_revenue DECIMAL(12,2),
    avg_order_value DECIMAL(10,2),
    avg_rating DECIMAL(3,2),
    cancellation_rate DECIMAL(5,2),
    CONSTRAINT fk_sp_date FOREIGN KEY (date_key) REFERENCES Dim_Date(date_key),
    CONSTRAINT fk_sp_restaurant FOREIGN KEY (restaurant_key) REFERENCES Dim_Restaurant(restaurant_key)
);

✅ Notes on DDL Design

Surrogate Keys for all fact tables (BIGINT IDENTITY)
Foreign keys reference dimension tables
Derived measures calculated in the fact table (profit_amount, delay_minutes)
Granularity enforced by primary key + business rules
Numeric measures only (no descriptive attributes)

If you want, I can also write the SQL DDL for all dimension tables to make this a fully deployable DW schema.

Do you want me to do that next?

22JAN26 - Best practices for Data modeling 1

---

## Data Modeling Best Practices – Headings

1. Start with Business Requirements (Not Tables)

2. Define Grain First (Critical Rule)

3. Choose the Right Model Type

4. Use Clear Naming Conventions

5. Design Clean Dimension Tables

6. Keep Fact Tables Lean

7. Enforce Data Integrity

8. Handle Dates Properly

9. Optimize for Query Performance

10. Plan for Data Growth

11. Document Everything

12. Validate with Real Data

13. Version Control Your Models

14. Avoid Over-Normalization in Warehouses

15. Design for BI & Reporting Tools

---------------------------------------------------------------------------------------------------------------------------

## 1. Start with Business Requirements (Not Tables)

**Best Practice**

* Understand **what questions the business wants answered**, not just source schemas.

**Why**

* Prevents over-engineering and unused columns/tables.

**How**

* Identify KPIs, reports, dashboards

* Define **facts**, **dimensions**, and **grain**

* Write requirements in plain language

*Example: “Daily sales by product and region”*

---------------------------------------------------------------------------------------------------------------------------

## 2. Define Grain First (Critical Rule)

**Best Practice**

* Clearly define the **lowest level of detail** for each fact table.

**Why**

* Prevents aggregation errors and duplicate data.

**Example**

* ❌ “Sales table” (too vague)

* ✅ “One row per product, per order, per day”

---

---------------------------------------------------------------------------------------------------------------------------

## 3. Choose the Right Model Type

### a) OLTP (Transactional Systems)

* Use **3NF (Third Normal Form)**

* Minimize redundancy

* Optimize for **writes**

### b) OLAP / Data Warehouse

* Use **Dimensional Modeling**

* Star schema preferred

* Optimize for **reads & analytics**

**Rule of Thumb**

> Applications → normalized

> Analytics → dimensional

---

---------------------------------------------------------------------------------------------------------------------------

## 4. Use Clear Naming Conventions

**Best Practice**

* Use **business-friendly, consistent names**

**Examples**

* `customer_id` (not `custid`)

* `order_date` (not `dt1`)

* Tables: singular or plural—choose one and stick to it

**Why**

* Improves usability for analysts and BI tools.

---------------------------------------------------------------------------------------------------------------------------

## 5. Design Clean Dimension Tables

**Best Practices**

* Use **surrogate keys** (not natural keys)

* Keep dimensions **wide, not deep**

* Handle Slowly Changing Dimensions (SCD)

**SCD Types**

* Type 1 – overwrite (no history)

* Type 2 – add new row (full history) ⭐ most common

* Type 3 – limited history

---------------------------------------------------------------------------------------------------------------------------

## 6. Keep Fact Tables Lean

**Best Practices**

* Store only:

* Foreign keys

* Numeric measures

* Avoid text fields in facts

**Example**

* ✔ `sales_amount`, `quantity`

* ❌ `customer_name`, `product_description`

---

---------------------------------------------------------------------------------------------------------------------------

## 7. Enforce Data Integrity

**Best Practice**

* Use constraints where possible

**Examples**

* Primary keys

* Foreign keys

* NOT NULL for required fields

**Why**

* Prevents silent data corruption.

---

---------------------------------------------------------------------------------------------------------------------------

## 8. Handle Dates Properly

**Best Practice**

* Use a **Date Dimension**, not raw dates everywhere

**Benefits**

* Easy filtering by:

* Year

* Month

* Quarter

* Fiscal periods

* Holidays

---

---------------------------------------------------------------------------------------------------------------------------

## 9. Optimize for Query Performance

**Best Practices**

* Index:

* Foreign keys

* Frequently filtered columns

* Use partitioning for large fact tables

* Avoid unnecessary joins

**Cloud Tip (AWS)**

* Redshift / Athena → sort keys & distribution keys

* RDS → proper indexing and query plans

---

---------------------------------------------------------------------------------------------------------------------------

## 10. Plan for Data Growth

**Best Practice**

* Design for **scale from day one**

**How**

* Avoid hard-coding values

* Use surrogate keys

* Anticipate new dimensions & measures

---

---------------------------------------------------------------------------------------------------------------------------

## 11. Document Everything

**Best Practice**

* Maintain data model documentation

**Include**

* Table purpose

* Column definitions

* Grain

* Business rules

**Why**

* Reduces dependency on individuals.

---

---------------------------------------------------------------------------------------------------------------------------

## 12. Validate with Real Data

**Best Practice**

* Test models using **actual data volumes**

**Checks**

* Row counts

* Duplicates

* Aggregation accuracy

* Edge cases

---

---------------------------------------------------------------------------------------------------------------------------

## 13. Version Control Your Models

**Best Practice**

* Treat data models like code

**How**

* Store DDL in Git

* Track schema changes

* Use migration scripts

---

---------------------------------------------------------------------------------------------------------------------------

## 14. Avoid Over-Normalization in Warehouses

**Best Practice**

* Prefer **simplicity over theoretical purity**

**Why**

* BI tools and users prefer fewer joins.

---

---------------------------------------------------------------------------------------------------------------------------

## 15. Design for BI & Reporting Tools

**Best Practice**

* Models should be **self-service friendly**

**How**

* Meaningful column names

* Pre-calculated metrics where useful

* Consistent dimensions across facts (conformed dimensions)

---

---------------------------------------------------------------------------------------------------------------------------

## Golden Rule Summary

> **A good data model is easy to understand, hard to misuse, and flexible to change.**

------------------------------------------------------------------------------------------------------------------------------

Monday, January 19, 2026

20JAN26 - Enterprise Data Model (EDM) Steps

Enterprise Data Model (EDM) Steps

Enterprise Data Model (EDM) steps to create a data warehouse for northwind database with 5 different reports with granule and granularity

Data Warehouse Design for Northwind

(With Granule & Granularity)

1. Define Business Objective

Goal
Provide consistent enterprise-wide reporting for:

Sales
Products
Customers
Employees
Logistics

EDM Focus
Single version of truth across all reports.

2. Identify Business Processes

(Kimball + EDM aligned)

Process	Description
Order Processing	Customer orders and order lines
Sales Analysis	Revenue & quantity
Product Analysis	Product & category trends
Customer Analysis	Buying behavior
Shipping Analysis	Delivery performance

3. Define Enterprise Data Domains (EDM)

Domain	Entities
Sales	Orders, Order Details
Customer	Customers
Product	Products, Categories, Suppliers
Employee	Employees
Time	Date
Logistics	Shippers

4. Conceptual Enterprise Data Model (High-Level)

Core Entities

Customer
Order
Order Line
Product
Employee
Shipper
Date

Key Relationships

Customer → Order (1:M)
Order → Order Line (1:M)
Product → Order Line (1:M)

At EDM level, no granularity yet—only business meaning.

5. Identify the Fact & Define the Granule (MOST IMPORTANT STEP)

Fact Table: FactSales

Granule (Grain)

One record per product per order per day

This means:

Each row = 1 product line item in an order
Lowest level of detail available from source

Granularity

Daily
Order Line level
Transactional

Supports all summary and detailed reports
Prevents double counting

6. Logical EDM → Dimensional Model

Fact Table

FactSales

Attribute	Description
OrderID	Degenerate dimension
DateKey	Order date
CustomerKey	FK
ProductKey	FK
EmployeeKey	FK
ShipperKey	FK
Quantity	Units sold
UnitPrice	Price
Discount	Discount
SalesAmount	Calculated

Dimension Tables (Granularity Alignment)

Dimension	Grain
DimCustomer	One row per customer
DimProduct	One row per product
DimEmployee	One row per employee
DimDate	One row per day
DimShipper	One row per shipper

7. Granule vs Granularity (Clear Distinction)

Term	Meaning	Example
Granule	What one row represents	One product in one order
Granularity	Level of detail	Daily, transactional

8. Reports with Granule & Granularity Mapping

Report 1: Executive Sales Summary

Required Granularity

Monthly / Yearly

Derived From Granule

Aggregated from order-line daily grain

KPIs

Total Revenue
Total Orders
Average Order Value

Safe aggregation because base grain is atomic

Report 2: Product Performance Report

Granularity

Product × Month

KPIs

Revenue per Product
Units Sold
Category Contribution %

Uses

DimProduct
FactSales

Report 3: Customer Analysis Report

Granularity

Customer × Quarter

KPIs

Customer Lifetime Value
Avg Purchase Frequency
Repeat Purchase Rate

Grain Dependency

Needs order-line detail to avoid duplication

Report 4: Employee Sales Performance

Granularity

Employee × Month

KPIs

Revenue per Employee
Orders Handled
Avg Discount %

Grain Rule

Employee linked at order level, rolled up from line items.

Report 5: Shipping & Logistics Report

Granularity

Shipper × Month

KPIs

Avg Shipping Days
On-Time Delivery %
Orders Shipped

Grain

Calculated using order-line facts + order dates

9. Why Lowest Granularity Wins (EDM Rule)

Benefit	Explanation
Flexibility	Supports future reports
Accuracy	No aggregation loss
Performance	Can pre-aggregate later
Governance	Single business definition

10. EDM Traceability Matrix

EDM Entity	DW Object	Grain
Order	FactSales	Order Line
Product	DimProduct	Product
Customer	DimCustomer	Customer
Employee	DimEmployee	Employee
Shipper	DimShipper	Shipper

11. Final EDM-Driven Architecture

Northwind OLTP
     ↓
 Staging (Raw Data)
     ↓
 Enterprise Data Warehouse
     ↓
 Star Schema (FactSales)
     ↓
 Aggregates / Data Marts
     ↓
 Reports & KPIs

Key Exam / Interview Takeaway

Always define the grain before building the fact table.
Granule = what one row means
Granularity = level of detail

Dimensional Data Modeling Steps

Dimensional Data Modeling Steps – Northwind Data Warehouse

1. Understand Business Requirements

Identify business users (Sales, Finance, Operations).
Define key business questions (sales performance, customer behavior, product demand).
Finalize reporting needs and KPIs.

2. Identify Business Processes

Typical Northwind business processes:

Sales Orders
Product Sales
Shipping & Logistics
Customer Transactions
Supplier Procurement

3. Declare the Grain (Most Important Step)

Decide exactly what a single row in each fact table represents
Example:
- One row per Order Line per Product per Day

This decision controls:

Level of detail
Storage size
Reporting flexibility

4. Identify Dimensions

Common dimensions from Northwind:

Date
Customer
Product
Employee
Supplier
Shipper
Category
Geography (Country, City, Region)

5. Identify Facts (Measures)

Numeric, additive values:

Sales Amount
Quantity Sold
Discount Amount
Freight Cost
Order Count

6. Design Star Schema

Fact table in the center
Dimension tables around it
Avoid snowflaking (keep dimensions denormalized)

7. Define Surrogate Keys

Create surrogate keys for all dimensions
Maintain Slowly Changing Dimensions (SCD Type 1 / Type 2 where required)

8. Design ETL Process

Extract data from Northwind OLTP
Clean, transform, and conform data
Load dimensions first, then facts

9. Validate with Sample Reports

Ensure the model supports all required reports
Check aggregations, filters, and drill-downs

10. Deploy & Optimize

Index fact tables
Partition by Date
Enable BI tools (Power BI / Tableau / QuickSight)

B. Core Star Schema (Northwind)

Fact Table

Fact_Sales

Column	Description
Date_Key
Product_Key
Customer_Key
Employee_Key
Shipper_Key
Quantity
Unit_Price
Discount
Sales_Amount
Freight

Dimension Tables

Dim_Date
Dim_Product
Dim_Customer
Dim_Employee
Dim_Shipper
Dim_Supplier
Dim_Category

C. Five Business Reports with Grain & Granularity

1. Sales Performance Report

Business Question:
How much revenue is generated over time?

Fact Table: Fact_Sales
Dimensions: Date, Product, Category

Grain (Granule):
One row per Product per Day

Granularity:

Daily → Monthly → Yearly

Measures:

Total Sales Amount
Total Quantity Sold

2. Customer Purchase Behavior Report

Business Question:
Which customers generate the most revenue?

Fact Table: Fact_Sales
Dimensions: Customer, Date, Geography

Grain:
One row per Customer per Order per Day

Granularity:

Order level → Monthly → Yearly

Measures:

Sales Amount
Order Count

3. Product Performance Report

Business Question:
Which products are best and worst sellers?

Fact Table: Fact_Sales
Dimensions: Product, Category, Supplier

Grain:
One row per Product per Order Line

Granularity:

Order Line → Product → Category

Measures:

Quantity Sold
Revenue
Discount Amount

4. Employee Sales Contribution Report

Business Question:
How well are employees performing in sales?

Fact Table: Fact_Sales
Dimensions: Employee, Date, Region

Grain:
One row per Employee per Order

Granularity:

Order → Month → Year

Measures:

Total Sales
Orders Handled

5. Shipping & Logistics Cost Report

Business Question:
What is the shipping cost trend by shipper?

Fact Table: Fact_Sales
Dimensions: Shipper, Date, Geography

Grain:
One row per Order per Shipper per Day

Granularity:

Daily → Monthly → Yearly

Measures:

Freight Cost
Number of Shipments

D. Summary Table

Report	Grain	Granularity
Sales Performance	Product per Day	Day → Month → Year
Customer Behavior	Customer per Order	Order → Month
Product Performance	Product per Order Line	Line → Product
Employee Performance	Employee per Order	Order → Year
Shipping Cost	Order per Shipper	Day → Year

PAGES

Thursday, January 29, 2026

Can you create a Physical model without a Logical one?

Sunday, January 25, 2026

Tool used for data profiling

Step-by-step: How I do data profiling in Informatica Data Quality

Step 1: Ingest the data file

Step 2: Run Column Profiling

Step 3: Run Data Domain & Pattern Analysis

Step 4: Identify Duplicates

Step 5: Apply Business Rules

Step 6: Generate Data Quality Score

Step 7: Create Data Quality Report

Step 8: Go / No-Go decision

How I ensure the file is in good shape

Strong closing line (interview gold)

Friday, January 23, 2026

Data Model (Step-by-Step)

Supermarket Sales Analytics – Requirement to Data Model (Step-by-Step)

STEP 1: Understand the Business Context

Business Overview

Business Goals

STEP 2: Requirement Gathering – How to Ask Questions

🔹 Core Requirement Questions (Very Important)

A. Business Objective

B. Sales-Related Questions

C. Time & Comparison

D. Hierarchy & Drill-down

E. Data Availability

STEP 3: Identify Required Reports (5 Reports)

✅ Final Agreed Reports

STEP 4: Define KPIs for Each Report

📊 Report 1: Sales Performance

👥 Report 2: Customer Analysis

🌍 Report 3: Regional Performance

🧑‍💼 Report 4: Employee Commission

📦 Report 5: Product in Demand

STEP 5: Define Grain, Granularity & Granules

🔹 Fact Table Grain (Very Important)

🔹 Granularity (Analysis Levels)

🔹 Granules (Lowest Identifiers)

STEP 6: Choose the Data Model

✅ Recommended Model: Star Schema (with selective Snowflake)

STEP 7: Identify Fact Tables

🔹 Primary Fact Table

STEP 8: Identify Dimension Tables

🔹 Dimensions Used Across Reports

STEP 9: Data Model per Report (Mapping)

📊 Sales Performance

👥 Customer Analysis

🌍 Regional Performance

🧑‍💼 Employee Commission

📦 Product in Demand

STEP 10: Data Modeling Steps (Execution)

🧠 Final Reality Check (Very Important)

🎤 Interview-Ready Closing Line

Wednesday, January 21, 2026

Scenario for Snowflake Model

🔍 Comparison Snapshot (TOP)

⭐ Star Schema

❄️ Snowflake Schema

🎯 What the Company Wants to View in Reports

Business Context

📊 Report Requirement 1

“Sales by Department → Category → SubCategory”

🔹 Grain, Granularity & KPIs (Explained)

Grain (Fact Table Grain)

Granularity (Reporting Levels)

Key KPIs Required

❌ Why This Is Weak in Star Schema

✅ Why Snowflake Works

📊 Report Requirement 3

“Impact Analysis When Categories Change”

❌ Star Schema Issue

✅ Snowflake Solution

🧠 Core Reason Star Is Not Enough

🧾 One-Line Business Justification

What is a Surrogate Key?

2️⃣ When to Use Surrogate Keys

3️⃣ When NOT to Use Surrogate Keys