Thursday, January 29, 2026

29jan26 - Physical model without a Logical one?

 

Can you create a Physical model without a Logical one?

The short answer: Yes. Technically, you can jump straight into a Physical Data Model (PDM) and start writing SQL code to create tables.

The long answer: You probably shouldn't.

While "skipping to the end" might seem faster, it often leads to significant issues:

FeaturePhysical Model Only (Skipping LDM)Using Both LDM and PDM
FlexibilityHard to change; tied to a specific database (e.g., PostgreSQL).Easy to migrate or adapt to new tech. 

> in future if you want to change the database software ,
since the LDM is in place its easy to create a PDM and tables 
CommunicationDifficult for non-tech stakeholders to read.Clear and understandable for business users.
Data IntegrityProne to "missing" business rules or logic.Ensures business rules are baked into the design.
ScalabilityOften results in "patchwork" fixes later.Built on a solid, future-proof foundation.

Sunday, January 25, 2026

25jan26 - how to do profilng in which tool and how

 Here’s a clean, interview-ready rewrite focused on one 3rd-party tool (Informatica Data Quality), with step-by-step profiling and reporting. You can say this almost verbatim.


Tool used for data profiling

I use Informatica Data Quality (IDQ) to profile data files and generate data quality reports before loading them into the database.


Step-by-step: How I do data profiling in Informatica Data Quality

Step 1: Ingest the data file

  • Connect the source file (CSV / Excel / flat file / S3 / database)

  • Define the source metadata in IDQ

  • Validate column names and data types

📌 Purpose: Ensure the file is readable and structurally correct.


Step 2: Run Column Profiling

  • Use Column Profile in Informatica Analyst

  • Analyze each column for:

    • Data type distribution

    • Null and blank percentage

    • Min / Max values

    • Distinct count

    • Value frequency

📌 Purpose: Understand the content and detect obvious issues.


Step 3: Run Data Domain & Pattern Analysis

  • Apply data domains (date, email, phone, numeric)

  • Use pattern analysis to detect invalid formats

Examples:

  • Invalid email formats

  • Date columns stored as strings

  • Mixed data types in one column

📌 Purpose: Validate format and consistency.


Step 4: Identify Duplicates

  • Use duplicate analysis on key fields

  • Identify exact and fuzzy duplicates (if needed)

Example:

  • Same order_id appearing multiple times

  • Same customer with slight name variations

📌 Purpose: Prevent double counting and incorrect metrics.


Step 5: Apply Business Rules

  • Create business rule transformations

  • Examples:

    • Order amount > 0

    • Delivery time between 0 and 180 minutes

    • Order status ∈ allowed values

📌 Purpose: Ensure data follows business logic, not just technical rules.


Step 6: Generate Data Quality Score

  • Assign weights to rules (critical vs non-critical)

  • Calculate overall data quality score

  • Categorize issues:

    • Critical

    • Warning

    • Informational

📌 Purpose: Measure readiness of the file.


Step 7: Create Data Quality Report

  • Generate profiling reports from Informatica Analyst

  • Report includes:

    • Column statistics

    • Failed rules

    • Duplicate counts

    • Data quality score

📌 Purpose: Provide transparency to stakeholders.


Step 8: Go / No-Go decision

  • If critical rules fail → Reject the file

  • Notify source system / upstream team

  • Reload only after correction

📌 Purpose: Prevent bad data entering the warehouse.


How I ensure the file is in good shape

A file is considered ready when schema validation passes, nulls and duplicates are within thresholds, business rules are satisfied, and the overall data quality score meets acceptance criteria.


Strong closing line (interview gold)

“Using Informatica Data Quality, I profile the data structurally, statistically, and against business rules, generate a data quality report, and enforce Go/No-Go criteria before loading the data.”


 

Just say 👍

Friday, January 23, 2026

23 jan 26 - Data Modeling - ALL STEPS - With KPIS and req gathering

23 jan 26 -  

Data Model (Step-by-Step)


 Supermarket Sales Analytics – Requirement to Data Model (Step-by-Step)


STEP 1: Understand the Business Context

Business Overview

A supermarket chain operates:

  • Multiple stores across regions

  • Thousands of products

  • Sales staff earning commissions

  • Loyalty customers

Business Goals

  • Increase sales & profitability

  • Identify high-demand products

  • Improve regional performance

  • Track employee commission

  • Understand customer behavior


STEP 2: Requirement Gathering – How to Ask Questions

🔹 Core Requirement Questions (Very Important)

A. Business Objective

Ask:

  • What decisions will this report support?

  • Who will use this report (CEO, Manager, Ops)?

  • Is this for monitoring or deep analysis?


B. Sales-Related Questions

Ask:

  • Do you want daily, monthly, or yearly sales?

  • Should discounts be included in revenue?

  • Do you want gross margin or only sales?


C. Time & Comparison

Ask:

  • Do you need YoY / MoM comparison?

  • How many years of history?


D. Hierarchy & Drill-down

Ask:

  • Region → Store → City?

  • Category → Subcategory → Product?


E. Data Availability

Ask:

  • Is cost available for margin?

  • Is employee commission stored or calculated?


STEP 3: Identify Required Reports (5 Reports)

✅ Final Agreed Reports

  1. Sales Performance Report

  2. Customer Analysis Report

  3. Regional Performance Report

  4. Employee Commission Report

  5. Product in Demand Report


STEP 4: Define KPIs for Each Report


📊 Report 1: Sales Performance

Business Questions

  • How much are we selling?

  • Are sales increasing?

KPIs

  • Total Sales Amount

  • Quantity Sold

  • Discount Amount

  • Gross Margin

  • Sales Growth %


👥 Report 2: Customer Analysis

Business Questions

  • Who are our top customers?

  • Are customers returning?

KPIs

  • Total Spend per Customer

  • Purchase Frequency

  • Average Basket Value

  • Customer Lifetime Value (basic)


🌍 Report 3: Regional Performance

Business Questions

  • Which regions perform best?

  • Where are we underperforming?

KPIs

  • Sales by Region

  • Sales Growth %

  • Contribution %

  • Store Count


🧑‍💼 Report 4: Employee Commission

Business Questions

  • How much commission is earned?

  • Who are top performers?

KPIs

  • Total Sales by Employee

  • Commission Amount

  • Commission %

  • Rank by Sales


📦 Report 5: Product in Demand

Business Questions

  • What products sell the most?

  • Which products should be stocked more?

KPIs

  • Quantity Sold

  • Sales Amount

  • Sales Frequency

  • Stock Turnover (if inventory exists)


STEP 5: Define Grain, Granularity & Granules


🔹 Fact Table Grain (Very Important)

One row per Product per Store per Day per Employee

This supports all 5 reports.


🔹 Granularity (Analysis Levels)

DimensionGranularity Levels
TimeDay → Month → Year
ProductProduct → SubCategory → Category
GeographyStore → City → Region
CustomerCustomer
EmployeeEmployee

🔹 Granules (Lowest Identifiers)

Granules are the keys that define the grain:

Date_Key Product_Key Store_Key Employee_Key Customer_Key

Each unique combination = 1 row


STEP 6: Choose the Data Model

✅ Recommended Model: Star Schema (with selective Snowflake)

Why?

  • BI friendly

  • Easy reporting

  • Snowflake only where hierarchy is complex (Product, Region)


STEP 7: Identify Fact Tables

🔹 Primary Fact Table

FACT_SALES

Measures:

  • Sales_Amount

  • Quantity

  • Discount

  • Cost

  • Commission_Amount


STEP 8: Identify Dimension Tables

🔹 Dimensions Used Across Reports

DimensionPurpose
DIM_DATETime analysis
DIM_PRODUCTProduct details
DIM_CATEGORY (Snowflake)Product hierarchy
DIM_STOREStore info
DIM_REGION (Snowflake)Geography hierarchy
DIM_CUSTOMERCustomer analysis
DIM_EMPLOYEECommission tracking

STEP 9: Data Model per Report (Mapping)


📊 Sales Performance

FACT_SALES → DIM_DATE → DIM_PRODUCT → DIM_STORE

👥 Customer Analysis

FACT_SALES → DIM_CUSTOMER → DIM_DATE

🌍 Regional Performance

FACT_SALES → DIM_STORE → DIM_REGION → DIM_DATE

🧑‍💼 Employee Commission

FACT_SALES → DIM_EMPLOYEE → DIM_DATE

📦 Product in Demand

FACT_SALES → DIM_PRODUCT → DIM_CATEGORY → DIM_DATE

STEP 10: Data Modeling Steps (Execution)

  1. Finalize grain

  2. List KPIs & measures

  3. Identify dimensions

  4. Define hierarchies

  5. Design star/snowflake schema

  6. Assign surrogate keys

  7. Validate against reports

  8. Review with business

  9. Freeze model

  10. Build ETL


🧠 Final Reality Check (Very Important)

Reports drive KPIs → KPIs drive grain → Grain drives data model

Never design tables first.


🎤 Interview-Ready Closing Line

“I start with business questions, derive KPIs, define the grain, and then design a star or snowflake schema that supports all required reports with minimal redundancy.”

Wednesday, January 21, 2026

22jan26 - when to use snowflake model

 

Scenario for Snowflake Model


🔍 Comparison Snapshot (TOP)

⭐ Star Schema

FACT → DIM_PRODUCT (flat)

❄️ Snowflake Schema

FACT → Product → SubCategory → Category → Department

🎯 What the Company Wants to View in Reports

Business Context

A large retail enterprise wants enterprise-level analytical reporting, not just simple dashboards.

They want reports that:

  • Work across millions of products

  • Follow official category hierarchies

  • Support frequent hierarchy changes

  • Match master data definitions


📊 Report Requirement 1

“Sales by Department → Category → SubCategory”

Business Expectation

  • View sales at Department level

  • Drill down to Category

  • Drill further to SubCategory

  • Restructure categories without reloading all product data


🔹 Grain, Granularity & KPIs (Explained)

Grain (Fact Table Grain)

One row per Product per Store per Day

This defines the lowest level at which data is stored in the fact table.


Granularity (Reporting Levels)

Using the snowflaked hierarchy, the business can analyze data at multiple levels:

  • Department level – Executive overview

  • Category level – Business unit performance

  • SubCategory level – Merchandising analysis

  • Product level – SKU-level analysis

This is possible without changing the fact table, only by navigating dimension hierarchies.


Key KPIs Required

  • Total Sales Amount

  • Total Quantity Sold

  • Gross Margin

  • Average Selling Price (ASP)

  • Discount Amount

  • Sales Growth %

These KPIs aggregate cleanly across all hierarchy levels.


❌ Why This Is Weak in Star Schema

Flat Product Dimension

DIM_PRODUCT
Product_Key | Product_Name | SubCategory | Category | Department

Problems

  • Category and Department values are repeated for every product

  • If a Category moves to a new Department →
    ❗ millions of product rows must be updated

  • High risk of inconsistent category names

  • Hard to audit hierarchy changes over time

Result

  • Maintenance-heavy

  • Error-prone

  • Poor governance


✅ Why Snowflake Works

FACT → Product → SubCategory → Category → Department

Benefits

  • Category and Department stored once

  • Reclassification requires updating only hierarchy tables

  • Clean drill-down and roll-up paths

  • Enterprise-controlled hierarchy


📊 Report Requirement 3

“Impact Analysis When Categories Change”

Business Expectation

  • When a category is moved or renamed, reports should:

    • Update instantly

    • Not require massive data reloads


❌ Star Schema Issue

  • Category change = product dimension reload

  • Long ETL windows

  • High operational risk


✅ Snowflake Solution

  • Update one hierarchy table

  • Fact data remains untouched

  • Faster, safer changes


🧠 Core Reason Star Is Not Enough

Star schema collapses hierarchy into flat dimensions, which breaks scalability, governance, and change management when hierarchies are large and dynamic.


🧾 One-Line Business Justification

“Snowflake schema enables governed, hierarchical reporting that a flat star model cannot support efficiently at enterprise scale.”

If you want, I can next:

  • Add Date / Geography grain examples

  • Convert this into Northwind scenario

  • Make it interview-ready Q&A

  • Create a diagram with keys

Just tell me 👍

22jan26 - Surrogate keys - when to use and not use

 

What is a Surrogate Key?

  • A surrogate key is an artificial, system-generated unique identifier for a row in a table.

  • Usually integer or bigint, auto-incremented or generated by the system.

  • Not derived from business data (unlike natural keys).

Example:

customer_key INT IDENTITY(1,1) -- surrogate key

2️⃣ When to Use Surrogate Keys

ScenarioReason / BenefitExample
Data Warehouse / Fact-Dimension ModelingAllows consistent foreign key joins, supports slowly changing dimensions (SCD), avoids issues when natural keys change.Dim_Customer with customer_key used in Fact_Order_Sales
Slowly Changing Dimensions (Type 2)A surrogate key lets you track history without altering historical fact data.Customer changes address → new surrogate key row in Dim_Customer
Composite / Complex Natural KeysAvoid long, multi-column foreign keys.Order detail identified by order_id + line_number → simpler to use order_line_key
Integration from Multiple SystemsDifferent source systems may use different IDs → surrogate key standardizes references.Merging customer_id from two CRMs
Performance / IndexingInteger surrogate keys are more efficient for joins than long string keys.Joining Dim_Product.product_key to fact table instead of product_sku

3️⃣ When NOT to Use Surrogate Keys

ScenarioReason / Alternative
Transactional OLTP Tables where natural key is stable and simpleSurrogate key adds no value; natural key can uniquely identify the row.
When key has real-world meaning and is immutableUsing a surrogate key may cause unnecessary joins and duplication.
Small reference tables with static valuesSurrogate key adds complexity without benefit.
Temporary / staging tablesSurrogate keys are unnecessary; use natural key or row hash.
Where natural key is already numeric, short, and stableSurrogate key is redundant

4️⃣ Key Guidelines

  1. Always use surrogate keys in data warehouse dimensions and fact table foreign keys.

  2. Avoid surrogate keys in OLTP tables with stable, meaningful natural keys.

  3. Never mix natural keys and surrogate keys for joins in the same context—it confuses lineage.

  4. Surrogate keys are particularly helpful for history tracking (SCD2) and integration of multiple sources

22jan26 - Food Delivery Business – Data Modeling from OLAP - V1

 Below is a complete, end-to-end explanation for a Food Delivery Business, written exactly the way it is expected in projects, design documents, interviews, and architecture reviews.


🍔 Food Delivery Business – Data Modeling from OLTP to Star Schema


1️⃣ Transaction Model (Enterprise / OLTP Model)

🎯 Purpose

  • Support day-to-day operations

  • Optimized for fast inserts & updates

  • Highly normalized (3NF)


📋 Core Transaction Tables

Customers

  • customer_id (PK)

  • name

  • phone

  • email

  • address_id

  • created_date

Restaurants

  • restaurant_id (PK)

  • restaurant_name

  • cuisine_type

  • city

  • rating

Menu_Items

  • menu_item_id (PK)

  • restaurant_id (FK)

  • item_name

  • category

  • price

  • is_available

Orders

  • order_id (PK)

  • customer_id (FK)

  • restaurant_id (FK)

  • order_datetime

  • order_status

  • total_amount

Order_Items

  • order_item_id (PK)

  • order_id (FK)

  • menu_item_id (FK)

  • quantity

  • item_price

Delivery_Partners

  • delivery_partner_id (PK)

  • name

  • vehicle_type

  • join_date

Payments

  • payment_id (PK)

  • order_id (FK)

  • payment_mode

  • payment_status

  • paid_amount


2️⃣ Need for Reporting & Analytics

📊 Business Questions

  • Daily / monthly revenue?

  • Top restaurants by sales?

  • Most ordered food items?

  • Average delivery time?

  • Customer repeat rate?

  • City-wise performance?

  • Partner productivity?


3️⃣ Problems Reporting Directly from Transaction Model ❌

ProblemExplanation
Too many joins8–10 joins per report
Poor performanceTables optimized for writes
Complex queriesHard for BI tools
Mixed grainOrders vs order items
No historical trackingOverwritten data
Business logic scatteredRevenue logic repeated

📌 Conclusion:

Transaction models are NOT designed for analytics.


4️⃣ Why We Need a Data Warehouse ✅

Benefits

  • Optimized for read & aggregation

  • Simple query structure

  • Historical data support

  • Business-friendly schema

  • Faster dashboards


5️⃣ Define Grain & Granularity (MOST IMPORTANT)

🎯 Reporting Focus

Analyze food orders and revenue


✅ Fact Table Grain

ONE ROW per menu item per order per day

Formally:

Order × Menu Item × Date

📌 Granularity

  • Lowest level: Order Item

  • Time level: Day

  • Allows roll-up to:

    • Order

    • Restaurant

    • City

    • Month / Year


6️⃣ Map Business Requirements to Fact & Dimension Tables

Business RequirementTable TypeTable Name
Revenue analysisFactFact_Order_Sales
Time-based analysisDimensionDim_Date
Customer analysisDimensionDim_Customer
Restaurant performanceDimensionDim_Restaurant
Menu analysisDimensionDim_Menu_Item
Location analysisDimensionDim_Location
Delivery partner analysisDimensionDim_Delivery_Partner
Payment analysisDimensionDim_Payment

7️⃣ Fact & Dimension Tables List

Fact Table

Fact_Order_Sales

Grain: One row per menu item per order per day

Measures

  • quantity_ordered

  • item_sales_amount

  • discount_amount

  • delivery_fee

  • total_sales_amount

  • delivery_time_minutes

Foreign Keys

  • date_key

  • customer_key

  • restaurant_key

  • menu_item_key

  • delivery_partner_key

  • location_key

  • payment_key


Dimension Tables

Dim_Date

  • date_key (PK)

  • full_date

  • day

  • month

  • quarter

  • year

  • is_weekend


Dim_Customer

  • customer_key (PK)

  • customer_id

  • name

  • gender

  • age_group

  • city

  • signup_date

  • is_current (SCD2)


Dim_Restaurant

  • restaurant_key (PK)

  • restaurant_id

  • restaurant_name

  • cuisine_type

  • rating

  • is_active


Dim_Menu_Item

  • menu_item_key (PK)

  • item_name

  • category

  • veg_flag

  • price


Dim_Location

  • location_key (PK)

  • city

  • state

  • country

  • zone


Dim_Delivery_Partner

  • delivery_partner_key (PK)

  • partner_name

  • vehicle_type

  • experience_level


Dim_Payment

  • payment_key (PK)

  • payment_mode

  • payment_provider


8️⃣ Star Schema Structure ⭐

                 Dim_Date
                    |
Dim_Customer — Fact_Order_Sales — Dim_Restaurant
                    |
              Dim_Menu_Item
                    |
              Dim_Location
                    |
          Dim_Delivery_Partner
                    |
                Dim_Payment

9️⃣ Why STAR Schema is Preferred over Snowflake ⭐⭐⭐

Star SchemaSnowflake Schema
Fewer joinsMore joins
Faster queriesSlower
Simple for BIComplex
Business friendlyTechnical
Easy maintenanceHarder

📌 Rule:

Use Snowflake only when dimensions are extremely large or reused.


🔟 Example Business Queries

1️⃣ Daily Revenue

SELECT d.full_date, SUM(f.total_sales_amount) revenue
FROM Fact_Order_Sales f
JOIN Dim_Date d ON f.date_key = d.date_key
GROUP BY d.full_date;

2️⃣ Top 5 Restaurants by Revenue

SELECT r.restaurant_name, SUM(f.total_sales_amount) revenue
FROM Fact_Order_Sales f
JOIN Dim_Restaurant r ON f.restaurant_key = r.restaurant_key
GROUP BY r.restaurant_name
ORDER BY revenue DESC
LIMIT 5;

3️⃣ Most Ordered Menu Items

SELECT m.item_name, SUM(f.quantity_ordered) qty
FROM Fact_Order_Sales f
JOIN Dim_Menu_Item m ON f.menu_item_key = m.menu_item_key
GROUP BY m.item_name
ORDER BY qty DESC;

1️⃣1️⃣ Common Mistakes ❌

  1. Mixed grain in fact table

  2. Storing customer name in fact

  3. No date dimension

  4. Snowflaking without reason

  5. Missing surrogate keys

  6. Overloading fact tables

  7. No documentation of grain


🎯 Final Summary

  • OLTP → operations

  • Data Warehouse → analytics

  • Grain defines correctness

  • Star schema = performance + simplicity


 Here’s a comprehensive SQL DDL for all six fact tables in the Food Delivery Data Warehouse, following best practices (surrogate keys, proper data types, foreign keys, and measures only in fact tables).


1️⃣ Fact_Order_Sales

CREATE TABLE Fact_Order_Sales (
    sales_key BIGINT IDENTITY(1,1) PRIMARY KEY,
    date_key INT NOT NULL,
    customer_key INT NOT NULL,
    restaurant_key INT NOT NULL,
    menu_item_key INT NOT NULL,
    delivery_partner_key INT,
    location_key INT,
    payment_key INT,
    order_id VARCHAR(50) NOT NULL,
    order_line_no INT NOT NULL,
    quantity_sold INT NOT NULL,
    item_price DECIMAL(10,2) NOT NULL,
    discount_amount DECIMAL(10,2),
    delivery_fee DECIMAL(10,2),
    total_sales_amount DECIMAL(12,2) NOT NULL,
    cost_amount DECIMAL(10,2),
    profit_amount AS (total_sales_amount - cost_amount) PERSISTED,
    CONSTRAINT fk_order_date FOREIGN KEY (date_key) REFERENCES Dim_Date(date_key),
    CONSTRAINT fk_customer FOREIGN KEY (customer_key) REFERENCES Dim_Customer(customer_key),
    CONSTRAINT fk_restaurant FOREIGN KEY (restaurant_key) REFERENCES Dim_Restaurant(restaurant_key),
    CONSTRAINT fk_menu_item FOREIGN KEY (menu_item_key) REFERENCES Dim_Menu_Item(menu_item_key),
    CONSTRAINT fk_delivery_partner FOREIGN KEY (delivery_partner_key) REFERENCES Dim_Delivery_Partner(delivery_partner_key),
    CONSTRAINT fk_location FOREIGN KEY (location_key) REFERENCES Dim_Location(location_key),
    CONSTRAINT fk_payment FOREIGN KEY (payment_key) REFERENCES Dim_Payment(payment_key)
);

2️⃣ Fact_Customer_Activity

CREATE TABLE Fact_Customer_Activity (
    activity_key BIGINT IDENTITY(1,1) PRIMARY KEY,
    date_key INT NOT NULL,
    customer_key INT NOT NULL,
    orders_count INT NOT NULL,
    total_spend DECIMAL(12,2),
    avg_order_value DECIMAL(10,2),
    last_order_days INT,
    is_repeat_customer BIT,
    CONSTRAINT fk_cust_date FOREIGN KEY (date_key) REFERENCES Dim_Date(date_key),
    CONSTRAINT fk_customer FOREIGN KEY (customer_key) REFERENCES Dim_Customer(customer_key)
);

3️⃣ Fact_Order_Forecast

CREATE TABLE Fact_Order_Forecast (
    forecast_key BIGINT IDENTITY(1,1) PRIMARY KEY,
    forecast_date_key INT NOT NULL,
    restaurant_key INT NOT NULL,
    location_key INT NOT NULL,
    forecast_orders INT NOT NULL,
    forecast_revenue DECIMAL(12,2),
    model_version VARCHAR(50),
    CONSTRAINT fk_forecast_date FOREIGN KEY (forecast_date_key) REFERENCES Dim_Date(date_key),
    CONSTRAINT fk_forecast_restaurant FOREIGN KEY (restaurant_key) REFERENCES Dim_Restaurant(restaurant_key),
    CONSTRAINT fk_forecast_location FOREIGN KEY (location_key) REFERENCES Dim_Location(location_key)
);

4️⃣ Fact_Menu_Item_Performance

CREATE TABLE Fact_Menu_Item_Performance (
    menu_item_perf_key BIGINT IDENTITY(1,1) PRIMARY KEY,
    date_key INT NOT NULL,
    menu_item_key INT NOT NULL,
    restaurant_key INT NOT NULL,
    quantity_sold INT NOT NULL,
    item_revenue DECIMAL(12,2),
    item_cost DECIMAL(12,2),
    item_profit AS (item_revenue - item_cost) PERSISTED,
    popularity_rank INT,
    CONSTRAINT fk_mip_date FOREIGN KEY (date_key) REFERENCES Dim_Date(date_key),
    CONSTRAINT fk_mip_menu_item FOREIGN KEY (menu_item_key) REFERENCES Dim_Menu_Item(menu_item_key),
    CONSTRAINT fk_mip_restaurant FOREIGN KEY (restaurant_key) REFERENCES Dim_Restaurant(restaurant_key)
);

5️⃣ Fact_Season_Sales

CREATE TABLE Fact_Season_Sales (
    season_sales_key BIGINT IDENTITY(1,1) PRIMARY KEY,
    date_key INT NOT NULL,
    season_key INT NOT NULL,
    location_key INT NOT NULL,
    total_orders INT NOT NULL,
    total_revenue DECIMAL(12,2),
    avg_delivery_time DECIMAL(6,2),
    CONSTRAINT fk_ss_date FOREIGN KEY (date_key) REFERENCES Dim_Date(date_key),
    CONSTRAINT fk_ss_season FOREIGN KEY (season_key) REFERENCES Dim_Season(season_key),
    CONSTRAINT fk_ss_location FOREIGN KEY (location_key) REFERENCES Dim_Location(location_key)
);

6️⃣ Fact_Delivery_Performance

CREATE TABLE Fact_Delivery_Performance (
    delivery_perf_key BIGINT IDENTITY(1,1) PRIMARY KEY,
    date_key INT NOT NULL,
    order_id VARCHAR(50) NOT NULL,
    delivery_partner_key INT NOT NULL,
    restaurant_key INT NOT NULL,
    location_key INT,
    promised_time_minutes INT,
    actual_time_minutes INT,
    delay_minutes AS (actual_time_minutes - promised_time_minutes) PERSISTED,
    sla_met_flag BIT,
    CONSTRAINT fk_dp_date FOREIGN KEY (date_key) REFERENCES Dim_Date(date_key),
    CONSTRAINT fk_dp_partner FOREIGN KEY (delivery_partner_key) REFERENCES Dim_Delivery_Partner(delivery_partner_key),
    CONSTRAINT fk_dp_restaurant FOREIGN KEY (restaurant_key) REFERENCES Dim_Restaurant(restaurant_key),
    CONSTRAINT fk_dp_location FOREIGN KEY (location_key) REFERENCES Dim_Location(location_key)
);

7️⃣ Fact_Store_Performance

CREATE TABLE Fact_Store_Performance (
    store_perf_key BIGINT IDENTITY(1,1) PRIMARY KEY,
    date_key INT NOT NULL,
    restaurant_key INT NOT NULL,
    total_orders INT NOT NULL,
    total_revenue DECIMAL(12,2),
    avg_order_value DECIMAL(10,2),
    avg_rating DECIMAL(3,2),
    cancellation_rate DECIMAL(5,2),
    CONSTRAINT fk_sp_date FOREIGN KEY (date_key) REFERENCES Dim_Date(date_key),
    CONSTRAINT fk_sp_restaurant FOREIGN KEY (restaurant_key) REFERENCES Dim_Restaurant(restaurant_key)
);

✅ Notes on DDL Design

  • Surrogate Keys for all fact tables (BIGINT IDENTITY)

  • Foreign keys reference dimension tables

  • Derived measures calculated in the fact table (profit_amount, delay_minutes)

  • Granularity enforced by primary key + business rules

  • Numeric measures only (no descriptive attributes)


If you want, I can also write the SQL DDL for all dimension tables to make this a fully deployable DW schema.

Do you want me to do that next?



22JAN26 - Best practices for Data modeling 1

  

---


## Data Modeling Best Practices – Headings


1. Start with Business Requirements (Not Tables)

2. Define Grain First (Critical Rule)

3. Choose the Right Model Type

4. Use Clear Naming Conventions

5. Design Clean Dimension Tables

6. Keep Fact Tables Lean

7. Enforce Data Integrity

8. Handle Dates Properly

9. Optimize for Query Performance

10. Plan for Data Growth

11. Document Everything

12. Validate with Real Data

13. Version Control Your Models

14. Avoid Over-Normalization in Warehouses

15. Design for BI & Reporting Tools


 

---------------------------------------------------------------------------------------------------------------------------


## 1. Start with Business Requirements (Not Tables)


**Best Practice**


* Understand **what questions the business wants answered**, not just source schemas.


**Why**


* Prevents over-engineering and unused columns/tables.


**How**


* Identify KPIs, reports, dashboards

* Define **facts**, **dimensions**, and **grain**

* Write requirements in plain language

  *Example: “Daily sales by product and region”*



---------------------------------------------------------------------------------------------------------------------------


## 2. Define Grain First (Critical Rule)


**Best Practice**


* Clearly define the **lowest level of detail** for each fact table.


**Why**


* Prevents aggregation errors and duplicate data.


**Example**


* ❌ “Sales table” (too vague)

* ✅ “One row per product, per order, per day”


---

---------------------------------------------------------------------------------------------------------------------------


## 3. Choose the Right Model Type


### a) OLTP (Transactional Systems)


* Use **3NF (Third Normal Form)**

* Minimize redundancy

* Optimize for **writes**


### b) OLAP / Data Warehouse


* Use **Dimensional Modeling**

* Star schema preferred

* Optimize for **reads & analytics**


**Rule of Thumb**


> Applications → normalized

> Analytics → dimensional


---

---------------------------------------------------------------------------------------------------------------------------


## 4. Use Clear Naming Conventions


**Best Practice**


* Use **business-friendly, consistent names**


**Examples**


* `customer_id` (not `custid`)

* `order_date` (not `dt1`)

* Tables: singular or plural—choose one and stick to it


**Why**


* Improves usability for analysts and BI tools.


---------------------------------------------------------------------------------------------------------------------------


## 5. Design Clean Dimension Tables


**Best Practices**


* Use **surrogate keys** (not natural keys)

* Keep dimensions **wide, not deep**

* Handle Slowly Changing Dimensions (SCD)


**SCD Types**


* Type 1 – overwrite (no history)

* Type 2 – add new row (full history) ⭐ most common

* Type 3 – limited history


---------------------------------------------------------------------------------------------------------------------------


## 6. Keep Fact Tables Lean


**Best Practices**


* Store only:


  * Foreign keys

  * Numeric measures

* Avoid text fields in facts


**Example**


* ✔ `sales_amount`, `quantity`

* ❌ `customer_name`, `product_description`


---

---------------------------------------------------------------------------------------------------------------------------


## 7. Enforce Data Integrity


**Best Practice**


* Use constraints where possible


**Examples**


* Primary keys

* Foreign keys

* NOT NULL for required fields


**Why**


* Prevents silent data corruption.


---

---------------------------------------------------------------------------------------------------------------------------


## 8. Handle Dates Properly


**Best Practice**


* Use a **Date Dimension**, not raw dates everywhere


**Benefits**


* Easy filtering by:


  * Year

  * Month

  * Quarter

  * Fiscal periods

  * Holidays


---

---------------------------------------------------------------------------------------------------------------------------


## 9. Optimize for Query Performance


**Best Practices**


* Index:


  * Foreign keys

  * Frequently filtered columns

* Use partitioning for large fact tables

* Avoid unnecessary joins


**Cloud Tip (AWS)**


* Redshift / Athena → sort keys & distribution keys

* RDS → proper indexing and query plans


---

---------------------------------------------------------------------------------------------------------------------------


## 10. Plan for Data Growth


**Best Practice**


* Design for **scale from day one**


**How**


* Avoid hard-coding values

* Use surrogate keys

* Anticipate new dimensions & measures


---

---------------------------------------------------------------------------------------------------------------------------


## 11. Document Everything


**Best Practice**


* Maintain data model documentation


**Include**


* Table purpose

* Column definitions

* Grain

* Business rules


**Why**


* Reduces dependency on individuals.


---

---------------------------------------------------------------------------------------------------------------------------


## 12. Validate with Real Data


**Best Practice**


* Test models using **actual data volumes**


**Checks**


* Row counts

* Duplicates

* Aggregation accuracy

* Edge cases


---

---------------------------------------------------------------------------------------------------------------------------


## 13. Version Control Your Models


**Best Practice**


* Treat data models like code


**How**


* Store DDL in Git

* Track schema changes

* Use migration scripts


---

---------------------------------------------------------------------------------------------------------------------------


## 14. Avoid Over-Normalization in Warehouses


**Best Practice**


* Prefer **simplicity over theoretical purity**


**Why**


* BI tools and users prefer fewer joins.


---

---------------------------------------------------------------------------------------------------------------------------


## 15. Design for BI & Reporting Tools


**Best Practice**


* Models should be **self-service friendly**


**How**


* Meaningful column names

* Pre-calculated metrics where useful

* Consistent dimensions across facts (conformed dimensions)


---

---------------------------------------------------------------------------------------------------------------------------


## Golden Rule Summary


> **A good data model is easy to understand, hard to misuse, and flexible to change.**


------------------------------------------------------------------------------------------------------------------------------


 


29jan26 - Physical model without a Logical one?

  Can you create a Physical model without a Logical one? The short answer: Yes. Technically, you can jump straight into a Physical Data Mod...