Tuesday, May 12, 2026

12may26 - challenges in data modeling

 Transitioning from technical design to the actual execution of a project brings a different set of headaches. In a manufacturing environment—where the physical world moves faster than the IT world—these three challenges are almost universal.


1. The "Shadow IT" Spreadsheet Trap

In our project, we found that the shop floor managers weren't actually using the centralized database. Instead, they were keeping critical machine maintenance logs in personal Excel files on their desktops.

  • The Challenge: The data model we built was technically perfect, but it was empty. The real "truth" lived in fragmented spreadsheets that didn't follow any naming conventions or validation rules, making it impossible to get a factory-wide view of downtime.

  • The Consequence: Reports generated for the CEO showed 99% uptime, while the physical reality was closer to 70%. Decisions were being made based on hallucinated data because the "Shadow IT" records weren't synced with our model.

  • How we overcame it: Instead of forcing them to stop using Excel immediately, we built an Ingestion Layer with a standardized template. We created a simple "Upload" button that mapped their spreadsheet columns directly into our normalized database schema. This secured the data first, then we gradually replaced the spreadsheets with mobile-friendly web forms.


2. The "Moving Target" Schema (Scope Creep)

Halfway through the manufacturing project, the engineering team decided to add a new line of "Smart Sensors" that provided 15 new data points (vibration, humidity, etc.) that our rigid relational model wasn't designed to hold.

  • The Challenge: Every time a new machine was added, we had to alter the table schema, which required taking the system offline and rewriting dozens of ETL (Extract, Transform, Load) scripts.

  • The Consequence: The project fell three months behind schedule. The developers spent all their time running "Alter Table" scripts and fixing broken queries instead of building the actual analytics dashboards the business requested.

  • How we overcame it: We moved to a Hybrid Modeling approach. We kept core business data (Machine ID, Location) in structured relational tables, but added a Metadata column using a JSONB (Document) data type. This allowed us to store any new sensor data on the fly without changing the database structure, giving us the flexibility of NoSQL within a SQL environment.


3. The "Granularity" Conflict

The Production Team wanted to see data by the second to troubleshoot machine "micro-stops," but the Finance Team only wanted to see data by the month for cost-per-unit analysis.

  • The Challenge: If we modeled the data at a high level (Monthly), the engineers couldn't use it. If we modeled it at a granular level (Seconds), the Finance reports became unbearably slow because they had to sum up millions of rows just to see a single month's total.

  • The Consequence: We faced a user revolt. Finance claimed the system was "broken" because it was too slow, while Engineering claimed it was "useless" because it wasn't detailed enough.

  • How we overcame it: We implemented Data Aggregation Tables (also known as a Medallion or Star Schema architecture). We stored the "Atomic" data for the engineers in a cold-storage layer and created "Summary Tables" (Daily and Monthly) for the Finance team. We used a tool to automatically refresh these summaries every night, so Finance got their reports in seconds without touching the raw sensor data.

12may26- Mistakes in data modeling

 

1. The "Hard-Coded" Unit of Measure

In our manufacturing project, we built a schema for Raw Materials but ignored the detail that different global plants might use different measurement systems.

  • The Mistake: We modeled the Quantity column as a simple decimal and assumed everyone used Metric (kilograms) without including a UnitID or UOM column.

  • The Consequence: When the company acquired a US-based plant using Imperial units, the system began mixing pounds and kilograms in the same column. Production batches failed because the formulas were chemically incorrect, leading to thousands of dollars in wasted physical inventory.

  • How we handled it: We performed a massive data migration to add a UnitOfMeasure reference table. We then implemented a calculation layer in the database that automatically converted all inputs into a "Base Unit" while storing the original local unit for the plant’s UI.


2. Overlooking "Time-Travel" (Temporal Data)

We ignored the detail that Product Costs are not static; they fluctuate based on the market price of raw materials like steel or plastic.

  • The Mistake: We simply overwrote the UnitCost in the Parts table whenever the price changed, rather than creating a historical price table.

  • The Consequence: When the finance team tried to run a "Year-to-Date" profitability report, the system calculated January’s sales using May’s high inflation prices. All historical financial reports were wrong, making it look like the company had lost money all year when it hadn't.

  • How we handled it: We moved the cost attribute out of the main Parts table and into a PartPriceHistory table. We used EffectiveDate and EndDate columns so the system could "look back in time" and join the correct price to the date the order was actually placed.


3. The "Flat" Equipment Hierarchy

In our asset management module, we ignored the detail that a "Machine" is actually a collection of smaller, replaceable components with their own maintenance schedules.

  • The Mistake: We modeled the Equipment table as a flat list, treating a "CNC Lathe" as a single entity rather than a parent with child components (like motors or sensors).

  • The Consequence: Maintenance teams couldn't track the specific life cycle of a $10,000 motor inside the machine. When the motor was moved to a different lathe, the service history was lost, leading to redundant repairs and a critical "unplanned downtime" event because a sensor's age was hidden.

  • How we handled it: We redesigned the model to support a Self-Referencing Relationship (a ParentEquipmentID column within the table). This allowed us to build a "Tree" structure where every sub-part could have its own history while still belonging to the main machine.

how to join customer (scd2 ) with transaction fact table to get location of the customer on date of customer

 To join a Transaction (Fact) table with an SCD Type 2 Customer (Dimension) table, you must match the transaction date to the specific period when a customer's record was "active."

In your specific table structure, since you don't have a valid_to column, you have to "close" the records yourself using a window function before joining.

Step 1: Prepare the Dimension (The "Timeline" View)

First, use a Common Table Expression (CTE) to define the start and end dates for every version of the customer record.

SQL
WITH Customer_Timeline AS (
    SELECT 
        custo_id,
        addr,
        updated_date AS start_date,
        -- Create an end_date by looking at the next record's update date
        LEAD(updated_date) OVER (PARTITION BY custo_id ORDER BY surrogate) AS end_date
    FROM Customers
)

Step 2: The Point-in-Time Join

Now, join the transaction table to this timeline. The join condition must ensure the transaction date falls on or after the start_date and strictly before the end_date.

SQL
SELECT 
    t.trx_id,
    t.date AS trx_date,
    t.customer,
    c.addr AS location_at_trx_time
FROM Transactions t
JOIN Customer_Timeline c 
  ON t.customer = c.custo_id
  AND t.date >= c.start_date 
  AND (t.date < c.end_date OR c.end_date IS NULL);

Why this logic is essential

If you simply join on cust_id, SQL will return every version of that customer for every transaction, causing a "fan-out" (duplicate data).

Using the date range ensures a 1:1 match:

  • Transaction on 01-01-2000: Joins with the record where addr = 'usa'.

  • Transaction on 01-03-2000: Joins with the record where addr = 'india'.

Key Considerations for Interviews

  1. The "NULL" Problem: Always mention the OR end_date IS NULL part. This handles the current "active" record, which hasn't been superseded yet.

  2. SCD Type 2 Identification: Mention that this approach is specifically for Slowly Changing Dimensions (Type 2) where we preserve history.

  3. Performance: Range joins (>= and <) can be computationally expensive on large datasets. In a production environment, it is better to have the ETL process pre-calculate a valid_to_date column rather than calculating it on the fly with LEAD().

12may26 - challenges in data modeling

 Transitioning from technical design to the actual execution of a project brings a different set of headaches. In a manufacturing environmen...