Contents
6. > star and snowflake when to
use
7. > how scd works on downsteams
8. How to handle late arriving
dimensions ?
Tell me your knowledge on the 3 layers of ETL
## Data Modeling Best Practices – Headings
what are 3 levels of Data modeling
Types of Dimension tables (30, 46,48)
Steps in Datawarehouse life cycle steps
what is ODS( operational data store)
OLTP & OLAP- differences (pg 17) 53
identify connect the data source
select storage mode identify the
query performance issues
resolve inconsistencies and
unexpected values null values data quality issues
apply value replacements identify
the keys for joints
identify and transform column data
types
apply naming conventions two
columns
configure table and column
properties
flatten the parent child hierarchy
define relationships and
cardinality
resolve many to many relationships
define the data level granularity
ELT (Extract, Load, Transform) advantages
68. When to use star schema and when to go for snowflake
?
A Compact List of Snowflake Features
Erwin - Data modeler workspace
Erwin - types of relationships
Erwin - which erwin features did you use
Erwin - which feature used to merge the subtype tables
to supertype
Erwin - Forward and reverse engineering steps
Erwin - Complete compare feature
non identifying relationships differences...........
Two types of model versions exist:
1. > unique key and primary
key diff
2. query to find number of subjects
of each student
3. Peoplefy - How to design facts and dimensions
4. Peoplefy - sql query to find sum
5. query to find length of each
column and subtract total commas from it
Teksystems - query : find and delete duplicate rows Two methods to find and delete duplicate rows
what is inner
join and outer join count of below
53 . How to write the INNER JOIN ,OUTER JOIN QUERIES
54 . How to find cumulative check - if employee got hike
or not
55. How to find if a single user purchased same product
on different dates
SQL - show highest salary in each department
SQL - count(*) , count(column) , count(1)
-- BOTH COUNT(*) AND COUNT(1) ARE SAME BUT COUNT(1) IS
FASTER
sql query : TO get highest ordered products in each
state
VARCHAR AND NVARCHAR difference
what is most likely reason that sql query may run slowly
- sql query : find and delete
duplicate rows
SQL : query to show Total
salary of each department
SQL : query to sum of salary by dept and
How do you do Data model optimization ?
what is most likely reason that sql query may run slowly
Query optimization best practices
best practices for SQL query optimization summarized in
one-liners:
best practices for data modeling summarized in
one-liners:
1-- . Hospital project data model
2 . The Patient Visit
Lifecycle
3 . How we identified the
entities
4 . Tables list of OLTP and
OLAP
4.1. ) OLTP tables list : Operational Database (approx.
15–20 tables)
4.2 ) OLAP tables list:
Data Warehouse ( facts, dims )
4.2.1) Fact Tables (The focus of your analytics):
\\ 2 . Insurance
project and how you build data mode
1 . Insurance Project Data Model Specification
1.1 About the Insurance Project
2. The Patient-Insurance Lifecycle
3. Insurance Data Model Entities
3.1 OLTP Tables List (Operational Database)
3.2 OLAP Tables List (Data Warehouse)
5. Data Traceability & Integrity
\\ 3 . Bank
project data model
2. The Customer-Account Lifecycle
3.1 OLTP Tables List (Operational Database)
3.2 OLAP Tables List (Data Warehouse)
\\ 4 . Food
delivery project data mode
1. About the Food Delivery
Project
2. The Order Fulfillment
Lifecycle
3.1 OLTP Tables List
(Operational Database)
3.2 OLAP Tables List (Data
Warehouse)
\\ 5. Casella waste management project Data mode
Waste Management Project Data Model Specification
1. About the Waste Management Project
2. The Waste Collection Lifecycle
3.1 OLTP Tables List (Operational Database)
3.2 OLAP Tables List (Data Warehouse)
2. Failure to Normalize (Redundancy)
4. Naming and Integrity Issues
1. Challenge: Inconsistent definition
of "Patient"
3. Challenge: Regulatory compliance
gaps for prescription history
4. Challenge: Supplier lead-time
volatility
5. Challenge: Product classification
ambiguity
6. Challenge: Partial payment and
refund handling
1. The "Infinite Customization" Modifier
Problem
2. Bi-Temporal Menu Versioning (The "Price
Lock" Paradox)
3. High-Concurrency Geofence Matching
4. Multi-Party Split Settlement (The
"Double-Entry" Ledger)
1. Real-Time Security & Compliance
2. Customer 360-Degree Integration
3. Financial Precision & Multi-Currency
4. Amortization & Payment Integrity
Issue: Data inconsistency in customer profiles
(duplicates).
Issue: Conflicts between Policy status and Payment
history.
Issue: Late-arriving fact records (claims ingested after
the reporting window).
Issue: Referential integrity failure (orphaned payments
due to out-of-order ingestion).
Issue: Historical schema drift (change in data structure
for Underwriting_Rules).
2. Failure to Normalize (Redundancy)
4. Naming and Integrity Issues
1. Real-Time Security & Compliance
2. Customer 360-Degree Integration
3. Financial Precision & Multi-Currency
4. Amortization & Payment Integrity
No index entries found.
.
Data modeling
1.
Types of Facts
·
3
types
o
Additive - ( sum , avg )
o
Semi additive - (
cumulative like bank balance statement )
o
Non additive - (
prices )
2.
Types of Fact tables
·
Accumulative fact tables
·
Transaction fact tables
·
Snapshot fact tables
·
Factless fact tables
3.
Types of Dimension tables
4.
> WHY 3 models needed
5.
> Why use surrogate key
6.
> star and snowflake when to use
7.
> how scd works on downsteams
8.
How to handle late arriving dimensions ?
To handle late-arriving dimensions, you can use strategies like
creating "unknown" or placeholder dimension records, holding fact
records in a temporary table until the dimension arrives, or using a
reconciliation pattern to retry failed joins, ensuring data integrity and
completeness.
Here's a more detailed breakdown of common approaches:
1. "Unknown" or
Placeholder Dimensions:
Concept:
When a dimension record is
missing, create a placeholder or "unknown" record in the dimension
table with a default value for all attributes.
Benefits:
● Prevents fact data from being lost or
incomplete.
● Allows fact data to be loaded and reported
on even when dimensions are missing.
● Simplifies the process of updating dimension
records when they arrive.
Example:
If a product dimension is
missing, create a "product_id_unknown" record with default values for
other attributes.
2. Hold Fact Records in a
Temporary Table:
Concept:
When a dimension record is
missing, store the fact record in a temporary table until the dimension is
available.
Benefits:
● Ensures that all fact data is processed
eventually.
● Simplifies the process of updating fact
records when the dimension arrives.
Example:
Store fact records with missing
product dimensions in a temporary table, and then re-process them when the
product dimension is loaded.
3. Reconciliation Pattern:
Concept:
Use a process to identify and
retry failed joins when dimensions are missing.
Benefits:
● Ensures that all fact data is processed
eventually.
● Can be used with both streaming and batch
ETL processes.
Steps:
● Write unjoined records to a
"streaming\_pipeline\_errors" table.
● Consolidate multiple failed retries for the
same event into a single fact row.
● Union Kafka and retryable events.
● Transform/join this unioned data with the
dimension table.
● Control how many retries are allowed by
filtering out reconciliation records that exceed the configured number of
retries.
●
4. Other Considerations:
● Identify Slowly Changing Dimensions: Determine how to handle changes to
dimension attributes over time.
●
● Use Surrogate Keys: Employ surrogate keys for dimension tables
to simplify joins and manage slowly changing dimensions.
●
● Data Deduplication: Ensure that you are not duplicating fact
data when retrying joins.
●
● Schema Evolution: Plan for changes to the schema of your
dimension tables.
●
● Inferred Flag: Use a flag to indicate whether a dimension
record is inferred
How snowflake is different
from other databases or data warehouses
1. Separate
"Storage" and "Brain" Power
Snowflake keeps storage and computing (the
"brain") separate, so you can store data cheaply and only pay for
computing power when you are actually using it. Unlike traditional databases
that force you to pay for extra "brain" power whenever you need more
storage, Snowflake keeps these costs independent.
2. No "Upkeep"
●
Snowflake: It is "set it and forget it." Snowflake handles all the
technical chores, software updates, and performance tuning for you
automatically.
●
Others: Traditional databases require a technical expert to constantly
manage the hardware, install updates, and manually tune the system to keep it
fast.
3. Handles "Messy" Data Easily
●
Snowflake: It can read both neat, organized tables and "messy" data
(like information from websites or apps) equally well, meaning you don’t have
to spend time cleaning it before use.
●
Others: Most traditional databases only work with neat, structured tables,
so you must spend extra time and effort cleaning the data before it can be
stored.
4. Only Pay for What You Use
●
Snowflake: Because you can turn off the "brain" when you aren't
using it, you stop paying for that power immediately.
●
Others: You usually pay for a fixed amount of hardware capacity; even if
the system is sitting idle, you are still paying for that unused power.
Tell me your knowledge on
the 3 layers of ETL
●
Extract: Gathering raw data from various sources without disrupting the
original systems.
●
Transform: Cleaning, formatting, and organizing the raw data so it is consistent
and useful.
●
Load:
Saving the prepared, high-quality data into a final warehouse where it is ready
for business analysis.
------------------------------------------------------------------------------
1. Start with Business Requirements
(Not Tables)
2. Define Grain First (Critical
Rule)
3. Choose the Right Model Type
4. Use Clear Naming Conventions
5. Design Clean Dimension Tables
6. Keep Fact Tables Lean
7. Enforce Data Integrity
8. Handle Dates Properly
9. Optimize for Query Performance
10. Plan for Data Growth
11. Document Everything
12. Validate with Real Data
13. Version Control Your Models
14. Avoid Over-Normalization in
Warehouses
15. Design for BI & Reporting Tools
Types of Fact tables
|
Types of Fact tables |
1. Transactional Fact Tables Transactional fact tables are designed to capture
individual business events or transactions. 2. Periodic Snapshot Fact Tables Periodic snapshot fact tables provide a summarized view
of metrics over regular time intervals. 3. Accumulating Snapshot Fact Tables Accumulating snapshot fact tables are designed to track
the stages of a business process or workflow. 4. Factless facts |
types of fact tables 51
|
1. Additive ( can be added
like total qty , sales ) 2. Semi additive ( cannot
be added but updated like account balance )
3. Non Additive ( numerical
but cannot be added like today price
of scotk ) |
Granularity and Granule
|
Granularity in data refers to the level
of detail >
high level of granularity would have a large number of individual
pieces (columns) of information, such as individual records or measurements. >Data that has a low level of
granularity would have a small number of individual pieces of information,
such as summary data or aggregated data. Data granularity can affect how it
is used and analyzed, and can impact the accuracy and usefulness of the
results. |
what are 3 levels of Data modeling
|
1. Conceptual model list of entities and relationships , no attributes
diagram only 2. logical 3. physical model |
Types of Dimension tables (30, 46,48)
|
1 Roleplay dimension Role playing dimensions are dimensions
that are used more than once in a fact table, each time with a different
meaning or role.
when we want two attributes from the source entity to be connected to
1 entity as destination, that is not possible. In such cases, we can connect
both the columns of one table to 2nd table, but with a new role name that is
called as role play dimension something like connecting address and customer
table with 2 columns bill to address and ship to address 2 Conformed dimension Conformed
dimensions are dimensions that are shared by multiple stars 3 Degenerate dimension
A column of the key column in the fact table that does not have the
associated dimension table but used for reporting and analysis, such
column is called degenerate dimension For ex, we have a fact table with customer_id,
product_id, branch_id, employee_id, bill_no 4 Junk dimension This combined dimension table covers all
possible set of values which might occur in fact table and is known as
junk dimension.5 Shrunken dimension 5 Static dimensions Static dimension items do not change
with time; they are fixed components that are
always returned in a freeform table. The time , date , country code , status codes etc 6. Dynamic dim which change dynamically based on values and formula 7. Deferred dims Sometimes the
fact table have to be loaded with some data, like customer id. But the
dimension of customer is not loaded with that customer id. In such cases, we
delay inserting such rows into fact table And later on, once the the
dimension of customer is loaded with this customerId, then we add this rows
into fact table. This type of delaying the fact table is called deferred dimensions 8 Inferred Dimensions When the fact data is available to be loaded in the
source but related dimension table does not have that eg:
like the customerId is in sales
orders table but not in customer table .. in such case we add an sample entry in customer table and enable
'inferred dimension' flag and later point of time we make a original entry
into customer table and disable the inferred dimension 9 A slowly changing
dimension (SCD) scd is a dimension which contains relatively static data which can change
slowly but unpredictably, rather than according to a regular schedule SCD Type 0 Ignore
any changes and audit the changes. Type 1 Overwrite
the changes Type 2 History
will be added as a new row. Type 3 History
will be added as a new column. Type 4 A
new dimension will be added Type 6 Combination
of Type 2 and Type 3 10 Rapidly changing dimensions (RCD) Rapidly changing dimension (RCD) is a dimension which has
attributes where values will be getting changed often. 11) sparse dim Some of the dimension tables are not loaded with full
values 12) dense dim Sometimes most of the columns of the dimension tables
are loaded, fully loaded. So such dimensions 13) shrunken dim Sometimes we create a replica of the original dimension,
but it is with lesser, fewer number of columns than the original table |
||||||||||||||||||
What is factless fact table
|
A factless Most of the columns are empty. Such
dimensions are called sparse dimensions fact table is a table that contains
only foreign keys to dimensions, but no numeric facts. Factless fact tables
are used to capture events or situations that have no measurable outcome |
||||||||||||||||||
What is Data Mart
|
Data mart is a subset of datawarehouse purpose-built to support the analytical
needs of a particular department, or line of business, or geographic region. |
||||||||||||||||||
Steps in Datawarehouse life cycle steps
|
Requirements gathering Requirements analysis High level design (HLD) Low level design (LLD) Development – Automating scripts Testing User Acceptance Testing (UAT) Project release or deployment Warranty support |
what is ODS( operational data store)
|
An operational data store usually stores and processes data in real time. An ODS is connected to multiple data sources and pulls data into a
central location. ODS is a temportary data store from all
sources for volatile data like resumes of interview, quotation from suppliers
,etc. which need not be stored permanently |
OLTP & OLAP- differences (pg 17) 53
|
https://easy-data-modeling.blogspot.com/2024/03/olap-oltp-diff.html |
Data model testing how
Data model can
tested using the following:
Unit testing: This
involves testing each component of the data model individually to ensure that
it performs as expected.
Integration
testing: This involves testing how different components of the data model work
together
using test data
Data modeler roles
the data from data
source
identify connect the data source
pick a share data set
select storage mode identify the query performance
issues
use or create a data flow
profile the data
identify data anomalies
examine data structures
clean transform load data
resolve inconsistencies and unexpected values null
values data quality issues
apply value replacements identify the keys for joints
identify and transform column data types
apply naming conventions two columns
perform data loading
resolve data input errors
model the data
designer data model
define the tables
configure table and column properties
flatten the parent child hierarchy
role playing dimensions
define relationships and cardinality
design the data model
resolve many to many relationships
design a common date table
define the data level granularity
Layer of Data lake
- landing
layer
2. curated layer
3. application data layer ( summary data)
4. Sand box layer
5. Target layer
Views
and materialized views
Views and materialized views are both database objects that
provide access to data, but they differ in several ways:
Storage
Views
are virtual and don't store data
,
while materialized views store precomputed data as a physical table.
Performance
Views
are more efficient because they don't require storage or updates,
but
materialized views can be faster for complex queries and aggregations.
Data governance
Data governance is the process of
managing the
availability, usability, integrity and
security
of the data in enterprise systems,
based on internal standards and policies
that also control data usage.
Effective data governance ensures that
data is consistent and trustworthy
and doesn't get misused.
Data governance is increasingly critical as organizations are
expanding data privacy regulations and
rely more and more on data analytics
to help optimize operations and
drive business decision-making.
https://www.thoughtspot.com/data-trends/data-governance/how-to-implement-data-governance
https://data-modeling-made-easy.blogspot.com/2024/08/how-to-implement-data-governance.html
ELT (Extract, Load, Transform)
advantages
ELT is better suited for larger data sets, situations
where timeliness is crucial, or when you must keep raw data available for
future analysis. For instance, ELT is faster than ETL because it loads data
directly into the destination system and can perform transformations in
similarity.
What is a Data Lake?
A data lake is a centralized repository that allows you to
store all your structured and unstructured data at any scale. You can store
your data as-is,
A data lakehouse
is a data architecture that combines a data lake and a data warehouse to
provide a flexible solution for business intelligence and data science. It
allows organizations to use low-cost, flexible storage for all types of data,
while also benefiting from the data management, schema, and governance of a
warehouse.
Data lakes
and data warehouses are both storage systems for big data used by business
analysts, data engineers, and data scientists. They have different
architectures and serve different business needs:
Data lakes
Store raw, structured, and semi-structured data from internal and external
sources. They are good for data analytics, machine learning, experimenting with
data, and training AI models. Data lakes prioritize storage volume and cost
over performance, and can store any type or volume of data in full fidelity.
They also allow users to ingest data from any system at any speed, and process
data in real time or batch mode.
Data warehouses
Store structured data from multiple sources. They are good for business
intelligence (BI), data visualizations, reporting, and accessing reliable,
quality data. Data warehouses are designed for the fastest query performance,
and are optimized to analyze relational data coming from transactional systems
and line of business application
68. When to use star
schema and when to go for snowflake ?
If you have a few dimensions and low cardinality (less unique values in
dims ) , but you require fast query execution, a star schema is
the right choice.
However, if you have several dimensions and
high cardinality, a snowflake schema will be a better scheme
features of snowflake db
Snowflake can seamlessly integrate with popular cloud providers such as AWS, Azure,
and Google Cloud Platform. It can be scaled up or down automatically, ensuring efficient data
loading, integration, and analysis. This allows multiple users to run numerous
workloads simultaneously without concerns about resources
https://docs.snowflake.com/en/user-guide/intro-supported-features
https://staragile.com/blog/snowflake-features
A Compact List of Snowflake Features
● Decoupling of storage and compute in Snowflake
● Auto-Resume, Auto-Suspend, Auto-Scale
● Workload Separation and Concurrency
● Semi-structured Data Storage
● Cloning
● Snowpark
Data vault qsns
https://data-modeling-made-easy.blogspot.com/p/data-vault-faq.html
|
ERWIN qsns |
|
Erwin - Data modeler workspace
|
is the space where the Tables are created and connected |
|
Erwin - Domains |
A domain is like a custom datatype an
independent model object that you can use to quickly assign properties to an
attribute or column |
Erwin - types of relationships
|
Identifying relationship. Non-identifying relationship. Subtype relationship. Many-to-many relationship. Recursive relationship. |
Erwin - which erwin features did you use
|
1. Name hardening 2. Domains for Data types 3. Naming standards( abbrevations) 4. Fwd & reverse engineering 5. Merge data models 6. complete compare 7. sub type and super type |
Erwin - which feature used to merge the subtype tables to
supertype
|
|
Erwin - Forward and reverse engineering steps
|
Forward engr steps : Actions > schema > Generate scripts Reverse engnr steps : Actions > reverse engnr > logical / physical > Database (or) script file > connect > ok |
Erwin - Complete compare feature
|
Complete Compare is a powerful tool
that lets you view and resolve the differences between two models, or a model
and a database or script file Actions > complete compare wizared
> source target models select > compare > final >
resolve differenceces |
Erwin - Naming standards
|
Use the Naming Standard Editor to
create naming standards and develop a glossary that contains glossary words,
their abbreviations, and alternate abbreviations. Access the Model Explore> open
Naming Standard Editor > Define naming standards for logical and physical
models> Create abbreviations |
Erwin - Name hardening
|
Name hardening is a mechanism in Erwin Data Modeler that
prevents physical model names from changing when logical names change |
Erwin - querying tool
|
|
Erwin - subject areas
|
Click Model, Subject Areas. The Subject
Area Editor opens. Click New. A new subject area |
Erwin - identifying and
non identifying relationships differences...........
|
Identifying : Represented by a solid line, the primary
key of the parent migrates to the primary key area of the child entity or
table eg : Book & chapter tables , Country and state tables Non-identifying : The child entity is not dependent on
the parent entity for its identity and can exist without it .. eg : supplier
& product list , |
ERWIN - how maintain
versions
Two types of model versions exist:
Delta Version: A delta version is created when you initially save a new model or
when you save changes to an existing model.
If you do not want to
maintain multiple versions, you can clear the Maintain multiple versions check
box. If you clear the checkbox, then erwin� Data Modeler does not create a
delta version for any incremental save. erwin� Data Modeler updates the current
version every time you save the model.
Named Version:
A named version is
created from a delta version, or from another named
version. A named version represents a milestone in the development of the
model, or at least a version that you want to keep indefinitely. When you save
the changes in a named version, the changes are saved in the same version.
The Based upon field
in the Edit Catalog pane displays the version
number from which the Named version is created.
=========================================================================
Create a Delta Version
You
create a delta version when you save a model initially or when you
incrementally save an existing model.
Follow
these steps:
- Click File, Mart,
Open >
> The Open Model dialog opens. > Select a model and click OK.
> The model opens.
>Make necessary changes to the model.
>Click File, Mart, Save.
- A delta version of the model is created with the
incremental changes.
Create a Named Version
A
named version of a model represents a milestone in the development of the
model. You create a named version to keep that model version indefinitely.
Follow
these steps:
- Click File,
Mart, Catalog Manager .
> The Catalog Manager opens. > Select a model version,
>right-click and click Mark Version. - A named version is created with a default name.
> Edit the name of the named version (new name ) and press enter. - A new named version is created in the catalog.
=========================================================================
Compare Model Versions
Compare
two model versions of a data model to view the differences between them.
Follow
these steps:
- Click File,
Mart, Catalog Manager. >
The Catalog Manager dialog opens.
Hold the Ctrl key and select two model versions of a model and
> click Version Compare. >Click Compare. - The Complete Compare wizard opens. >.
- The Resolve Difference dialog opens.
Review the differences and use the tools on the toolbar to work with and
manage the changes. > Finish
> Click Close.
-
2 . Sql questions
1. > unique key and primary
key diff
2.
query to find number of subjects of each student
a. ========================================================
b. (name ) , (marks )
c. student1 20,30,40,50
d. student2 30,40
e. student3 30,40,55
3.
Peoplefy - How to design facts
and dimensions
4.
Peoplefy - sql query to find sum
a. sql query to find sum of all sales on
1-1-2024 with id 101 who have
b.
c. orderid qty date
d. 101 150 1-1-2024
e. 101 15 1-1-2024
f. 102 22 2-1-2024
5. query to find length of each column and subtract total commas from
it
5.
SELECT name,
CASE
WHEN marks = '' OR marks IS NULL THEN 0
ELSE (LENGTH(marks) -
LENGTH(REPLACE(marks, ',', '')) + 1)
END AS subject_count
FROM students;
select name , (REPLACE(marks, ',', '') )from students;
select name , marks
from students;
Query to find third highest ?
To find the fourth highest salary by
each department, you can use a combination of SQL queries with ROW_NUMBER or DENSE_RANK
functions. Here's an example:
Using ROW_NUMBER (all
sequential ranks ):
select employee_id,salary from (
select
e.*,row_number() over (order by salary desc)
as row_num
from employee e ) a
where row_num = 3 ;
Using RANK (for duplicate ranks , next rank = prev
rank + total dups + 1):
select
* from
(select e.*,rank() over (order by salary desc)
as rank
from emp
e )
where rank
= 3;
Using DENSE_RANK (for
duplicate ranks , next rank = prev rank
+ 1):
select
* from
(select e.*,dense_rank() over (order by
salary desc) as rank
from emp
e )
where rank
= 3;
6.
Data
model constraints ?
Data model constraints refer to rules or
limitations designed to maintain the integrity and validity of data within a
database or system. Here are some common types of constraints:
1.
Entity
Constraints:
●
Primary
Key: Ensures that each row in a
table is unique and identifiable (e.g., no duplicate or NULL values in the
primary key column).
●
Unique
Constraint: Ensures all values in a
column are unique, but unlike the primary key, it allows one NULL value.
2.
Domain
Constraints:
●
Restrict
the type, format, or range of values for a column (e.g., specifying data types
like INTEGER, VARCHAR, or a range for a numeric field).
3.
Referential
Constraints:
● Foreign Key: Ensures a column’s value matches the
primary key of another table, maintaining referential integrity between tables.
4.
Check
Constraints:
● Specifies a condition that each row must
satisfy (e.g., salary > 0 or age >= 18).
5.
Default
Constraints:
●
Assigns
a default value to a column if no value is provided during insertion.
6.
Not
Null Constraint:
● Ensures that a column cannot have NULL
values (e.g., name column must always have a
value).
7.
Functional
Dependencies:
●
Enforces
relationships between attributes in a table, often used for normalization.
8.
Key
Constraints:
●
Besides
primary and unique keys, these may involve composite keys (combination of
columns acting as a primary key).
9.
Business
Rules:
●
Application-specific
rules tailored to the business logic (e.g., an employee can only belong to one
department).
10. Performance Constraints:
●
Limitations
imposed to enhance query performance, such as partitioning or indexing
strategies.
These constraints ensure data
reliability, prevent inconsistencies, and enforce rules for both operational
and analytical use cases.
----------
=============================
28aug25 - Teksystems
=============================================================
Teksystems - in data ware house how do
you update the history table with scd2 if the last updated data was corrupt and
to update the data with correct values
When a corrupt record needs to be updated
in a Type 2 Slowly Changing Dimension (SCD2) history table, you don't simply
"update" the existing record. Instead, you follow a specific process
that maintains a complete history of changes.
The
Process for Correcting a Corrupt Record
The standard SCD2 method for handling
changes is to mark the old record as inactive and insert a new, active record.
The same principle applies here to correct a corrupt entry. Here's a
step-by-step breakdown:
- Identify the Corrupt
Record:
Find the record in the SCD2 table that contains the incorrect or corrupt
data. This record will have its end_date
set to a future or '9999-12-31' value, and its is_current
flag (or similar) set to true.
- Expire the Corrupt Record: The first step is to "end"
the active life of the corrupt record. You do this by performing an UPDATE operation on the row. You will:
●
Set the
end_date of the corrupt record to the
current date, or the date the correction is being applied.
●
Set the
is_current or current_flag to false.
- Insert the Correct Record: After the corrupt record is marked as
inactive, you must INSERT a new
row into the history table. This new record will contain the correct
values for all the attributes. This new row will have:
●
The
same primary key as the original record (often a business key).
●
The
new, correct data for all relevant columns.
●
A start_date set to the current date (or the
date of correction).
●
An end_date set to a future or '9999-12-31'
value.
● An is_current or current_flag set to true.
Teksystems - query : find
and delete duplicate rows Two
methods to find and delete duplicate rows
-=================================
1. Delete all rows other than min
id value ..
(used when id column is not duplicate but other
columns values are duplicate)
===========================================
Method 1. using id Column (if ID
column is unique, but title is duplicate)
================================
delete films
where film_id
not in (select min(film_id) from films group by
title, rel_date)
===========================================
Method 2. using CTE ( Two steps)
============================================
step 1 :
with CTE as
(select id ,name , email , row_number()
over (partition by name , email order by name ,
email ) rownum
from sales
step 2 :
Delete from CTE where rownum > 1
===========================================
Method 3. using ROWID
============================================
Delect from table1
where rowid not in (select
min(rowid) from table 1
group by col1,col2)
========================================================================
what is inner join and outer
join count of below
========================================================================
A
B
1
1
1
1
2
2
2
3
null
null Inner join = count of
matching rows
outer join = inner join + left side
unmatched
inner = 6 rows
left outer = 6 + 1 = 7 rows
=======================================================================
53 . How to write the
INNER JOIN ,OUTER JOIN QUERIES
=======================================================================
-- //
Create database //
use db1;
show tables;
create table t1 ( id int , name
varchar(44) ) ;
create table t2 ( id int , name
varchar(44) ) ;
--
// insert data into tables //
insert into t1 values( 1,
"x1");
insert into t1 values( 2,
"x2");
insert into t1 values( 3,
"x3");
insert into t1 values( 4,
"x4");
select * from t1;
insert into t2 values( 1,
"x1");
insert into t2 values( 2,
"x2");
insert into t2 values( 5,
"x5");
insert into t2 values( 6,
"x6");
select * from t2;
------------------
-- // inner join //
select * from t1 join t2 on t1.id = t2.id
-- // left join //
select * from t1 left join t2 on t1.id =
t2.id
-- // Right join //
select * from t1 right join t2 on t1.id =
t2.id
- // fuill outer join //
select * from t1 left join t2 on t1.id = t2.id
union
select * from t1 right join t2 on t1.id =
t2.id
================================================================
54 . How to find cumulative check - if employee got hike or not
================================================================
-- --// create employee table // --
create table employee (employee_id
int ,name varchar(44),
monthofsalary date,
salary int);
-- alter table employee rename column month to monthOfSalary;
-- --// insert data to employee
table // --
-- truncate table employee;
insert into employee
values(1,"john",'2004-01-01',13000);
insert into employee
values(1,"john",'2004-02-01',13000);
insert into employee
values(1,"john",'2004-03-01',13000);
insert into employee
values(1,"john",'2004-04-01',13000);
insert into employee
values(1,"john",'2004-05-01',13000);
insert into employee
values(1,"john",'2004-06-01',13000);
insert into employee
values(1,"john",'2004-07-01',14000);
insert into employee
values(1,"john",'2004-08-01',15000);
insert into employee
values(1,"john",'2004-09-01',15000);
-- --// show data in employee table
// --
select
employee_id,name,month(monthofsalary),salary from employee;
-- -------------// query to find
cumulativ // ------------------------------
WITH salarychanges
AS (SELECT employee_id,
name,
monthofsalary,
salary,
Lag(salary, 1)
over (
PARTITION BY employee_id
ORDER BY monthofsalary) AS previous_salary
FROM employee)
SELECT employee_id,
name,
monthofsalary,
salary,
CASE
WHEN salary > previous_salary THEN 'yes'
ELSE 'no'
END
AS salary_hike,
SUM(salary)
over (
PARTITION BY employee_id
ORDER BY monthofsalary ROWS BETWEEN unbounded preceding AND CURRENT
ROW) AS
cumulative_salary
FROM salarychanges;
55. How to find if a single
user purchased same product on different dates
create
table orders( userid int ,prodid int, orderdate date);
insert into orders values ( 1,101,'2024-1-1');
insert into orders values ( 1,101,'2024-1-2');
insert into orders values ( 1,101,'2024-1-1');
insert into orders values ( 2,102,'2024-1-1');
insert into orders values ( 2,102,'2024-1-1');
select * from orders;
--
--------------------------------------------------------------------------------
-- // count of
users who purchased same product on different dates //
--
--------------------------------------------------------------------------------
SELECT
userid, prodid, COUNT(DISTINCT orderdate) AS
Noofdays_purchased
FROM
orders GROUP BY userid,prodid;
================================================================
SQL - show highest salary in each department
================================================================
================================================================
SQL - count(*) , count(column) , count(1)
================================================================
select count(*),count(distinct deptid) from emp;
================================================================
-- BOTH COUNT(*) AND COUNT(1) ARE SAME BUT COUNT(1) IS FASTER
================================================================
===========================================================
==========================================================
================================================================
Query TASKS lists
================================================================
10.Display duplicate Records
11.Display Top 4th highest Salary
12.Find 2nd lowest salary
13.Display Top Two salaries in each
department wise
14.Find each department wise sum of
salaries
15.Display maximum salary department wise
along with employee details
16.to display 5 to 7 rows from a table
================================================================
SQL FUNCTIONS
================================================================
'
================================================================
SQL - CONSTRAINTS
================================================================
● Not NULL
● Unique
● Primary key
● foreign Key
● Check (check
columnname > 0 )
================================================================
sql query : TO get highest ordered products in each state
==============================================================
CREATE DATABASE retail_db;
USE retail_db;
--
----------------------------------------------
CREATE TABLE orders_table
(
order_id INT NOT NULL,
customer_id INT
NOT NULL,
product_id INT NOT NULL,
quantity INT NOT NULL,
price DECIMAL NOT NULL
);
CREATE TABLE customer_table
(
customer_id INT NOT NULL,
customer_name VARCHAR(50),
state_name VARCHAR(50)
);
--
----------------------------------------------
INSERT INTO orders_table VALUES
(
101,1001,202,4,1000
) ;
INSERT INTO orders_table VALUES
(
102,1002,203,5,2000
);
INSERT INTO orders_table VALUES
(
103,1003,204,6,3000
);
INSERT INTO orders_table VALUES
(
104,1004,205,7,4000
);
SELECT * FROM orders_table;
--
----------------------------------------------
INSERT INTO customer_table VALUES
(
1001,
'Liam',
' Alabama'
);INSERT INTO customer_table VALUES
(
1002,
'Noah',
' Alaska'
);INSERT INTO customer_table VALUES
(
1003,
'Oliver',
' Arizona'
);INSERT INTO customer_table VALUES
(
1004,
'James',
' Arkansas'
);SELECT *
FROM customer_table;
--
----------------------------------------------
TRUNCATE
customer_table ;
--
----------------------------------------------
SELECT c.state_name,
o.product_id,
Count(o.order_id) AS highest_order_count
FROM
orders_table o
JOIN
customer_table c
ON
o.customer_id = c.customer_id
GROUP BY c.state_name,
o.product_id
ORDER BY c.state_name,
highest_order_count DESC
--
=========================================================
-- ==
OR TRY BELOW === ---
WITH ordercounts
AS (SELECT c.state_name,
o.product_id,
Count(o.order_id) AS order_count,
Rank()
OVER (
partition BY c.state_name
ORDER BY Count(o.order_id) DESC) AS ranka
FROM orders_table o
JOIN customer_table c
ON o.customer_id = c.customer_id
GROUP BY c.state_name,
o.product_id)
SELECT state_name,
product_id,
order_count
FROM ordercounts
WHERE ranka = 1;
--
----------------------------------------------
VARCHAR AND NVARCHAR difference
In database contexts, VARCHAR and NVARCHAR are data types used to store variable-length text strings.
The key difference lies in their encoding:
VARCHAR uses single-byte character sets (like ASCII),
while NVARCHAR
uses Unicode, allowing for
a wider range of characters including those from multiple languages. NVARCHAR typically requires more storage space and has a shorter maximum
character length compared to VARCHAR
what is most likely reason that sql query may run slowly
|
Possible Reasons for Slow SQL Queries 1. Lack of Indexes: I 2. Complex Joins: 3. Large Data Volume:
4. Inefficient Query Structure: 5. Database Configuration: Configuration settings such as memory
allocation, cache size, and connection limits can affect performance. 6. Server Load: High server load due to concurrent queries or insufficient resources
(CPU, memory, I/O) can slow down execution times. 7. Network Latency: If the database is remote, network
latency can add to the time it takes to execute a query. 8. Statistics Outdated: Database optimizers rely on
statistics to create execution plans. 9.
Locking and Blocking: If a query is waiting for locks held by other transactions, it can
slow down execution. Types of Joins and Their Impact on
Execution Time 1. INNER JOIN: Generally efficient if both tables have indexes on the join columns.
2. LEFT JOIN (or RIGHT JOIN): These can be slower than INNER JOINS
because they need to return all records f 3. FULL OUTER JOIN: Typically the slowest join type, as
it combines the results o 4. CROSS JOIN: This join produces a Cartesian product, It's usually slower and should be used
with caution. 5.
SELF JOIN: While not
inherently slow, performance Optimization Tips ●
Indexing: ●
Analyze Execution Plans: Use tools provided by your database
(like EXPLAIN in PostgreSQL) to ●
Refactor Queries: Simplify complex queries, ●
●
Monitor Server Performance: |
types of indexes 53
|
1. NORMAL INDEX 2. BITMAP INDEX 3. UNIQUE & NON UNIQUE INDEX 4. CLUSTERED & NON CLUSTERED 5. PARTITIONED & NON PARTITIONED 6. EXPRESSION BASED 7. B- TREE |
-
sql query : find and delete duplicate rows
Delete all rows other than min id value ..(used when id
column is not duplicate but other columns values are duplicate)
===========================================
Method 1. using id Column (if ID column is unique, but
title is duplicate)
================================
delete films
where film_id
not in (select min(film_id) from films group by title,
rel_date)
===========================================
Method 2. using CTE ( Two steps)
============================================
step 1 :
with CTE as
(select id ,name , email , row_number()
over (partition by name , email order by name , email )
rownum
from sales
step 2 :
Delete from CTE where rownum > 1
===========================================
Method 3. using ROWID
============================================
Delect from table1
where rowid not in (select min(rowid) from table 1
group by col1,col2)
SQL : query to show Total salary of each department
SELECT
Id,
empname, deptid, Salary,
SUM(Salary) OVER(PARTITION BY deptid) AS
SUM_SAL
FROM
emp
ORDER BY id
===========================
# Id, empname, deptid, Salary, SUM_SAL
'1', 'john', '1', '20000', '50000'
'2', 'kem', '1', '30000', '50000'
'3', 'chan', '2', '10000', '30000'
'4', 'henry', '2', '20000', '30000'
'5', 'bill', '3', '30000', '30000'
'6', 'salt', '4', '20000', '20000'
SQL : query to sum of
salary by dept and
cumulative salary by each dept
SELECT deptid,salary ,
Sum(salary)
OVER(
partition BY deptid
ORDER BY id) AS CUMulative_SAL
FROM emp
ORDER BY id
===============================
# deptid, salary,
CUMulative_SAL
'1', '20000', '20000'
'1', '30000', '50000'
'2', '10000', '10000'
'2', '20000', '30000'
'3', '30000', '30000'
'4', '20000', '20000'
How do you do Data model optimization ?
Data model optimization involves refining
a data model to improve its performance, maintainability, and scalability. Here
are some essential techniques to consider:
1.
Normalization and Denormalization:
●
Normalization:
Organize data to eliminate redundancy and dependency by splitting tables and
defining relationships.
●
Denormalization:
Combine tables to improve query performance, especially in read-heavy systems,
such as reporting systems.
2.
Indexing:
●
Use indexes (e.g., clustered,
non-clustered, or composite) to accelerate data retrieval and improve query
performance.
●
Be cautious with excessive indexing, as
it can impact write operations.
3.
Partitioning:
●
Divide large tables into smaller, more
manageable chunks (e.g., by date or region).
●
Enhances performance for queries
targeting specific partitions.
4. Data
Caching:
●
Use caching mechanisms to store
frequently accessed data in memory for faster retrieval.
●
Tools like Redis or Memcached can reduce
database load.
5.
Schema Optimization:
●
Simplify the schema design, ensuring it
aligns with the specific requirements of the application.
●
Use schemas like star or snowflake for
analytical workloads.
6.
Materialized Views:
●
Precompute and store complex query
results to improve performance for frequently executed queries.
7.
Query Optimization:
●
Refactor queries to eliminate unnecessary
joins or subqueries.
●
Avoid selecting unnecessary columns (SELECT * should
be avoided).
8. Data
Types:
●
Choose appropriate data types for each
column to reduce storage size and improve performance.
9.
Archiving and Purging:
●
Move historical or less-used data to
archive storage, ensuring the active database operates efficiently.
10.
Concurrency Control:
●
Ensure your database can handle multiple
simultaneous transactions without bottlenecks or conflicts.
11.
Monitoring and Feedback:
●
Use database performance monitoring tools
to identify and address bottlenecks, such as long-running queries or
underutilized indexes.
what is most likely reason that sql query may run slowly
|
Possible Reasons for Slow SQL Queries 1. Lack of Indexes: I 2. Complex Joins: 3. Large Data Volume:
4. Inefficient Query Structure: 5. Database Configuration: Configuration settings such as memory
allocation, cache size, and connection limits can affect performance. 6. Server Load: High server load due to concurrent queries or insufficient resources
(CPU, memory, I/O) can slow down execution times. 7. Network Latency: If the database is remote, network
latency can add to the time it takes to execute a query. 8. Statistics Outdated: Database optimizers rely on
statistics to create execution plans. 9.
Locking and Blocking: If a query is waiting for locks held by other transactions, it can
slow down execution. Types of Joins and Their Impact on
Execution Time 1. INNER JOIN: Generally efficient if both tables have indexes on the join columns.
2. LEFT JOIN (or RIGHT JOIN): These can be slower than INNER JOINS
because they need to return all records f 3. FULL OUTER JOIN: Typically the slowest join type, as
it combines the results o 4. CROSS JOIN: This join produces a Cartesian product, It's usually slower and should be used
with caution. 5.
SELF JOIN: While not
inherently slow, performance Optimization Tips ●
Indexing: ●
Analyze Execution Plans: Use tools provided by your database
(like EXPLAIN in PostgreSQL) to ●
Refactor Queries: Simplify complex queries, ●
●
Monitor Server Performance: |
Query optimization best
practices
When optimizing queries, best
practices include:
+ using indexes effectively,
+ avoiding SELECT * from ... instead use limit 10 ,
+ filtering data early in the WHERE clause,
+ choosing appropriate JOIN types,
+ minimizing subqueries ,
+ utilizing stored procedures,
+ monitoring query performance time,
+ and selecting the correct data typeis instead of varchar and
conversion
+ always review and analyze execution plans to identify areas for
improvement.
Here
are the
best practices for SQL query
optimization summarized in one-liners:
●
Indexes: Apply indexes strategically to
columns used in filtering, joining, and sorting to accelerate data retrieval.
●
WHERE
vs. HAVING: Prioritize the WHERE clause
to filter rows before aggregation, as it is significantly more efficient than HAVING.
●
Avoid
Loops: Replace iterative row-by-row processing with set-based bulk operations
to minimize overhead.
●
Selective
SELECT: Retrieve only necessary columns by avoiding SELECT * to reduce network traffic and database I/O.
●
EXPLAIN
Plan: Use the EXPLAIN keyword to analyze and debug execution
paths to identify performance bottlenecks.
●
Wildcard
Placement: Place wildcards at the end of search strings (e.g., LIKE 'Value%') to allow the database to utilize indexes
effectively.
●
EXISTS
over COUNT: Utilize EXISTS() to
confirm record presence, as it terminates immediately upon finding the first
match.
●
Avoid
Cartesian Products: Always include specific JOIN conditions to prevent unintentional
cross-joins that generate massive, unnecessary datasets.
●
Denormalization:
Selectively introduce redundancy to reduce the number of costly JOIN operations for read-heavy workloads.
●
Optimize
JOINs: Ensure all join columns are properly indexed and select the most
efficient JOIN type for your data relationship.
best practices for data
modeling summarized in one-liners:
●
Grain: Define the lowest level of granularity
first to ensure the model can support both atomic details and summarized
insights.
●
●
Naming: Establish and strictly adhere to a
consistent naming convention, utilizing schemas to logically organize
namespaces by business unit or data source.
●
●
Materialization: Carefully balance the trade-off between
tables (precomputed, faster performance) and views (real-time, potentially
slower performance) based on your specific use case.
●
●
Permissions
and Governance:
Collaborate with security teams to ensure the model strictly complies with
industry-specific regulations and data privacy requirements.
Project
based questions
---
1-- .
Hospital project data model
1 . About Hospital project
The
Requirement:
"The
main goal was to build a system that helps a hospital handle things like
patient check-ins, doctor schedules, and medical records.
The
biggest challenges were making sure the data stays accurate across different
departments, managing tricky connections—like linking many patients to
many different medications—and keeping sensitive patient information safe
and private.”
The
Design Approach:
"To handle this, I broke the design process down into a few simple steps:
●
Mapping
it out: I started by identifying the
main parts of the system, like
■
Patients
■
Doctors,
■
Appointments,
■
and
Departments.
■
○
Diagram
■
I drew
a diagram to see how they all connect, which helped me track a patient’s
journey from the moment they arrive to when they leave.
●
Keeping
it organized:
■
I
organized the data with normalization to avoid duplication. For example, I kept
doctor details in one place and appointment records in another. This way, if a
doctor’s info changes, we only have to update it in one spot, which prevents
mistakes.
●
Handling
connections:
■
For complex parts, like connecting patients to
their medications, I used 'bridge' tables. This allowed me to keep track of the
relationship while also saving extra details like the dosage and how often the
medicine needs to be taken.
●
Setting
clear rules:
■
Finally, I added built-in rules to the
database. Things like making sure two appointments don't overlap for the same
doctor or ensuring a patient ID is always there. This keeps the data clean and
prevents the system from saving wrong or incomplete information.”
2 . The Patient Visit Lifecycle
To
design a system that truly works, I mapped out the end-to-end journey of a
patient. By understanding these specific stages, I could ensure the database
captured the right data at every touchpoint. Here is how I explained that
lifecycle during the project:
- Registration (Front Desk): The lifecycle begins when a patient
arrives. We capture demographics, insurance details, and the reason for
the visit. This creates a record in the Patients
table and initializes an Encounters
record to track the visit's start.
- Triage & Clinical Examination: The patient meets with nursing staff
for vitals (blood pressure, temperature, etc.) and then the doctor. The
doctor records observations, diagnoses, and orders. These are stored in Encounters, Medical_Records,
and Diagnosis tables.
- Advanced Procedures (Surgeries/Labs): If the doctor determines a surgery or
lab work is needed, a new record is created in Surgery_Schedules
or Lab_Orders.
○
Surgery: We track the operating room, the surgical
team, and anesthesia details.
○
Lab: Results are logged in Lab_Results once the test is completed, linking
back to the original Encounters ID.
- Pharmacy & Medication: If medication is prescribed, the
doctor creates a record in the Prescriptions
table. This links the specific drug, dosage, and duration to the patient.
The pharmacy team updates the status of these prescriptions to
"Fulfilled" or "Dispatched."
- Discharge & Final Billing: Once the treatment is complete, the Encounters status is updated to
"Closed." The system automatically aggregates all
charges—consultation fees, surgery costs, lab tests, and medications—into
the Billing_Invoices table.
- Payment Processing: The final stage is financial. The Payments table tracks the amount paid by the
patient and the amount covered by their insurance provider, effectively
closing out the financial lifecycle of that specific visit.
Data Traceability
By
following this lifecycle, I ensured that every Patient_ID
and Encounter_ID acted as the
"thread" that connects all these tables. This allowed us to generate
a complete history for any patient, showing exactly what happened from the
moment they walked in the front door to the moment they paid their bill.
3 . How
we identified the entities
The
Requirement:
"The main goal was to build a system that helps a hospital handle things
like patient check-ins, doctor schedules, and medical records. The biggest
challenges were making sure the data stays accurate across different
departments, managing tricky connections—like linking many patients to many
different medications—and keeping sensitive patient information safe and
private."
How the
Client Explained Business Processes:
"I started by sitting down with the hospital staff to map out their daily
workflow. They walked me through the lifecycle of a patient visit—from the
front desk registration, through the examination with a doctor, to the final
billing and pharmacy stage. By listening to their 'pain points'—like scheduling
conflicts and missing patient history—I was able to identify exactly which data
points needed to be captured at each step."
How We
Identified Entities:
"To turn those workflows into a model, I looked for the 'nouns' in their
descriptions. If they talked about a 'doctor,' 'patient,' 'room,' or
'medication,' those became my primary entities. I then identified the actions
connecting them (like 'schedules,' 'visits,' or 'prescribes') to determine the
relationships between those entities."
Designing
the Conceptual Model (Facts and Dims): "For the analytical side of the
project, I designed a star schema. I treated 'Appointments' or 'Medical
Procedures' as the Fact table, which contains the quantitative data
(like visit counts or total costs). Around this, I created Dimension tables
for 'Patients,' 'Doctors,' 'Time,' and 'Department.' This setup allows the
hospital to easily pull reports, like 'How many patients did Dr. Smith see in
the Pediatrics department last month?'"
Tables
in OLTP vs. OLAP:
"I separated the databases to ensure high performance:
●
OLTP
(Operational Database): I used
about 15 to 20 tables. This was designed for day-to-day operations—fast
inserts and updates for things like checking in a patient or updating a record.
●
OLAP
(Analytics/Data Warehouse): I
simplified this down to about 6 to 8 core tables. By using a star
schema, we reduced the number of joins, making it much faster for leadership to
run complex queries and generate business intelligence reports."
The
Design Approach:
"To handle this, I broke the design process down into a few simple steps:
Mapping
it out: I used a diagram to see how
all parts connect, tracking a patient’s journey from the moment they arrive to
when they leave.
Keeping
it organized: I
organized the data to avoid duplication, ensuring that if a doctor’s info
changes, we only update it in one spot.
Handling
connections: For
complex parts, like connecting patients to their medications, I used 'bridge'
tables to store extra details like dosage and frequency.
Setting
clear rules: I
added built-in rules, like preventing overlapping appointments, to keep the
data clean and error-free."
4 .
Tables list of OLTP and OLAP
4.1. ) OLTP tables list : Operational
Database (approx. 15–20 tables)
These
tables are normalized to ensure there is no data redundancy, making it easy for
staff to update records without errors.
- Patients (Personal info)
- Doctors (Staff credentials)
- Departments (Hospital wards/units)
- Appointments (Date, time, status)
- Medical_History (Chronic conditions)
- Encounters (Each specific visit)
- Staff_Roles (Job titles/permissions)
- Rooms (Bed/room availability)
- Medications (Drug catalog)
- Prescriptions (The link between patients and drugs)
- Lab_Tests (Available test types)
- Lab_Results (Individual patient test outcomes)
- Billing_Invoices (Financial records)
- Payments (Transaction logs)
- Insurance_Providers (Insurance company details)
- Patient_Insurance (Linking patients to policies)
- Schedules (Doctor shift patterns)
- Referrals (Records of outside consults)
4.2 ) OLAP tables list: Data Warehouse ( facts, dims )
(6–8 core tables)
For the
analytical model (Star Schema), we centralize the data. We use Fact tables
to hold the "numbers" (metrics) and Dimension tables to hold
the "descriptions" (filters).
4.2.1) Fact Tables (The focus of your
analytics):
- Fact_Appointments: (KPIS : Count of visits, wait times)
- Fact_Lab_Results: (
: Normal/abnormal count, test duration)
- Fact_Prescriptions: (
: Medication frequency, prescription volume)
- Fact_Billing: (
: Total revenue, insurance coverage amounts)
- Fact_Bed_Occupancy: (
: Duration of stay, room turnover)
5 .
KPIS and FACTS
●
Fact_Appointments
○
KPIs:
Count of Appointments, No-show Rate, Average Wait Time, Total Scheduled
Duration.
○
Dimensions: Dim_Date,
Dim_Patient, Dim_Provider,
Dim_Department, Dim_Location.
●
Fact_Lab_Results
○
KPIs:
Count of Tests Performed, Average Turnaround Time, Abnormal Result Rate.
○
○
Dimensions: Dim_Date,
Dim_Patient, Dim_Provider,
Dim_Test_Type
.
●
Fact_Prescriptions
○
KPIs: Count of Prescriptions, Total Medication
Cost, Average Dosage Frequency.
○
Dimensions: Dim_Date,
Dim_Patient, Dim_Provider,
Dim_Medication.
●
Fact_Billing
○
KPIs:
Total Billed Amount, Total Collected Amount, Outstanding Balance, Insurance
Reimbursement Rate.
○
Dimensions: Dim_Date,
Dim_Patient, Dim_Department,
Dim_Insurance.
●
Fact_Bed_Occupancy
○
KPIs:
Bed Occupancy Rate, Average Length of Stay (ALOS), Patient Turnover Rate.
○
Dimensions: Dim_Date,
Dim_Patient, Dim_Department,
Dim_Location.
6 . KPI Definitions
Average Wait Time: Measures the average time a patient spends
waiting.
No-show Rate: Tracks the percentage of missed
appointments.
Average Turnaround Time: Measures the time from
ordering a lab test to getting results.
Abnormal Result Rate: Tracks the percentage of lab results
outside of normal ranges.
Total Medication Cost: Calculates the total cost of all prescribed
medications.
Average Dosage Frequency: Tracks how often a medication
is administered to patients.
Total Billed Amount: Measures total revenue from all hospital
services.
Insurance Reimbursement Rate: Calculates the percentage of
bills covered by insurance.
Bed Occupancy Rate: Measures the percentage of available beds
currently in use.
Average Length of Stay (ALOS): Tracks the average number of
days a patient stays in the hospital.
---
\\ 2 . Insurance project and how you build data
model
1 . Insurance Project Data Model
Specification
1.1 About the Insurance Project
The core objective was to build a system
capable of managing the entire insurance lifecycle—from policy enrollment to
complex claims adjudication and provider reimbursement. The primary challenges
included managing longitudinal patient histories, enforcing complex business
rules for coverage verification, and ensuring strict compliance with healthcare
privacy regulations.
1.2Design Approach
●
Mapping the Lifecycle: We identified key touchpoints: Enrollment, Policy Issuance, Provider
Network Verification, Claims Submission, Adjudication, and Payment
Reconciliation.
●
Normalization (OLTP): Data is organized into 3rd Normal Form (3NF) to eliminate redundancy.
Each entity (Patient, Provider, Policy) exists in a single source of truth,
ensuring that updates to records (e.g., policy renewal) propagate correctly.
●
Handling Complex Relationships: We utilized "bridge" tables (e.g., Claim_Line_Items, Patient_Insurance) to handle many-to-many relationships, storing specific attributes
like coverage ratios or procedure-level costs.
●
Business Rules & Integrity: Constraints such as effective_date < expiry_date and
mandatory Provider_ID association ensure clean, audit-ready data.
2. The
Patient-Insurance Lifecycle
To ensure the database captures every
relevant touchpoint, we mapped the patient’s financial journey:
- Enrollment: Patient/Member demographic
data is captured and linked to an Insurance_Provider
record.
- Policy Activation: A Policy is generated, defining coverage limits and plan types (HMO/PPO).
- Claim Submission: During an Encounter, the provider submits a claim header with service details.
- Adjudication: The system checks the claim
against Plan rules (deductibles, out-of-pocket maximums).
- Billing & Reimbursement: Billing_Invoices are generated, tracking patient responsibility vs. insurance
payments.
- Payment Processing: The Payments table logs the final closure of the financial lifecycle for that
visit.
3. Insurance
Data Model Entities
3.1
OLTP Tables List (Operational Database)
Data model image
Designed for high-concurrency inserts and
updates (approx. 15–20 tables).
●
Members/Patients: Personal info, eligibility status.
●
Insurance_Providers: Company credentials, contact details.
●
Plans:
Master product definitions, deductible logic.
●
Policies: Contract records linking members to plans.
●
Providers: Practitioner NPI details, network status.
●
Claims: Header information for service requests.
●
Claim_Line_Items: Procedure-level details (CPT/HCPCS codes).
●
Adjudication_Logs: Status history (Submitted, Review, Paid, Denied).
●
Billing_Invoices: Aggregated financial charges.
●
Payments: Transaction logs (Patient vs. Insurance).
●
Authorizations: Records for pre-approved procedures.
●
Networks: Mapping of providers to specific plan networks.
3.2
OLAP Tables List (Data Warehouse)
Designed for fast complex analytical
queries (Star Schema).
Fact Tables (Metrics)
●
Fact_Claims: Claims processed, denial rates, total billed vs. allowed amounts.
●
Fact_Payments: Revenue collected, reimbursement turnaround times.
●
Fact_Enrollment: Membership growth, churn rates, policy renewal frequency.
Dimension Tables (Filters)
●
Dim_Date: Temporal analysis (Fiscal year, quarter, month).
●
Dim_Member: Demographic segmentation (Age, region, plan type).
●
Dim_Provider: Network status, specialty, geographical location.
●
Dim_Plan: Product tier, coverage types.
4. KPIs and
Analytics
|
Fact
Table |
Key
Performance Indicators (KPIs) |
|
Fact_Claims |
Denial Rate, Average Adjudication Time,
Total Billed Amount |
|
Fact_Payments |
Reimbursement Rate, Outstanding
Balances, Provider Payouts |
|
Fact_Enrollment |
Member Retention Rate, Policy Renewal
Frequency, Plan Utilization Rate |
5. Data
Traceability & Integrity
Every Member_ID and Policy_ID acts as a unique thread connecting operational data across systems.
This allows for full auditability, enabling the hospital/insurance entity to
generate a complete history for any claim—from the initial service order to the
final financial settlement.
6. KPIS and
FACTS
Fact_Appointments
●
KPIs:
Count of Appointments, No-show Rate, Average Wait Time, Total Scheduled
Duration.
●
Dimensions: Dim_Date, Dim_Patient, Dim_Provider, Dim_Department, Dim_Location.
Fact_Lab_Results
●
KPIs:
Count of Tests Performed, Average Turnaround Time, Abnormal Result Rate.
●
Dimensions: Dim_Date, Dim_Patient, Dim_Provider, Dim_Test_Type.
Fact_Prescriptions
●
KPIs:
Count of Prescriptions, Total Medication Cost, Average Dosage Frequency.
●
Dimensions: Dim_Date, Dim_Patient, Dim_Provider, Dim_Medication.
Fact_Billing
●
KPIs:
Total Billed Amount, Total Collected Amount, Outstanding Balance, Insurance
Reimbursement Rate.
●
Dimensions: Dim_Date, Dim_Patient, Dim_Department, Dim_Insurance.
Fact_Bed_Occupancy
●
KPIs:
Bed Occupancy Rate, Average Length of Stay (ALOS), Patient Turnover Rate.
●
Dimensions: Dim_Date, Dim_Patient, Dim_Department, Dim_Location.
6. KPI
Definitions
●
Average Wait Time: Measures the average time a patient spends waiting.
●
No-show Rate: Tracks the percentage of missed appointments.
●
Average Turnaround Time: Measures the time from ordering a lab test to getting results.
●
Abnormal Result Rate: Tracks the percentage of lab results outside of normal ranges.
●
Total Medication Cost: Calculates the total cost of all prescribed medications.
●
Average Dosage Frequency: Tracks how often a medication is administered to patients.
●
Total Billed Amount: Measures total revenue from all hospital services.
●
Insurance Reimbursement Rate: Calculates the percentage of bills covered by insurance.
●
Bed Occupancy Rate: Measures the percentage of available beds currently in use.
●
Average Length of Stay (ALOS): Tracks the average number of days a patient stays in the hospital.
---
\\ 3 . Bank project
data model
1. About the
Banking Project
The core objective was to build a system
capable of managing core banking operations—from customer account onboarding to
complex transaction processing and loan lifecycle management. The primary
challenges included ensuring high-frequency transaction consistency,
maintaining strict security for financial records, and enforcing regulatory
compliance (e.g., AML, KYC).
Design
Approach
●
Mapping the Lifecycle: We identified key touchpoints: Account Opening, Deposits/Withdrawals,
Fund Transfers, Loan Origination, and Interest Accrual.
●
Normalization (OLTP): Data is organized into 3rd Normal Form (3NF) to maintain a single
source of truth for customers, accounts, and ledger entries, ensuring atomic
transaction integrity.
●
Handling Complex Relationships: We utilized "bridge" tables (e.g., Account_Signatories, Transaction_Categories) to handle many-to-many relationships, storing detailed audit logs for
every movement of funds.
●
Business Rules & Integrity: Constraints such as minimum_balance_check and daily_limit_validation ensure financial safety and prevent unauthorized overdrafts.
2. The
Customer-Account Lifecycle
- Onboarding (KYC): Capture customer
demographic data and verify identification records.
- Account Creation: A Savings or Checking account is initialized with
unique routing and account numbers.
- Transaction Execution: Deposits,
withdrawals, or transfers are processed, triggering ledger updates in
real-time.
- Interest & Fee Calculation:
End-of-period processes calculate interest and service charges based on
account balance rules.
- Statement Generation: Periodic summary of
account activity is generated for the customer.
- Closure/Retention: Account status is
updated based on dormancy or customer request, finalizing the lifecycle.
3. Bank Data
Model Entities
3.1
OLTP Tables List (Operational Database)
●
Customers: KYC info, contact details, identity verification status.
●
Accounts: Account type (Savings/Checking), balance, status, open date.
●
Transactions: Ledger of all deposits, withdrawals, and transfers.
●
Loans:
Loan terms, interest rates, repayment schedules.
●
Repayments: Payment logs against outstanding loan balances.
●
Branch: Geographic and operational banking locations.
●
Employees: Staff credentials, role-based access permissions.
●
Audit_Logs: Immutable history of all record changes for security compliance.
3.2
OLAP Tables List (Data Warehouse)
Fact Tables (Metrics)
●
Fact_Transactions: Count of transfers, total transaction volume, fee revenue.
●
Fact_Deposits: Daily balance snapshots, deposit inflow/outflow.
●
Fact_Loans: Outstanding principal, interest earned, delinquency rates.
Dimension Tables (Filters)
●
Dim_Date: Time-based analysis (Date, Month, Quarter, Fiscal Year).
●
Dim_Customer: Demographic segmentation (Age group, income tier, risk profile).
●
Dim_Account_Type: Product classification (Retail, Corporate, Student).
●
Dim_Branch: Branch performance and regional clusters.
4. KPIs and
Analytics
|
Fact
Table |
Key
Performance Indicators (KPIs) |
|
Fact_Transactions |
Transaction Volume, Average Transaction
Value, Processing Error Rate |
|
Fact_Deposits |
Net Deposit Growth, Average Daily
Balance, Liquidity Ratio |
|
Fact_Loans |
Delinquency Rate, Loan Approval Rate,
Interest Income Ratio |
5. KPIS and
FACTS
Fact_Transactions
●
KPIs:
Count of Transactions, Total Transaction Volume, Transaction Error Rate,
Average Transaction Value.
●
Dimensions: Dim_Date, Dim_Customer, Dim_Account_Type, Dim_Branch.
Fact_Savings_Accounts
●
KPIs:
Average Daily Balance, Interest Earned, Total Deposits, Total Withdrawals.
●
Dimensions: Dim_Date, Dim_Customer, Dim_Account_Type, Dim_Branch.
Fact_Loan_Performance
●
KPIs:
Total Principal Outstanding, Interest Income, Default Rate, Loan Approval
Volume.
●
Dimensions: Dim_Date, Dim_Customer, Dim_Loan_Type, Dim_Branch.
Fact_Customer_Activity
●
KPIs:
New Account Openings, Account Closure Rate, Total Customer Lifetime Value
(CLV).
●
Dimensions: Dim_Date, Dim_Customer, Dim_Customer_Segment, Dim_Branch.
6. KPI
Definitions
●
Transaction Volume: Total number of successful transactions within a period.
●
Transaction Error Rate: Percentage of failed or disputed transactions.
●
Average Daily Balance: The mean balance across a specified period.
●
Delinquency Rate: Percentage of loans with payments overdue by more than 30/60/90 days.
●
Net Deposit Growth: Difference between total deposits and withdrawals over time.
●
Loan Approval Rate: Percentage of loan applications approved against total applications.
●
Interest Income: Revenue generated from interest-bearing loan portfolios.
●
Customer Lifetime Value (CLV): Total predicted revenue attributed to a single customer relationship.
●
Account Closure Rate: Percentage of accounts closed relative to total active accounts.
●
Liquidity Ratio: Ratio of liquid assets available to cover immediate deposit
withdrawals.
---
\\ 4 . Food delivery project data model
Explain about your project
1. About the Food Delivery Project
The
core objective was to build a system capable of managing a multi-sided
marketplace—connecting customers, restaurants, and delivery drivers in
real-time. The primary challenges included managing high-concurrency order
placement, real-time location tracking for drivers, and ensuring fair
commission structures and timely payouts.
Design Approach
●
Mapping
the Lifecycle: We
identified key touchpoints: User Registration, Restaurant Menu Management,
Order Placement, Dispatch/Routing, Order Fulfillment, and Driver Payouts.
●
Normalization
(OLTP): Data is organized into 3rd
Normal Form (3NF) to ensure consistency between customer profiles, restaurant
menus, and order status tracking, preventing inventory and delivery errors.
●
Handling
Complex Relationships: We
utilized "bridge" tables (e.g., Order_Items, Driver_Earnings) to
manage many-to-many relationships, storing detailed attributes like item
modifiers, delivery distance, and variable service fees.
●
Business
Rules & Integrity:
Constraints such as order_status_progression
and delivery_window_validation ensure that orders move logically through
the lifecycle without status conflicts.
2. The Order Fulfillment Lifecycle
- User Onboarding: Customer registration with saved
addresses and payment methods.
- Menu Management: Restaurants publish menus with prices,
descriptions, and availability status.
- Order Placement: Customers browse and select items,
initiating an Order
record with a specific status ("Pending").
- Dispatch: The system assigns a nearby driver to
the order based on geographic proximity and availability.
- Fulfillment: Real-time tracking of the order from
restaurant pickup to customer doorstep.
- Financial Settlement: The Billing stage calculates the order total,
delivery fee, restaurant commission, and driver earnings.
3. Food Data Model
Entities
Data model image
3.1 OLTP Tables List (Operational Database)
Based on your requirements, here is the
list of database tables:
●
Customers
●
Customer_Addresses
●
Restaurants
●
Restaurant_Hours
●
Menu_Categories
●
Menu_Items
●
Orders
●
Order_Items
●
Item_Modifiers
●
Drivers
●
Deliveries
●
Payments
●
Commission_Splits
●
Driver_Payouts
●
3.2 OLAP Tables List (Data Warehouse)
Fact
Tables (Metrics)
●
Fact_Orders: Total volume, average order value,
cancellation rates.
●
Fact_Delivery_Performance: Delivery times, driver performance metrics,
fulfillment efficiency.
●
Fact_Restaurant_Revenue: Commission earnings, daily sales volume,
top-selling items.
Dimension
Tables (Filters)
●
Dim_Date: Temporal analysis (Date, Hour, Day of Week,
Month).
●
Dim_Customer: Demographic segmentation (Frequency of use,
location).
●
Dim_Restaurant: Cuisine type, geographical region.
●
Dim_Driver: Driver performance tier, vehicle type.
4. KPIs and Analytics
|
Fact Table |
Key Performance Indicators (KPIs) |
|
Fact_Orders |
Order
Volume, Average Order Value, Order Cancellation Rate |
|
Fact_Delivery_Performance |
Average
Delivery Time, On-time Arrival Rate, Driver Utilization Rate |
|
Fact_Restaurant_Revenue |
Total
Gross Merchandise Value (GMV), Commission Revenue |
5. KPIS and FACTS
Fact_Orders
●
KPIs: Count of Orders, Average Order Value, Order
Cancellation Rate, Gross Merchandise Value (GMV).
●
Dimensions: Dim_Date, Dim_Customer, Dim_Restaurant,
Dim_City.
Fact_Delivery_Performance
●
KPIs: Average Delivery Time, On-time Arrival
Rate, Driver Utilization Rate, Total Distance Traveled.
●
Dimensions: Dim_Date, Dim_Driver, Dim_Restaurant,
Dim_City.
Fact_Restaurant_Revenue
●
KPIs: Total Commission Earned, Total Sales
Volume, Top Performing Categories, Average Item Price.
●
Dimensions: Dim_Date, Dim_Restaurant, Dim_Category.
Fact_Customer_Activity
●
KPIs: Customer Retention Rate, New User Signups,
Average Orders per Customer, Lifetime Value (CLV).
●
Dimensions: Dim_Date, Dim_Customer, Dim_City.
6. KPI Definitions
●
Order
Volume: Total number of successful
orders placed.
●
Average
Order Value (AOV): Total
sales revenue divided by the number of orders.
●
Order
Cancellation Rate:
Percentage of orders cancelled by users or restaurants.
●
Average
Delivery Time: Mean
duration from order confirmation to delivery completion.
●
On-time
Arrival Rate:
Percentage of orders delivered within the estimated window.
●
Driver
Utilization Rate:
Percentage of time drivers spend on active delivery tasks.
●
Gross
Merchandise Value (GMV): The
total value of food sold through the platform.
●
Commission
Revenue: Total income generated through
restaurant commissions and service fees.
●
Customer
Retention Rate:
Percentage of customers who repeat orders within a period.
●
Customer
Lifetime Value (CLV): Total
predicted revenue attributed to a customer relationship.
--
\\ 5.
Casella waste management project Data model
Waste Management Project Data Model Specification
1. About the
Waste Management Project
The core objective was to build a system
capable of managing waste collection logistics—connecting
residential/commercial generators, fleet operations, and disposal facilities.
The primary challenges included optimizing collection routes, tracking container
fill levels via IoT sensors, and ensuring regulatory compliance for hazardous
and municipal waste disposal.
Design
Approach
●
Mapping the Lifecycle: We identified key touchpoints: Service Contract Initiation, Route
Planning, Collection/Pickup, Disposal/Recycling Processing, and Invoicing.
●
Normalization (OLTP): Data is organized into 3rd Normal Form (3NF) to ensure consistency
between customer accounts, service locations, and collection history,
preventing billing errors and missed pickups.
●
Handling Complex Relationships: We utilized "bridge" tables (e.g., Service_Contracts, Fleet_Assignments) to manage many-to-many relationships, storing detailed attributes
like waste types, container capacities, and pickup frequencies.
●
Business Rules & Integrity: Constraints such as collection_schedule_validation and disposal_compliance_check ensure that waste is processed according to environmental regulations
and contract terms.
2. The Waste
Collection Lifecycle
- Contract Management: Customers
(residential/commercial) sign up for service, defining waste types and
pickup frequency.
- Route Planning: The system uses
geographic data to assign specific bins/locations to fleet routes for
optimal fuel and time usage.
- Collection: Drivers follow assigned
routes, logging pickup confirmations and recording container status (e.g.,
damaged/full).
- Disposal/Processing: Waste is transported
to designated transfer stations, landfills, or recycling centers where
weight and type are verified.
- Fleet Maintenance: IoT and maintenance
logs track truck health, mileage, and driver behavior.
- Financial Settlement: The Billing stage aggregates pickups, tonnage fees, and service charges into
customer invoices.
3. Data Model
Entities
3.1
OLTP Tables List (Operational Database)
●
Customers: Service profiles, billing info, location coordinates.
●
Service_Contracts: Frequency, waste type (Recycling/Compost/Trash), container info.
●
Routes: Route segments, scheduled pickup days, assigned zones.
●
Fleet_Vehicles: Asset info, mileage, maintenance status, fuel logs.
●
Drivers: Profiles, shift schedules, certification levels.
●
Pickups: Real-time collection status, timestamps, geolocation.
●
Disposal_Logs: Destination site, tonnage/weight records, waste classification.
●
Billing_Invoices: Aggregated charges based on tonnage and contract terms.
3.2
OLAP Tables List (Data Warehouse)
Fact Tables (Metrics)
●
Fact_Pickups: Volume collected, service completion rates, missed pickup logs.
●
Fact_Fleet_Operations: Fuel consumption, route efficiency, truck utilization.
●
Fact_Disposal_Revenue: Revenue by material type, facility costs, service margins.
Dimension Tables (Filters)
●
Dim_Date: Temporal analysis (Date, Shift, Route Day).
●
Dim_Customer: Demographic/Business type, service tier.
●
Dim_Route: Geographic zone, route ID.
●
Dim_Vehicle: Vehicle type, age, maintenance history.
4. KPIs and
Analytics
|
Fact
Table |
Key
Performance Indicators (KPIs) |
|
Fact_Pickups |
Pickup Completion Rate, Missed Pickup
Rate, Total Weight Collected |
|
Fact_Fleet_Operations |
Cost per Route, Fuel Efficiency (MPG),
Vehicle Downtime |
|
Fact_Disposal_Revenue |
Revenue per Ton, Recycling Diversion
Rate, Outstanding Receivables |
5. KPIS and
FACTS
Fact_Pickups
●
KPIs:
Count of Pickups, Missed Pickup Rate, Total Tonnage Collected, Service Time.
●
Dimensions: Dim_Date, Dim_Customer, Dim_Route, Dim_Location.
Fact_Fleet_Operations
●
KPIs:
Fuel Efficiency, Route Deviation Rate, Total Distance Traveled, Vehicle
Utilization.
●
Dimensions: Dim_Date, Dim_Vehicle, Dim_Driver, Dim_Route.
Fact_Disposal_Revenue
●
KPIs:
Total Revenue, Cost per Ton, Recycling Diversion Rate, Service Fee Margin.
●
Dimensions: Dim_Date, Dim_Customer, Dim_Material_Type.
Fact_Customer_Activity
●
KPIs:
Contract Renewal Rate, Service Volume Growth, Customer Churn, Outstanding
Balance.
●
Dimensions: Dim_Date, Dim_Customer, Dim_Service_Type.
6. KPI
Definitions
●
Pickup Completion Rate: Percentage of scheduled stops successfully serviced.
●
Missed Pickup Rate: Percentage of stops skipped or reported as missed.
●
Total Tonnage Collected: Aggregate weight of waste collected across all services.
●
Fuel Efficiency: Average miles per gallon (MPG) for the collection fleet.
●
Route Deviation Rate: Frequency at which drivers deviate from optimized paths.
●
Recycling Diversion Rate: Percentage of waste diverted from landfills to recycling/compost.
●
Cost per Route: Operational cost (fuel, labor, maintenance) per scheduled route.
●
Contract Renewal Rate: Percentage of customers renewing their service contracts.
●
Service Volume Growth: Percentage increase in tonnage or new customers month-over-month.
●
Revenue per Ton: Total revenue divided by the total weight of material processe
Problems /challenges you faced in Data
modeling
Hospital project
1. Lack of Relational Mapping
A data
model is not just a list of nouns; it is a map of how data interacts. Your list
does not define relationships (e.g., One-to-Many or Many-to-Many).
●
Missing Links: You
haven't defined how a Patient relates to an Appointment, or how a Doctor
is assigned to a Department. Without these,
you cannot query "Which doctor saw this patient?"
●
Ambiguity: Does an Invoice belong to a Patient,
an Appointment, or a Prescription?
2. Failure to Normalize (Redundancy)
Without a
relational structure, you will face data redundancy issues.
●
Hospital Name: Storing Hospital_Name as a standalone entity is usually
unnecessary unless you are managing a chain of hospitals. If it's a single
hospital, this is redundant data that should be a global configuration setting.
●
Duplicate Data: If you
don't use foreign keys, you might end up typing a Patient_Name
or Doctor_ID multiple times across Appointment, Prescription,
and Invoice tables, leading to data
inconsistencies.
3. Missing Critical Entities
Your model
is missing "bridge" entities required to handle complex hospital
workflows:
●
Medical Record / Visit: A patient
may visit multiple times. You need a way to track specific episodes of care
rather than just linking a patient directly to a prescription.
●
Specialization: Doctors
belong to departments, but they also have specific qualifications.
●
Treatment / Procedure: Prescription covers medication, but Invoice often needs to account for medical
procedures, tests, or bed charges.
4. Naming and Integrity Issues
●
Typos: Involce
(should be Invoice).
●
Granularity: Staff is too broad. You need to distinguish
between administrative staff, nurses, and doctors, as they have different
roles, access levels, and salary structures.
●
Lack of Primary Keys: Each
entity needs a unique identifier (e.g., Patient_ID,
Doctor_ID) to ensure data integrity.
Recommended Entity-Relationship Structure
To improve
this model, you should transition to a relational design.
|
Entity |
Primary
Key |
Key
Attributes |
|
Department |
Dept_ID |
Name,
Location, Head_Doctor |
|
Doctor |
Doctor_ID |
Name,
Specialization, Dept_ID (FK) |
|
Patient |
Patient_ID |
Name,
DOB, Contact, Insurance_Info |
|
Appointment |
Appt_ID |
Patient_ID
(FK), Doctor_ID (FK), Date, Time |
|
Prescription |
Presc_ID |
Appt_ID
(FK), Medication_Details, Dosage |
Pharma challenges
1.
Challenge: Inconsistent definition of "Patient"
- (e.g., whether a patient is
identified by phone number, name, or insurance ID). We identified this in
the initial data modeling workshop when different business units
gave conflicting counts for total customer base.
- Resolution: We established a
"Golden Record" standard where Patient_ID is strictly mapped to a unique government-issued health
identification number or insurance primary key, with all other fields
(name, phone) treated as descriptive attributes.
3.
Challenge: Regulatory compliance gaps for prescription history
a.
(e.g., no clear requirement on how
long to store the Doctor authorization). We identified this
in the compliance review stage when auditing the Prescriptions table against local pharmacy laws. Resolution: We added a Retention_Expiry_Date attribute and a soft-delete/archive flag to the Prescriptions table, ensuring records are retained per mandatory audit cycles
without cluttering the active transactional view.
4.
Challenge: Supplier lead-time volatility
a.
(the Lead_Time_Days field in Suppliers is a static estimate, but reality varies). We identified this in the Inventory
forecasting session when comparing manual ledger records to the Inventory table performance.
b.
Resolution: We modified the model to
include a Supplier_Performance_Log table that captures historical delivery times for every replenishment,
replacing static estimates with a rolling average calculation for more accurate
inventory planning.
5.
Challenge: Product classification ambiguity
a.
(many drugs have brand names and chemical
names that appear as different entities). We identified this in the catalog
validation phase when populating the Products table
b.
. Resolution: We
implemented a mapping layer in the Products table where Chemical_Formula acts as the master-key (or generic identifier), while Name
handles the brand aliases, ensuring we can aggregate sales by drug efficacy
regardless of the brand name used.
6.
Challenge: Partial payment and refund handling
a.
(the Payments table was designed for a single
lump sum per order). We identified this in the finance workflow walkthrough
when discussing returns and insurance co-pay splits.
b.
Resolution: We refactored the Payments table from a 1:1 relationship with
Orders to a 1:N structure, allowing multiple entries per Order_ID
to track initial payments, partial insurance settlements, and subsequent
customer refunds independently.
Food delivery project
1. The "Infinite
Customization" Modifier Problem
Challenge:
Supporting unlimited combinations of menu modifiers (e.g., a pizza with
"extra cheese" on the left half, "no onions," and
"gluten-free crust") without creating thousands of redundant menu
rows.
Difficulty: In a
standard relational model, creating columns for every possible topping leads to
"Sparse Data" (mostly empty columns) and is impossible to maintain as
restaurants change their menus weekly. Hard-coding these options makes the
database rigid and slow.
Solution: The
Recursive Associative Bridge. We implemented a hierarchical modifier_group
table. Instead of linking a pizza to "Pepperoni," we link the pizza
to a "Toppings Group." That group contains many modifier_items. This
allows for "nesting"—where choosing a "Burger" triggers a
"Protein Group," and choosing "Steak" within that group
triggers a "Doneness Group." This metadata-driven approach allows
restaurants to update options in real-time without a single schema change.
2. Bi-Temporal Menu Versioning
(The "Price Lock" Paradox)
Challenge:
Ensuring that historical financial reports remain accurate even when a
restaurant changes its prices or deletes a menu item.
Difficulty: If you
simply update the price column in the menu_item table, every historical order
linked to that ID will suddenly show the new price in your reports, breaking
your accounting. However, creating a brand-new ID for every price change causes
"Data Fragmentation," making it impossible to track a dish's
popularity over a year.
Solution: Bi-Temporal
Effective Dating. We added valid_from_timestamp and valid_to_timestamp
columns to the pricing and menu tables. When a price change occurs, the system
doesn't "Update" the row; it "Closes" the current row by
setting the valid_to date and inserts a new row with the same business_key.
Queries then use a WHERE order_timestamp BETWEEN valid_from AND valid_to clause
to fetch the exact price that existed at the second the order was placed.
3. High-Concurrency Geofence
Matching
Challenge:
Assigning the "best" delivery partner to an order in real-time based
on moving GPS coordinates and complex delivery zones.
Difficulty:
Standard SQL "Distance Calculations" (Pythagorean theorem) are
mathematically expensive. Running these calculations for 500 orders against
2,000 moving drivers every few seconds will crash a standard database CPU.
Solution: Spatial
Partitioning with GiST Indexing. We moved the logistics data into a
dedicated schema using PostGIS. Instead of calculating raw distance, we
implemented Geofencing. Each driver is mapped to a current_zone_id (a
polygon). When an order is placed, the system first filters for drivers within
that specific polygon using a Generalized Search Tree (GiST) index. This
reduces the search pool from thousands to dozens instantly, allowing the final
distance math to run only on a tiny subset of candidates.
4. Multi-Party Split
Settlement (The "Double-Entry" Ledger)
Challenge:
Automatically splitting a single customer payment into three distinct payouts:
the Restaurant’s earnings, the Driver’s fee/tip, and the Platform’s commission.
Difficulty: A
single transaction can have multiple "Sub-Statuses." For example, the
customer is charged immediately, but the driver is only paid after delivery,
and the restaurant might be paid weekly. Tracking this in a single
"Orders" table creates a "Financial Reconciliation
Nightmare" where cents go missing during refunds or cancellations.
Solution: The
Atomic Settlement Ledger. We decoupled "Orders" from
"Money" by building a Double-Entry Bookkeeping Ledger. Every
order triggers a set of balanced entries: a "Debit" to the Customer
and "Credits" to three separate internal "Holding Accounts"
(Restaurant_Wallet, Driver_Wallet, Platform_Revenue). This ensures the sum of
all payouts always equals the total collected. If a refund occurs, the system
moves
1. Real-Time Security &
Compliance
●
Requirement: Automated tracking of all administrative overrides and suspicious
activities.
●
Problem: Manual logging is prone to gaps, leaving the Audit and Frauddetection
tables incomplete.
●
Solution: Implemented database-level triggers to capture immutable logs for
every state change automatically.
2. Customer 360-Degree
Integration
●
Requirement: Consolidating a single Customer identity across retail, loans,
and wealth management.
●
Problem: Data silos made it difficult to link an Account to an Investment
or Insurance policy.
●
Solution: Established a master CustomerID as a global Foreign Key across
all 15 tables for unified reporting.
3. Financial Precision &
Multi-Currency
●
Requirement: Handling global transactions via Forex while maintaining
sub-cent accuracy.
●
Problem: Standard float data types caused rounding errors in Transaction
and Creditscore calculations.
●
Solution: Enforced strict DECIMAL(15,2) for balances and DECIMAL(10,6) for
exchange rates to ensure 100% accuracy.
4. Amortization & Payment
Integrity
●
Requirement: Precise tracking of debt repayment schedules in the Loanpayment
table.
●
Problem: Difficulty accurately splitting payments between principal, interest,
and penalties over time.
●
Solution: Architected a relational link between Loan and Loanpayment
with calculated fields to automate balance updates.
Issue: Data inconsistency in
customer profiles (duplicates).
Stage Identified:
During the initial data profiling and join operations between Customer_Master
and Policy_Master. Resolution: We implemented an MDM (Master Data
Management) layer with fuzzy matching logic to link multiple records
representing the same individual into a single "Golden Customer ID."
Issue: Conflicts between
Policy status and Payment history.
Stage Identified:
During the validation phase of the "Active Policies" financial
reporting dashboard. Resolution: We created a "Derived Status"
field in the transformation layer that cross-references both tables to flag
policies as "Suspended" if premiums are overdue, regardless of the
system-level status.
Issue: Late-arriving fact
records (claims ingested after the reporting window).
Stage
Identified: During daily reconciliation when current-week metrics didn't
match the transactional logs. Resolution: We updated the ETL pipeline to
include a "Look-back Window," where the system automatically
re-processes and updates the last 7 days of data with every daily load.
Issue: Referential integrity
failure (orphaned payments due to out-of-order ingestion).
Stage
Identified: During the automated staging-to-warehouse load process. Resolution:
We built a "Holding Table" for orphaned records, which temporarily
stores transactions until the parent policy record arrives, at which point the
ETL triggers a retry to link them.
Issue: Historical schema drift
(change in data structure for Underwriting_Rules).
Stage Identified:
When the automated daily ETL pipeline began failing due to unexpected JSON
format errors. Resolution: We moved to an ELT (Extract, Load, Transform)
approach, where raw data is dumped into a flexible "Staging/Landing
Zone" first, allowing us to parse the complex JSON logic after the
raw data is safely ingested.
1. Lack of Relational Mapping
A data model is not just a list of nouns; it
is a map of how data interacts. Your list does not define relationships (e.g.,
One-to-Many or Many-to-Many).
●
Missing Links: You haven't defined how a Patient relates to an Appointment, or how a
Doctor is assigned to a Department. Without these, you cannot query "Which
doctor saw this patient?"
●
Ambiguity: Does an Invoice belong to a Patient, an Appointment, or a
Prescription?
2. Failure to Normalize
(Redundancy)
Without a relational structure, you will face
data redundancy issues.
●
Hospital Name: Storing Hospital_Name as a standalone entity is usually unnecessary
unless you are managing a chain of hospitals. If it's a single hospital, this
is redundant data that should be a global configuration setting.
●
Duplicate Data: If you don't use foreign keys, you might end up typing a Patient_Name
or Doctor_ID multiple times across Appointment, Prescription, and Invoice
tables, leading to data inconsistencies.
3. Missing Critical Entities
Your model is missing "bridge"
entities required to handle complex hospital workflows:
●
Medical Record / Visit: A patient may visit multiple times. You need a way to track specific
episodes of care rather than just linking a patient directly to a prescription.
●
Specialization: Doctors belong to departments, but they also have specific
qualifications.
●
Treatment / Procedure: Prescription covers medication, but Invoice often needs to account for
medical procedures, tests, or bed charges.
4. Naming and Integrity Issues
●
Typos:
Involce (should be Invoice).
●
Granularity: Staff is too broad. You need to distinguish between administrative
staff, nurses, and doctors, as they have different roles, access levels, and
salary structures.
●
Lack of Primary Keys: Each entity needs a unique identifier (e.g., Patient_ID, Doctor_ID) to
ensure data integrity.
Recommended Entity-Relationship Structure
To improve this model, you should transition
to a relational design.
|
Entity |
Primary Key |
Key Attributes |
|
Department |
Dept_ID |
Name, Location, Head_Doctor |
|
Doctor |
Doctor_ID |
Name, Specialization, Dept_ID (FK) |
|
Patient |
Patient_ID |
Name, DOB, Contact, Insurance_Info |
|
Appointment |
Appt_ID |
Patient_ID (FK), Doctor_ID (FK), Date, Time |
|
Prescription |
Presc_ID |
Appt_ID (FK), Medication_Details, Dosage |
Banking project
1.
Real-Time Security & Compliance
●
Requirement: Automated tracking of all administrative overrides and
suspicious activities.
●
Problem:
Manual logging is prone to gaps, leaving the Audit and Frauddetection
tables incomplete.
●
Solution:
Implemented database-level triggers to capture immutable logs for every state
change automatically.
2.
Customer 360-Degree Integration
●
Requirement: Consolidating a single Customer identity across retail,
loans, and wealth management.
●
Problem:
Data silos made it difficult to link an Account to an Investment
or Insurance policy.
●
Solution:
Established a master CustomerID as a global Foreign Key across all 15
tables for unified reporting.
3.
Financial Precision & Multi-Currency
●
Requirement: Handling global transactions via Forex while maintaining
sub-cent accuracy.
●
Problem:
Standard float data types caused rounding errors in Transaction and Creditscore
calculations.
●
Solution:
Enforced strict DECIMAL(15,2) for balances and DECIMAL(10,6) for exchange rates to ensure 100% accuracy.
4.
Amortization & Payment Integrity
●
Requirement: Precise tracking of debt repayment schedules in the Loanpayment
table.
●
Problem:
Difficulty accurately splitting payments between principal, interest, and
penalties over time.
●
Solution:
Architected a relational link between Loan and Loanpayment with
calculated fields to automate balance updates.
-----------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------
Ralph,Kimball : says
build the data marts first , then build the data warehouse by joining them into
single data warehouse , This is called bottom up approach
Inmon : says build one single data warehouse for
the entire enterprise , then divide the warehouse into Datamarts , This is
called Top down approach .
|
|
|
|
|
|
|
|
|
|
|
sno |
Question |
Answer |
Links |
|
1 |
Types of Fact tables |
1. Transactional Fact Tables Transactional fact tables are designed to capture
individual business events or transactions. 2. Periodic Snapshot Fact Tables Periodic snapshot fact tables provide a summarized view
of metrics over regular time intervals. 3. Accumulating Snapshot Fact Tables Accumulating snapshot fact tables are designed to track
the stages of a business process or workflow. 4. Factless facts |
https://easy-data-modeling.blogspot.com/2024/03/types-of-fact-tables.html |
|
45 |
types of fact tables 51 |
1. Additive ( can be added
like total qty , sales ) 2. Semi additive ( cannot
be added but updated like account balance )
3. Non Additive ( numerical
but cannot be added like today price
of scotk ) |
|
|
2 |
Granularity and Granule |
Granularity in data refers to the level
of detail >
high level of granularity would have a large number of individual
pieces (columns) of information, such as individual records or measurements. >Data that has a low level of
granularity would have a small number of individual pieces of information,
such as summary data or aggregated data. Data granularity can affect how it
is used and analyzed, and can impact the accuracy and usefulness of the
results. |
|
|
3 |
what are 3 levels of Data modeling |
1. Conceptual model list of entities and relationships , no attributes
diagram only 2. logical 3. physical model |
|
|
4 |
|
|
|
|
5 |
|
|
|
|
6 |
|
|
|
|
7 |
|
|
|
|
8 |
|
|
|
|
9 |
|
|
|
|
|
|
|
|
|
10 |
|
|
|
|
11 |
|
|
|
|
12 |
|
|
|
|
13 |
|
|
|
|
14 |
|
|
|
|
14.1 |
|
|
|
|
15 |
|
|
|
|
16 |
|
|
|
|
17 |
|
|
|
|
18 |
|
|
|
|
19 |
|
|
|
|
20 |
|
|
|
|
21 |
|
|
|
|
22 |
Fact and dimension table - differences |
. Fact tables contain numerical data, while dimension
tables provide context and background information. |
https://www.simplilearn.com/fact-table-vs-dimension-table-article#what_is_a_fact_table |
|
23 |
What is Datawarehouse |
A data warehouse is a centralized system that stores
data from various sources for reporting and analysis Data warehouses can: Consolidate data: Store data from multiple sources into
a single source of truth Store historical data: Store and analyze long-term
historical data Analyze data about a particular subject or functional
area: |
|
|
24 |
|
|
|
|
25 |
|
|
|
|
|
|
|
|
|
28 |
|
|
|
|
|
|
|
|
|
30 |
|
●
|
|
No comments:
Post a Comment