Data modeling excel



----------------------------------------------------------------------------

## Data Modeling Best Practices – Headings

------------------------------------------------------------------------------

click here for full post

1. Start with Business Requirements (Not Tables)

2. Define Grain First (Critical Rule)

3. Choose the Right Model Type

4. Use Clear Naming Conventions

5. Design Clean Dimension Tables

6. Keep Fact Tables Lean

7. Enforce Data Integrity

8. Handle Dates Properly

9. Optimize for Query Performance

10. Plan for Data Growth

11. Document Everything

12. Validate with Real Data

13. Version Control Your Models

14. Avoid Over-Normalization in Warehouses

15. Design for BI & Reporting Tools


-----------------------------------------------------------------------------------------------------------------------------
 Common mistakes in data modeling include

 

Common mistakes of Data modeling :  

neglecting stakeholder input,

 inconsistent naming conventions,

and failing to adequately test and

 validate the model. 

not considering data quality,

 ignoring business requirements,
 and overlooking future scalability needs. 

 

Mistakes

Why it's a mistake:

How to avoid:

1. Ignoring Stakeholder Input:

Without input from users and business analysts, the model may not accurately reflect real-world requirements or user needs, leading to a model that is difficult to use or maintain.

Involve stakeholders throughout the data modeling process, gathering feedback on data structures, relationships, and naming conventions. 

2. Inconsistent Naming Conventions:

Inconsistent naming makes the model confusing and difficult to understand, leading to errors and wasted time.

Establish clear naming conventions for entities, attributes, and relationships, and adhere to them consistently throughout the model. 

3. Failing to Test and Validate:

A model that isn't thoroughly tested and validated will likely contain errors that can lead to inaccurate reporting, poor decision-making, and other problems.

Implement a testing strategy that includes unit tests, integration tests, and user acceptance tests. 

4. Neglecting Data Quality:

Poor data quality can lead to inaccurate results, unreliable reports, and wasted resources.

Implement data quality checks to ensure data is accurate, complete, consistent, and up-to-date. 

5. Ignoring Business Requirements:

If the data model doesn't meet the business requirements, it will be difficult to use effectively.

Thoroughly understand the business requirements before starting the data modeling process. 

6. Overlooking Scalability:

A model that isn't designed with scalability in mind may not be able to handle future growth in data volume or user load.

Consider the future needs of the application when designing the data model. 

7. Inadequate Normalization:

Too much or too little normalization can lead to performance issues or data redundancy.

Achieve a balance between normalization and denormalization to optimize for both data integrity and query performance. 

8. Starting Too Late:

Why it's a mistake: Delaying data modeling can lead to rework later in the project.

How to avoid: Start the data modeling process early in the project lifecycle

--------------------------------------------------------------------------------------------------------------

VARCHAR AND NVARCHAR difference

In database contexts, VARCHAR and NVARCHAR are data types used to store variable-length text strings.

 The key difference lies in their encoding:

 VARCHAR uses single-byte character sets (like ASCII), 

while NVARCHAR 

uses Unicode, allowing for a wider range of characters including those from multiple languages. NVARCHAR typically requires more storage space and has a shorter maximum character length compared to VARCHAR

---------------------------------------------------------------------------------------------------------------
Data governance my notes .pdf

Ralph,Kimball  : says build the data marts first , then build the data warehouse by joining them into single data warehouse , This is called bottom up approach

Inmon : says build one single data warehouse for the entire enterprise , then divide the warehouse into Datamarts , This is called Top down approach .



snoQuestionAnswerLinks
1Types of Fact tables1. Transactional Fact Tables
Transactional fact tables are designed to capture individual business events or transactions.

2. Periodic Snapshot Fact Tables
Periodic snapshot fact tables provide a summarized view of metrics over regular time intervals.

3. Accumulating Snapshot Fact Tables
Accumulating snapshot fact tables are designed to track the stages of a business process or workflow.

4. Factless facts
https://easy-data-modeling.blogspot.com/2024/03/types-of-fact-tables.html
45types of fact tables 511. Additive 
    ( can be added like total qty , sales ) 
2. Semi additive
     
( cannot be added but updated like account balance )  
3. Non Additive 
    ( numerical but cannot be added like today price  of scotk ) 
2Granularity and GranuleGranularity in data refers to the level of detail 

>  
high level of granularity would have a large number of individual pieces (columns) of information, such as individual records or measurements.

>Data that has a 
low level of granularity would have a small number of individual pieces of information, such as summary data or aggregated data. Data granularity can affect how it is used and analyzed, and can impact the accuracy and usefulness of the results.
3what are 3 levels of Data modeling1. Conceptual model
list of entities and relationships , no attributes diagram only
2. logical 
3. physical model
4Logical and physical model - Differences ?
5Star schema & Snowflake schema - Differences ?
6Types of Dimension tables (30, 46,48) 

1) SCD

 (slow changing )

2) RCD 

(Rapid changing )

3)  JUNK 

 

4)  INFERRED .

5) DEFERRED 

6) CONFORMED (multiple facts )

 

7) DEGENERATE (invoice)

 

8) ROLE PLAY
(same meaning , 
multiple names)

 

9) STATIC

10) DYNAMIC  

10. SPARSE

11) DENSE 

12) SHRUNKEN

 


1 Roleplay dimension
Role playing dimensions are dimensions that are used more than once in a fact table, each time with a different meaning or role.
  when we want two attributes from the source entity to be connected to 1 entity as destination, that is not possible. In such cases, we can connect both the columns of one table to 2nd table, but with a new role name that is called as role play dimension something like connecting address and customer table with 2 columns bill to address and ship to address 

2 Conformed dimension
 Conformed dimensions are dimensions that are shared by multiple stars

3 Degenerate dimension
  A column of the key column in the fact table that does not have the associated dimension table but used for reporting and analysis, such column is called degenerate dimension 
 For ex, we have a fact table with customer_id, product_id, branch_id, employee_id, bill_no

4 Junk dimension
This combined dimension table covers all possible set of values which might occur in fact table and is known as junk dimension.5 Shrunken dimension


5 Static dimensions
Static dimension items do not change with time; they are fixed components that are always returned in a freeform table.
The time , date , country code , status codes etc

6. Dynamic dim
which change dynamically based on values and formula

7. Deferred dims
 Sometimes the fact table have to be loaded with some data, like customer id. But the dimension of customer is not loaded with that customer id. In such cases, we delay inserting such rows into fact table And later on, once the the dimension of customer is loaded with this customerId, then we add this rows into fact table.
This type of delaying the fact table  is called deferred dimensions

8 Inferred Dimensions
When the fact data is available to be loaded in the source but related dimension table does not have that
eg:  like the customerId is  in sales orders table but not in customer table .. in such case we add  an sample entry in customer table and enable 'inferred dimension' flag and later point of time we make a original entry into customer table and disable the inferred dimension

slowly changing dimension (SCD)
scd is a dimension which contains relatively static data which can change slowly but unpredictably, rather than according to a regular schedule
   
SCD 
Type 0 Ignore any changes and audit the changes.
Type 1 Overwrite the changes
Type 2 History will be added as a new row.
Type 3 History will be added as a new column.
Type 4 A new dimension will be added
Type 6 Combination of Type 2 and Type 3


 
10 Rapidly changing dimensions (RCD)
Rapidly changing dimension (RCD) is a dimension which has attributes where values will be getting changed often.

11) sparse dim
Some of the dimension tables are not loaded with full values

12) dense dim
Sometimes most of the columns of the dimension tables are loaded, fully loaded. So such dimensions

13) shrunken dim
Sometimes we create a replica of the original dimension, but it is with lesser, fewer number of columns than the original table
7What is factless fact tableA factless Most of the columns are empty. Such dimensions are called sparse dimensions fact table is a table that contains only foreign keys to dimensions, but no numeric facts. Factless fact tables are used to capture events or situations that have no measurable outcome
8What is Data MartData mart is a subset of datawarehouse  purpose-built to support the analytical needs of a particular department, or line of business, or geographic region.
9Steps in Datawarehouse life cycle stepsRequirements gathering
Requirements analysis
High level design (HLD)
Low level design (LLD)
Development – Automating scripts
Testing
User Acceptance Testing (UAT)
Project release or deployment
Warranty support
10ERWIN qsns
11Erwin - Data modeler workspaceis the space where the Tables are created and connected
12Erwin - DomainsA domain is like a custom datatype an independent model object that you can use to quickly assign properties to an attribute or column
13Erwin - types of relationshipsIdentifying relationship.
Non-identifying relationship.
Subtype relationship.
Many-to-many relationship.
Recursive relationship.
14Erwin - which erwin features did you use1. Name hardening
2. Domains for Data types
3. Naming standards( abbrevations)
4. Fwd & reverse engineering
5. Merge data models
6. complete compare
7. sub type and super type
14.1Erwin - which feature used to merge the subtype tables to supertype
15Erwin - Forward and reverse engineering stepsForward engr steps :
Actions > schema > Generate scripts

Reverse engnr steps :
Actions > reverse engnr > logical / physical >
Database (or) script file > connect > ok
16Erwin - Complete compare featureComplete Compare is a powerful tool that lets you view and resolve the differences between two models, or a model and a database or script file

Actions > complete compare wizared > source target models
select > compare > final > resolve differenceces 
17Erwin - Naming standardsUse the Naming Standard Editor to create naming standards and develop a glossary that contains glossary words, their abbreviations, and alternate abbreviations.

Access the Model Explore> open Naming Standard Editor > Define naming standards for logical and physical models> Create abbreviations
18Erwin - Name hardeningName hardening is a mechanism in Erwin Data Modeler that prevents physical model names from changing when logical names change
19Erwin - querying tool
20Erwin - subject areasClick Model, Subject Areas. The Subject Area Editor opens. Click New. A new subject area
21Erwin - identifying and
non identifying relationships differences...........
Identifying : Represented by a solid line, the primary key of the parent migrates to the primary key area of the child entity or table eg : Book & chapter tables , Country and state tables

Non-identifying : The child entity is not dependent on the parent entity for its identity and can exist without it .. eg : supplier & product list ,
22Fact and dimension table - differences. Fact tables contain numerical data, while dimension tables provide context and background information.
23What is DatawarehouseA data warehouse is a centralized system that stores data from various sources for reporting and analysis

Data warehouses can:
Consolidate data: Store data from multiple sources into a single source of truth
Store historical data: Store and analyze long-term historical data
Analyze data about a particular subject or functional area:
24what is ODS( operational data store)An operational data store usually stores and processes data in real time. An ODS is connected to multiple data sources and pulls data into a central location.

ODS is a temportary data store from all sources for volatile data like resumes of interview, quotation from suppliers ,etc. which need not be stored permanently



25OLTP & OLAP- differences (pg 17) 53https://easy-data-modeling.blogspot.com/2024/03/olap-oltp-diff.html
 
28 
 
30what is most likely reason that sql query may run slowly

Possible Reasons for Slow SQL Queries

  1. Lack of Indexes: I 
  2. Complex Joins
  3. Large Data Volume:  
  4. Inefficient Query Structure
  5. Database Configuration: Configuration settings such as memory allocation, cache size, and connection limits can affect performance.
  6. Server Load: High server load due to concurrent queries or insufficient resources (CPU, memory, I/O) can slow down execution times.
  7. Network Latency: If the database is remote, network latency can add to the time it takes to execute a query.
  8. Statistics Outdated: Database optimizers rely on statistics to create execution plans.  
  9. Locking and Blocking: If a query is waiting for locks held by other transactions, it can slow down execution.

Types of Joins and Their Impact on Execution Time

  1. INNER JOIN: Generally efficient if both tables have indexes on the join columns. 
  2. LEFT JOIN (or RIGHT JOIN): These can be slower than INNER JOINS because they need to return all records f 
  3. FULL OUTER JOIN: Typically the slowest join type, as it combines the results o 
  4. CROSS JOIN: This join produces a Cartesian product,   It's usually slower and should be used with caution.
  5. SELF JOIN: While not inherently slow, performance  

Optimization Tips

  • Indexing
  • Analyze Execution Plans: Use tools provided by your database (like EXPLAIN in PostgreSQL) to 
  • Refactor Queries: Simplify complex queries,  
  •  
  • Monitor Server Performance:  
 
 
42 
43data quality checks -pg 281. CONSISTENCY
2. ACCURACY
3. COMPLETENESS
4. AUDITABLITY
5. ORDERLINESS
6. UNIQUENESS
7. TIMELINESS
44types of indexes 531. NORMAL INDEX
2. BITMAP INDEX
3. UNIQUE & NON UNIQUE INDEX
4. CLUSTERED & NON CLUSTERED
5. PARTITIONED & NON PARTITIONED
6. EXPRESSION BASED
7. B- TREE
 
47surrogate key (33)Surrogate key is a unique sequential number key column 
Generated by the database operation 
not provided by the client
 
Normally used in a History table when the PK is duplicated to
identify row uniquely


==============================================
When to use a surrogate key
if you want to allow duplicates to view the history of changes ( scd) then you can add a surrogate key (sequential number column) so that you can pick the highest values from duplicates as the latest entry 
48Normalization (34)1NF - There should not be multiple values in each cell
2NF- in 1NF and  All non Non key attributes should be dependent on key attributes and all.
3NF  -  It should be in 2NF  - And  there should not be any transitive dependency
4NF : Avoids multivalued dependency like 
    Student id , courses , subjects are not related , multivalued and in same table
49sql query : find nth highest salary - 54
Difference between Row_number(), Rank() and Dense_rank() is 

In case of value is same then 
Row_number()  gives : sequential number 
for totals = 10,20,20,20
ranks = 1,2,3,4

Rank  () : Assigns same Rank if same but the
(next rank )= (prev rank + total previous ranks +1 )

for totals = 10,20,20,20
Ranks = sequential numbers 
ranks = 1,2,3,4

 
next rank =(prev rank + total previous ranks +1 )
=2 + count(20,20,20) +1 
= 2+(3)+1 = 6 is next rank  

Dense_Rank() : Assigns same rank and next rank = next sequential number of previous rank
for totals 

for totals = 10,20,20,20
ranks = 1 for 10
Rank 2 for 20 
next rank = Next sequential number after 2 = 3

=========================================

a) row_number():

( To find 2nd highest salary using ROW_NUMBER)

select empid,salary from (

select e.*,row_number() over (order by salary desc)

as row_num from emp e )

where row_num = 3 ;


======================
b) using Rank ():
 

-- To show only 2nd highest --

select * from

(select e.*,rank() over (order by salary desc) as rank

from emp e )

where rank = 2;


-- ** To display ranks ** --

select rank() over (order by salary desc) as rank,e.*

from emp e ;


======================


c ) Dense_Rank ()   is the  BEST 

-- To show only 2nd highest --

select * from

(select e.*,dense_rank() over (order by salary desc) as rank

from emp e )

where rank = 2;


-- ** To display ranks ** --

select dense_rank() over (order by salary desc) as rank,e.*

from emp e ;


======================

d) using count distinct
select empno ,sal from emp e1
where 3 =
(select count(distinct sal )
from emp e2
where e2.sal > e1.sal )
50sql query : find and delete duplicate rowsTwo methods to find and delete duplicate rows
-=================================
1. Delete all rows other than min id value ..(used when id column is not duplicate but other columns values are duplicate)
===========================================
Method 1. 
using id Column (if ID column is unique, but title is duplicate)
================================
delete films
where film_id 
not in (select min(film_id) from films group by title, rel_date)
===========================================
Method 2. 
using CTE ( Two steps)
============================================
step 1 :

with CTE as
(select id ,name , email , row_number()
over (partition by name , email order by name , email ) rownum
from sales

step 2 :
Delete from CTE where rownum > 1
===========================================
Method 3. using ROWID
============================================
Delect from table1
where rowid not in (select min(rowid) from table 1
group by col1,col2)
52what is inner join and outer join count of below
A       B
1       1
1       1
2       2

3
null   null
Inner join = count of matching rows
outer join = inner join + left side unmatched

inner = 6 rows
left outer = 6 + 1 = 7 rows
53Which tools used for Data profiling toolsInformatica data profiling
Open source data quality profiling
54Which tools used for 'schema comparision'1) SQL Server Data Tools (SSDT)

2) SQLDBDiff. : SQLDBDiff is a powerful and intuitive tool that compares the
55What are ACID propertiesAtomicity - each statement in a transaction Either the entire statement is executed, or none of it is executed.

Consistency - This means that integrity constraints must be maintained so that the database is consistent before and after the transaction

Isolation -Changes occurring in a particular transaction will not be visible to any other transaction until that transaction has been committed

Durability - once the transaction has completed execution, the updates and modifications to the database are stored in and written to disk and are permanent
56How to do Data model testingData model can tested using the following:
Unit testing: This involves testing each component of the data model individually to ensure that it performs as expected.
Integration testing: This involves testing how different components of the data model work together
using test data
51What are roles of data modeler 61get the data from data source
identify connect the data source
pick a share data set
select storage mode identify the query performance issues
use or create a data flow

profile the data
identify data anomalies
examine data structures

clean transform load data
resolve inconsistencies and unexpected values null values data quality issues
apply value replacements identify the keys for joints
identify and transform column data types
apply naming conventions two columns
perform data loading
resolve data input errors


model the data
designer data model
define the tables
configure table and column properties
flatten the parent child hierarchy
role playing dimensions
define relationships and cardinality
design the data model
resolve many to many relationships
design a common date table
define the data level granularity
52what are layers in Datalake1. landing layer
2. curated layer
3. application data layer ( summary data)
4. Sand box layer
5. Target layer
53Layers of big data architecture..
54Roles of data architect

Data architect tasks and responsibilities

Typical responsibilities range from evaluating the current data architecture to keeping databases secure. Depending on your organisation and industry, your day-to-day tasks might include:

  • Translating business requirements into databases, data warehouses, and data streams.

  • Creating procedures to ensure data accuracy and accessibility.

  • Analysing, planning, and defining data architecture framework, including security, reference data, metadata, and master data.

  • Creating and implementing data management processes and procedures.

  • Collaborating with other teams within the organisation to devise and implement data strategies, build models, and assess shareholder needs and goals.

  • Researching data acquisition opportunities.

  • Developing application programming interfaces (APIs) to retrieve data.

 
56steps to prepare STTM
Steps Involved in Source to Target Mapping
You can map your data from a source of your choice to your desired destination by implementing the following steps:
Step 1: Defining the Attributes 
Step 2: Mapping the Attributes 
Step 3: Transforming the Data
 Step 4: Give Standard Naming Conventions Specifics 
Step 5: Establish Flows for Data Mapping
Step 6: Establish Data Transformation Guidelines 
Step 7: Testing the Mapping Process 
Step 8: Deploying the Mapping Process 
Step 9: Maintaining the Mapping Process  

wo

 
58Which Azure features did you work
59Which AWS features did you work
60What is Data vault modeling and contents
  
62steps for Dimensional model
63Diff. View and materialized view
Views and materialized views are both database objects that provide access to data, but they differ in several ways: 
  • Storage
    Views are virtual and don't store data
    , while materialized views store precomputed data as a physical table. 
  • Performance
    Views are more efficient because they don't require storage or updates,
    but materialized views can be faster for complex queries and aggregations. 



64What is data lienage is a process of documentation of  ETL all steps through which each table and each column has gone thru till final target in a diagram 
its like building a data model ( add entities attributes )
here we (add tables columns and next stage in diagrams )
65Types of Nosql databases
66Types of Data models
67Kimball and inmon model matrix






     61

    What is Data governance (pg 109 and 116)

    Data governance is the process of managing the
     availability, usability, integrity and security
    of the data in enterprise systems,
    based on internal standards and policies
    that also control data usage.

     Effective data governance ensures that
    data is consistent and trustworthy
    and doesn't get misused.

    Data governance is  increasingly critical as organizations are
    expanding data privacy regulations and
     rely more and more on data analytics
     to help optimize operations and
     drive business decision-making.

    https://www.thoughtspot.com/data-trends/data-governance/how-to-implement-data-governance

    https://data-modeling-made-easy.blogspot.com/2024/08/how-to-implement-data-governance.html

     

    Advantages of ELT (Extract, Load, Transform)

    ELT (Extract, Load, Transform) adv

    ELT is better suited for larger data sets, situations where timeliness is crucial, or when you must keep raw data available for future analysis. For instance, ELT is faster than ETL because it loads data directly into the destination system and can perform transformations in similarity.

     

     

    What is a Data Lake?

    What is a Data Lake?
    A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is,

    data lakehouse

    A data lakehouse
     
    is a data architecture that combines a data lake and a data warehouse to provide a flexible solution for business intelligence and data science. It allows organizations to use low-cost, flexible storage for all types of data, while also benefiting from the data management, schema, and governance of a warehouse.

    Data lakes and data warehouses are both storage systems for big data used by business analysts, data engineers, and data scientists. They have different architectures and serve different business needs:

    Data lakes
    Store raw, structured, and semi-structured data from internal and external sources. They are good for data analytics, machine learning, experimenting with data, and training AI models. Data lakes prioritize storage volume and cost over performance, and can store any type or volume of data in full fidelity. They also allow users to ingest data from any system at any speed, and process data in real time or batch mode.

    Data warehouses
    Store structured data from multiple sources. They are good for business intelligence (BI), data visualizations, reporting, and accessing reliable, quality data. Data warehouses are designed for the fastest query performance, and are optimized to analyze relational data coming from transactional systems and line of business application


    SQL : query to show Total salary  of each department

     

     

     

     

     

     

     

    SELECT

      Id,   empname,   deptid,   Salary,

      SUM(Salary) OVER(PARTITION BY deptid) AS SUM_SAL

    FROM

      emp  ORDER BY    id

     ===========================

    # Id, empname, deptid, Salary, SUM_SAL

    '1', 'john', '1', '20000', '50000'

    '2', 'kem', '1', '30000', '50000'

    '3', 'chan', '2', '10000', '30000'

    '4', 'henry', '2', '20000', '30000'

    '5', 'bill', '3', '30000', '30000'

    '6', 'salt', '4', '20000', '20000'


     SQL : query to sum of salary by dept and 
    cumulative salary by each dept 

      SELECT  deptid,salary ,  

           Sum(salary)

             OVER(

               partition BY deptid

               ORDER BY id) AS CUMulative_SAL

    FROM   emp

    ORDER  BY id 

    ===============================

    # deptid, salary, CUMulative_SAL

    '1', '20000', '20000'

    '1', '30000', '50000'

    '2', '10000', '10000'

    '2', '20000', '30000'

    '3', '30000', '30000'

    '4', '20000', '20000'




    =========================================================================

     68. When to use star schema and when to go for snowflake ?

     If you have a few dimensions   and low cardinality (less unique values in dims ) , but you require fast query execution, a star schema is the right choice. 

     However, if you have several dimensions and high cardinality, a snowflake schema will be a better scheme 

     


     


     



    =========================================================================
    69. features of snowflake db

    Snowflake can seamlessly integrate with popular cloud providers such as AWS, Azure, and Google Cloud Platform. It can be scaled up or down automatically, ensuring efficient data loading, integration, and analysis. This allows multiple users to run numerous workloads simultaneously without concerns about resources

    https://airbyte.com/data-engineering-resources/snowflake-features#:~:text=Snowflake%20can%20seamlessly%20integrate%20with,simultaneously%20without%20concerns%20about%20resources.

    https://docs.snowflake.com/en/user-guide/intro-supported-features

    https://staragile.com/blog/snowflake-features

    A Compact List of Snowflake Features

    1. Decoupling of storage and compute in Snowflake
    2. Auto-Resume, Auto-Suspend, Auto-Scale
    3. Workload Separation and Concurrency
    4. Snowflake Administration
    5. Cloud Agnostic
    6. Semi-structured Data Storage
    7. Data Exchange
    8. Time Travel
    9. Cloning
    10. Snowpark
    11. Snowsight
    12. Security Features
    13. Snowflake Pricing
    =========================================================================


    ERWIN - how maintain versions

     
    Two types of model versions exist:

    Delta Version: A delta version is created when you initially save a new model or when you save changes to an existing model.

    If you do not want to maintain multiple versions, you can clear the Maintain multiple versions check box. If you clear the checkbox, then erwin� Data Modeler does not create a delta version for any incremental save. erwin� Data Modeler updates the current version every time you save the model.

    Named Version:
    A named version is created from a delta version, or from another named version. A named version represents a milestone in the development of the model, or at least a version that you want to keep indefinitely. When you save the changes in a named version, the changes are saved in the same version.

    The Based upon field in the Edit Catalog pane displays the version number from which the Named version is created.

    =========================================================================

    Create a Delta Version

    You create a delta version when you save a model initially or when you incrementally save an existing model.

    Follow these steps:

    1. Click File, Mart, Open >
      > The Open Model dialog opens. > Select a model and click OK.
      > The model opens.
      >Make necessary changes to the model.
      >Click File, Mart, Save.

    2. A delta version of the model is created with the incremental changes.

    =========================================================================Create a Named Version

    A named version of a model represents a milestone in the development of the model. You create a named version to keep that model version indefinitely.

    Follow these steps:

    1. Click File, Mart, Catalog Manager   .

      > The Catalog Manager opens. > Select a model version,
      >right-click and click Mark Version. 

    2. A named version is created with a default name.
      > Edit the name of the named version (new name ) and press enter.

    3. A new named version is created in the catalog.

    =========================================================================

    Compare Model Versions

    Compare two model versions of a data model to view the differences between them.

    Follow these steps:

    1. Click File, Mart, Catalog Manager.  > The Catalog Manager dialog opens.
      Hold the Ctrl key and select two model versions of a model and
      > click Version Compare. >Click Compare.
    2. The Complete Compare wizard opens. >.

    3. The Resolve Difference dialog opens. Review the differences and use the tools on the toolbar to work with and manage the changes. > Finish  > Click Close.

    4.  =========================================================================


    When to use a surrogate key
    if you want to allow duplicates to view the history of changes ( scd) then you can add a surrogate key (sequential number column) so that you can pick the highest values from duplicates as the latest entry 


    A surrogate key is a unique identifier for a row in a database table that's often used when a natural key is not suitable as a primary key. Surrogate keys can be useful in many situations, including:
    • while maintaining a History table when the primary key is repeated we add a 
      sequential auto generated number to identiy each row uniquely .. that is called surrogate key

    =========================================================================

    what is data lineage ? and tools


    Data lineage is the process of tracking the flow of data over time, providing a clear understanding of
    - where the data originated,
    - how it has changed, and
    - its ultimate destination within the 
    data pipeline.

    https://www.montecarlodata.com/blog-open-source-data-lineage-tools/

    =========================================================================

    How do you calculate the size of each table 


    What are the datatypes and their Memory size calculation 

    https://data-modeling-made-easy.blogspot.com/2024/07/sql-data-types-and-sizes.html


    =========================================================================

    What is Data governance

    • Data governance is 
      • a set of processes, policies, roles, metrics, and standards  
      • that help  management processes to keep 
        • data secure, private, accurate, and usable throughout its life cycle. 

    =========================================================================

    Contents of Data Governance :

    1. Data Architechture Management
    2. Data Development
    3. Database operations Management
    4. Data security Management
    5. Reference & Master data Management
    6. DWH & BI Management
    7. Document and content Management
    8. Meta Data Management
    9. Data Quality Management
    =========================================================================
    Data Governance Framework Steps :
    =========================================================================

    Goals

    Methods

    People

    Processes

    Technology

    culture

    Security

    Policies

    sponsor

    Issues Management

    ingestion

    collaboration

    privacy

    Guides

    owner

    change Management

    cataloging

    Crowd sourcing

    compliance

    Guardrails

    steward

    quality Management

    Data preparation

    communication

    Quality

    Gates

    curator

    cataloging

    Data analysis

    Sharing

    Integration

    Code of Ethics

    coach

    Measurement

    Pipeline Management

    Reuse

    Metadata

    curating

    consumer

    Monitoring

     

     

    Retention

    coaching

    Stakeholder

     

     

     

    Risk

     

     

     

     

     

    Impact

     

     

     

     

     


    =========================================================

    Explain Gdpr & Hipaa in data governance? 

    • HIPAA  (Health Insurance Portability and Accountability Act) is United States legislation) 

    What are the primary HIPAA goals?

    • To limit the use of protected health information to those with a “need to know”
    • To penalize those who do not comply with confidentiality regulations

    What health information is protected?

    • Any healthcare information with an identifier that links a specific patient to healthcare information (name, social security number, telephone number, email address, street address, among others)

     
    • GDPR (The General Data Protection Regulation )

    The General Data Protection Regulation (GDPR) is a European law that protects individuals' personal information and their fundamental rights and freedoms

    =========================================================================
    What is Medallion architechture ?

    ===============================================================

     
    ===============================================================

    where do we use data vault modeling

    A data vault is a data modeling approach and methodology used in enterprise data warehousing to handle complex and varying data structures. It combines the strengths of 3rd normal form and star schema.
    ===============================================================

    Problems / Chellenges of Data modeling ?
    Data modeling can face a number of challenges, including:
    • Data quality: Data may be missing, incorrect, or inconsistent, and it can be difficult to maintain data quality over time. 
       
    • Data security: The many interconnected data sources make it vulnerable to attacks from hackers. 
       
    • Integrating diverse data sources: Data from different sources may be in different structures, schemas, and formats. It's important to make sure the data is cleaned and transformed correctly before loading it into a hub. 
       
    • Scalability: Big data can be enormous, and the system may run too slowly or be unable to handle heavy pressure. Cloud computing can help with this challenge. 
       
    • Choosing the right data model: It can be challenging to choose the right data model. 
       
    • Balancing normalization and denormalization: It can be challenging to balance normalization and denormalization. 
       
    • Handling data changes and evolution: It can be challenging to handle data changes and evolution. 
       
    • Communicating and collaborating with stakeholders: It can be challenging to communicate and collaborate with stakeholder
    ============================================
    sql :  union union all difference
    ============================================

    The Short Answer: SQL UNION vs. UNION ALL. The key difference is that UNION removes duplicate records, whereas UNION ALL includes all duplicates

    This distinction not only changes the number of rows in the query result, but it also impacts performance

    so always use UNION ALL instead of UNION to speed up the execution time
    ============================================
    DIFFERENCE between joins and union

    joins combine data side-by-side, whereas unions combine data one on top of the other
     
    ===========================================================

    Data modeling tools list 

    Astah
    Database Workbench
    DbSchema
    Enterprise Architect
    ER/Studio
    Erwin Data Modeler
    MagicDraw
    SQL Server Management Studio
    ModelRight
    MySQL Workbench
    Navicat Data Modeler
    NORMA Object-Role Modeling
    Open ModelSphere
    Oracle SQL Developer Data Modeler
    SAP PowerDesigner
    Software Ideas Modeler
    SQLyog
    Toad Data Modeler

    ===========================================================
    Query optimization best practices

    When optimizing queries, best practices include: 
     + using indexes effectively, 
    + avoiding SELECT * from ... instead use limit 10 ,
    + filtering data early in the WHERE clause,
    + choosing appropriate JOIN types,
    + minimizing subqueries ,
    + utilizing stored procedures,
    + monitoring query performance time,
    + and selecting the correct data typeis instead of varchar and conversion
    + always review and analyze execution plans to identify areas for improvement. 
    ============================================
    Key points to remember:
    • Select only necessary columns:
      Avoid using SELECT * and explicitly list the columns you need to retrieve only the relevant data. 
    • Filter early:
      Apply WHERE clauses as early as possible in the query to reduce the amount of data processed. 
    • Use indexes wisely:
      Create indexes on frequently used columns in WHERE clauses to speed up lookups, but be cautious of over-indexing which can slow down writes. 
    • Optimize joins:
      Choose the appropriate JOIN type (INNER JOIN, LEFT JOIN, etc.) based on your data relationships and avoid unnecessary joins. 
    • Minimize subqueries:
      Try to rewrite queries to avoid nested subqueries where possible as they can be inefficient. 
    • Consider data types:
      Use the most appropriate data type for each column to optimize storage and comparison operations. 
    • Analyze execution plans:
      Regularly review the query execution plan to identify potential bottlenecks and optimize accordingly. 
    • Use stored procedures:
      For complex logic or frequently used queries, consider using stored procedures to improve performance and maintainability. 
    Other important practices:
    • Partitioning and sharding:
      For very large datasets, consider partitioning tables by date or other relevant criteria to improve query performance on specific subsets. 
    • Avoid unnecessary calculations:
      Perform calculations only when needed and avoid redundant computations within the query. 
    • Use UNION ALL instead of UNION:
      When combining results from multiple queries, use UNION ALL if you don't need to remove duplicates. 
    • Monitor query performance:
      Implement monitoring tools to track query execution times and identify potential performance issues. 
    • Optimize for your database system:
      Understand the specific optimization features and best practices available for your database platform. 
    ==============================================================
    Best practices for SQL query 

    Best Practices For SQL Query Optimizations

    Last Updated : 03 May, 2024

    SQL stands for Structured Query Language which is used to interact with a relational database. It is a tool for managing, organizing, manipulating, and retrieving data from databases. SQL offers amazing advantages like faster query processing, highly portable, interactive language, cost-efficient, and many more. When data needs to be retrieved from SQL a request is sent. The DBMS processes the requests and returns them to us. 

    Best-Practices-For-SQL-Query-Optimizations

    The main aim of SQL is to retrieve data from databases. So if these queries aren’t effective enough this can lead to a slowdown from the server. So SQL query optimizations need to be done to maximize the output. In this blog, we will discuss the Best Practices for SQL Query optimization. But first, let us understand what is SQL Query Optimization and their requirements.

    What is SQL Query Optimization?

    SQL query optimization is the process of refining SQL queries to improve their efficiency and performance. Optimization techniques help to query and retrieve data quickly and accurately. Without proper optimization, the queries would be like searching through this data unorganized and inefficiently, wasting time and resources.

    Requirement For SQL Query Optimization

    The main goal of SQL query optimization is to reduce the load on system resources and provide accurate results in lesser time. It makes the code more efficient which is important for optimal performance of queries. The major reasons for SQL Query Optimizations are:

    • Enhancing Performance: The main reason for SQL Query Optimization is to reduce the response time and enhance the performance of the query. The time difference between request and response needs to be minimized for a better user experience.
    • Reduced Execution Time: The SQL query optimization ensures reduced CPU time hence faster results are obtained. Further, it is ensured that websites respond quickly and there are no significant lags.
    • Enhances the Efficiency: Query optimization reduces the time spend on hardware and thus servers run efficiently with lower power and memory consumption.

    Best Practices For SQL Query Optimization

    The optimized SQL queries not only enhance the performance but also contribute to cost savings by reducing resource consumption. Let us see the various ways in which you can optimize SQL queries for faster performance.

    1. Use Indexes

    Indexes act like internal guides for the database to locate specific information quickly. Identify frequently used columns in WHERE clauses, JOIN conditions, and ORDER BY clauses, and create indexes on those columns. However, creating too many indexes can slow down adding and updating data, so use them strategically.

    The database needs to maintain the indexes in addition to the main table data, which adds some overhead. So, it’s important to strike a balance and only create indexes on columns that will provide significant search speed improvements.

    2. Use WHERE Clause instead of having

    The use of the WHERE clause instead of Having enhances the efficiency to a great extent. WHERE query execute more quickly than HAVINGWHERE filters are recorded before groups are created and HAVING filters are recorded after the creation of groups. This means that using WHERE instead of HAVING will enhance the performance and minimize the time taken.

    For Example

    • SELECT name FROM table_name WHERE age>=18; results in displaying only those names whose age is greater than or equal to 18 whereas
    • SELECT age COUNT(A) AS Students FROM table_name  GROUP BY age HAVING COUNT(A)>1; results in first renames the row and then displaying only those values which pass the condition

    3. Avoid Queries inside a Loop

    This is one of the best optimization techniques that you must follow. Running queries inside the loop will slow down the execution time to a great extent. In most cases, you will be able to insert and update data in bulk which is a far better approach as compared to queries inside a loop.

    The iterative pattern which could be visible in loops such as for, while and do-while takes a lot of time to execute, and thus the performance and scalability are also affected. To avoid this, all the queries can be made outside loops, and hence, the efficiency can be improved.

    4. Use Select instead of Select *

    One of the best ways to enhance efficiency is to reduce the load on the database. This can be done by limiting the amount of information to be retrieved from each query. Running queries with Select * will retrieve all the relevant information which is available in the database table. It will retrieve all the unnecessary information from the database which takes a lot of time and enhance the load on the database.

    Let’s understand this better with the help of an example. Consider a table name GeeksforGeeks which has columns names like Java, Python, and DSA. 

    • Select * from GeeksforGeeks; – Gives you the complete table as an output whereas 
    • Select condition from GeeksforGeeks; –  Gives you only the preferred(selected) value

    So the better approach is to use a Select statement with defined parameters to retrieve only necessary information. Using Select will decrease the load on the database and enhances performance.

    5. Add Explain to the Beginning of Queries

    Explain keywords to describe how SQL queries are being executed. This description includes how tables are joined, their order, and many more. It is a beneficial query optimization tool that further helps in knowing the step-by-step details of execution. Add explain and check whether the changes you made have reduced the runtime significantly or not. Running Explain query takes time so it should only be done during the query optimization process.

    6. Keep Wild cards at the End of Phrases

    wildcard is used to substitute one or more characters in a string. It is used with the LIKE operator. LIKE operator is used with where clause to search for a specified pattern. Pairing a leading wildcard with the ending wildcard will check for all records matching between the two wildcards. Let’s understand this with the help of an example. 

    Consider a table Employee which has 2 columns name and salary. There are 2 different employees namely Rama and Balram.

    • Select name, salary From Employee Where name  like ‘%Ram%’;
    • Select name, salary From Employee Where name  like ‘Ram%’;

    In both the cases, now when you search %Ram% you will get both the results Rama and Balram, whereas Ram% will return just Rama. Consider this when there are multiple records of how the efficiency will be enhanced by using wild cards at the end of phrases.

    7. Use Exist() instead of Count()

    Both Exist() and Count() are used to search whether the table has a specific record or not. But in most cases Exist() is much more effective than Count(). As Exist() will run till it finds the first matching entry whereas Count() will keep on running and provide all the matching records. Hence this practice of SQL query optimization saves a lot of time and computation power. EXISTS stop as the logical test proves to be true whereas COUNT(*) must count each and every row, even after it has passed the test.

    8. Avoid Cartesian Products

    Cartesian products occur when every row from one table is joined with every row from another table, resulting in a massive dataset. Accidental Cartesian products can severely impact query performance. Always double-check JOIN conditions to avoid unintended Cartesian products. Make sure you’re joining the tables based on the specific relationship you want to explore.

    For Example

    • Incorrect JOIN (Cartesian product): SELECT * FROM Authors JOIN Books; (This joins every author with every book)
    • Correct JOIN (retrieves books by author): SELECT Authors.name, Books.title FROM Authors JOIN Books ON Authors.id = Books.author_id; (This joins authors with their corresponding books based on author ID).

    9. Consider Denormalization

    Denormalization involves strategically adding redundant data to a database schema to improve query performance. It can reduce the need for JOIN operations but should be balanced with considerations for data integrity and maintenance overhead. JOIN operations, which combine data from multiple tables, can be slow, especially for complex queries. Denormalization aims to reduce the need for JOINs by copying some data from one table to another.

    For Example

    Imagine tables for “Customers” and “Orders.” Normally, you’d link them with a foreign key (e.g., customer ID) in the Orders table. To speed up queries that retrieve customer information along with their orders, you could denormalize by adding some customer details (e.g., name, email) directly into the Orders table.

    10. Optimize JOIN Operations

    JOIN operations combine rows from two or more tables based on a related column. Select the JOIN type that aligns with the data you want to retrieve. For example, to find all customers and their corresponding orders (even if a customer has no orders), use a LEFT JOIN on the customer ID column. The JOIN operation works by comparing values in specific columns from both tables (join condition). Ensure these columns are indexed for faster lookups. Having indexes on join columns significantly improves the speed of the JOIN operation.


    ===============================================

    Why do we need Data warehouse when Application database already exists ?

    Data warehouses are important for a number of reasons, including:
    • Centralizing data
      Data warehouses consolidate large amounts of data from multiple sources into a single, central database. This makes it easier to analyze the data and gain valuable insights. 
    • Improving decision-making
      Data warehouses help organizations make faster, more informed decisions by providing easy access to high-quality data. 
    • Maintaining historical data
      Data warehouses can store months or years of information, which can be useful for trend analysis and forecasting. 
    • Securing data
      Data warehouses store data in a single location, which makes it easier to control access and keep data secure. 
    • Handling big data
      Data warehouses can help solve big data challenges by making large amounts of information more usable. 
    • Optimizing for read access
      Data warehouses are usually optimized for read access, which can result in faster report generation. 

    Data warehouses are a critical component of business intelligence systems and data pipelines
    =======================================================
    Best Data modeling practices

    4 Best Practices for Data Modelling

    There are four principles and best practices for data modeling design to help you enhance the productivity of your data warehouse:

    Data Modeling Best Practices #1: Grain

    Indicate the level of granularity at which the data will be kept. Usually, the least proposed grain would be the starting point for data modeling. Then, you may modify and combine the data to obtain summary insights.

    Data Modeling Best Practices #2: Naming

    Naming things remains a problem in data modeling. The ideal practice is to pick and adhere to a naming scheme.

    Utilize schemas to identify name-space relations, such as data sources or business units. For instance, you might use the marketing schema to hold the tables most relevant to the marketing team, and the analytics schema to store advanced concepts such as long-term value.

    Data Modeling Best Practices #3: Materialization

    It is one of the most important tools for constructing an exceptional data model. If you build the relation as a table, you may precompute any required computations, resulting in faster query response times for your user base.

    If you expose your relation as a view, your users’ queries will return the most recent data sets. Nonetheless, reaction times will be sluggish. Depending on the data warehousing strategy and technologies you employ, you may have to make various trade-offs according to actualization.

    Data Modeling Best Practices #4: Permissions and Governance

    Data modelers should be aware of the varying rights and data governance requirements of the enterprise. Working collaboratively with your security team to verify that your data warehouse adheres to all applicable regulations would be beneficial.

    For instance, firms that deal with medical data sets are subject to HIPAA data authorization and privacy rules. All customer-facing internet firms should be aware of the EU General Data Protection Regulation (EU GDPR), and SaaS enterprises are frequently constrained in their ability to exploit client data depending on the terms of their contracts.

    =======================================================

    ===============================================

    No comments:

    Post a Comment

    29jan26 - Physical model without a Logical one?

      Can you create a Physical model without a Logical one? The short answer: Yes. Technically, you can jump straight into a Physical Data Mod...