----------------------------------------------------------------------------
## Data Modeling Best Practices – Headings
------------------------------------------------------------------------------
1. Start with Business Requirements (Not Tables)
2. Define Grain First (Critical Rule)
3. Choose the Right Model Type
4. Use Clear Naming Conventions
5. Design Clean Dimension Tables
6. Keep Fact Tables Lean
7. Enforce Data Integrity
8. Handle Dates Properly
9. Optimize for Query Performance
10. Plan for Data Growth
11. Document Everything
12. Validate with Real Data
13. Version Control Your Models
14. Avoid Over-Normalization in Warehouses
15. Design for BI & Reporting Tools
-----------------------------------------------------------------------------------------------------------------------------
Common mistakes in data modeling include
|
|
Common mistakes of Data modeling : neglecting stakeholder input, inconsistent naming conventions, and failing to adequately test and validate the model. not considering data quality,
ignoring business requirements, |
|
|
Mistakes |
Why it's a mistake: |
How to avoid: |
|
1. Ignoring Stakeholder
Input: |
Without
input from users and business analysts, the model may not accurately reflect
real-world requirements or user needs, leading to a model that is difficult
to use or maintain. |
Involve
stakeholders throughout the data modeling process, gathering feedback on data
structures, relationships, and naming conventions. |
|
2. Inconsistent Naming
Conventions: |
Inconsistent
naming makes the model confusing and difficult to understand, leading to
errors and wasted time. |
Establish
clear naming conventions for entities, attributes, and relationships, and
adhere to them consistently throughout the model. |
|
3. Failing to Test and
Validate: |
A model
that isn't thoroughly tested and validated will likely contain errors that
can lead to inaccurate reporting, poor decision-making, and other problems. |
Implement
a testing strategy that includes unit tests, integration tests, and user
acceptance tests. |
|
4. Neglecting Data Quality: |
Poor data
quality can lead to inaccurate results, unreliable reports, and wasted
resources. |
Implement
data quality checks to ensure data is accurate, complete, consistent, and
up-to-date. |
|
5. Ignoring Business
Requirements: |
If the
data model doesn't meet the business requirements, it will be difficult to
use effectively. |
Thoroughly
understand the business requirements before starting the data modeling
process. |
|
6. Overlooking Scalability: |
A model
that isn't designed with scalability in mind may not be able to handle future
growth in data volume or user load. |
Consider
the future needs of the application when designing the data model. |
|
7. Inadequate Normalization: |
Too much
or too little normalization can lead to performance issues or data
redundancy. |
Achieve a
balance between normalization and denormalization to optimize for both data
integrity and query performance. |
|
8. Starting Too Late: |
Why
it's a mistake: Delaying
data modeling can lead to rework later in the project. |
How
to avoid: Start
the data modeling process early in the project lifecycle |
--------------------------------------------------------------------------------------------------------------
VARCHAR AND NVARCHAR difference
In database contexts, VARCHAR and NVARCHAR are data types used to store variable-length text strings.
The key difference lies in their encoding:
VARCHAR uses single-byte character sets (like ASCII),
while NVARCHAR
uses Unicode, allowing for a wider range of characters including those from multiple languages. NVARCHAR typically requires more storage space and has a shorter maximum character length compared to VARCHAR
Data governance my notes .pdf
61 | What is Data governance | Data governance is the process of managing the Effective data governance ensures that Data governance is increasingly critical as organizations are https://data-modeling-made-easy.blogspot.com/2024/08/how-to-implement-data-governance.html |
Advantages of ELT (Extract, Load, Transform) | ELT (Extract, Load, Transform) adv |
| |
What is a Data Lake? | What is a Data Lake? |
data lakehouse | A data lakehouse Data lakes and data warehouses are both storage systems for big data used by business analysts, data engineers, and data scientists. They have different architectures and serve different business needs: |
SQL : query to show Total salary of each department
| SELECT Id, empname, deptid, Salary, SUM(Salary) OVER(PARTITION BY deptid) AS SUM_SAL FROM emp ORDER BY id # Id, empname, deptid, Salary, SUM_SAL '1', 'john', '1', '20000', '50000' '2', 'kem', '1', '30000', '50000' '3', 'chan', '2', '10000', '30000' '4', 'henry', '2', '20000', '30000' '5', 'bill', '3', '30000', '30000' '6', 'salt', '4', '20000', '20000' |
SQL : query to sum of salary by dept and | Sum(salary) OVER( partition BY deptid ORDER BY id) AS CUMulative_SAL FROM emp ORDER BY id =============================== # deptid, salary, CUMulative_SAL '1', '20000', '20000' '1', '30000', '50000' '2', '10000', '10000' '2', '20000', '30000' '3', '30000', '30000' '4', '20000', '20000' |
If you have a few dimensions and low cardinality (less unique values in dims ) , but you require fast query execution, a star schema is the right choice.
However, if you have several dimensions and high cardinality, a snowflake schema will be a better scheme
https://airbyte.com/data-engineering-resources/snowflake-features#:~:text=Snowflake%20can%20seamlessly%20integrate%20with,simultaneously%20without%20concerns%20about%20resources.
https://docs.snowflake.com/en/user-guide/intro-supported-features
https://staragile.com/blog/snowflake-features
A Compact List of Snowflake Features
- Decoupling of storage and compute in Snowflake
- Auto-Resume, Auto-Suspend, Auto-Scale
- Workload Separation and Concurrency
- Snowflake Administration
- Cloud Agnostic
- Semi-structured Data Storage
- Data Exchange
- Time Travel
- Cloning
- Snowpark
- Snowsight
- Security Features
- Snowflake Pricing
ERWIN - how maintain versions
Two types of model versions exist:
Delta Version: A delta version is created when you initially save a new model or when you save changes to an existing model.
If you do not want to maintain multiple versions, you can clear the Maintain multiple versions check box. If you clear the checkbox, then erwin� Data Modeler does not create a delta version for any incremental save. erwin� Data Modeler updates the current version every time you save the model.
Named Version:
The Based upon field in the Edit Catalog pane displays the version number from which the Named version is created.
Create a Delta Version
You create a delta version when you save a model initially or when you incrementally save an existing model.
Follow these steps:
- Click File, Mart, Open >
> The Open Model dialog opens. > Select a model and click OK.
> The model opens.
>Make necessary changes to the model.
>Click File, Mart, Save. - A delta version of the model is created with the incremental changes.
=========================================================================Create a Named Version
A named version of a model represents a milestone in the development of the model. You create a named version to keep that model version indefinitely.
Follow these steps:
- Click File, Mart, Catalog Manager .
> The Catalog Manager opens. > Select a model version,
>right-click and click Mark Version. A named version is created with a default name.
> Edit the name of the named version (new name ) and press enter.A new named version is created in the catalog.
Compare Model Versions
Compare two model versions of a data model to view the differences between them.
Follow these steps:
- Click File, Mart, Catalog Manager. > The Catalog Manager dialog opens.
Hold the Ctrl key and select two model versions of a model and
> click Version Compare. >Click Compare. The Complete Compare wizard opens. >.
The Resolve Difference dialog opens. Review the differences and use the tools on the toolbar to work with and manage the changes. > Finish > Click Close.
=========================================================================
what is data lineage ? and tools
Data lineage is the process of tracking the flow of data over time, providing a clear understanding of
- where the data originated,
- how it has changed, and
- its ultimate destination within the data pipeline.
https://www.montecarlodata.com/blog-open-source-data-lineage-tools/
How do you calculate the size of each table
What are the datatypes and their Memory size calculation
https://data-modeling-made-easy.blogspot.com/2024/07/sql-data-types-and-sizes.html
=========================================================================
What is Data governance
- Data governance is
- a set of processes, policies, roles, metrics, and standards
- that help management processes to keep
- data secure, private, accurate, and usable throughout its life cycle.
Contents of Data Governance :
- Data Architechture Management
- Data Development
- Database operations Management
- Data security Management
- Reference & Master data Management
- DWH & BI Management
- Document and content Management
- Meta Data Management
- Data Quality Management
Goals | Methods | People | Processes | Technology | culture |
Security | Policies | sponsor | Issues Management | ingestion | collaboration |
privacy | Guides | owner | change Management | cataloging | Crowd sourcing |
compliance | Guardrails | steward | quality Management | Data preparation | communication |
Quality | Gates | curator | cataloging | Data analysis | Sharing |
Integration | Code of Ethics | coach | Measurement | Pipeline Management | Reuse |
Metadata | curating | consumer | Monitoring | | |
Retention | coaching | Stakeholder | | | |
Risk | | | | | |
Impact | | | | | |
- HIPAA (Health Insurance Portability and Accountability Act) is United States legislation)
What are the primary HIPAA goals?
- To limit the use of protected health information to those with a “need to know”
- To penalize those who do not comply with confidentiality regulations
What health information is protected?
- Any healthcare information with an identifier that links a specific patient to healthcare information (name, social security number, telephone number, email address, street address, among others)
- GDPR (The General Data Protection Regulation )
- Data quality: Data may be missing, incorrect, or inconsistent, and it can be difficult to maintain data quality over time.
- Data security: The many interconnected data sources make it vulnerable to attacks from hackers.
- Integrating diverse data sources: Data from different sources may be in different structures, schemas, and formats. It's important to make sure the data is cleaned and transformed correctly before loading it into a hub.
- Scalability: Big data can be enormous, and the system may run too slowly or be unable to handle heavy pressure. Cloud computing can help with this challenge.
- Choosing the right data model: It can be challenging to choose the right data model.
- Balancing normalization and denormalization: It can be challenging to balance normalization and denormalization.
- Handling data changes and evolution: It can be challenging to handle data changes and evolution.
- Communicating and collaborating with stakeholders: It can be challenging to communicate and collaborate with stakeholder
- Avoid using
SELECT *and explicitly list the columns you need to retrieve only the relevant data. - Apply WHERE clauses as early as possible in the query to reduce the amount of data processed.
- Create indexes on frequently used columns in WHERE clauses to speed up lookups, but be cautious of over-indexing which can slow down writes.
- Choose the appropriate JOIN type (INNER JOIN, LEFT JOIN, etc.) based on your data relationships and avoid unnecessary joins.
- Try to rewrite queries to avoid nested subqueries where possible as they can be inefficient.
- Use the most appropriate data type for each column to optimize storage and comparison operations.
- Regularly review the query execution plan to identify potential bottlenecks and optimize accordingly.
- For complex logic or frequently used queries, consider using stored procedures to improve performance and maintainability.
- For very large datasets, consider partitioning tables by date or other relevant criteria to improve query performance on specific subsets.
- Perform calculations only when needed and avoid redundant computations within the query.
- When combining results from multiple queries, use UNION ALL if you don't need to remove duplicates.
- Implement monitoring tools to track query execution times and identify potential performance issues.
- Understand the specific optimization features and best practices available for your database platform.
Best Practices For SQL Query Optimizations
SQL stands for Structured Query Language which is used to interact with a relational database. It is a tool for managing, organizing, manipulating, and retrieving data from databases. SQL offers amazing advantages like faster query processing, highly portable, interactive language, cost-efficient, and many more. When data needs to be retrieved from SQL a request is sent. The DBMS processes the requests and returns them to us.

The main aim of SQL is to retrieve data from databases. So if these queries aren’t effective enough this can lead to a slowdown from the server. So SQL query optimizations need to be done to maximize the output. In this blog, we will discuss the Best Practices for SQL Query optimization. But first, let us understand what is SQL Query Optimization and their requirements.
What is SQL Query Optimization?
SQL query optimization is the process of refining SQL queries to improve their efficiency and performance. Optimization techniques help to query and retrieve data quickly and accurately. Without proper optimization, the queries would be like searching through this data unorganized and inefficiently, wasting time and resources.
Requirement For SQL Query Optimization
The main goal of SQL query optimization is to reduce the load on system resources and provide accurate results in lesser time. It makes the code more efficient which is important for optimal performance of queries. The major reasons for SQL Query Optimizations are:
- Enhancing Performance: The main reason for SQL Query Optimization is to reduce the response time and enhance the performance of the query. The time difference between request and response needs to be minimized for a better user experience.
- Reduced Execution Time: The SQL query optimization ensures reduced CPU time hence faster results are obtained. Further, it is ensured that websites respond quickly and there are no significant lags.
- Enhances the Efficiency: Query optimization reduces the time spend on hardware and thus servers run efficiently with lower power and memory consumption.
Best Practices For SQL Query Optimization
The optimized SQL queries not only enhance the performance but also contribute to cost savings by reducing resource consumption. Let us see the various ways in which you can optimize SQL queries for faster performance.
1. Use Indexes
Indexes act like internal guides for the database to locate specific information quickly. Identify frequently used columns in WHERE clauses, JOIN conditions, and ORDER BY clauses, and create indexes on those columns. However, creating too many indexes can slow down adding and updating data, so use them strategically.
The database needs to maintain the indexes in addition to the main table data, which adds some overhead. So, it’s important to strike a balance and only create indexes on columns that will provide significant search speed improvements.
2. Use WHERE Clause instead of having
The use of the WHERE clause instead of Having enhances the efficiency to a great extent. WHERE query execute more quickly than HAVING. WHERE filters are recorded before groups are created and HAVING filters are recorded after the creation of groups. This means that using WHERE instead of HAVING will enhance the performance and minimize the time taken.
For Example
- SELECT name FROM table_name WHERE age>=18; results in displaying only those names whose age is greater than or equal to 18 whereas
- SELECT age COUNT(A) AS Students FROM table_name GROUP BY age HAVING COUNT(A)>1; results in first renames the row and then displaying only those values which pass the condition
3. Avoid Queries inside a Loop
This is one of the best optimization techniques that you must follow. Running queries inside the loop will slow down the execution time to a great extent. In most cases, you will be able to insert and update data in bulk which is a far better approach as compared to queries inside a loop.
The iterative pattern which could be visible in loops such as for, while and do-while takes a lot of time to execute, and thus the performance and scalability are also affected. To avoid this, all the queries can be made outside loops, and hence, the efficiency can be improved.
4. Use Select instead of Select *
One of the best ways to enhance efficiency is to reduce the load on the database. This can be done by limiting the amount of information to be retrieved from each query. Running queries with Select * will retrieve all the relevant information which is available in the database table. It will retrieve all the unnecessary information from the database which takes a lot of time and enhance the load on the database.
Let’s understand this better with the help of an example. Consider a table name GeeksforGeeks which has columns names like Java, Python, and DSA.
- Select * from GeeksforGeeks; – Gives you the complete table as an output whereas
- Select condition from GeeksforGeeks; – Gives you only the preferred(selected) value
So the better approach is to use a Select statement with defined parameters to retrieve only necessary information. Using Select will decrease the load on the database and enhances performance.
5. Add Explain to the Beginning of Queries
Explain keywords to describe how SQL queries are being executed. This description includes how tables are joined, their order, and many more. It is a beneficial query optimization tool that further helps in knowing the step-by-step details of execution. Add explain and check whether the changes you made have reduced the runtime significantly or not. Running Explain query takes time so it should only be done during the query optimization process.
6. Keep Wild cards at the End of Phrases
A wildcard is used to substitute one or more characters in a string. It is used with the LIKE operator. LIKE operator is used with where clause to search for a specified pattern. Pairing a leading wildcard with the ending wildcard will check for all records matching between the two wildcards. Let’s understand this with the help of an example.
Consider a table Employee which has 2 columns name and salary. There are 2 different employees namely Rama and Balram.
- Select name, salary From Employee Where name like ‘%Ram%’;
- Select name, salary From Employee Where name like ‘Ram%’;
In both the cases, now when you search %Ram% you will get both the results Rama and Balram, whereas Ram% will return just Rama. Consider this when there are multiple records of how the efficiency will be enhanced by using wild cards at the end of phrases.
7. Use Exist() instead of Count()
Both Exist() and Count() are used to search whether the table has a specific record or not. But in most cases Exist() is much more effective than Count(). As Exist() will run till it finds the first matching entry whereas Count() will keep on running and provide all the matching records. Hence this practice of SQL query optimization saves a lot of time and computation power. EXISTS stop as the logical test proves to be true whereas COUNT(*) must count each and every row, even after it has passed the test.
8. Avoid Cartesian Products
Cartesian products occur when every row from one table is joined with every row from another table, resulting in a massive dataset. Accidental Cartesian products can severely impact query performance. Always double-check JOIN conditions to avoid unintended Cartesian products. Make sure you’re joining the tables based on the specific relationship you want to explore.
For Example
- Incorrect JOIN (Cartesian product):
SELECT * FROM Authors JOIN Books;(This joins every author with every book) - Correct JOIN (retrieves books by author):
SELECT Authors.name, Books.title FROM Authors JOIN Books ON Authors.id = Books.author_id;(This joins authors with their corresponding books based on author ID).
9. Consider Denormalization
Denormalization involves strategically adding redundant data to a database schema to improve query performance. It can reduce the need for JOIN operations but should be balanced with considerations for data integrity and maintenance overhead. JOIN operations, which combine data from multiple tables, can be slow, especially for complex queries. Denormalization aims to reduce the need for JOINs by copying some data from one table to another.
For Example
Imagine tables for “Customers” and “Orders.” Normally, you’d link them with a foreign key (e.g., customer ID) in the Orders table. To speed up queries that retrieve customer information along with their orders, you could denormalize by adding some customer details (e.g., name, email) directly into the Orders table.
10. Optimize JOIN Operations
JOIN operations combine rows from two or more tables based on a related column. Select the JOIN type that aligns with the data you want to retrieve. For example, to find all customers and their corresponding orders (even if a customer has no orders), use a LEFT JOIN on the customer ID column. The JOIN operation works by comparing values in specific columns from both tables (join condition). Ensure these columns are indexed for faster lookups. Having indexes on join columns significantly improves the speed of the JOIN operation.
===============================================
Why do we need Data warehouse when Application database already exists ?
- Data warehouses consolidate large amounts of data from multiple sources into a single, central database. This makes it easier to analyze the data and gain valuable insights.
- Data warehouses help organizations make faster, more informed decisions by providing easy access to high-quality data.
- Data warehouses can store months or years of information, which can be useful for trend analysis and forecasting.
- Data warehouses store data in a single location, which makes it easier to control access and keep data secure.
- Data warehouses can help solve big data challenges by making large amounts of information more usable.
- Data warehouses are usually optimized for read access, which can result in faster report generation.
4 Best Practices for Data Modelling
There are four principles and best practices for data modeling design to help you enhance the productivity of your data warehouse:
Data Modeling Best Practices #1: Grain
Indicate the level of granularity at which the data will be kept. Usually, the least proposed grain would be the starting point for data modeling. Then, you may modify and combine the data to obtain summary insights.
Data Modeling Best Practices #2: Naming
Naming things remains a problem in data modeling. The ideal practice is to pick and adhere to a naming scheme.
Utilize schemas to identify name-space relations, such as data sources or business units. For instance, you might use the marketing schema to hold the tables most relevant to the marketing team, and the analytics schema to store advanced concepts such as long-term value.
Data Modeling Best Practices #3: Materialization
It is one of the most important tools for constructing an exceptional data model. If you build the relation as a table, you may precompute any required computations, resulting in faster query response times for your user base.
If you expose your relation as a view, your users’ queries will return the most recent data sets. Nonetheless, reaction times will be sluggish. Depending on the data warehousing strategy and technologies you employ, you may have to make various trade-offs according to actualization.
Data Modeling Best Practices #4: Permissions and Governance
Data modelers should be aware of the varying rights and data governance requirements of the enterprise. Working collaboratively with your security team to verify that your data warehouse adheres to all applicable regulations would be beneficial.
For instance, firms that deal with medical data sets are subject to HIPAA data authorization and privacy rules. All customer-facing internet firms should be aware of the EU General Data Protection Regulation (EU GDPR), and SaaS enterprises are frequently constrained in their ability to exploit client data depending on the terms of their contracts.
===============================================




No comments:
Post a Comment