-------------------------------------
8apr25 - KPI partner
-------------------------------------------
How do you do Data model optimization ?
Data model optimization involves refining a data model to improve its performance, maintainability, and scalability. Here are some essential techniques to consider:
1. Normalization and Denormalization:
- Normalization: Organize data to eliminate redundancy and dependency by splitting tables and defining relationships.
- Denormalization: Combine tables to improve query performance, especially in read-heavy systems, such as reporting systems.
2. Indexing:
- Use indexes (e.g., clustered, non-clustered, or composite) to accelerate data retrieval and improve query performance.
- Be cautious with excessive indexing, as it can impact write operations.
3. Partitioning:
- Divide large tables into smaller, more manageable chunks (e.g., by date or region).
- Enhances performance for queries targeting specific partitions.
4. Data Caching:
- Use caching mechanisms to store frequently accessed data in memory for faster retrieval.
- Tools like Redis or Memcached can reduce database load.
5. Schema Optimization:
- Simplify the schema design, ensuring it aligns with the specific requirements of the application.
- Use schemas like star or snowflake for analytical workloads.
6. Materialized Views:
- Precompute and store complex query results to improve performance for frequently executed queries.
7. Query Optimization:
- Refactor queries to eliminate unnecessary joins or subqueries.
- Avoid selecting unnecessary columns (
SELECT *
should be avoided).
8. Data Types:
- Choose appropriate data types for each column to reduce storage size and improve performance.
9. Archiving and Purging:
- Move historical or less-used data to archive storage, ensuring the active database operates efficiently.
10. Concurrency Control:
- Ensure your database can handle multiple simultaneous transactions without bottlenecks or conflicts.
11. Monitoring and Feedback:
- Use database performance monitoring tools to identify and address bottlenecks, such as long-running queries or underutilized indexes.
----------------------------
How to handle late arriving dimensions ?
- When a dimension record is missing, create a placeholder or "unknown" record in the dimension table with a default value for all attributes.
- Prevents fact data from being lost or incomplete.
- Allows fact data to be loaded and reported on even when dimensions are missing.
- Simplifies the process of updating dimension records when they arrive.
- If a product dimension is missing, create a "product_id_unknown" record with default values for other attributes.
- When a dimension record is missing, store the fact record in a temporary table until the dimension is available.
- Ensures that all fact data is processed eventually.
- Simplifies the process of updating fact records when the dimension arrives.
- Store fact records with missing product dimensions in a temporary table, and then re-process them when the product dimension is loaded.
- Use a process to identify and retry failed joins when dimensions are missing.
- Ensures that all fact data is processed eventually.
- Can be used with both streaming and batch ETL processes.
- Write unjoined records to a "streaming\_pipeline\_errors" table.
- Consolidate multiple failed retries for the same event into a single fact row.
- Union Kafka and retryable events.
- Transform/join this unioned data with the dimension table.
- Control how many retries are allowed by filtering out reconciliation records that exceed the configured number of retries.
- Identify Slowly Changing Dimensions: Determine how to handle changes to dimension attributes over time.
- Use Surrogate Keys: Employ surrogate keys for dimension tables to simplify joins and manage slowly changing dimensions.
- Data Deduplication: Ensure that you are not duplicating fact data when retrying joins.
- Schema Evolution: Plan for changes to the schema of your dimension tables.
- Inferred Flag: Use a flag to indicate whether a dimension record is inferred
Data model constraints refer to rules or limitations designed to maintain the integrity and validity of data within a database or system. Here are some common types of constraints:
Entity Constraints:
- Primary Key: Ensures that each row in a table is unique and identifiable (e.g., no duplicate or NULL values in the primary key column).
- Unique Constraint: Ensures all values in a column are unique, but unlike the primary key, it allows one NULL value.
Domain Constraints:
- Restrict the type, format, or range of values for a column (e.g., specifying data types like INTEGER, VARCHAR, or a range for a numeric field).
Referential Constraints:
- Foreign Key: Ensures a column’s value matches the primary key of another table, maintaining referential integrity between tables.
Check Constraints:
- Specifies a condition that each row must satisfy (e.g.,
salary > 0
orage >= 18
).
- Specifies a condition that each row must satisfy (e.g.,
Default Constraints:
- Assigns a default value to a column if no value is provided during insertion.
Not Null Constraint:
- Ensures that a column cannot have NULL values (e.g.,
name
column must always have a value).
- Ensures that a column cannot have NULL values (e.g.,
Functional Dependencies:
- Enforces relationships between attributes in a table, often used for normalization.
Key Constraints:
- Besides primary and unique keys, these may involve composite keys (combination of columns acting as a primary key).
Business Rules:
- Application-specific rules tailored to the business logic (e.g., an employee can only belong to one department).
Performance Constraints:
- Limitations imposed to enhance query performance, such as partitioning or indexing strategies.
These constraints ensure data reliability, prevent inconsistencies, and enforce rules for both operational and analytical use cases.
-----------------------------------
Find sql for fourth highest ?
To find the fourth highest salary by each department, you can use a combination of SQL queries with ROW_NUMBER
or DENSE_RANK
functions. Here's an example:
Using ROW_NUMBER
:
WITH RankedSalaries AS (
SELECT
department_id,
salary,
ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank
FROM employees
)
SELECT
department_id,
salary AS fourth_highest_salary
FROM RankedSalaries
WHERE rank = 4;
Using DENSE_RANK
(if there are duplicate salaries):
WITH RankedSalaries AS (
SELECT
department_id,
salary,
DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank
FROM employees
)
SELECT
department_id,
salary AS fourth_highest_salary
FROM RankedSalaries
WHERE rank = 4;
--------------------------------
aug -25
--------------------------------------
20aug25
how snowflake is different from other databases or datawarehouses
tell me your knowledge on the 3 layers of ETL
=============================
28aug25 - Teksystems
=============================================================
in data ware house how do you update the history table with scd2 if the last updated data was corrupt and to update the data with correct values
No comments:
Post a Comment