Data-modeling-made-easy: My interview questions

-------------------------------------

8apr25 - KPI partner

-------------------------------------------

How do you do Data model optimization ?

Data model optimization involves refining a data model to improve its performance, maintainability, and scalability. Here are some essential techniques to consider:

1. Normalization and Denormalization:

Normalization: Organize data to eliminate redundancy and dependency by splitting tables and defining relationships.
Denormalization: Combine tables to improve query performance, especially in read-heavy systems, such as reporting systems.

2. Indexing:

Use indexes (e.g., clustered, non-clustered, or composite) to accelerate data retrieval and improve query performance.
Be cautious with excessive indexing, as it can impact write operations.

3. Partitioning:

Divide large tables into smaller, more manageable chunks (e.g., by date or region).
Enhances performance for queries targeting specific partitions.

4. Data Caching:

Use caching mechanisms to store frequently accessed data in memory for faster retrieval.
Tools like Redis or Memcached can reduce database load.

5. Schema Optimization:

Simplify the schema design, ensuring it aligns with the specific requirements of the application.
Use schemas like star or snowflake for analytical workloads.

6. Materialized Views:

Precompute and store complex query results to improve performance for frequently executed queries.

7. Query Optimization:

Refactor queries to eliminate unnecessary joins or subqueries.
Avoid selecting unnecessary columns (SELECT * should be avoided).

8. Data Types:

Choose appropriate data types for each column to reduce storage size and improve performance.

9. Archiving and Purging:

Move historical or less-used data to archive storage, ensuring the active database operates efficiently.

10. Concurrency Control:

Ensure your database can handle multiple simultaneous transactions without bottlenecks or conflicts.

11. Monitoring and Feedback:

Use database performance monitoring tools to identify and address bottlenecks, such as long-running queries or underutilized indexes.

----------------------------

How to handle late arriving dimensions ?

To handle late-arriving dimensions, you can use strategies like creating "unknown" or placeholder dimension records, holding fact records in a temporary table until the dimension arrives, or using a reconciliation pattern to retry failed joins, ensuring data integrity and completeness.

Here's a more detailed breakdown of common approaches:

1. "Unknown" or Placeholder Dimensions:

Concept:
When a dimension record is missing, create a placeholder or "unknown" record in the dimension table with a default value for all attributes.
Benefits:
Prevents fact data from being lost or incomplete.
Allows fact data to be loaded and reported on even when dimensions are missing.
Simplifies the process of updating dimension records when they arrive.
Example:
If a product dimension is missing, create a "product_id_unknown" record with default values for other attributes.

2. Hold Fact Records in a Temporary Table:

Concept:
When a dimension record is missing, store the fact record in a temporary table until the dimension is available.
Benefits:
Ensures that all fact data is processed eventually.
Simplifies the process of updating fact records when the dimension arrives.
Example:
Store fact records with missing product dimensions in a temporary table, and then re-process them when the product dimension is loaded.

3. Reconciliation Pattern:

Concept:
Use a process to identify and retry failed joins when dimensions are missing.
Benefits:
Ensures that all fact data is processed eventually.
Can be used with both streaming and batch ETL processes.
Steps:
Write unjoined records to a "streaming\_pipeline\_errors" table.
Consolidate multiple failed retries for the same event into a single fact row.
Union Kafka and retryable events.
Transform/join this unioned data with the dimension table.
Control how many retries are allowed by filtering out reconciliation records that exceed the configured number of retries.

4. Other Considerations:

Identify Slowly Changing Dimensions: Determine how to handle changes to dimension attributes over time.
Use Surrogate Keys: Employ surrogate keys for dimension tables to simplify joins and manage slowly changing dimensions.
Data Deduplication: Ensure that you are not duplicating fact data when retrying joins.
Schema Evolution: Plan for changes to the schema of your dimension tables.
Inferred Flag: Use a flag to indicate whether a dimension record is inferred

-----------------------------------

What are Data model constraints ?

Data model constraints refer to rules or limitations designed to maintain the integrity and validity of data within a database or system. Here are some common types of constraints:

Entity Constraints:
- Primary Key: Ensures that each row in a table is unique and identifiable (e.g., no duplicate or NULL values in the primary key column).
- Unique Constraint: Ensures all values in a column are unique, but unlike the primary key, it allows one NULL value.
Domain Constraints:
- Restrict the type, format, or range of values for a column (e.g., specifying data types like INTEGER, VARCHAR, or a range for a numeric field).
Referential Constraints:
- Foreign Key: Ensures a column’s value matches the primary key of another table, maintaining referential integrity between tables.
Check Constraints:
- Specifies a condition that each row must satisfy (e.g., salary > 0 or age >= 18).
Default Constraints:
- Assigns a default value to a column if no value is provided during insertion.
Not Null Constraint:
- Ensures that a column cannot have NULL values (e.g., name column must always have a value).
Functional Dependencies:
- Enforces relationships between attributes in a table, often used for normalization.
Key Constraints:
- Besides primary and unique keys, these may involve composite keys (combination of columns acting as a primary key).
Business Rules:
- Application-specific rules tailored to the business logic (e.g., an employee can only belong to one department).
Performance Constraints:
- Limitations imposed to enhance query performance, such as partitioning or indexing strategies.

These constraints ensure data reliability, prevent inconsistencies, and enforce rules for both operational and analytical use cases.

-----------------------------------

Find sql for fourth highest ?

To find the fourth highest salary by each department, you can use a combination of SQL queries with ROW_NUMBER or DENSE_RANK functions. Here's an example:

Using `ROW_NUMBER`:

WITH RankedSalaries AS (
    SELECT 
        department_id,
        salary,
        ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank
    FROM employees
)
SELECT 
    department_id,
    salary AS fourth_highest_salary
FROM RankedSalaries
WHERE rank = 4;

Using `DENSE_RANK` (if there are duplicate salaries):

WITH RankedSalaries AS (
    SELECT 
        department_id,
        salary,
        DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank
    FROM employees
)
SELECT 
    department_id,
    salary AS fourth_highest_salary
FROM RankedSalaries
WHERE rank = 4;

--------------------------------

aug -25

--------------------------------------

20aug25

how snowflake is different from other databases or datawarehouses

tell me your knowledge on the 3 layers of ETL

=============================

28aug25 - Teksystems

=============================================================

in data ware house how do you update the history table with scd2 if the last updated data was corrupt and to update the data with correct values

When a corrupt record needs to be updated in a Type 2 Slowly Changing Dimension (SCD2) history table, you don't simply "update" the existing record. Instead, you follow a specific process that maintains a complete history of changes.

The Process for Correcting a Corrupt Record

The standard SCD2 method for handling changes is to mark the old record as inactive and insert a new, active record. The same principle applies here to correct a corrupt entry. Here's a step-by-step breakdown:

Identify the Corrupt Record: Find the record in the SCD2 table that contains the incorrect or corrupt data. This record will have its end_date set to a future or '9999-12-31' value, and its is_current flag (or similar) set to true.
Expire the Corrupt Record: The first step is to "end" the active life of the corrupt record. You do this by performing an UPDATE operation on the row. You will:
- Set the end_date of the corrupt record to the current date, or the date the correction is being applied.
- Set the is_current or current_flag to false.
Insert the Correct Record: After the corrupt record is marked as inactive, you must INSERT a new row into the history table. This new record will contain the correct values for all the attributes. This new row will have:
- The same primary key as the original record (often a business key).
- The new, correct data for all relevant columns.
- A start_date set to the current date (or the date of correction).
- An end_date set to a future or '9999-12-31' value.
- An is_current or current_flag set to true.

This two-step process ensures that the history of the data is preserved. The corrupt record remains in the table, clearly marked with an end date, while the new, correct record becomes the current, active version. This is the fundamental design principle of SCD2—it provides an audit trail of every change, including corrections.

=============================================================

Here are some common challenges faced in data modeling projects ?

Data Understanding and Quality

A fundamental challenge is getting a complete understanding of the data. Often, data is messy, incomplete, and inconsistent. It may come from various sources with different formats and definitions, making it difficult to unify. Poor data quality can lead to inaccurate models and unreliable insights.

Evolving Business Requirements

Business needs are rarely static. Requirements can change mid-project, which necessitates significant changes to the data model. This constant evolution can be a source of project delays and rework, making it hard to finalize a stable design.

Scalability and Performance

Building a data model that works well with a small amount of data is one thing; designing it to handle massive, ever-growing datasets is another. As data volumes increase, the model must be scalable without a negative impact on query performance. A model that's too complex or not optimized for the underlying database can lead to slow queries and reports.

Stakeholder Collaboration

Data modeling requires close collaboration with various stakeholders, including business analysts, data engineers, and end-users. Miscommunication or a lack of alignment on data definitions and business rules can lead to a model that doesn't meet everyone's needs. Getting buy-in and feedback from all parties throughout the process is crucial but often difficult.

Tool and Technology Selection

Choosing the right data modeling tools and technologies can be a challenge. The landscape is vast and constantly changing, with options ranging from traditional relational databases to NoSQL and data lakes. Selecting the wrong tool for the job can create limitations and inefficiencies down the line.

Complexity and Abstraction

Data modeling is an exercise in abstraction, simplifying complex business processes into a logical structure. This can be challenging because some business rules are difficult to translate into a clear, structured model. A model that's too complex can be hard to maintain and understand for future team members

==========================

Data-modeling-made-easy

PAGES

My interview questions

1. Normalization and Denormalization:

2. Indexing:

3. Partitioning:

4. Data Caching:

5. Schema Optimization:

6. Materialized Views:

7. Query Optimization:

8. Data Types:

9. Archiving and Purging:

10. Concurrency Control:

11. Monitoring and Feedback:

Using `ROW_NUMBER`:

Using `DENSE_RANK` (if there are duplicate salaries):

The Process for Correcting a Corrupt Record

No comments:

Post a Comment

Bank dm data model diagram and sql

PAGES

My interview questions

1. Normalization and Denormalization:

2. Indexing:

3. Partitioning:

4. Data Caching:

5. Schema Optimization:

6. Materialized Views:

7. Query Optimization:

8. Data Types:

9. Archiving and Purging:

10. Concurrency Control:

11. Monitoring and Feedback:

Using ROW_NUMBER:

Using DENSE_RANK (if there are duplicate salaries):

The Process for Correcting a Corrupt Record

No comments:

Post a Comment

Bank dm data model diagram and sql

Using `ROW_NUMBER`:

Using `DENSE_RANK` (if there are duplicate salaries):