Sunday, January 25, 2026

25jan26 - how to do profilng in which tool and how

 Here’s a clean, interview-ready rewrite focused on one 3rd-party tool (Informatica Data Quality), with step-by-step profiling and reporting. You can say this almost verbatim.


Tool used for data profiling

I use Informatica Data Quality (IDQ) to profile data files and generate data quality reports before loading them into the database.


Step-by-step: How I do data profiling in Informatica Data Quality

Step 1: Ingest the data file

  • Connect the source file (CSV / Excel / flat file / S3 / database)

  • Define the source metadata in IDQ

  • Validate column names and data types

πŸ“Œ Purpose: Ensure the file is readable and structurally correct.


Step 2: Run Column Profiling

  • Use Column Profile in Informatica Analyst

  • Analyze each column for:

    • Data type distribution

    • Null and blank percentage

    • Min / Max values

    • Distinct count

    • Value frequency

πŸ“Œ Purpose: Understand the content and detect obvious issues.


Step 3: Run Data Domain & Pattern Analysis

  • Apply data domains (date, email, phone, numeric)

  • Use pattern analysis to detect invalid formats

Examples:

  • Invalid email formats

  • Date columns stored as strings

  • Mixed data types in one column

πŸ“Œ Purpose: Validate format and consistency.


Step 4: Identify Duplicates

  • Use duplicate analysis on key fields

  • Identify exact and fuzzy duplicates (if needed)

Example:

  • Same order_id appearing multiple times

  • Same customer with slight name variations

πŸ“Œ Purpose: Prevent double counting and incorrect metrics.


Step 5: Apply Business Rules

  • Create business rule transformations

  • Examples:

    • Order amount > 0

    • Delivery time between 0 and 180 minutes

    • Order status ∈ allowed values

πŸ“Œ Purpose: Ensure data follows business logic, not just technical rules.


Step 6: Generate Data Quality Score

  • Assign weights to rules (critical vs non-critical)

  • Calculate overall data quality score

  • Categorize issues:

    • Critical

    • Warning

    • Informational

πŸ“Œ Purpose: Measure readiness of the file.


Step 7: Create Data Quality Report

  • Generate profiling reports from Informatica Analyst

  • Report includes:

    • Column statistics

    • Failed rules

    • Duplicate counts

    • Data quality score

πŸ“Œ Purpose: Provide transparency to stakeholders.


Step 8: Go / No-Go decision

  • If critical rules fail → Reject the file

  • Notify source system / upstream team

  • Reload only after correction

πŸ“Œ Purpose: Prevent bad data entering the warehouse.


How I ensure the file is in good shape

A file is considered ready when schema validation passes, nulls and duplicates are within thresholds, business rules are satisfied, and the overall data quality score meets acceptance criteria.


Strong closing line (interview gold)

“Using Informatica Data Quality, I profile the data structurally, statistically, and against business rules, generate a data quality report, and enforce Go/No-Go criteria before loading the data.”


 

Just say πŸ‘

No comments:

Post a Comment

29jan26 - Physical model without a Logical one?

  Can you create a Physical model without a Logical one? The short answer: Yes. Technically, you can jump straight into a Physical Data Mod...