Data Visualization

CSV Data Profiling: Summary Stats, Missing Values, Duplicates, and Outliers (with Python)

Q: What Python library is best for CSV profiling?

For quick profiling, pandas df.describe() and df.info() cover the basics. For comprehensive reports, ydata-profiling (formerly pandas-profiling) generates full HTML reports with correlations, distributions, and alerts. If you want profiling without writing code, an AI sandbox can run these libraries for you from a plain-English prompt.

Q: How do I find duplicates in a CSV file?

In Python, use df.duplicated().sum() for exact row duplicates, or df.duplicated(subset=['column_name']) to check specific columns. For fuzzy duplicates (similar but not identical rows), you need string similarity libraries. An AI sandbox can run both checks from a simple prompt like 'show me duplicate rows' or 'find near-duplicate company names.'

Zedly AI Editorial Team February 16, 2026 10 min read

You have a CSV file. Before you chart it, pivot it, or feed it into a model, you need to answer one question: what is actually in this file?

That is CSV data profiling. It is the step between "I have a file" and "I can work with this data." A profiling pass tells you the column types, how much data is missing, whether there are duplicates, what the distributions look like, and where the outliers hide.

This guide covers what a CSV profiling report should include, how to generate one with Python, and how to do it without writing any code using an AI sandbox. If you already know your data is clean and want to jump straight to charts, use the CSV data visualizer directly.

What a CSV Profiling Report Should Tell You

A useful profiling report answers five questions about your data. If you open a CSV and skip this step, you will waste time debugging chart errors, getting wrong aggregations, or missing entire columns of nulls.

1. Shape and Structure

Row count: How many records are in the file?
Column count: How many fields per record?
Column names: Are they clean, or do they have leading spaces, special characters, or duplicates?
File size: Will this fit in memory for analysis?

2. Data Types

Which columns are numeric (integers, floats)?
Which are dates or timestamps?
Which are categorical (text with repeated values)?
Which are free-text (descriptions, notes, addresses)?
Are any columns that look numeric actually stored as text (common with currency, percentages, or leading zeros)?

3. Missing Values

Per-column null count and percentage: A column that is 80% null is probably not useful for analysis.
Null patterns: Are the same rows missing across multiple columns (suggesting incomplete records)?
Hidden nulls: Values like "N/A", "null", "-", "0", or empty strings that are not technically null but represent missing data.

4. Duplicates

Exact duplicates: Rows where every column matches another row.
Key duplicates: Rows with the same ID, date, or transaction reference but different values in other columns (data entry errors or merge artifacts).
Near-duplicates: "John Smith" vs "john smith" vs "JOHN SMITH" -- same entity, different formatting.

5. Distributions and Outliers

Summary statistics: Mean, median, standard deviation, min, max, and percentiles for numeric columns.
Value counts: Top N most frequent values for categorical columns.
Outliers: Values that fall far outside the expected range (a $50,000 lunch expense, a negative age, a date in the year 3025).
Skewness: Is the distribution symmetric, or heavily skewed? This affects which charts and statistics are meaningful.

CSV Profiling with Python (The Manual Way)

If you write Python, here is the standard approach using pandas. This is what runs under the hood when you use an AI sandbox to profile data.

Basic Shape and Types

import pandas as pd

df = pd.read_csv("your_file.csv")

print(f"Rows: {len(df)}, Columns: {len(df.columns)}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nFirst 5 rows:\n{df.head()}")

Missing Values

missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(1)
print(pd.DataFrame({"missing": missing, "pct": missing_pct})
        .query("missing > 0")
        .sort_values("pct", ascending=False))

Duplicates

exact_dupes = df.duplicated().sum()
print(f"Exact duplicate rows: {exact_dupes}")

# Check duplicates on a specific key column
key_dupes = df.duplicated(subset=["transaction_id"]).sum()
print(f"Duplicate transaction IDs: {key_dupes}")

Summary Statistics

# Numeric columns: count, mean, std, min, 25%, 50%, 75%, max
print(df.describe())

# Categorical columns: count, unique, top value, frequency
print(df.describe(include=["object"]))

Outlier Detection (IQR Method)

def flag_outliers_iqr(series, multiplier=1.5):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - multiplier * IQR
    upper = Q3 + multiplier * IQR
    return (series < lower) | (series > upper)

for col in df.select_dtypes(include="number").columns:
    outliers = flag_outliers_iqr(df[col]).sum()
    if outliers > 0:
        print(f"{col}: {outliers} outliers")

This is roughly 30 lines of code. It works, but you need a Python environment, you need to remember the API, and you need to adapt it for every new file. If you do this regularly, an AI sandbox that runs these checks from a plain-English prompt saves significant time.

CSV Profiling Without Writing Code

The alternative to writing Python yourself: upload your CSV to an AI-powered sandbox and describe what you want to know. The system writes and runs the code for you.

Example Prompts

Each of these runs real Python (pandas, numpy, scipy) in a sandboxed environment and returns formatted results:

"Profile this CSV"

Returns: row/column count, data types, missing values per column, summary statistics, duplicate count, and top values for categorical columns.

"Show me missing values by column, sorted by percentage"

Returns: table of columns with null counts and percentages, sorted descending. Highlights columns above 50% missing.

"Are there duplicate rows? Also check for duplicate invoice numbers"

Returns: exact duplicate count, plus duplicate key analysis on the specified column, with sample rows shown.

"Flag outliers in the Amount column using IQR. Show me the outlier rows."

Returns: IQR bounds, count of outliers, and a table of the flagged rows with all columns visible.

"What data types should each column be? Are any numbers stored as text?"

Returns: current vs recommended data types, with flags for columns where numeric data is stored as strings (common with currency, ZIP codes, or ID fields).

The advantage over writing your own scripts: you can ask follow-up questions. "Now remove those outlier rows and show me the updated statistics." "Which rows have missing values in both the Date and Amount columns?" "Convert the Amount column to numeric and re-run the profile." Each follow-up runs incrementally on the same data.

Open in the CSV data visualizer to try these prompts with your own file.

Profiling Checklist: What to Check Before Any Analysis

Use this as a pre-analysis checklist. Run through it every time you get a new CSV, whether from an export, a client, or a data pipeline.

Shape: How many rows and columns? Is this the complete dataset or a sample?

Headers: Are column names clean? Any leading/trailing spaces, special characters, or duplicate names?

Types: Are dates parsed as dates? Are numbers parsed as numbers? Are IDs stored as text (they should be)?

Nulls: Which columns have missing values? What percentage? Are there hidden nulls ("N/A", "null", "-", empty strings)?

Duplicates: Any exact duplicate rows? Any duplicate keys (IDs, dates, references)?

Ranges: Do min/max values make sense? Negative amounts where you expect positive? Dates in the future?

Outliers: Any extreme values that could skew averages or charts? Flag them before they distort your analysis.

Cardinality: How many unique values in each column? A "Category" column with 10,000 unique values is probably not categorical.

If everything checks out, you are ready to move on to visualization. The CSV data visualizer lets you go from profiling to charting in the same session -- profile your data, clean it, then ask for charts, all in one place.

Common Data Quality Issues (and How to Handle Them)

Mixed Data Types in One Column

A column has "42", "N/A", "37.5", and "pending". Pandas will read this as object (string) type. Fix: replace non-numeric values with NaN, then convert to float. In the sandbox: "Convert the Amount column to numeric, replacing non-numeric values with null."

Date Format Inconsistency

Same column has "01/15/2026", "2026-01-15", and "Jan 15, 2026". Pandas can usually handle this with pd.to_datetime(df['date'], infer_datetime_format=True), but edge cases fail silently. Profile first to catch them.

Encoding Issues

Names with accents, currency symbols, or non-Latin characters appear as garbled text. This is usually a UTF-8 vs Latin-1 encoding mismatch. Fix: specify encoding on load, or let the sandbox auto-detect it.

Trailing Whitespace

"New York" and "New York " are different values to a computer. This inflates unique value counts and breaks groupby operations. Fix: df.columns = df.columns.str.strip() for headers, df = df.apply(lambda x: x.str.strip() if x.dtype == "object" else x) for values.

Implicit Nulls

Values like "0", "-", "None", "N/A", "NA", "null", or empty strings that represent missing data but are not read as null by pandas. Profile by checking value counts for these sentinel values: "Show me rows where Amount is 0 or Amount is 'N/A'."

Profiling Libraries Worth Knowing

If you write Python and want to go beyond df.describe(), these libraries generate comprehensive profiling reports:

ydata-profiling (formerly pandas-profiling): Generates a full HTML report with distributions, correlations, missing value patterns, and alerts. One line of code: ProfileReport(df).to_file("report.html"). Best for exploratory analysis on medium-sized datasets.
sweetviz: Generates visual comparison reports. Useful when you want to compare two datasets (e.g., training vs test, this month vs last month). Produces side-by-side HTML reports.
Great Expectations: More of a data validation framework than a profiler. Define expectations ("this column should never be null", "values should be between 0 and 100") and run them as automated checks. Best for pipelines and recurring data quality monitoring.

All three can be run in the Zedly Python sandbox without local installation. Upload your file and ask: "Run ydata-profiling on this CSV and show me the report."

From Profiling to Visualization

Profiling is not the end goal. It is the quality gate before analysis. Once you know your data is clean (or you have cleaned it), the next step is visualization: charting trends, comparing categories, spotting patterns. If you are evaluating platforms for this full workflow, our guide to online AI systems for data analysis covers what to look for in privacy, repeatability, and export capabilities.

The workflow:

Upload your CSV to the sandbox
Profile: "Profile this data" -- check types, nulls, duplicates, outliers
Clean: "Remove duplicate rows", "Drop rows where Amount is null", "Flag outliers in Revenue using IQR"
Visualize: "Chart revenue by month", "Show me a scatter plot of price vs quantity", "Create a bar chart of top 10 categories"

All four steps happen in the same session, on the same data. No exporting, no switching tools, no re-uploading. Try it in the CSV data visualizer.

Frequently Asked Questions

What is CSV data profiling?

CSV data profiling is the process of examining a CSV file to understand its structure and quality before analysis or visualization. A profiling report typically includes column data types, row and column counts, missing value percentages, duplicate detection, unique value counts, and basic statistics like mean, median, min, and max for numeric columns.

How do I profile a CSV file online without installing anything?

Upload your CSV to an AI-powered tool like Zedly's CSV data visualizer and ask questions like "profile this data" or "show me missing values by column." The system runs Python (pandas) in a sandboxed environment and returns a full profiling report. No local installation required.

What Python library is best for CSV profiling?

For quick profiling, pandas df.describe() and df.info() cover the basics. For comprehensive reports, ydata-profiling (formerly pandas-profiling) generates full HTML reports with correlations, distributions, and alerts. If you want profiling without writing code, an AI sandbox can run these libraries for you from a plain-English prompt.

How do I find duplicates in a CSV file?

In Python, use df.duplicated().sum() for exact row duplicates, or df.duplicated(subset=['column_name']) to check specific columns. For fuzzy duplicates (similar but not identical rows), you need string similarity libraries like fuzzywuzzy or rapidfuzz.

How do I detect outliers in CSV data?

Common methods include IQR (interquartile range), z-score, and MAD (median absolute deviation). In Python, calculate IQR bounds with pandas quantile functions, or use scipy for z-scores. For a quick check without code, ask an AI sandbox to "flag outliers in the Amount column using IQR."

Ready to get started?

Transform spreadsheets into insights. No code required.

Try Free Account