How to Use Python’s Pingouin Library to Level Up Your Exploratory Data Analysis

Stop Doing EDA the Hard Way

If you’ve ever spent three hours writing the same descriptive statistics, correlation matrices, and normality tests you wrote last week — just for a different dataset — you’ll understand the frustration. Exploratory Data Analysis (EDA) is non-negotiable, but the manual grind of it doesn’t have to be. Enter Pingouin, a Python statistics library that quietly does some of the heaviest lifting in your analysis workflow, and does it cleanly.

For data analysts working across South African industries — whether you’re interrogating claims data in insurance, sales trends in retail, or operational metrics in manufacturing — Pingouin adds statistical rigour without the complexity tax. Here’s how to put it to work.

What Is Pingouin and Why Should You Care?

Pingouin (French for penguin, a nod to Python’s mascot culture) is an open-source statistical library built on top of NumPy, SciPy, and Pandas. What sets it apart isn’t just what it calculates — it’s how it presents results. Unlike SciPy, which returns bare-bones output, Pingouin returns a tidy Pandas DataFrame for almost every test. That means your results are immediately readable, exportable, and reportable.

Install it in seconds:

  • pip install pingouin

Then import it alongside your usual stack:

  • import pingouin as pg
  • import pandas as pd

From that point, you have access to dozens of statistical tests, effect size calculations, power analyses, and more — all returning structured, human-readable output.

Key Features That Accelerate Your EDA

Normality Testing is often the first gate in any statistical workflow. Pingouin’s pg.normality() function runs the Shapiro-Wilk test across one or multiple columns simultaneously and returns a DataFrame with the test statistic, p-value, and a clear normal/not-normal flag. No more looping through columns manually.

Correlation Analysis gets a significant upgrade too. pg.pairwise_corr() gives you Pearson, Spearman, or Kendall correlations across all variable pairs, complete with confidence intervals, sample sizes, and p-values — all in one table. For analysts reporting to non-technical stakeholders in a boardroom in Sandton or a co-op in the Western Cape, that kind of structured output is gold.

T-tests and ANOVA are handled with equal elegance. pg.ttest() returns not just the t-statistic and p-value, but also Cohen’s d (effect size), confidence intervals, and statistical power — information that’s routinely left out of base Python implementations but that serious analysts need. Similarly, pg.anova() and pg.welch_anova() handle between-group comparisons cleanly, with post-hoc tests available via pg.pairwise_tukey().

Outlier Detection rounds out the toolkit nicely. pg.madmedianrule() applies the MAD-median rule to flag outliers in a numeric series — a robust alternative to the standard IQR method when your data isn’t normally distributed.

A Practical EDA Workflow with Pingouin

Here’s a simplified sequence that works well in real-world projects:

  • Step 1 — Normality check: Run pg.normality(df) across your numeric columns before deciding which tests to use downstream.
  • Step 2 — Correlation sweep: Use pg.pairwise_corr(df, method='spearman') if normality fails, or Pearson if it holds. Filter results to surface only significant relationships.
  • Step 3 — Group comparisons: If you’re comparing segments — say, customer spend by region or defect rates by production shift — use pg.anova() with a follow-up Tukey test to identify which specific groups differ.
  • Step 4 — Effect size review: Don’t just chase p-values. Cohen’s d and eta-squared (both returned by Pingouin) tell you whether a statistically significant finding is actually meaningful in business terms.
  • Step 5 — Document and export: Because Pingouin outputs DataFrames, you can concatenate results and export directly to Excel or feed them into a reporting pipeline.

This workflow typically cuts EDA time by 30–50% compared to assembling the same outputs from SciPy and manual formatting.

The Bottom Line for South African Analysts

In an environment where data teams are often lean and turnaround times are tight, tools like Pingouin matter. You’re not sacrificing statistical depth — you’re removing the administrative overhead that slows you down between raw data and actionable insight. Whether you’re a solo analyst, part of a corporate BI team, or a student building your Python toolkit, Pingouin earns a permanent place in your stack.

The best analysis isn’t the most complex one — it’s the one that gets done accurately, communicated clearly, and acted on quickly.

If you want to sharpen your Python analytics skills or need a consulting partner to help turn your data into decisions, reach out to the team at oCode360. We work with businesses across South Africa to build analytics capability that sticks.

📧 [email protected]

oCode360 (t/a JVW Business Solutions (Pty) Ltd) — Making data make sense.

Leave A Comment