New Dataset Onboarding
Use exploratory studies when you receive raw data with no predefined business question. You're setting the stage for everything that follows.
EDA Toolbox
Summary statistics, distribution plots, missing value analysis, correlation heatmaps, pairplots, outlier detection using IQR or Z-score.
Insights Report
A documented list of interesting findings, data quality issues, and suggested next steps — the foundation for deeper studies.
Scenario: You receive a CSV file from a retail company with 200,000 rows of transaction data covering 3 years. No business question is given. Your job: figure out what's interesting.
You start by loading the data and immediately look at the shape, dtypes, missing values, and basic statistics. You then visualize sales distribution — it's right-skewed with a few massive transactions. You notice a monthly seasonal pattern and a dip in February every year. You detect 3 product categories that account for 70% of revenue. These findings drive all future analyses.
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # ── Step 1: Load & Inspect ────────────────────── df = pd.read_csv('transactions.csv', parse_dates=['date']) print(df.info()) print(df.describe()) # ── Step 2: Missing Values ────────────────────── missing = df.isnull().sum() / len(df) * 100 print(missing[missing > 0].sort_values(ascending=False)) # ── Step 3: Distribution of Sales ────────────── plt.figure(figsize=(12, 4)) sns.histplot(df['sales_amount'], bins=60, kde=True, color='#f0a500') plt.title('Sales Distribution') plt.show() # ── Step 4: Correlation Heatmap ───────────────── corr = df.select_dtypes(include='number').corr() sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm') plt.title('Feature Correlation Matrix') plt.show() # ── Step 5: Seasonal Pattern ──────────────────── df['month'] = df['date'].dt.month_name() monthly = df.groupby('month')['sales_amount'].sum() print(monthly.sort_values(ascending=False))
↑ Hover bars to see values. This type of chart is your first stop in any exploratory study.
Exploratory EDA Checklist — run through this for every new dataset:
- Check shape: rows × columns, and data types
- Identify and quantify missing values per column
- Summarize numerical columns with .describe()
- Visualize distributions (histograms, boxplots) for every numeric feature
- Count unique values in categorical columns
- Detect outliers using IQR or Z-score method
- Plot correlation heatmap for numeric features
- Check for duplicate rows
- Parse and explore any date/time columns for trends
- Document surprising findings for stakeholder review
KPI Dashboards
Summaries of business metrics like revenue, churn rate, engagement, or patient visits — presented in a clear, scannable format.
Aggregation & Grouping
GroupBy operations, pivot tables, rolling windows, percentile rankings, and frequency distributions presented visually.
Dashboards & Reports
Power BI, Tableau, or Plotly Dash dashboards with interactive filters. Ideal for stakeholder consumption.
Scenario: A healthcare company wants a descriptive report on patient admissions for the last 12 months. No hypothesis — just the facts.
You compute: total admissions by month, top 5 diagnoses, average length of stay by department, readmission rate by region, and gender-age breakdown. These are presented in a clean Power BI dashboard with filters for department and date range. The CMO uses this every Monday morning to track performance.
import pandas as pd import plotly.express as px df = pd.read_csv('patient_admissions.csv', parse_dates=['admission_date']) # ── Monthly admissions ────────────────────────── df['month'] = df['admission_date'].dt.to_period('M') monthly = df.groupby('month').agg( total_admissions=('patient_id', 'count'), avg_stay_days=('length_of_stay', 'mean'), readmissions=('is_readmission', 'sum') ).reset_index() # ── Top diagnoses ─────────────────────────────── top_dx = (df['diagnosis'] .value_counts() .head(10) .reset_index()) top_dx.columns = ['diagnosis', 'count'] # ── Readmission rate by region ────────────────── region_stats = df.groupby('region').agg( total=('patient_id', 'count'), readmit_rate=('is_readmission', 'mean') ).sort_values('readmit_rate', ascending=False) # ── Interactive chart with Plotly ─────────────── fig = px.bar(monthly, x='month', y='total_admissions', title='Monthly Patient Admissions', color='total_admissions', color_continuous_scale='teal') fig.show()
↑ Hover cells to see values. Heatmaps are ideal for descriptive multi-dimensional summaries.
Hypothesis Testing
t-tests, chi-square, ANOVA, Mann-Whitney U test. Determine if observed differences are statistically significant.
Regression & A/B Tests
Linear and logistic regression to measure effect sizes. A/B testing to isolate variable impact in controlled experiments.
Correlation ≠ Causation
Always consider confounding variables. Causal inference methods like DiD (Difference-in-Differences) or IV help establish true causality.
Scenario: An e-commerce company ran a new checkout page design for 2 weeks. Conversion rates seem higher — but is it real or random noise?
You analyze the A/B test results: control group (old page, n=15,000) vs treatment group (new page, n=14,800). Using a two-proportion z-test, you find a conversion lift of +1.8% with p=0.003. The result is statistically significant. You also run logistic regression to understand which customer segments benefited most — and find mobile users drove the entire uplift.
from scipy import stats import numpy as np # ── Sample Data ───────────────────────────────── control_n, control_conv = 15000, 1350 # 9.0% treat_n, treat_conv = 14800, 1628 # 11.0% # ── Conversion Rates ──────────────────────────── p_control = control_conv / control_n p_treat = treat_conv / treat_n print(f"Control: {p_control:.1%} | Treatment: {p_treat:.1%}") print(f"Absolute lift: {p_treat - p_control:.1%}") # ── Two-proportion z-test ─────────────────────── count = np.array([treat_conv, control_conv]) nobs = np.array([treat_n, control_n]) from statsmodels.stats.proportion import proportions_ztest z_stat, p_val = proportions_ztest(count, nobs) print(f"z={z_stat:.3f}, p={p_val:.4f}") if p_val < 0.05: print("✓ Statistically significant — roll out the new design!") else: print("✗ Not significant — more data needed.")
import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt # ── Feature Matrix ────────────────────────────── features = ['is_treatment', 'is_mobile', 'page_load_time', 'items_in_cart', 'is_returning_user'] X = df[features] y = df['converted'] # ── Scale & Fit ───────────────────────────────── scaler = StandardScaler() X_scaled = scaler.fit_transform(X) model = LogisticRegression().fit(X_scaled, y) # ── Visualize Coefficients ────────────────────── coef_df = pd.DataFrame({ 'feature': features, 'coef': model.coef_[0] }).sort_values('coef') coef_df.plot(kind='barh', x='feature', y='coef', color='#ef476f', legend=False) plt.title('What Drives Conversion? (Logistic Coeff.)') plt.show()
⚠️ Common Pitfalls in Explanatory Studies:
- Confounding variables — Two things correlate only because a third variable drives both. Always ask "what else could explain this?"
- Multiple comparisons problem — Running 20 t-tests means 1 will show p < 0.05 by chance. Use Bonferroni or FDR correction.
- Underpowered tests — Too small a sample means you'll miss real effects. Always calculate required sample size before running an A/B test.
- p-hacking — Don't keep adding data until p < 0.05. Pre-register your hypothesis and sample size.
- Survivorship bias — Only analyzing the data that "survived" a filter skews all conclusions.
- Reversing causality — E.g., "hospitals make people sick" (people go to hospitals because they're sick).
Forecasting Models
ARIMA, Prophet, LSTM for sequential data. Ideal for sales forecasting, stock prices, demand planning, and web traffic.
Churn & Risk Prediction
Random Forest, XGBoost, Logistic Regression. Predict binary or multi-class outcomes like churn, fraud, or default.
Metrics & Validation
RMSE, MAE for regression. Accuracy, AUC-ROC, F1 for classification. Always validate with held-out test data.
Scenario: A logistics company wants to predict delivery delays before they happen, so dispatchers can intervene proactively.
You engineer features: distance, carrier, day of week, weather severity score, historical delay rate for that route. You train an XGBoost classifier. It achieves AUC-ROC = 0.87. You visualize feature importance — route distance and weather score top the list. The model is deployed and saves $2M annually in customer compensation.
from prophet import Prophet import pandas as pd import plotly.graph_objects as go # ── Prepare data in Prophet format ────────────── # Prophet requires 'ds' (date) and 'y' (value) df_prophet = df.rename(columns={'date': 'ds', 'sales': 'y'}) # ── Build & Fit Model ──────────────────────────── model = Prophet( seasonality_mode='multiplicative', yearly_seasonality=True, weekly_seasonality=True, changepoint_prior_scale=0.05 ) model.fit(df_prophet) # ── Forecast next 90 days ──────────────────────── future = model.make_future_dataframe(periods=90) forecast = model.predict(future) # ── Evaluate on test set ───────────────────────── from sklearn.metrics import mean_absolute_error mae = mean_absolute_error(y_test, y_pred) print(f"Mean Absolute Error: {mae:,.0f}") # ── Plot results ───────────────────────────────── fig = model.plot(forecast) fig.show()
↑ The shaded region represents the prediction confidence interval — standard in all forecasting visualizations.
Budget & Resource Allocation
Linear programming, portfolio optimization, and multi-objective optimization to find the best allocation of limited resources.
Scenario Modeling
Monte Carlo simulations to model uncertainty. "What if we increased marketing spend by 20%?" — run 10,000 scenarios and show the distribution.
Decision Support Dashboards
Interactive dashboards where decision-makers can tweak inputs (budget, headcount, pricing) and see how projected outcomes shift in real-time.
Scenario: A marketing team has $500K to allocate across 5 channels (email, paid search, social, display, influencer). Past data tells you the ROI per dollar for each channel. What's the optimal split?
You model the diminishing returns curve for each channel (ROI declines as spend increases past a threshold), then use scipy.optimize to find the allocation that maximizes total conversions subject to the budget constraint. You present the result as a Sankey diagram showing the recommended flow of budget. The recommendation increases projected conversions by 23% vs. the current equal-split approach.
import numpy as np from scipy.optimize import minimize # ── Channel configuration ─────────────────────── channels = ['email', 'paid_search', 'social', 'display', 'influencer'] total_budget = 500_000 # ── Diminishing returns model ─────────────────── # Returns = saturation_level * (1 - exp(-spend / scale)) saturation = [50000, 80000, 60000, 40000, 70000] scale = [80000, 120000, 100000, 60000, 150000] def channel_returns(spend, sat, sc): return sat * (1 - np.exp(-spend / sc)) # ── Objective: maximize total returns ─────────── def objective(x): return -sum(channel_returns(x[i], saturation[i], scale[i]) for i in range(5)) constraints = {'type': 'eq', 'fun': lambda x: sum(x) - total_budget} bounds = [(10_000, 300_000)] * 5 x0 = [total_budget / 5] * 5 result = minimize(objective, x0, method='SLSQP', bounds=bounds, constraints=constraints) for ch, spend in zip(channels, result.x): print(f"{ch:15}: ${spend:,.0f}")
User Behavior by Segment
Compare retention, LTV, or engagement across user cohorts acquired in different months or via different channels.
Performance vs. Industry
Compare your metrics against competitors or industry averages. Identify gaps and opportunities through variance analysis.
Significance of Differences
Don't just say "Region A is 15% higher." Use ANOVA or t-tests to verify the difference is real, not sampling variance.
Scenario: A SaaS company wants to compare user retention between 2023 and 2024 cohorts to see if recent product changes improved stickiness.
You build a cohort retention matrix: rows = month of first purchase, columns = months since join. Each cell shows the % of users still active. The 2024 cohort shows 8-point higher retention at month 3. You visualize this as a heatmap for 2023 vs 2024 side-by-side, and run a chi-square test confirming the difference is statistically significant at p < 0.01.
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # ── Create Cohort Retention Matrix ────────────── df['cohort'] = df.groupby('user_id')['date'].transform('min').dt.to_period('M') df['period'] = df['date'].dt.to_period('M') df['cohort_index'] = (df['period'] - df['cohort']).apply(lambda x: x.n) cohort_data = (df.groupby(['cohort', 'cohort_index'])['user_id'] .nunique().reset_index()) cohort_pivot = cohort_data.pivot(index='cohort', columns='cohort_index', values='user_id') # ── Convert to retention rates ────────────────── retention = cohort_pivot.divide(cohort_pivot.iloc[:, 0], axis=0) * 100 # ── Heatmap ───────────────────────────────────── plt.figure(figsize=(14, 7)) sns.heatmap(retention, annot=True, fmt='.0f', cmap='YlOrRd', vmin=0, vmax=100) plt.title('Monthly Cohort Retention Rates (%)') plt.show()
↑ Grouped bar charts are the go-to for comparative studies across categories.
Decomposition
Separate a time series into trend, seasonality, and residual components using STL decomposition or classical additive/multiplicative models.
Line & Area Charts
Line charts with rolling averages smooth noise. Area charts show volume. Candlestick charts for financial data. Annotate key events on the timeline.
Rolling Windows & Lags
7-day, 30-day rolling averages reveal underlying trends. Lag features (value N days ago) are powerful for prediction.
Scenario: A public health agency tracks weekly COVID-19 case counts across 5 years, overlaid with vaccination rollout dates and policy changes.
You apply STL decomposition to separate the seasonal flu pattern from true COVID waves. You annotate the chart with vaccine rollout events and lockdown policies. The analysis clearly shows that vaccination correlated with a 60% reduction in hospitalization peaks in subsequent waves, even as case counts remained elevated. This longitudinal view is the only way to see these long-term dynamics.
import pandas as pd import matplotlib.pyplot as plt from statsmodels.tsa.seasonal import STL # ── Load time series ───────────────────────────── df = pd.read_csv('weekly_cases.csv', parse_dates=['date'], index_col='date') # ── Rolling averages ───────────────────────────── df['ma7'] = df['cases'].rolling(7).mean() df['ma30'] = df['cases'].rolling(30).mean() # ── STL Decomposition ──────────────────────────── stl = STL(df['cases'], seasonal=13) # 13-week seasonality result = stl.fit() fig, axes = plt.subplots(4, 1, figsize=(14, 10)) result.observed.plot(ax=axes[0], title='Observed') result.trend.plot(ax=axes[1], title='Trend') result.seasonal.plot(ax=axes[2], title='Seasonality') result.resid.plot(ax=axes[3], title='Residual') plt.tight_layout() plt.show()
Step 1 — Data Collection & Cleaning
Collect the full historical series. Handle missing dates by forward-filling or interpolation. Ensure consistent frequency (daily/weekly/monthly).
Step 2 — Visualize the Raw Series
Plot the full time series first. Look for obvious trends, seasonality, structural breaks, or sudden level shifts.
Step 3 — Smooth with Rolling Averages
Apply 7-day and 30-day rolling means to distinguish signal from noise. Choose window size based on your data frequency.
Step 4 — Decompose Trend & Seasonality
Use STL or classical decomposition to isolate the underlying trend from seasonal patterns. This reveals what the "true" trajectory looks like.
Step 5 — Annotate Key Events
Mark policy changes, product launches, crises, or interventions on the chart. Longitudinal value comes from understanding WHY the series moved.
📈 Portfolio Risk Analysis
Correlation matrices across assets, VaR calculations, Sharpe ratio analysis, sector exposure heat maps, and Monte Carlo portfolio simulations.
🏥 Patient Outcomes
Survival analysis (Kaplan-Meier), readmission prediction, drug efficacy comparisons, SDOH (social determinants) mapping on choropleth maps.
⚽ Performance Metrics
Shot maps, player heatmaps, xG (expected goals) models, clustering players by playing style, and win probability models using real-time data.
🛒 Customer Analytics
RFM segmentation (Recency, Frequency, Monetary), funnel analysis, basket analysis with association rules, and LTV (Lifetime Value) modeling.
🗺️ Geospatial Analysis
Traffic congestion maps (Folium/Kepler.gl), crime cluster detection, accessibility scoring, demographic overlay on census geographies.
📱 Sentiment & Trends
Sentiment analysis (VADER, BERT), viral content pattern detection, network graph analysis of influencer connections, and topic modeling (LDA).
Scenario (Finance): Build a portfolio risk dashboard for a multi-asset portfolio of 10 stocks, showing correlation, volatility, and risk-adjusted returns.
You pull 5 years of daily prices via yfinance. You compute rolling 30-day volatility, build the correlation matrix (visualized as a heatmap), calculate Sharpe ratios, and run a Monte Carlo simulation of 10,000 future portfolio paths. The dashboard shows the 5th percentile "worst case" outcome and optimal portfolio weights using Modern Portfolio Theory.
import yfinance as yf import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # ── Download price data ────────────────────────── tickers = ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'JPM', 'GS', 'JNJ', 'PFE', 'GLD', 'BND'] prices = yf.download(tickers, start='2019-01-01')['Adj Close'] # ── Daily returns & correlation ────────────────── returns = prices.pct_change().dropna() corr = returns.corr() sns.heatmap(corr, annot=True, fmt='.2f', cmap='RdYlGn', vmin=-1, vmax=1) plt.title('Asset Correlation Matrix') plt.show() # ── Sharpe Ratio ───────────────────────────────── mean_ret = returns.mean() * 252 std_ret = returns.std() * np.sqrt(252) sharpe = (mean_ret - 0.05) / std_ret print(sharpe.sort_values(ascending=False)) # ── Monte Carlo Portfolio Simulation ──────────── n_sims, n_days = 1000, 252 weights = np.ones(len(tickers)) / len(tickers) cov_matrix = returns.cov() * 252 # ... simulate 1000 paths using Cholesky decomposition
Free Datasets to Practice Industry Case Studies:
- Finance: Yahoo Finance (yfinance), Quandl, FRED (Federal Reserve Economic Data), SEC EDGAR
- Healthcare: CMS.gov, MIMIC-III (physionet.org), CDC WONDER, WHO Global Health Data
- Sports: StatsBomb Open Data, FBref.com, Basketball-Reference, Kaggle sports datasets
- E-commerce: UCI Online Retail dataset, Instacart dataset (Kaggle), Brazilian e-commerce (Kaggle)
- Geospatial: OpenStreetMap, US Census TIGER files, NYC Open Data, World Bank Geospatial Data
- Social Media: Twitter/X Academic API, Reddit Pushshift, Yelp Open Dataset, Google Trends API
- General Purpose: Kaggle.com, UCI ML Repository, data.gov, Our World in Data, Google Dataset Search
Knowledge Check
// Test your understanding — 5 questions
A data analyst receives a brand new dataset with no business question attached. What is the best first type of case study to apply?
An e-commerce company wants to know if a new checkout page design caused a statistically significant increase in conversions. Which case study type is most appropriate?
Which visualization is most appropriate for showing a user retention cohort matrix where rows are signup months and columns are months-since-join?
A marketing team has a $500K budget and wants to know the optimal split across 5 channels to maximize conversions, considering diminishing returns. Which case study type applies?
Which Python library is specifically designed for time series forecasting with support for seasonality, holidays, and trend changepoints?
Complete Summary Table
| Type | Question Answered | Key Methods | Best Visualization | Difficulty |
|---|---|---|---|---|
| 🔍 Exploratory | What's interesting in this data? | EDA, distributions, correlation | Histograms, heatmaps, pairplots | Beginner |
| 📊 Descriptive | What happened? | Aggregation, KPIs, pivot tables | Dashboards, bar charts, tables | Beginner |
| ⚗️ Explanatory | Why did it happen? | A/B tests, regression, ANOVA | Coeff. plots, funnel charts | Intermediate |
| 🔮 Predictive | What will happen? | ML models, time series, cross-validation | Forecast lines, ROC curves, feature importance | Intermediate |
| 🎯 Prescriptive | What should we do? | Optimization, Monte Carlo, simulation | Sankey diagrams, scenario sliders | Advanced |
| ⚖️ Comparative | How does X compare to Y? | Cohort analysis, benchmarking, t-tests | Grouped bars, side-by-side heatmaps | Beginner |
| 📈 Longitudinal | How did it change over time? | Time series, decomposition, rolling avg | Line charts, area charts, annotated timelines | Intermediate |
| 🏭 Industry-Specific | Domain-specific insights | Domain methods + any of the above | Domain-specific charts (shotmaps, choropleths) | Advanced |
Portfolio Roadmap for Data Analysts
Foundations
- Build 2 Exploratory studies on Kaggle datasets
- Create a Descriptive KPI dashboard in Tableau or Power BI
- Master pandas, matplotlib, seaborn
- Publish findings on GitHub with notebooks
Analysis Skills
- Run a complete A/B test analysis (Explanatory)
- Build a Cohort Retention study (Comparative)
- Learn scipy.stats and statsmodels
- Add statistical narrative to visualizations
Advanced Methods
- Build a time series forecast (Predictive / Longitudinal)
- Create a churn prediction model with XGBoost
- Learn Plotly Dash or Streamlit for interactivity
- Pick your industry focus (finance, health, etc.)
Specialization
- Build a Prescriptive optimization case study
- Create 2 deep industry-specific studies
- Present findings publicly (blog, talk, LinkedIn)
- Contribute to open-source analytics projects