This Power BI dashboard transforms the Online Retail II dataset into strategic sales insights — from ETL and cleaning to SQL analysis and visual YOY comparisons. Note: December 2011 is incomplete in the source data and has been excluded from YOY calculations.
This project aimed to create a complete end-to-end data pipeline for the Online Retail II dataset, implementing ETL processes, rigorous data cleaning, SQL analysis, and an interactive Power BI dashboard with a focus on year-over-year (YOY) comparison between 2010 and 2011. The goal was to transform raw transactional data into actionable business insights that could drive strategic decisions around customer retention, product performance, and geographic expansion.
The project followed a structured approach: First, the ~1M row dataset was ingested into MS SQL Server using Python (pandas + pyodbc) with chunked processing for reliability. A multi-step data cleaning pipeline was then implemented in SQL to handle issues like non-standard invoices, missing customer IDs, inconsistent product codes, and pricing anomalies. After creating an analysis-ready dataset, a comprehensive Power BI data model was built with a proper date table and relationships. Finally, DAX measures were created to support YOY analysis, and an interactive dashboard was designed to visualize key insights about sales performance, customer behavior, and product trends.
This project implemented a complete end-to-end data pipeline for the Online Retail II dataset, from ingestion and cleaning to analysis and visualization. The process transformed raw transactional data into actionable business insights through ETL processes, SQL analysis, and an interactive Power BI dashboard focused on year-over-year comparison between 2010 and 2011. The final deliverables provide stakeholders with clear insights into sales performance, customer behavior, and product trends to support data-driven decision making.
Dataset: Online Retail II (UCI Machine Learning Repository) — transactions of a UK online retailer
Columns (raw): InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country
Record count: ~1,067,371 rows
Key Notes:
Business model: Non-store, UK online retailer selling giftware across global customers (retail + wholesalers)
Key assumptions made during cleaning:
/OnlineRetailProject/
├─ data/
│ └─ online_retail_II.csv
├─ notebooks/
│ └─ Import Dataset from CSV to SQL.ipynb
├─ sql/
│ ├─ create_tables.sql
│ ├─ cleaning_queries.sql
│ └─ analysis_queries.sql
├─ powerbi/
│ └─ OnlineRetail_Dashboard.pbix
├─ docs/
│ └─ Online Retail II — Full Project Documentation.md
└─ scripts/
└─ import_to_sql.py
Purpose: Reliably import the large CSV into MS SQL Server with robust type coercion and chunked inserts
Key patterns used:
Python code (condensed example):
import pyodbc
import pandas as pd
# 1) Read CSV
csv_path = r"path\to\online_retail_II.csv"
df = pd.read_csv(csv_path)
print('Total rows to insert:', len(df))
# 2) Preprocessing
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], errors='coerce')
df['Quantity'] = pd.to_numeric(df['Quantity'], errors='coerce').fillna(0).astype(int)
df['Price'] = pd.to_numeric(df['UnitPrice'], errors='coerce').fillna(0.0)
df['CustomerID'] = df['CustomerID'].replace(0, pd.NA)
df['Description'] = df['Description'].astype(str).str.strip()
# 3) Connect to SQL
conn = pyodbc.connect(
Driver='ODBC Driver 17 for SQL Server',
Server=r'YUU\SQLEXPRESS',
Database='online_retail',
Trusted_Connection='yes',
autocommit=False
)
cur = conn.cursor()
cur.fast_executemany = True
# 4) Prepare statement
sql = '''INSERT INTO dbo.invoices (Invoice, StockCode, Description, Quantity, InvoiceDate, Price, Customer_ID, Country)
VALUES (?,?,?,?,?,?,?,?)'''
params = df[['Invoice','StockCode','Description','Quantity','InvoiceDate','Price','CustomerID','Country']].values.tolist()
# 5) Chunked insert
chunk_size = 50000
for start in range(0, len(params), chunk_size):
chunk = params[start:start+chunk_size]
try:
cur.executemany(sql, chunk)
conn.commit()
print(f'Committed rows {start}–{start+len(chunk)-1}')
except Exception as e:
print(f'Error inserting rows {start}–{start+len(chunk)-1}: {e}')
pd.DataFrame(chunk).to_csv(f'failed_chunk_{start}.csv', index=False)
break
cur.close()
conn.close()
Target table (cleaned workspace): dbo.retail_data_cleaned
Example CREATE TABLE for initial invoices:
CREATE TABLE dbo.invoices (
Invoice VARCHAR(20),
StockCode VARCHAR(50),
Description VARCHAR(255),
Quantity INT,
InvoiceDate DATETIME2(0),
Price DECIMAL(10,2),
[Customer ID] VARCHAR(50),
Country VARCHAR(50)
);
Safe copy creation:
SELECT * INTO dbo.retail_data_cleaned FROM dbo.retail_data;
Important indexed columns (recommended):
SELECT * INTO dbo.retail_data_cleaned FROM dbo.retail_data;
UPDATE dbo.retail_data_cleaned
SET [Customer ID] = NULL
WHERE [Customer ID] = 0;
DELETE FROM dbo.retail_data_cleaned
WHERE Invoice NOT LIKE '[0-9][0-9][0-9][0-9][0-9][0-9]';
DELETE FROM dbo.retail_data_cleaned
WHERE StockCode NOT LIKE '[0-9][0-9][0-9][0-9][0-9]%';
UPDATE dbo.retail_data_cleaned
SET Quantity = ABS(Quantity);
DELETE FROM dbo.retail_data_cleaned
WHERE Price = 0;
UPDATE dbo.retail_data_cleaned
SET Description = CONCAT(UPPER(LEFT(Description,1)), LOWER(SUBSTRING(Description,2,LEN(Description))))
WHERE Description IS NOT NULL;
UPDATE dbo.retail_data_cleaned
SET Description = LTRIM(RTRIM(Description))
WHERE Description IS NOT NULL;
ALTER TABLE dbo.retail_data_cleaned ADD InvoiceDate_Converted DATETIME2(0);
UPDATE dbo.retail_data_cleaned
SET InvoiceDate_Converted = DATEADD(SECOND, DATEDIFF(SECOND, '1900-01-01', InvoiceDate), '1900-01-01');
ALTER TABLE dbo.retail_data_cleaned DROP COLUMN InvoiceDate;
EXEC sp_rename 'dbo.retail_data_cleaned.InvoiceDate_Converted','InvoiceDate','COLUMN';
ALTER TABLE dbo.retail_data_cleaned
ADD InvoiceDateOnly DATE, InvoiceTimeOnly TIME;
UPDATE dbo.retail_data_cleaned
SET InvoiceDateOnly = CAST(InvoiceDate AS DATE), InvoiceTimeOnly = CAST(InvoiceDate AS TIME);
ALTER TABLE dbo.retail_data_cleaned
ADD Sales AS (Price * Quantity);
SELECT COUNT(*) AS TotalRows FROM dbo.retail_data_cleaned;
SELECT SUM(CASE WHEN Price=0 THEN 1 ELSE 0 END) AS ZeroPriceCount FROM dbo.retail_data_cleaned;
SELECT (1.0 - CAST((SELECT COUNT(*) FROM dbo.retail_data_cleaned) AS FLOAT) / (SELECT COUNT(*) FROM dbo.retail_data)) * 100 AS TotalDataLossPercent;
Columns added/derived:
SQL examples:
ALTER TABLE dbo.retail_data_cleaned ADD InvoiceDate DATE;
ALTER TABLE dbo.retail_data_cleaned ADD InvoiceTime TIME;
UPDATE dbo.retail_data_cleaned
SET InvoiceDate = CAST(InvoiceDate AS DATE), InvoiceTime = CAST(InvoiceDate AS TIME);
-- Hour (SQL Server)
ALTER TABLE dbo.retail_data_cleaned ADD [Hour] AS DATEPART(HOUR, InvoiceTime);
Checks performed:
SQL examples:
SELECT COUNT(*) FROM dbo.retail_data; -- original
SELECT COUNT(*) FROM dbo.retail_data_cleaned; -- cleaned
SELECT Description, COUNT(*) AS Frequency FROM dbo.retail_data_cleaned GROUP BY Description ORDER BY Frequency DESC;
SELECT LEN(StockCode) AS StockCodeLength, COUNT(*) AS Cnt FROM dbo.retail_data_cleaned GROUP BY LEN(StockCode) ORDER BY StockCodeLength;
Data model:
Date Table (DAX):
Date Table = CALENDAR(MIN(retail_data_cleaned[InvoiceDate]), MAX(retail_data_cleaned[InvoiceDate]))
Columns created on Date table:
Relationships:
Modeling tips:
Core KPIs:
Total Sales = SUM(retail_data_cleaned[Sales])
Total Orders = DISTINCTCOUNT(retail_data_cleaned[Invoice])
Total Quantity sold = SUM(retail_data_cleaned[Quantity])
Avg Basket Value = DIVIDE([Total Sales],[Total Orders])
Avg Items per Order = DIVIDE([Total Quantity sold],[Total Orders])
CY / PY / NY measures:
CY Sales = CALCULATE([Total Sales], VALUES('Date Table'[Year]))
PY Sales = CALCULATE([CY Sales], DATEADD('Date Table'[Date], -1, YEAR))
NY Sales = CALCULATE([CY Sales], DATEADD('Date Table'[Date], +1, YEAR))
CY Order = CALCULATE([Total Orders], VALUES('Date Table'[Year]))
PY Order = CALCULATE([CY Order], DATEADD('Date Table'[Date], -1, YEAR))
CY Quantity = CALCULATE([Total Quantity sold], VALUES('Date Table'[Year]))
PY Quantity = CALCULATE([CY Quantity], DATEADD('Date Table'[Date], -1, YEAR))
Problem:
The dataset ends on December 9, 2011, making December 2011 incomplete. Comparing partial December 2011 to full December 2010 would lead to artificially low YoY metrics and misleading KPIs.
Solution:
1. Page Filter: A filter was applied on the "Sales Trend Over the Period" page to exclude December:
Date Table[Date].[Month] → "is not December"
2. Measure Adjustment: The original CY measures used ALL('Date Table')
which removed the month filter context. They were updated to use VALUES('Date Table'[Year])
to preserve the year context while respecting the month filter.
Original CY Sales Measure (Problematic):
CY Sales =
VAR _Year = SELECTEDVALUE('Date Table'[Year])
RETURN
CALCULATE(
[Total Sales],
FILTER(
ALL('Date Table'),
'Date Table'[Year] = _Year
)
)
Revised CY Sales Measure (Corrected):
CY Sales =
CALCULATE(
[Total Sales],
VALUES('Date Table'[Year])
)
The same adjustment was applied to:
CY Quantity
CY Orders
Result:
All YoY comparisons now accurately reflect January–November data for both years, avoiding biased results due to incomplete December data.
Average & trend helpers:
Yearly Month Avg Sales = AVERAGEX(SUMMARIZE(ALLSELECTED('Date Table'), 'Date Table'[Month]), [Total Sales])
Daily Avg Sales = AVERAGEX(ALLSELECTED(retail_data_cleaned[InvoiceDate]), [Total Sales])
Color rules:
Color For columns = IF([Total Sales] > [Yearly Month Avg Sales], "Above Average", "Below Average")
Color For Daily Bars = IF([Total Sales] > [Daily Avg Sales], "Above Average", "Below Average")
CM/PM for MOM:
CM Sales = VAR selected_month = SELECTEDVALUE('Date Table'[Month])
RETURN TOTALMTD(CALCULATE([Total Sales], 'Date Table'[Month] = selected_month), 'Date Table'[Date])
PM Sales = CALCULATE([CM Sales], DATEADD('Date Table'[Date], -1, MONTH))
MOM Growth & Diff Sales =
VAR month_diff = [CM Sales] - [PM Sales]
VAR mom = DIVIDE(month_diff, [PM Sales])
VAR sign = IF(month_diff > 0, "+", "")
VAR sign_trend = IF(month_diff > 0, "▲", "▼")
RETURN
IF(
ISBLANK([PM Sales]) || [PM Sales] = 0,
FORMAT(month_diff / 1000, "0.0k") & " vs LM",
sign_trend & " " & sign & FORMAT(mom, "0.0%") & " | " & sign & FORMAT(month_diff / 1000, "0.0k") & " vs LM"
)
YOY & labeling:
YOY Growth & Diff Sales =
VAR year_diff = [CY Sales] - [PY Sales]
VAR yoy = DIVIDE(year_diff, [PY Sales])
VAR _sign = IF(year_diff > 0, "+", "")
VAR _sign_trend = IF(year_diff > 0, "▲", "▼")
RETURN
IF(
ISBLANK([PY Sales]) || [PY Sales] = 0,
FORMAT(year_diff / 1000, "0.0k") & " vs LY",
_sign_trend & " " & _sign & FORMAT(yoy, "#0.00%" & " | " & _sign & FORMAT(year_diff / 1000, "0.0k")) & " vs LY"
)
Label for Product Type (Sales) =
VAR MostCommonDescription = MAXX(TOPN(1, VALUES(retail_data_cleaned[Description]), COUNTROWS(retail_data_cleaned), DESC), retail_data_cleaned[Description])
RETURN MostCommonDescription & " | " & FORMAT([Total Sales]/1000, "$0.00k")
Label for Store Location = SELECTEDVALUE(retail_data_cleaned[Country]) & " | " & FORMAT([Total Sales]/1000, "$0.00k")
PlaceHolder = 0
Items per Order Label = FORMAT(DIVIDE([Total Quantity Sold], [Total Orders]), "0") & " Items per Order"
Main Dashboard (Overview — YOY-first):
December Data Adjustment:
Heatmap page:
Product Insights page:
Styling:
Finding: Total Sales ≈ €7.68M (2010 → 2011, –0.03%); Total Quantity and Orders decreased (-8.4% quantity, -2.8% orders) while Avg Basket Value rose (+2.8%).
Interpretation: The business preserved revenue per order by increasing basket value while volume declined. This is risky long-term: losing customer/volume base.
Recommendation: Launch volume-lifting promotions (bundles, multi-buy incentives, and re-engagement campaigns) that preserve AOV via cross-sells/upsells.
Finding: Australia and Japan grew rapidly from small bases; EU countries (Netherlands, Germany, France) provide steady revenue.
Recommendation: Allocate test ad spend to AU/JP (10% of Q3 for AU, targeted JP email), localize content and test shipping thresholds in Netherlands.
Finding: Top customers show mixed changes; some champions grew >+10%, others declined >-20%.
Recommendation: Implement VIP loyalty for champions; create churn playbook for at-risk customers; run 2nd purchase onboarding for new large customers.
Finding: Storage & lighting SKUs show strong growth; some décor top-sellers are declining.
Recommendation: Reallocate ad & homepage real estate to high-growth categories; negotiate supplier discount for high-volume new SKUs; plan clearance for declining SKUs.
Finding: Peak months: Nov (Black Friday / pre-Christmas); troughs: Feb.
Recommendation: Reallocate 60% of Q4 promo to early Nov; run a February stimulus campaign focused on storage & décor.
Finding: 11 AM–1 PM weekdays (especially Wed) are peak order windows.
Recommendation: Staff live chat and CS during peak; schedule flash deals during midweek midday window.
KPIs to monitor weekly: Total Sales, Total Orders, Conversion Rate (if web data available), AOV, Top 10 SKUs by revenue and quantity, New vs Returning buyers.
Stakeholders:
Deliverables:
Runbook (manual):
Automating:
Top descriptions:
SELECT Description, COUNT(*) AS Frequency
FROM dbo.retail_data_cleaned
GROUP BY Description
ORDER BY Frequency DESC;
StockCode length distribution:
SELECT LEN(StockCode) AS StockCodeLength, COUNT(*) AS Count
FROM dbo.retail_data_cleaned
GROUP BY LEN(StockCode)
ORDER BY StockCodeLength;
Hourly sales heatmap source:
SELECT DATEPART(HOUR, InvoiceTime) AS Hour,
DATENAME(WEEKDAY, InvoiceDate) AS DayName,
SUM(Sales) AS TotalSales
FROM dbo.retail_data_cleaned
GROUP BY DATEPART(HOUR, InvoiceTime), DATENAME(WEEKDAY, InvoiceDate);
Column | Type | Description |
---|---|---|
Invoice | VARCHAR(20) | Invoice number (6-digit normal orders) |
StockCode | VARCHAR(50) | SKU code, expected 5-digit main code |
Description | VARCHAR(255) | Product description (normalized Proper Case) |
Quantity | INT | Number of units in invoice row (converted to positive) |
InvoiceDate | DATETIME2(0) | Transaction datetime (normalized) |
Price | DECIMAL(10,2) | Unit price |
[Customer ID] | VARCHAR(50) | Customer identifier (NULL for missing) |
Country | VARCHAR(50) | Billing country |
Sales | DECIMAL(18,2) | Computed column: Price * Quantity |
Why MySQL? MySQL (and MariaDB) are open-source RDBMS with good tooling for imports (LOAD DATA INFILE) and broadly used in web stacks.
Possible benefits: Faster local CSV imports, flexible open-source tooling, easier cloud migration to managed MySQL engines.
Download the Power BI Desktop file with embedded report and model.
Get Power BI FileDownload the Python scripts used for data ingestion and processing.
Get Python ScriptsFor full technical documentation and implementation details, all queries, measures, and the dataset are hosted in a GitHub repository with detailed README files and inline comments:
View Repository