Tutorial: Analyzing COVID-19 Data with CSV Tools (Johns Hopkins Dataset)

Learn practical data analysis techniques using real-world COVID-19 data from Johns Hopkins University. This hands-on tutorial shows you how to clean, analyze, and gain insights from a real public dataset, no programming required.

Dataset: COVID-19 Data Repository by CSSE at Johns Hopkins University
License: Public domain
Size: ~50MB
Rows: 200,000+ time series records
Difficulty: Beginner to Intermediate

What You'll Learn

By the end of this tutorial, you'll know how to:

✅ Clean messy real-world data
✅ Transform wide data into analysis-ready format
✅ Calculate daily changes and trends
✅ Smooth noisy data with rolling averages
✅ Find patterns and peak dates
✅ Create country-specific analysis
✅ Export results for presentations

No coding required, just point, click, and analyze!

Getting the Data

Step 1: Download the Dataset

Visit the Johns Hopkins COVID-19 repository
Navigate to: csse_covid_19_data/csse_covid_19_time_series/
Download: time_series_covid19_confirmed_global.csv

What this data contains: Confirmed COVID-19 cases for every country, updated daily from January 2020 onwards.

Step 2: Understand the Data Structure

The file looks like this (simplified):

Province/State	Country/Region	Lat	Long	1/22/20	1/23/20	1/24/20	...
	Afghanistan	33.93	67.71	0	0	0	...
	Albania	41.15	20.17	0	0	0	...

Challenges with this format:

❌ One column for EACH date (hundreds of columns!)
❌ Hard to sort by date
❌ Can't easily filter date ranges
❌ Cumulative totals (not daily new cases)

Don't worry, we'll fix all of this!

Part 1: Load and Inspect

Step 1: Upload to ILoveCSV

Open ILoveCSV in your browser
Drag and drop your CSV file, or click to browse
Wait for it to load (should take 5-10 seconds)

Step 2: Initial Inspection

Look at the preview to check:

Total rows: Should be around 280 (one per country/region)
Total columns: 1000+ (location info + daily data)
Missing values: Some regions have empty Province/State
Data type: Numbers for cases, text for locations

What you see: The data is messy but structured. Let's clean it up!

Part 2: Reshape the Data

The data is in "wide" format (one column per date). For analysis, we need "long" format (one row per date).

Step 3: Unpivot Date Columns

Click Transform → Unpivot/Melt Columns
Keep these columns:
- Province/State
- Country/Region
- Lat
- Long
Unpivot all other columns (the date columns)
Name the new columns:
- Variable column: Date
- Value column: Confirmed_Cases
Click Apply

Before (wide format):

Country	1/22/20	1/23/20	1/24/20
Italy	0	0	0

After (long format):

Country	Date	Confirmed_Cases
Italy	1/22/20	0
Italy	1/23/20	0
Italy	1/24/20	0

Why this matters: Now each row is one observation (country + date). Much easier to work with!

Part 3: Clean Date Formatting

Step 4: Fix Date Format

Currently dates look like 1/22/20, hard to sort correctly.

Select the Date column
Click Transform → Parse Dates
Input format: M/D/YY
Output format: YYYY-MM-DD
Click Apply

Result: 1/22/20 becomes 2020-01-22, properly sortable!

Part 4: Calculate Daily New Cases

The data shows cumulative cases (running total). We want daily new cases.

Step 5: Calculate Differences

Click Calculate → Row Difference
Group by: Country/Region (so we calculate within each country)
Sort by: Date (ascending)
Column to difference: Confirmed_Cases
New column name: Daily_New_Cases
Click Apply

What this does:

For each country, subtract yesterday's total from today's total
Result: How many NEW cases were reported each day

Example:

Date	Cumulative	Daily New
2020-03-01	1,000	,
2020-03-02	1,200	200
2020-03-03	1,500	300

Part 5: Focus on a Specific Country

Let's analyze one country in detail.

Step 6: Filter to Your Country

Example: Italy

Click Filter
Column: Country/Region
Condition: Equals
Value: Italy
Click Apply

Result: Only Italian data remains. You can now see the full timeline for Italy!

Part 6: Smooth Out the Noise

Daily numbers jump around due to reporting delays, weekends, etc.

Step 7: Calculate 7-Day Rolling Average

Click Calculate → Moving Average
Column: Daily_New_Cases
Window size: 7 days
New column name: 7Day_Average
Click Apply

What this does: Takes the average of the last 7 days for each date. Smooths out weekend effects and reporting delays.

Before: Daily numbers jump around (e.g., Mon: 5,000 / Tue: 4,500 / Wed: 6,000 / Thu: 2,000 / Fri: 3,000)

After (7-day average): Smooth trend line around 4,000-4,500

Part 7: Find Important Dates

Step 8: Find Peak Days

Click Sort
Column: Daily_New_Cases
Order: Descending (highest first)
Click Apply
Look at the top 10 rows

Analysis: When did cases peak? Was there more than one wave?

Part 8: Monthly Summary

Step 9: Aggregate by Month

Click Transform → Extract Date Parts
From column: Date
Extract: Year and Month
New columns: Year, Month
Click Apply

Then:

Click Aggregate → Group By
Group by columns: Country/Region, Year, Month
Aggregate column: Daily_New_Cases
Function: Sum
New column: Monthly_Total_Cases
Click Apply

Result: Total cases for each month, easier to see big picture trends!

Part 9: Calculate Growth Rates

Step 10: Week-over-Week Growth

Make sure data is sorted by Country/Region and Date
Click Calculate → Percent Change
Column: Confirmed_Cases
Periods: 7 (compare to 7 days ago)
New column: Weekly_Growth_Rate
Click Apply

Result: Shows percentage increase week-over-week. Negative values = cases declining!

Part 10: Export Your Analysis

Step 11: Save Results

Click Export
Choose your format:
- CSV: For further analysis in Excel/Python/R
- Excel (.xlsx): For presentations with formatting
- JSON: For web dashboards
Click Download

Your cleaned, analyzed data is ready to use!

Real Insights You Can Discover

Questions This Data Can Answer

1. When did your country's cases peak?

Sort by Daily_New_Cases descending
Look at the top dates

2. Which countries had the worst outbreaks?

Aggregate by country (sum all cases)
Sort descending

3. When did growth slow down?

Look at Weekly_Growth_Rate
Find when it went from positive to negative

4. How many distinct waves occurred?

Plot the 7-day average over time
Count the peaks

5. How effective were lockdowns?

Mark lockdown dates
Check if growth rate decreased after

Going Further

Advanced Analysis Ideas

Compare with Population

Download World Bank population data and join it:

Load population CSV
Use Join tool on Country/Region
Calculate: Cases_Per_100K = (Total_Cases / Population) * 100000

Vaccination Impact

Get vaccination data and analyze:

Before vaccination: growth rate
After vaccination: growth rate
Calculate percentage reduction

Regional Patterns

Group countries by region (Europe, Asia, etc.):

Add a Region column
Aggregate by region
Compare regional trends

Common Issues and Solutions

"My numbers look wrong!"

Check:

Did you group by country before calculating differences?
Are dates sorted correctly?
Did the source data have errors? (It happens!)

"I see negative daily cases!"

This happens when:

Data was retroactively corrected
Reporting errors were fixed
Duplicates were removed

Solution: Filter out or manually fix obvious errors.

"The data is still too noisy"

Try:

Increase moving average window (14 days instead of 7)
Aggregate to weekly instead of daily
Filter out countries with incomplete data

Key Takeaways

After completing this tutorial, you've learned:

✅ Real data is messy, expect to spend time cleaning
✅ Reshaping is crucial, wide vs long format matters
✅ Rolling averages, essential for noisy time series
✅ Growth rates, better than absolute numbers for comparisons
✅ Aggregation, monthly/weekly summaries reveal patterns
✅ Filtering, focus on what matters

These skills apply to any dataset:

Sales data
Website analytics
Stock prices
Sensor readings
Survey results

Try It Yourself!

Open ILoveCSV and start analyzing →

Download the dataset: Johns Hopkins COVID-19 Data

Time required: 30-45 minutes for first time; 10 minutes once you know the steps

Data Source: Johns Hopkins University CSSE
Last Tutorial Update: February 2026
Data License: Public Domain

This tutorial demonstrates real-world data analysis techniques using publicly available data. The same methods work for any CSV dataset, business data, research data, personal projects, and more.

Before: 1/22/20
After: 2020-01-22

Step 4: Calculate Daily New Cases

The dataset shows cumulative cases. To get daily new cases:

Use Calculate → Difference
Group by: Country/Region
Order by: Date
Calculate: Confirmed Cases - Previous Row
New column: Daily New Cases

Formula logic:

Daily New Cases = Current Day Total - Previous Day Total

Step 5: Filter to Specific Country

Example: Analyze Italy's data

Use Filter tool
Column: Country/Region
Condition: Equals
Value: Italy

Result: Only rows for Italy remain

Step 6: Calculate 7-Day Rolling Average

Smooth out daily fluctuations:

Use Calculate → Moving Average
Column: Daily New Cases
Window size: 7
New column: 7-Day Average

Why this matters: Daily reporting is noisy (weekends, holidays). Rolling averages show true trends.

Step 7: Find Peak Dates

Use Sort tool
Column: Daily New Cases
Order: Descending
View top 10 rows

Analysis: When did daily cases peak?

Step 8: Aggregate by Month

Use Extract → Date Parts
Extract: Year, Month from Date
Use Aggregate tool
Group by: Country/Region, Year, Month
Sum: Daily New Cases

Result: Total cases per country per month

Step 9: Calculate Growth Rates

Growth Rate = (Current - Previous) / Previous * 100

Sort by Country/Region, Date
Use Calculate → Percent Change
Column: Confirmed Cases
Periods: 7 (week-over-week growth)

Step 10: Export Results

Click Export
Choose format:
- CSV for further analysis
- Excel for presentations
- JSON for web applications

Real Insights You Can Find

Example Analyses

1. Which countries had the fastest growth?

Sort by growth rate
Filter to countries with >1M cases

2. When did waves occur?

Plot 7-day average over time
Identify peaks in the data

3. Compare countries

Normalize by population
Calculate cases per 100K people

4. Vaccination impact

Compare pre/post vaccination rollout
Calculate reduction in growth rates

Advanced Techniques

Merge with Population Data

Combine with World Bank population data:

Download population CSV
Load both datasets
Use Join tool
Join on: Country/Region
Calculate: Cases Per 100K = (Cases / Population) * 100000

Time Series Forecasting

Export cleaned data
Use Python/R for forecasting:

import pandas as pd
from statsmodels.tsa.arima.model import ARIMA

# Load cleaned data from iLoveCSV
df = pd.read_csv('covid_cleaned.csv')

# Fit ARIMA model
model = ARIMA(df['Daily New Cases'], order=(5,1,0))
forecast = model.fit().forecast(steps=30)

Data Quality Lessons

Issues We Found

Reporting delays: Weekend dips, Monday spikes
Definition changes: Testing criteria varied by country
Missing data: Some countries stopped reporting
Retroactive adjustments: Historical data revised

How We Handled Them

Rolling averages: Smooth out reporting noise
Outlier detection: Flag unrealistic spikes
Gap filling: Interpolate missing dates
Documentation: Note data quality issues

Conclusion

This tutorial demonstrated real-world data analysis using publicly available COVID-19 data. The techniques you learned apply to any time series dataset:

Government statistics
Financial market data
Web analytics
Sensor readings

Key Takeaways:

Real data is messy—expect to spend 60% of time cleaning
Reshaping data (pivot/unpivot) is crucial for analysis
Time series require special handling (rolling averages, growth rates)
Always document data sources and assumptions

Try this analysis yourself with ILoveCSV →

Data Source: Johns Hopkins University CSSE
Last Updated: February 2026
License: Public Domain

This tutorial uses real public data for educational purposes. The analysis techniques are general-purpose and apply to any CSV dataset.