Tutorial: Analyzing COVID-19 Data with CSV Tools (Johns Hopkins Dataset)
Learn practical data analysis techniques using real-world COVID-19 data from Johns Hopkins University. This hands-on tutorial shows you how to clean, analyze, and gain insights from a real public dataset, no programming required.
Dataset: COVID-19 Data Repository by CSSE at Johns Hopkins University
License: Public domain
Size: ~50MB
Rows: 200,000+ time series records
Difficulty: Beginner to Intermediate
What You'll Learn
By the end of this tutorial, you'll know how to:
- ✅ Clean messy real-world data
- ✅ Transform wide data into analysis-ready format
- ✅ Calculate daily changes and trends
- ✅ Smooth noisy data with rolling averages
- ✅ Find patterns and peak dates
- ✅ Create country-specific analysis
- ✅ Export results for presentations
No coding required, just point, click, and analyze!
Getting the Data
Step 1: Download the Dataset
- Visit the Johns Hopkins COVID-19 repository
- Navigate to:
csse_covid_19_data/csse_covid_19_time_series/ - Download:
time_series_covid19_confirmed_global.csv
What this data contains: Confirmed COVID-19 cases for every country, updated daily from January 2020 onwards.
Step 2: Understand the Data Structure
The file looks like this (simplified):
| Province/State | Country/Region | Lat | Long | 1/22/20 | 1/23/20 | 1/24/20 | ... |
|---|---|---|---|---|---|---|---|
| Afghanistan | 33.93 | 67.71 | 0 | 0 | 0 | ... | |
| Albania | 41.15 | 20.17 | 0 | 0 | 0 | ... |
Challenges with this format:
- ❌ One column for EACH date (hundreds of columns!)
- ❌ Hard to sort by date
- ❌ Can't easily filter date ranges
- ❌ Cumulative totals (not daily new cases)
Don't worry, we'll fix all of this!
Part 1: Load and Inspect
Step 1: Upload to ILoveCSV
- Open ILoveCSV in your browser
- Drag and drop your CSV file, or click to browse
- Wait for it to load (should take 5-10 seconds)
Step 2: Initial Inspection
Look at the preview to check:
- Total rows: Should be around 280 (one per country/region)
- Total columns: 1000+ (location info + daily data)
- Missing values: Some regions have empty Province/State
- Data type: Numbers for cases, text for locations
What you see: The data is messy but structured. Let's clean it up!
Part 2: Reshape the Data
The data is in "wide" format (one column per date). For analysis, we need "long" format (one row per date).
Step 3: Unpivot Date Columns
- Click Transform → Unpivot/Melt Columns
- Keep these columns:
Province/StateCountry/RegionLatLong
- Unpivot all other columns (the date columns)
- Name the new columns:
- Variable column:
Date - Value column:
Confirmed_Cases
- Variable column:
- Click Apply
Before (wide format):
| Country | 1/22/20 | 1/23/20 | 1/24/20 |
|---|---|---|---|
| Italy | 0 | 0 | 0 |
After (long format):
| Country | Date | Confirmed_Cases |
|---|---|---|
| Italy | 1/22/20 | 0 |
| Italy | 1/23/20 | 0 |
| Italy | 1/24/20 | 0 |
Why this matters: Now each row is one observation (country + date). Much easier to work with!
Part 3: Clean Date Formatting
Step 4: Fix Date Format
Currently dates look like 1/22/20, hard to sort correctly.
- Select the
Datecolumn - Click Transform → Parse Dates
- Input format:
M/D/YY - Output format:
YYYY-MM-DD - Click Apply
Result: 1/22/20 becomes 2020-01-22, properly sortable!
Part 4: Calculate Daily New Cases
The data shows cumulative cases (running total). We want daily new cases.
Step 5: Calculate Differences
- Click Calculate → Row Difference
- Group by:
Country/Region(so we calculate within each country) - Sort by:
Date(ascending) - Column to difference:
Confirmed_Cases - New column name:
Daily_New_Cases - Click Apply
What this does:
- For each country, subtract yesterday's total from today's total
- Result: How many NEW cases were reported each day
Example:
| Date | Cumulative | Daily New |
|---|---|---|
| 2020-03-01 | 1,000 | , |
| 2020-03-02 | 1,200 | 200 |
| 2020-03-03 | 1,500 | 300 |
Part 5: Focus on a Specific Country
Let's analyze one country in detail.
Step 6: Filter to Your Country
Example: Italy
- Click Filter
- Column:
Country/Region - Condition:
Equals - Value:
Italy - Click Apply
Result: Only Italian data remains. You can now see the full timeline for Italy!
Part 6: Smooth Out the Noise
Daily numbers jump around due to reporting delays, weekends, etc.
Step 7: Calculate 7-Day Rolling Average
- Click Calculate → Moving Average
- Column:
Daily_New_Cases - Window size:
7days - New column name:
7Day_Average - Click Apply
What this does: Takes the average of the last 7 days for each date. Smooths out weekend effects and reporting delays.
Before: Daily numbers jump around (e.g., Mon: 5,000 / Tue: 4,500 / Wed: 6,000 / Thu: 2,000 / Fri: 3,000)
After (7-day average): Smooth trend line around 4,000-4,500
Part 7: Find Important Dates
Step 8: Find Peak Days
- Click Sort
- Column:
Daily_New_Cases - Order:
Descending(highest first) - Click Apply
- Look at the top 10 rows
Analysis: When did cases peak? Was there more than one wave?
Part 8: Monthly Summary
Step 9: Aggregate by Month
- Click Transform → Extract Date Parts
- From column:
Date - Extract:
YearandMonth - New columns:
Year,Month - Click Apply
Then:
- Click Aggregate → Group By
- Group by columns:
Country/Region,Year,Month - Aggregate column:
Daily_New_Cases - Function:
Sum - New column:
Monthly_Total_Cases - Click Apply
Result: Total cases for each month, easier to see big picture trends!
Part 9: Calculate Growth Rates
Step 10: Week-over-Week Growth
- Make sure data is sorted by
Country/RegionandDate - Click Calculate → Percent Change
- Column:
Confirmed_Cases - Periods:
7(compare to 7 days ago) - New column:
Weekly_Growth_Rate - Click Apply
Result: Shows percentage increase week-over-week. Negative values = cases declining!
Part 10: Export Your Analysis
Step 11: Save Results
- Click Export
- Choose your format:
- CSV: For further analysis in Excel/Python/R
- Excel (.xlsx): For presentations with formatting
- JSON: For web dashboards
- Click Download
Your cleaned, analyzed data is ready to use!
Real Insights You Can Discover
Questions This Data Can Answer
1. When did your country's cases peak?
- Sort by
Daily_New_Casesdescending - Look at the top dates
2. Which countries had the worst outbreaks?
- Aggregate by country (sum all cases)
- Sort descending
3. When did growth slow down?
- Look at
Weekly_Growth_Rate - Find when it went from positive to negative
4. How many distinct waves occurred?
- Plot the 7-day average over time
- Count the peaks
5. How effective were lockdowns?
- Mark lockdown dates
- Check if growth rate decreased after
Going Further
Advanced Analysis Ideas
Compare with Population
Download World Bank population data and join it:
- Load population CSV
- Use Join tool on
Country/Region - Calculate:
Cases_Per_100K = (Total_Cases / Population) * 100000
Vaccination Impact
Get vaccination data and analyze:
- Before vaccination: growth rate
- After vaccination: growth rate
- Calculate percentage reduction
Regional Patterns
Group countries by region (Europe, Asia, etc.):
- Add a
Regioncolumn - Aggregate by region
- Compare regional trends
Common Issues and Solutions
"My numbers look wrong!"
Check:
- Did you group by country before calculating differences?
- Are dates sorted correctly?
- Did the source data have errors? (It happens!)
"I see negative daily cases!"
This happens when:
- Data was retroactively corrected
- Reporting errors were fixed
- Duplicates were removed
Solution: Filter out or manually fix obvious errors.
"The data is still too noisy"
Try:
- Increase moving average window (14 days instead of 7)
- Aggregate to weekly instead of daily
- Filter out countries with incomplete data
Key Takeaways
After completing this tutorial, you've learned:
✅ Real data is messy, expect to spend time cleaning
✅ Reshaping is crucial, wide vs long format matters
✅ Rolling averages, essential for noisy time series
✅ Growth rates, better than absolute numbers for comparisons
✅ Aggregation, monthly/weekly summaries reveal patterns
✅ Filtering, focus on what matters
These skills apply to any dataset:
- Sales data
- Website analytics
- Stock prices
- Sensor readings
- Survey results
Try It Yourself!
Open ILoveCSV and start analyzing →
Download the dataset: Johns Hopkins COVID-19 Data
Time required: 30-45 minutes for first time; 10 minutes once you know the steps
Data Source: Johns Hopkins University CSSE
Last Tutorial Update: February 2026
Data License: Public Domain
This tutorial demonstrates real-world data analysis techniques using publicly available data. The same methods work for any CSV dataset, business data, research data, personal projects, and more.
Before: 1/22/20
After: 2020-01-22
Step 4: Calculate Daily New Cases
The dataset shows cumulative cases. To get daily new cases:
- Use Calculate → Difference
- Group by:
Country/Region - Order by:
Date - Calculate:
Confirmed Cases - Previous Row - New column:
Daily New Cases
Formula logic:
Daily New Cases = Current Day Total - Previous Day Total
Step 5: Filter to Specific Country
Example: Analyze Italy's data
- Use Filter tool
- Column:
Country/Region - Condition: Equals
- Value:
Italy
Result: Only rows for Italy remain
Step 6: Calculate 7-Day Rolling Average
Smooth out daily fluctuations:
- Use Calculate → Moving Average
- Column:
Daily New Cases - Window size: 7
- New column:
7-Day Average
Why this matters: Daily reporting is noisy (weekends, holidays). Rolling averages show true trends.
Step 7: Find Peak Dates
- Use Sort tool
- Column:
Daily New Cases - Order: Descending
- View top 10 rows
Analysis: When did daily cases peak?
Step 8: Aggregate by Month
- Use Extract → Date Parts
- Extract: Year, Month from
Date - Use Aggregate tool
- Group by:
Country/Region,Year,Month - Sum:
Daily New Cases
Result: Total cases per country per month
Step 9: Calculate Growth Rates
Growth Rate = (Current - Previous) / Previous * 100
- Sort by
Country/Region,Date - Use Calculate → Percent Change
- Column:
Confirmed Cases - Periods: 7 (week-over-week growth)
Step 10: Export Results
- Click Export
- Choose format:
- CSV for further analysis
- Excel for presentations
- JSON for web applications
Real Insights You Can Find
Example Analyses
1. Which countries had the fastest growth?
- Sort by growth rate
- Filter to countries with >1M cases
2. When did waves occur?
- Plot 7-day average over time
- Identify peaks in the data
3. Compare countries
- Normalize by population
- Calculate cases per 100K people
4. Vaccination impact
- Compare pre/post vaccination rollout
- Calculate reduction in growth rates
Advanced Techniques
Merge with Population Data
Combine with World Bank population data:
- Download population CSV
- Load both datasets
- Use Join tool
- Join on:
Country/Region - Calculate:
Cases Per 100K = (Cases / Population) * 100000
Time Series Forecasting
- Export cleaned data
- Use Python/R for forecasting:
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
# Load cleaned data from iLoveCSV
df = pd.read_csv('covid_cleaned.csv')
# Fit ARIMA model
model = ARIMA(df['Daily New Cases'], order=(5,1,0))
forecast = model.fit().forecast(steps=30)
Data Quality Lessons
Issues We Found
- Reporting delays: Weekend dips, Monday spikes
- Definition changes: Testing criteria varied by country
- Missing data: Some countries stopped reporting
- Retroactive adjustments: Historical data revised
How We Handled Them
- Rolling averages: Smooth out reporting noise
- Outlier detection: Flag unrealistic spikes
- Gap filling: Interpolate missing dates
- Documentation: Note data quality issues
Conclusion
This tutorial demonstrated real-world data analysis using publicly available COVID-19 data. The techniques you learned apply to any time series dataset:
- Government statistics
- Financial market data
- Web analytics
- Sensor readings
Key Takeaways:
- Real data is messy—expect to spend 60% of time cleaning
- Reshaping data (pivot/unpivot) is crucial for analysis
- Time series require special handling (rolling averages, growth rates)
- Always document data sources and assumptions
Try this analysis yourself with ILoveCSV →
Data Source: Johns Hopkins University CSSE
Last Updated: February 2026
License: Public Domain
This tutorial uses real public data for educational purposes. The analysis techniques are general-purpose and apply to any CSV dataset.
