6 Session 3: Data Wrangling
Clean and prepare real-world data
- Duration: 4 hours
- Goal: Master the skills to clean, reshape, and prepare messy real-world data for analysis.
Real data is messy. Column names are inconsistent, values are missing, formats are wrong. Data wrangling is where analysts spend most of their time – and it’s the step that’s invisible in Excel. Today you learn to do it reproducibly.
6.1 Learning Outcomes
- Import data from CSV and Excel files
- Handle missing values and data quality issues
- Reshape data between wide and long formats
- Join multiple datasets together
- Apply data cleaning best practises
- Use the
here()package for robust file paths
6.2 Session Structure
6.2.1 Part 1: Homework Review & Setup (30 min)
Homework Review (25 min):
- Volunteer demos of homework indicators
- Quick discussion: What datasets did you bring? What challenges do you anticipate?
Setting Up Your Practise Script (5 min):
As in Session 2, create a new R script for today’s exercises:
- In Positron,
File > New File > R Script(orCtrl+Shift+N) - Save as
scripts/R/session-03-practice.R - Add a header and load the libraries you’ll need:
# Session 3 Practice - Data Wrangling
# Your name, today's date
library(tidyverse)
library(readxl)
library(janitor)
source(here::here("scripts", "R", "helpers.R"))Run lines with Ctrl+Enter or highlight a block and use Ctrl+Shift+Enter. Keep your code as you go – this file becomes a useful reference for your own indicator work.
6.2.2 Part 2: Importing Data (45 min)
Your indicator starts with a CSV or Excel download. Getting it into R cleanly is the first step.
Reading CSV Files:
library(tidyverse)
library(readxl)
source(here::here("scripts", "R", "helpers.R"))
# The WECA way
data <- load_csv(here::here("data", "raw", "indicator_data.csv"))Reading Excel Files:
# Basic read
data <- read_excel("data/raw/employment_data.xlsx")
# Specific sheet
data <- read_excel("data/raw/employment_data.xlsx", sheet = "Q4_2024")
# Skip header rows
data <- read_excel("data/raw/employment_data.xlsx", sheet = "Data", skip = 5)File Path Best Practises:
# ✅ Good - relative paths with here()
data <- load_csv(here::here("data", "raw", "myfile.csv"))
# ❌ Bad - absolute paths (breaks on other computers)
data <- read_csv("C:/Users/steve/projects/indicators/data/raw/myfile.csv")Try It (10 min): Load your own CSV with
read_csv(), then inspect it withglimpse(). How many rows and columns does it have? What types are the columns?
BREAK (15 min)
6.2.3 Part 3: Data Cleaning Essentials (60 min)
Real government data has messy names, missing values, and mixed types. This is how you fix them.
Cleaning Column Names:
library(janitor)
data_clean <- data %>%
clean_names() # Converts "Area Name" to "area_name"Try It (5 min): Run
clean_names()on your dataset. Comparenames(data)before and after.
Handling Missing Values:
# Check for missing data
check_missing(data)
# Remove rows with missing values
data_complete <- data %>%
drop_na(year, value)
# Replace missing values
data_filled <- data %>%
mutate(
value = if_else(is.na(value), 0, value),
category = replace_na(category, "Unknown")
)Try It (10 min): Cheque your data for NAs with
check_missing(). Decide how to handle them – drop the rows, fill with zeros, or replace with “Unknown”?
Type Conversions:
data <- data %>%
mutate(
year = as.numeric(year),
date = as.Date(date, format = "%d/%m/%Y"),
region = factor(region, levels = c("Bristol", "Bath", "South Glos", "North Somerset"))
)BREAK (15 min)
6.2.4 Part 4: Reshaping Data (60 min)
ONS data often arrives in wide format (one column per year). Your charts need long format.
Understanding Wide vs. Long Format:
Wide format (like Excel):
year bristol_value bath_value sg_value
2020 1000 500 750
Long format (tidy):
year area value
2020 Bristol 1000
2020 Bath 500
2020 South Glos 750
Pivot Longer - Wide to Long:
data_long <- data_wide %>%
pivot_longer(
cols = bristol_value:sg_value,
names_to = "area",
values_to = "value"
) %>%
mutate(
area = str_remove(area, "_value"),
area = str_to_title(area)
)Try It (10 min): Reshape a wide example dataset to long format using
pivot_longer(). Cheque the result withglimpse().
BREAK (15 min)
6.2.5 Part 5: Joining Datasets (30 min)
Many indicators combine data from multiple sources – e.g. population denominators with raw counts.
Types of Joins:
# Left join - keep all rows from first table
combined <- data1 %>%
left_join(data2, by = "area")
# Join on multiple columns
combined <- data1 %>%
left_join(data2, by = c("area", "year"))6.2.6 Common Pitfalls
pivot_longer()selects wrong columns – Usecols = starts_with("x")or name columns explicitly. Cheque withnames(data)first.- Join produces more rows than expected – Duplicate keys in one table cause row multiplication. Cheque with
count(data, key_column). read_excel()reads numbers as text – Excel formatting can trick R. Usemutate(value = as.numeric(value))and cheque for introduced NAs.- File not found with
here()– Ensure you’re in the R Project. Runhere::here()alone to cheque the project root.
6.2.7 Part 6: Real Data Workshop (45 min)
Before coding, write down 2-3 specific questions your indicator data could answer. Pick one to focus on. What chart type would best communicate the answer?
Work in pairs. Each pair takes one dataset. Divide the work: one person handles import and cleaning, the other handles reshaping and visualisation. Review each other’s code before combining.
Work with real WECA datasets:
- Import the data
- Inspect and document issues
- Clean column names
- Handle missing values
- Reshape if needed
- Create a visualisation
- Calculate summary statistics
6.2.8 Part 7: Wrap-up & Homework (15 min)
Reflection (5 min):
Before we finish, take 2 minutes to write down:
- One thing you understand now that you didn’t at the start
- One thing that’s still unclear or you’d like more practise with
What was the messiest thing about your data, and how did you fix it? Share with the group if you’re comfortable.
Homework (1-2 hours):
- Complete: Prepare your real indicator data (cleaned version in
data/processed/) - Create: A draft indicator section with data loading, cleaning, transformation, chart, and findings
- Read: CONTRIBUTING.md sections on “Code Style Guide”