9 Practice Datasets
9.1 Example Datasets
Two small, clean datasets are provided for practice exercises in Sessions 1-2. From Session 3 onwards, analysts work with their own real indicator data.
9.1.1 Bus Ridership
File: bus_ridership.csv
Description: Annual bus ridership data for the WECA region
Columns:
year- Year (2015-2024)ridership- Annual ridership in millionsroutes- Number of bus routesregion- Geographic area
Use cases:
- Session 1: Creating first visualisation
- Session 2: Line charts and trends with ggplot2
9.1.2 Area Employment
File: area_employment.csv
Description: Employment rate statistics by WECA local authority (2019-2023)
Columns:
area- Local authority name (Bath and North East Somerset, Bristol, North Somerset, South Gloucestershire)year- Year (2019-2023)employment_rate- Proportion employed, 16-64 age group (0-1 scale)population- Total population
Use cases:
- Session 2: Filtering, grouping, summarising with dplyr verbs
- Session 2: Bar charts and comparisons with ggplot2
9.2 Real Indicator Data
From Session 3 onwards, analysts bring their own real WECA datasets. These will typically be messier than the example data above – see Data Cleaning Notes below for common issues and solutions.
9.3 Data Cleaning Notes
Common issues to watch for:
- Column naming:
- Mixed case: “Area Name”, “area_name”, “AREA_NAME”
- Special characters: “Year (2023)”, “Percentage (%)”
- Solution: Use
janitor::clean_names()
- Missing values:
- Coded as: “N/A”, “n/a”, “-”, “*“, empty cells
- Solution: Explicit
is.na()checks andreplace_na()
- Date formats:
- “31/12/2023”, “2023-12-31”, “December 2023”
- Solution: Use
as.Date()with appropriate format strings
- Numeric values as text:
- “1,234”, “£50.00”, “12.5%”
- Solution:
str_remove_all()non-numeric characters, thenas.numeric()
- Excel header rows:
- Title rows, merged cells, footnotes
- Solution: Use
skip =parameter inread_excel()
9.4 Dataset Documentation Template
When adding new practice datasets, document them using this template:
### Dataset Name
**File:** `filename.csv`
**Description:** Brief description of what the data contains
**Columns:**
- `column1` - Description
- `column2` - Description
**Source:** Data source and URL
**Date range:** YYYY-YYYY
**Use cases:** Which sessions/exercises use this data
**Known issues:** Any data quality notes