Data journalism methodology
Systematic approaches for finding, analyzing and presenting data in journalism.
Story structure for data journalism
Data journalism framework
The framework for data journalism was established by Philip Meyer, a journalist for Knight-Ridder, Harvard Nieman Fellow and professor at UNC-Chapel Hill. In his book The New Precision Journalism, Meyer encourages journalists to treat journalism "as if it were a science" by adopting the scientific method:
- Make observations / formulate a question
- Research the question / collect, store, and retrieve data
- Formulate a hypothesis
- Test the hypothesis, using both qualitative (interviews, documents) and quantitative (data analysis) methods
- Analyze the results and reduce them to the most important findings
- Present them to the audience
The process is iterative, not sequential.
The data story arc
1. The hook (nut graf)
- What's the key finding?
- Why should readers care?
- What's the human impact?
2. The evidence
- Show the data
- Explain the methodology
- Acknowledge limitations
3. The context
- How does this compare to the past?
- How does this compare to elsewhere?
- What's the trend?
4. The human element
- Individual examples that illustrate the data
- Expert interpretation
- Affected voices
5. The implications
- What does this mean going forward?
- What questions remain?
- What actions could result?
6. The methodology box
- Where did the data come from?
- How was it analyzed?
- What are the limitations?
- How can readers explore further?
Methodology documentation template
## How we did this analysis
### Data sources
[List all data sources with links and access dates]
### Time period
[Specify exactly what time period is covered]
### Definitions
[Define key terms and how you operationalized them]
### Analysis steps
1. [First step of analysis]
2. [Second step]
3. [Continue...]
### Limitations
- [Limitation 1]
- [Limitation 2]
### What we excluded and why
- [Excluded category]: [Reason]
### Verification
[How findings were verified/checked]
### Code and data availability
[Link to GitHub repo if sharing code/data]
### Contact
[How readers can reach you with questions]
Data acquisition
Public data sources
Federal data sources
General:
- Data.gov — Federal open data portal. Many datasets were removed between Feb 2025 and 2026; consult the Harvard LIL Data.gov archive and the Data Rescue Project for preserved copies before assuming anything is still accessible.
- Census Bureau (census.gov) — Demographics, economic data. Many research pages were removed during the 2025 transition; the End of Term Web Archive holds snapshots.
- BLS (bls.gov) — Employment, inflation, wages. Following the 2025 funding lapse, the October 2025 Employment Situation release was canceled and the CPS October 2025 reference period is permanently uncollected. Check revised release dates before relying on series continuity.
- BEA (bea.gov) — GDP, economic accounts.
- FRED / Federal Reserve (fred.stlouisfed.org) — Financial and macroeconomic data; expanded API access through 2026.
- SEC EDGAR — Corporate filings.
Specific domains:
- EPA (epa.gov/data) — Environmental data. At least 80 climate webpages were removed in Dec 2025, the endangerment finding was repealed Feb 12, 2026, and the Climate Change Indicators site was largely gutted. The Environmental Data & Governance Initiative maintains mirrors.
- FDA / openFDA (open.fda.gov) — Drug approvals, recalls, adverse events.
- CDC WONDER — Health statistics. Many datasets were removed from data.cdc.gov after Jan 2025, partially restored under Doctors for America v. Trump (TRO Feb 11, 2025) but with altered terminology in some returns. The volunteer-run RestoredCDC.org mirrors removed content.
- NHTSA FARS / vPIC APIs — Vehicle safety data.
- DOT — Transportation statistics.
- FEC — Campaign finance; 2025-2026 cycle data live.
- USASpending.gov — Federal contracts and grants; API v2 operational.
Court records:
- CourtListener / RECAP (courtlistener.com) — Free PACER alternative covering federal court filings; RECAP Search Alerts launched June 2025 ("Google Alerts for federal courts").
- PACER — Federal court filings; $0.10 per page, $30 per quarter waiver threshold.
State and local:
- State open data portals (search: "[state] open data")
- Tyler Data & Insights (formerly Socrata, rebranded May 2025) hosts many city and state portals
- OpenStreetMap, municipal GIS portals
- State comptroller and auditor reports
International:
- Eurostat, OECD, World Bank Open Data, UN Data — major comparative datasets, mostly stable through 2026.
Specialized:
- NICAR Data Library (IRE) — curated datasets, IRE members only.
- IPUMS (University of Minnesota) — free with account; canonical for harmonized microdata.
- ICPSR (University of Michigan) — social-science data archive.
- ProPublica Data Store — frozen; datasets only run through 2023.
Federal-data preservation (use when source data has been removed):
- Data Rescue Project — citizen + library mirrors of removed federal data; more than 1,230 datasets across 85 offices as of Aug 2025.
- End of Term Web Archive — 500TB / 100M-page snapshot of federal sites at the 2024-2025 transition.
- Internet Archive Wayback Machine — useful for individual page-level recovery.
Data request strategies
Public records requests for datasets
For request mechanics (templates, fee-waiver language, NJ OPRA, appeals, FOIA Improvement Act statutory citations), see the foia-requests skill. Data-specific guidance:
- Request databases, not just documents
- Ask for the data dictionary or schema
- Request in native format (CSV, SQL dump) — not PDFs or scanned printouts
- Specify field-level needs and any computed columns you want included
- For active datasets, ask the cadence (daily, monthly, quarterly) and request standing access if your reporting will continue
Building your own dataset
- Scraping public information (respect robots.txt, ToS, and rate limits)
- Crowdsourcing from readers
- Systematic document review
- Surveys with documented methodology
Commercial data sources for newsrooms
- LexisNexis, Refinitiv, Bloomberg
- Industry-specific databases (often via library proxy through your institution)
Data cleaning and preparation
Common data problems
from typing import Any
import pandas as pd
import numpy as np
from rapidfuzz import fuzz
from itertools import combinations
# Inflation adjustment
import cpi
import wbdata
def standardize_name(name: Any) -> str | None:
"""Standardize name format to 'First Last'."""
if pd.isna(name):
return None
name = str(name).strip().upper()
# Handle "LAST, FIRST" format
if ',' in name:
parts = name.split(',')
name = f"{parts[1].strip()} {parts[0].strip()}"
return name
def parse_date(date_str: Any) -> pd.Timestamp | None:
"""Parse dates in various formats."""
if pd.isna(date_str):
return None
formats = [
'%m/%d/%Y', '%Y-%m-%d', '%B %d, %Y',
'%d-%b-%y', '%m-%d-%Y', '%Y/%m/%d'
]
for fmt in formats:
try:
return pd.to_datetime(date_str, format=fmt)
except:
continue
# Fall back to pandas parser
try:
return pd.to_datetime(date_str)
except:
return None
def handle_missing(
df: pd.DataFrame,
thresh: int | None = None,
per_thresh: float | None = None,
required_col: str