Data journalism methodology

Systematic approaches for finding, analyzing and presenting data in journalism.

Story structure for data journalism

Data journalism framework

The framework for data journalism was established by Philip Meyer, a journalist for Knight-Ridder, Harvard Nieman Fellow and professor at UNC-Chapel Hill. In his book The New Precision Journalism, Meyer encourages journalists to treat journalism "as if it were a science" by adopting the scientific method:

Make observations / formulate a question
Research the question / collect, store, and retrieve data
Formulate a hypothesis
Test the hypothesis, using both qualitative (interviews, documents) and quantitative (data analysis) methods
Analyze the results and reduce them to the most important findings
Present them to the audience

The process is iterative, not sequential.

The data story arc

1. The hook (nut graf)

What's the key finding?
Why should readers care?
What's the human impact?

2. The evidence

Show the data
Explain the methodology
Acknowledge limitations

3. The context

How does this compare to the past?
How does this compare to elsewhere?
What's the trend?

4. The human element

Individual examples that illustrate the data
Expert interpretation
Affected voices

5. The implications

What does this mean going forward?
What questions remain?
What actions could result?

6. The methodology box

Where did the data come from?
How was it analyzed?
What are the limitations?
How can readers explore further?

Methodology documentation template

## How we did this analysis

### Data sources
[List all data sources with links and access dates]

### Time period
[Specify exactly what time period is covered]

### Definitions
[Define key terms and how you operationalized them]

### Analysis steps
1. [First step of analysis]
2. [Second step]
3. [Continue...]

### Limitations
- [Limitation 1]
- [Limitation 2]

### What we excluded and why
- [Excluded category]: [Reason]

### Verification
[How findings were verified/checked]

### Code and data availability
[Link to GitHub repo if sharing code/data]

### Contact
[How readers can reach you with questions]

Data acquisition

Public data sources

Federal data sources

General:

Data.gov — Federal open data portal. Many datasets were removed between Feb 2025 and 2026; consult the Harvard LIL Data.gov archive and the Data Rescue Project for preserved copies before assuming anything is still accessible.
Census Bureau (census.gov) — Demographics, economic data. Many research pages were removed during the 2025 transition; the End of Term Web Archive holds snapshots.
BLS (bls.gov) — Employment, inflation, wages. Following the 2025 funding lapse, the October 2025 Employment Situation release was canceled and the CPS October 2025 reference period is permanently uncollected. Check revised release dates before relying on series continuity.
BEA (bea.gov) — GDP, economic accounts.
FRED / Federal Reserve (fred.stlouisfed.org) — Financial and macroeconomic data; expanded API access through 2026.
SEC EDGAR — Corporate filings.

Specific domains:

EPA (epa.gov/data) — Environmental data. At least 80 climate webpages were removed in Dec 2025, the endangerment finding was repealed Feb 12, 2026, and the Climate Change Indicators site was largely gutted. The Environmental Data & Governance Initiative maintains mirrors.
FDA / openFDA (open.fda.gov) — Drug approvals, recalls, adverse events.
CDC WONDER — Health statistics. Many datasets were removed from data.cdc.gov after Jan 2025, partially restored under Doctors for America v. Trump (TRO Feb 11, 2025) but with altered terminology in some returns. The volunteer-run RestoredCDC.org mirrors removed content.
NHTSA FARS / vPIC APIs — Vehicle safety data.
DOT — Transportation statistics.
FEC — Campaign finance; 2025-2026 cycle data live.
USASpending.gov — Federal contracts and grants; API v2 operational.

Court records:

CourtListener / RECAP (courtlistener.com) — Free PACER alternative covering federal court filings; RECAP Search Alerts launched June 2025 ("Google Alerts for federal courts").
PACER — Federal court filings; $0.10 per page, $30 per quarter waiver threshold.

State and local:

State open data portals (search: "[state] open data")
Tyler Data & Insights (formerly Socrata, rebranded May 2025) hosts many city and state portals
OpenStreetMap, municipal GIS portals
State comptroller and auditor reports

International:

Eurostat, OECD, World Bank Open Data, UN Data — major comparative datasets, mostly stable through 2026.

Specialized:

NICAR Data Library (IRE) — curated datasets, IRE members only.
IPUMS (University of Minnesota) — free with account; canonical for harmonized microdata.
ICPSR (University of Michigan) — social-science data archive.
ProPublica Data Store — frozen; datasets only run through 2023.

Federal-data preservation (use when source data has been removed):

Data Rescue Project — citizen + library mirrors of removed federal data; more than 1,230 datasets across 85 offices as of Aug 2025.
End of Term Web Archive — 500TB / 100M-page snapshot of federal sites at the 2024-2025 transition.
Internet Archive Wayback Machine — useful for individual page-level recovery.

Data request strategies

Public records requests for datasets

For request mechanics (templates, fee-waiver language, NJ OPRA, appeals, FOIA Improvement Act statutory citations), see the foia-requests skill. Data-specific guidance:

Request databases, not just documents
Ask for the data dictionary or schema
Request in native format (CSV, SQL dump) — not PDFs or scanned printouts
Specify field-level needs and any computed columns you want included
For active datasets, ask the cadence (daily, monthly, quarterly) and request standing access if your reporting will continue

Building your own dataset

Scraping public information (respect robots.txt, ToS, and rate limits)
Crowdsourcing from readers
Systematic document review
Surveys with documented methodology

Commercial data sources for newsrooms

LexisNexis, Refinitiv, Bloomberg
Industry-specific databases (often via library proxy through your institution)

Data cleaning and preparation

Common data problems

from typing import Any

import pandas as pd
import numpy as np
from rapidfuzz import fuzz
from itertools import combinations

# Inflation adjustment
import cpi
import wbdata

def standardize_name(name: Any) -> str | None:
    """Standardize name format to 'First Last'."""
    if pd.isna(name):
        return None
    name = str(name).strip().upper()
    # Handle "LAST, FIRST" format
    if ',' in name:
        parts = name.split(',')
        name = f"{parts[1].strip()} {parts[0].strip()}"
    return name

def parse_date(date_str: Any) -> pd.Timestamp | None:
    """Parse dates in various formats."""
    if pd.isna(date_str):
        return None

    formats = [
        '%m/%d/%Y', '%Y-%m-%d', '%B %d, %Y',
        '%d-%b-%y', '%m-%d-%Y', '%Y/%m/%d'
    ]

    for fmt in formats:
        try:
            return pd.to_datetime(date_str, format=fmt)
        except:
            continue

    # Fall back to pandas parser
    try:
        return pd.to_datetime(date_str)
    except:
        return None


def handle_missing(
    df: pd.DataFrame,
    thresh: int | None = None,
    per_thresh: float | None = None,
    required_col: str

data-journalism

How to add

Drop this on your repo README

Related skills

xlsx

how-it-works

mem-search

weekly-digests

Get new Dados e Análise skills every Monday