Personal Intelligence — Build a Topic Feed from Scratch
This skill captures the full architecture and hard-won lessons from NeuralField — an AI+Sports/Gaming intelligence system — and makes it reusable for any topic. Each section explains not just what to do, but why it matters.
1. Define Your Intelligence Domain
Before touching code, get crisp on what you're tracking. Vague domains create noisy feeds.
Answer three questions:
-
What is the intersection? The sharpest feeds track the intersection of two things — not "AI" (too broad), not "sports" (too broad), but "AI applied to sports." Think:
[emerging force] × [domain you care about]. -
Who are the 5-10 publications that would publish a perfect article for this feed? These become your direct RSS sources. If you can't name them, the domain isn't focused enough yet.
-
What would a false positive look like? A story you'd find in the feed that doesn't belong. This shapes your filter logic later. NeuralField's false positive was "police recruiting" articles — AI-related but not sports-related.
Topic template:
Domain: [primary subject area]
Intersection: [what's new or evolving within it]
Ideal sources: [list 5-10 publications]
False positive example: [what would be off-topic but sound on-topic]
Categories: [3-7 sub-groupings of articles, e.g. "Analytics", "Industry", "Performance"]
2. Build Your Feed Inventory
A good feed combines two complementary source types. Use both.
Direct RSS Feeds
Curated, high-confidence sources. These are known publications you trust. Each article is almost certainly on-domain — the question is just whether it's relevant enough to show.
{"url": "https://frontofficesports.com/feed/", "category": "Industry", "source_name": "Front Office Sports"},
{"url": "https://sportstechtoday.com/feed/", "category": "Industry", "source_name": "Sports Tech Today"},
{"url": "https://aws.amazon.com/blogs/machine-learning/tag/sports/feed/", "category": "Analytics", "source_name": "AWS"},
Finding direct RSS feeds: Append /feed/, /rss, /feed.xml, or /rss.xml to most publication URLs. Tools like rss.app or fetchrss.com can generate feeds from sites without native RSS.
Google News Search Feeds
Broader coverage that surfaces stories from anywhere on the web, including smaller outlets.
https://news.google.com/rss/search?q={keywords}&hl=en-US&gl=US&ceid=US:en
Keyword strategy:
- Use 2-4 terms, not 1 (too broad) and not 8 (too narrow)
- Include year for freshness:
AI+game+developer+tools+2026 - Run multiple queries covering different angles of your topic
- Rotate queries slightly to avoid Google News caching stale results
Warning on Google News links: Google News returns redirect URLs (e.g. https://news.google.com/rss/articles/...). These must be resolved to the real article URL at ingest time using an HTTP redirect follow, or they'll collide as duplicates from different queries.
Source Authority Scoring
Assign trust scores to known publications — this is used in the final ranking formula:
SOURCE_AUTHORITY = {
# Tier 1 — Major outlets (score 10)
"ESPN": 10, "Reuters": 10, "Bloomberg": 10, "The Guardian": 10,
# Tier 2 — Strong domain-specific (score 8)
"TechCrunch": 8, "Wired": 8, "VentureBeat": 8,
# Tier 3 — Solid niche (score 6)
"Axios": 6, "Business Insider": 6,
# Tier 4 — Domain specialists (score 4)
"your-niche-publication.com": 4,
}
DEFAULT_SOURCE_SCORE = 3 # unknown sources
3. The Filtering Architecture
Filtering is the hardest part to get right. The goal is to eliminate noise without losing signal. NeuralField uses a three-layer approach, applied both before and after ingestion.
Why run filters both before and after fetching?
A critical lesson: if you only filter before fetching, the fetch step re-ingests articles you already removed. Always run cleanup after fetch_feeds() completes too. See the pipeline ordering section for the exact pattern.
Layer 1: URL/Title Blocklist
Specific articles or sources you know are off-topic. The most surgical tool.
URL_BLOCKLIST = [
("sportico.com/law/", "legal section — rarely AI-relevant"),
("example.com/ads/", "promotional content"),
]
TITLE_BLOCKLIST = [
("College Athlete Feedback Site", "confirmed off-topic"),
]
Use title-based blocking as a belt-and-suspenders when an article might be fetched via a redirect URL that doesn't contain the original domain pattern.
Layer 2: Domain Gate (Keyword Filter)
Require articles to contain at least one keyword from your domain's vocabulary. This kills the "AI in healthcare" articles that slip into a sports AI feed.
SPORTS_GATE_TERMS = [
"sport", "athlete", "game", "team", "player", "coach", "league",
"nba", "nfl", "nhl", "mlb", "fifa", "esport", "stadium", "match",
"tournament", "championship", "olympics",
]
GAMING_GATE_TERMS = [
"video game", "game dev", "gaming", "npc", "game engine", "unity",
"unreal", "steam", "playstation", "xbox", "nintendo", "esport",
]
# Article must match at least one gate term to survive
Layer 3: Primary Topic Gate
If your feed tracks an intersection, both halves need to be present. For AI+Sports, an article must contain AI vocabulary AND sports vocabulary. A pure sports article with no AI content doesn't belong.
For a feed about "AI in finance," an article about bond yields with no ML mentions doesn't belong. For "climate policy," an article about policy with no climate terms doesn't belong.
AI_GATE_TERMS = [
"artificial intelligence", "machine learning", " ai ", "ai-powered",
"neural network", "deep learning", "large language model", "llm",
"generative ai", "computer vision", "natural language",
]
Categorisation
Map articles to sub-categories so users can filter by interest. Categories should reflect different angles on your topic, not just different keywords. For AI+Sports, the angles are: Industry (business news), Analytics (data science), Performance (athlete tech), Officiating (computer vision), Esports (gaming).
4. Scoring and Ranking
Every article gets a numerical score. Higher = more prominent placement. The formula:
score = source_authority_score
+ recency_bonus # +3.0 if < 1 day old, +2.0 if < 2 days, tapering
+ keyword_relevance # count of domain-specific "prestige" terms × 0.3
+ description_length_bonus # small bonus for articles with full summaries
Recency decay matters because intelligence feeds are about what's happening now, not what was important two weeks ago. Implement a rolling window (7 days is typical) — articles older than 7 days are dropped from the live feed but preserved in the archive.
Keyword relevance — define a list of "prestige terms" for your domain that signal high-quality, on-topic coverage. For AI+Sports: terms like "computer vision," "predictive model," "performance analytics" score higher than just "AI" which is now generic.
5. Deduplication
The same story will appear across multiple RSS feeds. Without dedup, your feed fills with 5 copies of the same announcement.
TF-IDF title similarity — compute cosine similarity between article titles. Articles sharing >65-70% title similarity are considered duplicates; keep the one with the higher source authority score.
URL normalization — strip UTM parameters, ?ref=... suffixes, and trailing slashes before hashing. Two URLs pointing to the same article should hash identically.
Domain dedup — limit to N articles per source domain per day (typically 2-3). This prevents one prolific publisher from dominating the feed.
6. Database Schema
SQLite works well for a personal intelligence feed. Two tables:
CREATE TABLE articles (
id INTEGER PRIMARY KEY,