FPGA Hardware Design Guide
A practical FPGA hardware design guide based on real-world project experience.
Core Design Philosophy
1. Pipeline Architecture First
When processing high-speed data streams (video, network packets), adopt multi-stage pipeline design:
- Single-stage processing: Combinational logic delay too large, prone to timing violations
- Multi-stage pipeline: Insert registers at each stage, distribute delay, increase clock frequency
- Typical applications: RGB-to-YUV conversion, image filtering, protocol parsing
Real Case: RGB-to-YUV converter with 5-stage pipeline
- Stage 0: Input register (synchronize input signals)
- Stage 1: Multiply operation (coefficient * pixel value)
- Stage 2: Partial accumulation (Rcoef_r + Gcoef_g)
- Stage 3: Final accumulation (+ B*coef_b)
- Stage 4: Shift and saturation (truncate result to 8-bit)
2. The Art of Bit-Width Management
Bit-width calculation principles:
Multiplication bit-width = input bit-width + coefficient bit-width + 1 (sign bit)
Accumulation bit-width = multiplication bit-width + log2(number of additions) + 1 (guard bit)
Rules of thumb:
- 8-bit unsigned * 9-bit signed coefficient = 18-bit signed result
- 3 numbers of 18-bit addition = 20-bit (leave 2 guard bits to prevent overflow)
- After right-shifting 8 bits = 8-bit final result
3. Iron Rules of Synchronous Design
Rules that must be followed:
- All flip-flops use the same clock domain (unless CDC is explicitly needed)
- Synchronous reset preferred over asynchronous reset (avoid metastability propagation)
- Input signals must be registered for two cycles (cross-clock domain or external inputs)
- Combinational logic outputs must be registered (avoid glitch propagation)
Lessons learned:
- Asynchronous reset leads to unpredictable behavior when clock is unstable
- Unregistered combinational outputs may produce glitches after place-and-route
- Direct use of cross-clock domain signals causes metastability
Timing Closure Practical Techniques
Delay Analysis and Optimization
Identifying critical paths:
- Check
Worst Negative Slack (WNS)in synthesis report - Analyze
Total Negative Slack (TNS)distribution - Locate logic levels with maximum delay
Optimization strategies:
-
Insert pipeline registers (most effective)
- Insert FF in the middle of combinational logic
- Each stage delay < 70% of target clock period
-
Logic retiming
- Use
set_property RETIMING true - Let tool automatically move register positions
- Use
-
Critical signal optimization
- Use
set_property HIGH_PRIORITY truefor critical paths - Manual placement for critical modules
set_property LOC ...
- Use
Real data:
- Original design: critical path 15ns, target 10ns (not met)
- After inserting 2 pipeline stages: critical path 7ns (met + 30% margin)
- Latency cost: 2 clock cycles (acceptable)
Resource Optimization Strategies
LUT Optimization
Methods to reduce LUT usage:
- Use case statements instead of if-else chains (more efficient LUT synthesis)
- Avoid complex nested ternary operators
- Use DSP Slices instead of LUTs for multiplication
Comparison example:
// Inefficient: nested if-else
if (condition1) out = a;
else if (condition2) out = b;
else if (condition3) out = c;
// Uses ~20 LUTs
// Efficient: case statement
case ({condition1, condition2, condition3})
3'b100: out = a;
3'b010: out = b;
3'b001: out = c;
default: out = d;
endcase
// Uses ~8 LUTs
BRAM Usage Techniques
When to use BRAM:
- Storage depth > 16 (typically)
- Dual-port access required
- Large lookup tables (>1KB)
When to use distributed RAM:
- Small storage (<16 depth)
- Asynchronous read needed
- Save BRAM resources
Code example:
// Automatically inferred as BRAM (36Kb block)
reg [7:0] mem [0:1023]; // 8Kbits
always @(posedge clk) begin
if (we) mem[addr] <= din;
dout <= mem[addr]; // Synchronous read
end
// Small capacity automatically uses LUTRAM
reg [7:0] small_mem [0:15]; // 128bits
DSP Slice Optimization
Fully utilize DSP48E1:
- 25×18 multiplier (supports signed/unsigned)
- 48-bit accumulator
- Pre-adder (for symmetric FIR filters)
Avoid DSP waste:
- Don't use DSP for small multiplications (<8bit), LUTs are more efficient
- Use dedicated routing (ACIN/ACOUT) when cascading DSPs
- Use CE and SCLR controls to save power
Debugging and Verification Methods
Simulation Strategy
Three-level verification system:
-
Behavioral simulation (pre-synthesis)
- Verify algorithm correctness
- Use ideal delay models
-
Post-synthesis simulation
- Verify synthesis result functionality
- Check rough timing estimates
-
Post-implementation simulation
- Include actual routing delays
- Closest to real hardware
Testbench writing essentials:
// 1. Self-checking test
initial begin
// Apply stimulus
apply_stimulus();
// Wait for processing
repeat(10) @(posedge clk);
// Check results
if (dout !== expected) begin
$error("Test failed! Expected %h, got %h", expected, dout);
$finish;
end
$display("Test passed!");
end
// 2. Coverage check
covergroup cg @(posedge clk);
coverpoint state {
bins idle = {IDLE};
bins busy = {BUSY};
bins done = {DONE};
}
endgroup
On-board Debugging Techniques
Using ILA (Integrated Logic Analyzer):
- Mark critical signals as
mark_debug - Set trigger conditions (e.g., error flags, specific states)
- Capture data to Vivado for analysis
Using VIO (Virtual Input/Output):
- Modify parameters in real-time (e.g., filter coefficients)
- Monitor internal status registers
- Debug without recompilation
Real debugging case:
- Issue: YUV output occasionally shows wrong values
- Method: ILA captured multiplication intermediate results
- Finding: Sign extension error caused high-bit overflow
- Solution: Fixed signed number extension logic
Reference Documentation
Detailed design patterns: See references/design-patterns.md
- CDC synchronizer design
- FIFO implementation
- AXI-Stream interface
Common issues troubleshooting: See references/troubleshooting.md
- Timing violation diagnosis process
- Metastability handling
- Resource conflict resolution
Device selection guide: See references/device-selection.md
- Selection based on resource requirements
- Package and speed grade selection
- Cost optimization suggestions
Golden Rules
- Function first, optimization second — Make the design work correctly first, then optimize timing and resources
- Constrain early, relax late — Strict timing constraints early, relax based on situation later
- Register all boundaries — Module inputs and outputs must be registered to avoid timing coupling
- Documentation is code — Clear comments and documentation are more important than complex designs
- Test-driven development — Write testbench first, then implement functionality
This guide is based on real-world project experience and is continuously updated.
Fix YAML syntax - add missing name field