# Cisco Connect

Dubrovnik, Croatia, South East Europe 20-22 May, 2013

# Anatomy of Internet Routers



ıılıılı cısco

Josef Ungerman Cisco, CCIE #6167

© 2013 Cisco and/or its affiliates. All rights reserve

Cisco Connect

### Agenda

### On the Origin of Species

- Router Evolution
- Router Anatomy Basics

### **Packet Processors**

- Lookup, Memories, ASIC, NP, TM, parallelism
- Examples, evolution trends

### **Switching Fabrics**

- Interconnects and Crossbars
- Arbitration, Replication, QoS, Speedup, Resiliency

### **Router Anatomy**

- Past, Present, Future CRS, ASR9000
- 1Tbps per slot?

BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.

### Hardware Router Control Plane vs. Data Plane



BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.

### Hardware Router Centralized Architecture



BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.

# Scaling the Forwarding Plane NP Clustering



BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.

## Scaling the Forwarding Plane Switching Fabric



BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.

## Scaling the Forwarding Plane Distributed Architecture



BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.

ılıılı cısco





# **Packet Processors**

# Packet Processing Trade-offs Performance vs. Flexibility

#### **CPU (Central Processing Unit)**

CPU

NP

- multi-purpose processors
- high s/w flexibility [weeks], but low performance [1's of Mpps]
  - high power, low cost
- usage example: access routers (ISR's)

#### ASIC (Application Specific Integrated Circuit)

- mono-purpose hard-wired functionality
  - complex design process [years]
  - high performance [100's of Mpps]
  - high development cost (but cheap production)
  - usage example: switches (Catalysts)

#### NP (Network Processor) = "something in between"

- performance [10's of Mpps] + programmability [months]
- cost vs. performance vs. flexibility vs. latency vs. power
  - high development cost
  - **usage example**: core → edge, aggregation routers

BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.





"It is always something

(corollary). Good, Fast, Cheap:

Pick any two (you can't have all three)."

RFC 1925 "The Twelve Networking Truths"

BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.

Cisco Public

10

# Hardware Routing Terminology



# FIB Memory & Forwarding Chain

#### TLU/PLU

memories storing Trie Data (today typically RLDRAM)

Typically multiple channels for parallel/pipelined lookup

•PLU (Packet Lookup Unit) – L3 lookup data (FIB itself)
•TLU (Table Lookup Unit) – L2 adjacencies data (hierarchy, load-sharing)



BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.

# CAM



### L2 Switching (also VPLS)

**Destination MAC** address lookup → Find the egress port (**Forwarding**)

Read @ Line-rate = Wire-speed Switching

Source MAC address lookup → Find the ingress port (Learning)

Write @ Line-rate = Wire-speed Learning

BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.

# TCAM

### TCAM (Ternary CAM)

"CAM with a wildcard" (VMR) CAM with a Selector at some cells Stable O(1) lookup performance 3<sup>rd</sup> state – Don't Care bit (mask) **usage:** IP lookup (addr/mask)

### **IP Lookup Applications**

L3 Switching (Dst Lookup) & RPF (Src Lookup) Netflow Implementation (flow lookup) ACL Implementation (Filters, QoS, Policers...) various other lookups

Content (Value/Mask) Result ... 192.168.100.xxx 801 192.168.200.xxx 802 192.168.300.xxx 803 ... 192.168.200.111 802 Query Result → pointer ACL PERMIT/DENY

#### **TCAM Evolution**

CAM2 – 180nm, 80Msps, 4Mb, 72/144/288b wide CAM3 – 130nm, 125 Msps, 18Mb, 72/144/288b wide CAM4 – 90nm, 250Msps, 40Mb, 80/160/320b wide

BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.

# **Pipelining Programmable ASIC** 2002: Engine3 (ISE) – Cisco 12000

√4 Mpps, 3 Gbps √u-programmable stages √2 per LC (Rx, Tx)



BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.

# **SMP Pipelining Programmable ASIC**

2004: Engine5 (SIP) – Cisco 12000

 $\sqrt{16}$  Mpps, 10 Gbps  $\sqrt{130/90}$ nm, u-programmable  $\sqrt{2}$  per LC (Rx, Tx)  $\sqrt{240}$ W/10G = 24 W/Gbps



BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.



BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.

### "If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?"

Seymour Cray

# "What would Cinderella pick to separate peas from ashes?"

Unknown IP Engineer

## "Good multiprocessors are built from good uniprocessors"

Steve Krueger

BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.

Cisco Public

18

# **PPE (Packet Processing Elements)** Generic CPU's or COT?

- NP does not need many generic CPU features
  - floating point ops, BCD or DSP arithmetic
  - complex instructions that compiler does not use (vector, graphics, etc.)
  - privilege/protection/hypervisor
  - Large caches
- Custom improvements
  - H/W assists (TCAM, PLU, HMR...)
  - Faster memories
  - Low power
  - C language programmable! (portable, code reuse)

|             |                                             | Cisco QFP                                                | Sun Ultrasparc T2 | Intel Core 2<br>Mobile U7600 |
|-------------|---------------------------------------------|----------------------------------------------------------|-------------------|------------------------------|
|             | Total number processes<br>(cores x threads) | 160                                                      | 64                | 2                            |
|             | Power per process                           | 0.51W                                                    | 1.01W             | 5W                           |
|             | Scalable traffic management                 | 128k queues                                              | None              | None                         |
| BRKSPG-2772 |                                             | © 2012 Cisco and/or its affiliates. All rights reserved. |                   |                              |





#### QFP:

- >1.3B transistors
- >100 engineers
- >5 years of development
- >40 patents

#### **Packaging Examples:**

- ESP5 = 20 PPEs @ 900MHz
- ESP10 = 40 PPEs @ 900MHz
- ESP20 = 40 PPEs @ 1200MHz
- etc.

#### SMP NPU (full packets processing) QFP (Quantum Flow Processor) – ASR1000 (ESP), ASR9000 (SIP) 2008 OFP √16 Mpps, 20 Gbps √90nm, C-programmable ✓ sees full packet bodies ✓ Central or distributed 2012 OFP **Processing Pool** $\sqrt{32}$ Mpps, 60 Gbps $\sqrt{45}$ nm, C-programmable **Fast Memory Access** ✓Clustering capabilities **160 Engines** $\checkmark$ SOC, Integrated TM (40 PPEs x 4 threads) RLDRAM2 7 on-chip ✓ sees full packet bodies SRAM resources TCAM4 ✓Central engine (ASR1K) RLDRAM2 0

BRKSPG-2772

complete packets

**Distribute & Gather Logic** 

**Resources & Memory Interconnect** 

© 2012 Cisco and/or its affiliates. All rights reserved.

Cisco Public

Pkt DRAM

TM ASIC - 128K queues

- 5L shaping



ılıılı cısco





# **Switching Fabrics**

# **Interconnects Technology Primer**

Capacity vs. Complexity



#### Bus

- half-duplex, shared media
- standard examples: PCI [800Mbps], PCIe [Nx 2.5Gbps], LDT/HT [25Gbs]
- simple and cheap

#### **Serial Interconnect**

- full-duplex, point-to-point media
- standard examples: SPI [10Gbps], Interlaken [100Gbps]
- Ethernet interfaces are very common (SGMII, XAUI,...)

### Switching Fabric (cross-bar)

- full-duplex, any-to-any media
- proprietary systems [up to multiple Tbps]
- often uses double-counting (Rx+Tx) to express the capacity

© 2012 Cisco and/or its affiliates. All rights reserved.



# **Fabric Port engineering – examples**



# "Non-Blocking" voodoo

RFC1925: It is more complicated than you think.



BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.

### **Example: 16x Multicast Replication** Egress Replication



BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.

# What if the fabric can't replicate multicast?

**Ingress Replication Flavors** 



#### Bad:

#### **Ingress Replication**

- central replication or encapsulation engines
- \*) of course, this is used in centralized routers

**10Gbps of multicast eats 160Gbps fabric bw!** (10G multicast impossible)

#### Good-enough/Not-bad-enough: Binary Ingress Replication

#### dumb owitching febrie

- dumb switching fabric
- non-Cisco

10Gbps of multicast eats 80Gbps fabric bw! (10G multicast impossible)

# **Cell dip explained**



# Cell dip

Q: Is this SF non-blocking? A: (MARKETING) Yes Yes Yes !

A: (ENGINEERING) Non-blocking for unicast

packet sizes above 53B.



# Cell dip gets bad – too low speedup (non-Cisco)

### MARKETING: this is non-blocking fabric\*

\*) because we can find at least one packet size that does not block



# Cell dip gets worse – multicast added (non-Cisco)

### MARKETING: this is non-blocking fabric\*

\*) because we can still find at least one packet size that does not block, and your network does not have that much multicast anyway



# **Router is blocking – 1 fabric cards fails (non-Cisco)**

### MARKETING: this is non-blocking fabric\*

\*) because nobody said the non-blocking capacity is "protected"



# What is "Protected Non-Blocking"



# HoLB (Head of Line Blocking) problem



# Good HoLB Solutions

### Fabric Scheduling + Backpressure + QoS



BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.

# **Multi-stage Switching Fabrics**

#### Multi-stage Switching Fabric

• constructing large switching fabric out of smaller SF elements

**50's: Ch. Clos** – general theory of multi-stage telephony switch **60's: V. Beneš** – special case of *rearrangeably non-blocking* Clos (n = m = 2)





- Multi-chassis capabilities (2+0, N+2,...)
- Massive scalability: up to 1296 slots !!!
- Output-buffered, speedup, backpressure



BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.

#### ASR9000 – Clos

- Single-chassis so far
- Scales to 22 slots today
- Arbitrated VOQ's



## **Virtual Output Queuing**

- VOQ on ingress modules represents fabric capacity on egress modules
- VOQ is "virtual" because it represents egress capacity but resides on ingress modules, however it is still physical buffer where packets are stored
- VOQ is not equivalent to ingress or egress fabric channel buffers/queues
- VOQ is not equivalent to ingress or egress NP/TM queues



#### ASR9000

- Multi-stage Fabric
- Granular Central VOQ arbiter
  - VOQ set per destination
    Destination is the NP, not just slot
    4 per 10G, 8 per 40G, 16 per 100G
- 4 VOQ's per set
  - 4 VOQ's per destination, strict priority Up to 4K VOQ's per ingress FIA
- Example (ASR9922): 20 LC's \* 8 10G NP's \* 4 VOQ's = up to 640 VOQ's per ingress FIA

ılıılı cısco





# Router Anatomy

## 2004: Cisco CRS – 40G+ per slot



BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.

## 2010: Cisco CRS – 100G+ per slot

## Next: 400G+ per slot (4x 100GE)

same backward-compatible architecture, same upgrade process



## **CRS Multi-Chassis**



BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.

## CRS Multi-Chassis (Back-to-Back, 2+0)



# CRS Multi-Chassis (N+1, N+2, N+4)



## 2009: Cisco ASR9000 – 80G+ per slot



## 2011: Cisco ASR9000 - 200G+ per slot

## Next: 800G+ per slot

new RSP, faster fabric, faster NPU's, backwards compatible



## 2012: Cisco ASR9922 – 500+G per slot

## Next: 1.5T+ per slot

7 fabric cards, faster traces, faster NPU's, backwards compatible



# **Entering the 100GE world**

#### Router port cost break-down



#### **Core Routing Example**

- 130nm (2004)  $\rightarrow$  65nm (2009): 3.5x more capacity, 60% less Watt/Gbps, ~8x less \$/Gbps
- 40nm (2013): up to 1Tbps per slot, adequate Watt/Gbps reduction...

Silicon keeps following Moore's Law Optics is fundamentally an analog problem Cisco puts 13% revenue (almost 6B\$ annually) to R&D cca 20,000 engineers

BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.

## **Terabit per slot**... CMOS Photonics

#### What is CMOS Photonics?

- Silicon is semi-transparent for SM wavelengths
- Use case: Externally modulated lasers



• 10x 10GE breakout cable (100x 10GE LR ports per slot)

BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.

Cisco Public

CFP

## Terabit per-slot...

### How to make it practically useful?

#### Silicon Magic @ 40nm

- Power zones inside the NPU low power mode
- Duplicate processing elements in-service u-code upgrade

#### **Optical Magic @ 100G**

- Single-carrier DP-QPSK modulators for 100GE (>3000km)
- CMOS Photonics
- ROADM



#### NPU Model (40nm)



#### Data Plane Magic – Multi-Chassis Architectures

- nV Satellite
- DWDM and OTN shelves

#### Control Plane Magic – SDN (Software Defined Networks)

- nLight: IP+Optical Integration
- MLR Multi-Layer Restoration
- Optimal Path & Orchestration (DWDM, OTN, IP, MPLS)

BRKSPG-2772

 $\ensuremath{\textcircled{\sc 0}}$  2012 Cisco and/or its affiliates. All rights reserved.

Cisco Public

50

## **Evolution: Keeping up with Moore's Law** higher density = less W/Gbps, less \$/Gbps

#### **Switching Fabrics**

- Faster, smaller, less power hungry
- Elastic multi-stage, extensible
- Integrated VOQ systems, arbiter systems, multi-functional fabric elements

#### **Packet Processors**

- 45nm, 40nm, 28nm, 20nm process
- Integrated functions TM, TCAM, CPU, RLDRAM, OTN,...
- ASIC slices Firmware ISSU, Low-power mode, Partial Power-off

#### **Router Anatomy**

- Control plane enhancements SDN, IP+Optical
- DWDM density OTN/DWDM satellites
- 100GE density CMOS optics, CPAK/CFP4
- 10GE density (TGE breakout cables) and GE density (GE satellites)

BRKSPG-2772

© 2012 Cisco and/or its affiliates. All rights reserved.

# Thank You.

# 

© 2013 Cisco and/or its affiliates. All rights reserved.

Cisco Connect 52