Wasabi Roll: VxLAN The Fundamentals of Virtual Networking Architecture Vol 8 rel 8

VxLAN Overview

What happens when your VLAN reaches 4,094 tags? If your switches haven't crashed by then, you run out of space. That's fine if you're only building virtual networks in tiny places, but if you want to create logical networks that are scalable and link your virtual machines across different networks, then you need something else.

That something else is Virtual eXtensible Local Area Network (VxLAN), the current pop star of overlay technologies. A protocol developed by VMware, VxLAN lets you create a Layer 2 network on top of a Layer 3 network using encapsulation, increasing the number of entities you can connect up to about 16 million. Since virtual machines (VMs) can't connect across multiple Layer 2 networks without breaking their links, it's a game-changer for the software defined network and is now fully supported inside Linux. VxLAN bounced into the cloud services environment in 2011.

In addition, It reduces the need for spanning tree, trunking, and stretching VLANs. Virtual machines can move between hosts without need to fiddle with different IPs. The nice thing about it is that its Defined by RFC-7348 standards based (not vendor locked).

VxLAN Basics

VNI and VLANs

Every VLAN has a VLAN ID 12 bits long allowing for (2¹²) 4096 VLANs but actually out of 4096 vlans allowed: 1–1001 are standard , 1002–1005 are reserved and 1006–4095 are extended range vlans. There is no vlan 0.
VxLAN VNI (VxLAN Network Identifier) (24 bits) 16,777,215 segments
VxLAN separate traffic – if you want to share communications between them – you need a router.

Overlays and Underlays

VxLANs overlays the underlay (layer 3) all traffic is routed so no trunking or spanning tree needed. Dynamic routing OSPF or EIGRP, ISIS is ok as well. This makes good use of ECMP for load sharing and fast recovery. You could use BGP, but it gets complex.
VxLAN – each VNI are a separate virtual network that runs over the underlay. Each of these VNIs are called a bridge domain. To create this virtual network, traffic is encapsulated with UDP and IP before it is sent out. When it reaches its destination, it’s stripped out.
A hidden advantage of this – you can change your underlay, as long as you have IP connectivity between points, without affecting the overlay.

VTEPs and Encapsulation

Switches and routers that participate in VxLAN have a special interface called the VTEP (VxLAN Tunnel End Point). The VTEP provides the connection between the overlay and the underlay. Each VTEP has an IP address in the underlay network; it also has one or more VNIs (VNI 1701, VNI 2403, etc…)

To deliver traffic from one host to another, a source and destination VTEP will create a stateless tunnel. These tunnels exist only long enough to deliver the VXLAN frame. When a frame for a remote host reaches a switch, the frame is encapsulated in IP and UDP headers, the switch forwards then through the underlay.

Hosts and Gateways

VxLAN can be supported on hardware or software. The advantage of this is that a Hypervisor like ESX or Hyper-V can support it. This is the host-based method; the Vswitch on the host has a VTEP, which encapsulates traffic from VMs before it touches physical switches. The physical switches just sees the IP traffic and are unaware of the VxLAN. The advantage of this is a simplified physical network that focuses on transport.

The VTEPs could also be on physical switches or routers like the configuring VxLAN on a Nexus switch, this is called a VxLAN gateway, the VM sends traffic and the Vswitches as normal – when the frame reaches the switch, and the VTEP encapsulates the frames. VxLAN on hardware is faster.

There are hybrid networks that can work. This is where some are behind gateways and some hosts have VxLAN enabled.

VxLAN – Header Format and Encapsulation

VxLAN Header Format

When traffic comes in it encapsulates IP and UDP headers. We start with a standard frame that a host would send – we calls this the inner MAC frame...

It has a Src MAC; Dst MAC; Src IP; Dst IP; Data; even a VLAN tag.

In this example, traffic will stay in within the VNI, so no routing required. The host sends the frame to the switch; the switch adds the VxLAN header:

Which contains the VNI; the VTEP adds several additional headers preserving the inner frame. VxLAN uses UDP for transport, the destination port is 4789 the source port is random. ECMP if available uses a hashing algorithm to decide which link to put the traffic on. The random source port helps the algorithm to use the links evenly.

An IP header is now added with the address of the destination VTEP. An Ethernet header with a MAC address is added for delivery to the next physical device. As normal, the source and destination MAC address changes with each device, they pass through. When the traffic arrives at the destination VTEP, the headers are removed leaving the original frame to be delivered to the host.

VxLAN Header (lot of unused space)

There are for parts:

Reserved 8-bits for future use

VNI 24 bits: the VxLAN ID

Reserved 24-bits for future use

Flags 8-bits; Bit 3 shows that the VNI is valid (set to 1 for a valid VNI)

The VxLAN adds about 50bytes of overhead; hence, you will need to enable jumbo frames everywhere.

Spine and Leaf Topology

In addition, known as the fabric, Spine and Leaf topology is often synonymous with VxLAN; however, it’s not exclusively VxLAN though. Fabric path and ACI are also known to use Spine and Leaf.

Traditional Hierarchical Architecture

Traditional Hierarchical Architecture makes use of a Core, Distribution and Access layers…

Access

The Access layer is where hosts and devices connect

Aggregation (Distribution)

Controls traffic in an area, e.g., a floor in the building

Core

Provides fast transport between areas

It is common in campus and data Center environments. It creates network ports and traffic is in a north south pattern where the majority of traffic needs to leave the local area.

If a smart phone is to move from one are to another – a new IP from DHCP. As an alternative, one option would be to span VLANs across the access layer, but although addresses mobility, increases broadcast and failure domain size.

This will only work within the distribution block, for the core is a routed network.

In a Data Center architecture, there would be servers with static IPs. If you had different subnets in each rack and you wanted to migrate a VM, you would have to have a new IP address. This is where the Spine and Leaf Architecture along with VxLAN comes in.

Spine and Leaves

Clos and Clos Network has been around since the 50s. Charles Clos initially created the mathematical theory of this architecture in 1953, hence the reason it is called Clos. Briefly, this architecture is based on associating each output with an input point. In other words, the number of outputs equals the number of inputs, or there is precisely one connection between nodes of one stage with those of another stage.

This architecture can be considered a two-tier hierarchy, in which the access layer switches

are called leaves and the aggregation layers are called spines. With this two-tier leaves and spines architecture, each leaf switch is connected to every spine switch on the aggregation layer, which makes the DCN architecture flatten and eliminate bandwidth aggregation / oversubscription, because of bandwidth being the same at every tier.

Figure 1 shows the hardware distributed VxLAN using the spine/leaf two-layer architecture. Spine nodes and gateways are converged and function as VxLAN egress devices, and leaf nodes function as distributed VxLAN gateways.

Figure 2-25 Hardware distributed VxLAN using the spine/leaf two-layer architecture

Overall design:

Flexibly configure the number of spine nodes and leaf nodes, as shown in Figure 2.

Spine nodes and leaf nodes are connected at Layer 3 and ECMP are configured on the entire network, achieving load balancing of traffic, non-blocking forwarding, and fast convergence.

Deploy ARP broadcast suppression globally and traffic suppression on an interface to prevent broadcast traffic from being flooded. ARP proxy can be also configured, which is the secondary choice. Then traffic is imported to the corresponding gateway, and the gateway monitors Layer 2 traffic.

Figure 2-26 Networking of extended spine and leaf nodes

Spine node:

Spine nodes can constitute an M-LAG or a stack. It is easier to deploy and maintain the stack, but the service interruption time is long during version upgrade. M-LAG is therefore recommended.

The spine node is used as the RR of BGP EVPN.

It is recommended that the CE12800 be used as the spine node to meet expansion requirements of the future network.

Leaf node:

When NICs of a server are connected in load balancing mode, leaf nodes support multiple networking such as the stack, M-LAG, and SVF composed of fixed devices. M-LAG is recommended because of its high reliability. When NICs of a server are connected in active/standby mode, leaf nodes use the standalone mode.

When server leaf nodes constitute an M-LAG, the Monitor Link group needs to be deployed. The uplink is associated with all downlinks, preventing traffic interruption when the uplink fails.

Router:

Routers and spine nodes are fully meshed, ECMP-based forwarding is implemented between spine nodes and routers, and links between routers and between spine nodes are used as backup links.

Routers are used as egress devices and are connected to extranets

As a result, this architecture offers a high degree of network redundancy and a simplified design, and supports a large volume of bisectional traffic (east west). The two-tier network has every link switch connect to a spine switch via a router governed by layer 3. Every link needs an IP address at each end and will use a subnet of /30 or /31. A routing protocol like OSPF or EIGRP is used to manage routing. What all of this means is that every path is equal and every destination is only two hops away.

If a link a link or switch fails, ECMP is used to reroute the path instantly. Notice there are no links between spine or leaf switches, i.e., spine to spine or leaf to leaf switches. This keeps the topology consistent and avoids unnecessary paths. The exception to this is if you are using VPC that connects a server to two or more leaf switches with a layer 2 p-link and layer 3 keep-alive link for VPC traffic only. (Completely different discussion but VPC is a variant of VxLAN).

VTEP redundancy is achieved by Cisco Nexus 9300 platform switches by using a pair of virtual PortChannel (vPC) switches to work or function as a logical VTEP device and sharing an anycast VTEP address.

The vPC switches use vPC concept for redundant host connectivity while individually running Layer 3 protocols with the upstream devices or switches in the underlay network. Both VTEP will join the multicast group for the same VXLAN VNI and use the same anycast VTEP address as the source to send VXLAN encapsulated packets. To the devices in the underlay network, including the multicast rendezvous point and the remote VTEP devices, the two vPC VTEP switches appear to be one logical VTEP entity.

Spine Leaf is well suited for the data center. A routed network limits the failure domain. Adding VxLAN over the top allows devices to move over the top without changing IP addresses. This also focuses on an east to west traffic pattern. This architecture favors the data center, for much of the traffic is between servers. Another major advantage is scalability. If you need add more hosts, just add more leaf switches. Need more bandwidth, add a spine switch or two. The point here is that the design remains the same as it scales. You just need more IP addresses, links, and switches, the routing protocol will take care of the rest.

Adding VxLAN

Extending the Fabric

So now, we have a nice fabric for our hosts. However, what about internet and WAN connectivity? What about firewalls, load balancers, and routers? This functionality is added to the leaf layer. Leaf switches with routers and firewalls attached are called border leaves.

The boarder leaves switches represent connectivity to and from the fabric. This is also, where routes from the overlay are redistributed to the core routing protocol. In practice, these switches are not particularly special; hosts can still be connected to boarder leaves.

What happens your fabric is growing out of control? You can break the fabrics into smaller fabrics and connect them to a super spine technology; however, this is for goliath size networks.

VxLAN Address Learning

Address Learning

It is time for things to get interesting. We know that VxLAN is an “overlay” technology. It creates tunnels in IP “underlay” Network. The ingress VTEP encapsulates the underlay traffic and sends it to the egress VTEP, where it is decapsulates the traffic, so it can be delivered by the underlay with no one the wiser.

How does the ingress VTEP find the egress VTEP? With all the VTEPs, how does the switch find the right one? In addition, how does it learn the destination MAC addresses are?

There are two conventions for Address Learning.

The first is called Data Plane Learning. This is the traditional method for learning about addresses – very much like traditional Ethernet in many ways.

The second is called Control Plane Learning. Which is a newer and more sophisticated approach. This method uses BGP to share MAC addresses. It is similar with the method that BGP learns and shares routes.

Before going further, let us look at the two different traffic types we will encounter. The first is Unicast traffic, this traffic is sent to a specific single destination. The second, BUM traffic, which requires special handling.

BUM Traffic

BUM is an acronym for Broadcast, Unknown Unicast, and Multicast traffic. Simply put any traffic that goes to more than one destination. ARP is an example of BUM traffic.

In a traditional Ethernet network BUM, traffic is flooded to many destinations. VxLAN has to be more discerning about this type of traffic, otherwise it would be able to scale to a large network.

There are two possible ways VxLAN can handle BUM traffic, these are multi-cast and headend replication.

Multicast is probably the most common solution, each VNI is mapped into a single multicast group. Conversely, each Multicast group may map in one or more VNIs. When a VTEP comes online, it uses IGMP to join the multicast groups that the VNIs uses. If there is a VNI that the VTEP does not use, it does not need to join that group. When a VTEP needs to send BUM traffic, it will send to only the relevant multicast group. This is one method of VTEP discovery.

As an alternative, headend replication can be use instead of multicast. However, is only available if you use BGP / EVPN. When BUM traffic arrives, the VTEP creates several unicast packets and sends one to every VTEP that supports the VNI. This is not as efficient as multicast and does not scale as well. However, it is much simpler than a multicast infrastructure. This method is fine for about 20 VTEPs or less.

Data Plane Learning

Data plane learning (flood and learn) is very similar to regular Ethernet. Ethernet works by shouting to all destinations the desired destination and all destinations receive this, but only the correct destination responds. Data plane learning is very similar; it is a little better in that it is flooding only goes to specific multicast group members. However, Data plane learning has a serious limitation, it has no built in support for routing, it is only using for bridging devices at layer 2. To reach the outside world, you will need an external router as your gateway. Additionally, if you want to route between VNIs, you will also require an external router, which will cause traffic to hairpin.

From a security perspective, VTEPs are not authenticated; hence, there is nothing preventing them from accepting a rogue VTEP. In most cases, Control Plane Network is recommended.

Control Plane Learning

Learning through the Data Plane has its drawbacks. Control Plane Learning is much more functional and efficient. Control Plane Learning has switches learn the MAC addresses before their needed. This works the same as a routing protocol. Switches peer with each other using BGP and share the addresses they know about. This uses the EVPN address family. Address families are the way BGP can carry reach-ability information for different protocols. There are IPv4 and IPv6 families, VPN address families for MPLS and others.

Each switch runs BGP and peers with every switch on the IP network. Hence, you are either a full mesh or route reflectors:

Some or all of these switches will have VTEPs; hence, this means that all of them will learn where each one are in the network. This also provides VTEP authentication, when VTEPs are learned through BGP, they are whitelisted; thus, any rogue VTEPs will be rejected. When BGP authentication may be used to prevent rogue peers.

Host MAC addresses are added through the BGP process. They are discovered when they start up or send GARP messages. The MAC address is then shared among all BGP peers. Therefore, when a host sends a frame to another host, the switch looks up the MAC address with BGP, subsequently sending the frame to the correct MAC address without flooding.

You may ask, Is ARP gone. When you use Control Plane Learning, you are also using ARP

suppression. Since the hosts on the network have no idea that VxLAN or EVPN exist, so they will still send ARP requests when they need to send to another host. However, when this ARP request reaches the switch, it is not flooded out as normal; it looks up the information via BGP, which already knows the IP to MAC address mapping. Subsequently, the switch responds to the host’s request with an ARP request response.

What about BUM traffic then? Is it still relevant when Control Plane Learning is used? Yes, applications still send broadcast and multicast traffic. Unknown unicast traffic may appear if there are solid hosts on the network. To handle BUM traffic, you still have Multicast or headend replication as aforementioned. The general recommendation is to use multicast, as it is the most scalable solution.

Routing and Multitenancy

Now there is a major advantage of using BGP/EVPN. It supports integrated routing and bridging. Unlike Data Plane Learning, you do not need to use an external router. VNIs can be configured as a layer 2 VNI or a layer 3 VNI. Both layer 2 and layer 3 information is carried in BGP. A layer 2 VNI is used for bridging, where traffic is kept in the same LAN segment.

A layer 3 VNI is used for routing, where traffic needs to leave a layer 2 VNI. Layer 3 VNIs are optional, but if you want to route through the local switch, you need them.

VTEPs only need to know the layer 2 VNIs they service locally, on the hand, all VTEPs need to know about all layer 3 VNIs. This is so they can support a feature called, “anycast gateway”. When IRB (not all platforms support IRB) is used, each switch acts as a default gateway for host in a VNI. Rather than give each switch a different IP address, they all have the same IP and the same virtual MAC address. There is no need for HELO messages or timers, like HSIP or VIARP have. This means that all hosts can have the same default gateway regardless to which switch they connect to. This afford the Virtual Machine mobility mentioned.

Multitenancy

To support multitenancy, layer 3 VNIs are attached to a VRF.

Virtual Routing and Forwarding or VRF allows a router to run more that one routing table simultaneously. When running more routing tables in the same time, they are completely independent. For example, you could use overlapping IP addresses inside more VRFs on the same router and they will function independently without conflict. It is possible to use same VRF instance on more routers and connect every instance separately using VRF dedicated router port or only a sub-interface.

You can find VRFs to be used on ISP side. Provider Edge (PE) routers are usually running one VRF per customer VPN so that one router can act as a PE router for multiple Customer Edge (CE) routers even with more customers exchanging the same subnets across the VPN.

By running VRF per customer, those subnets will never mix in-between them. If you are using MPLS, this is very familiar. This means many VNIs can be associated with a customer or tenant. Routes and routing tables are kept separate by using route targets and route distinguishers.

In Conclusion

We hope we were able to clear up the concepts of the VxLAN. With the virtual overlay riding over the physical layer allowing us to dynamically move virtual machines freely without the need for an IP address change. In addition, VxLAN can solve the problem of wanting overlay networks in virtualized data centers with multiple tenants and Layer 2 connectivity across several data centers.

It can address the security issues that come with VLAN groupings. It provides the flexibility to create new networks on the fly when used in conjunction with other VMware solutions like vCloud Director. It fools VMs into believing that they are part of one big, flat network, and it's massively scalable. Perfect for bigger virtual networks.

So, enjoy...
___________________________________________

References

______________________________________________________

“Once more unto the breach, dear friends, once more;”

___________________________________________

We would like to thank our sponsors, for without them - our fine content wouldn't be deliverable!

Rick's Cafe' American

___________________________________________

About Rick Ricker

An IT professional with over 23 years experience in Information Security, wireless broadband, network and Infrastructure design, development, and support.
For more information, contact Rick at (800) 399-6085 x502

Wasabi Roll

Friday, September 6, 2019

VxLAN The Fundamentals of Virtual Networking Architecture Vol 8 rel 8

Spine node:

Leaf node:

Router:

No comments:

Post a Comment