While rolling out Meraki MX64s out to an organisation, I noticed a slow build-up of sites that had either the primary or secondary connections down. This build-up led me to diagnose and identify a global issue with the firmware on these devices.
The symptoms observed where as follows:
1 – OFFLINE on our monitoring platform (This was simple ICMP ping packet Layer 3 was down).
2 – Session ONLINE on our PPPoE servers.
3 – OFFLINE on the WAN interface on the router itself (This was observed as we would be able to connect to the router via its backup connection).
At this stage we had to pinpoint which device was causing this behavior and the only way to do this was to understand what the devices were doing with the packets during the PPPoED process.
Our PPPoE servers were ruled out as no other device/customer was experiencing this exact issue. Logs also showed that the device had already completed the PPPoED process and should be online. Nothing untoward was detected here.
Each site had two broadband services (with different providers) that terminated to a Draytek modem. As such there was suspicion in the beginning that this could be causing it. Draytek was engaged whom replicated the environment and provided evidence to us that showed that these devices simply couldn’t be dropping packets at the Layer 2 level.
A packet capture pulled from the device showed that the router was continuously attempting to send out the first packet needed to initiate the broadband connection. Meraki used this to point blame towards one of the upstream devices (Modem/PPPoE server). See below the map I built which shows the location of the packet capture (WAN1) and what packets it could see:
It was at this stage however I felt we had more evidence to point blame towards the router so configured a switch with port mirroring to hopefully capture all the traffic between modem and router.
Once a switch was installed in-between the router and switch capturing all the packets traversing this link, we noticed the following:
1 – When unplugging the Ethernet cable from the Draytek modem and into the packet capture switch, we noticed the PPPoE session drop on our PPPoE servers but then come back up shortly after with the same symptoms.
2 – The packet capture showed that when it starts the PPPoED process – when it detects a change on the WAN uplink – it initiates the conversation correctly. What happens next however explained why the routers interface remained down in these situations. The router would keep sending out the first PADI packet even after completing the PPPoED process!
See below my final network map notes that shows this:
As the router keeps sending out PADI packets after completing the PPPoED process, it’s no surprise why our PPPoE servers thought the router had an active broadband session and why the router believed the interface was down.
Upon providing this evidence to Meraki they quickly developed and pushed out a new firmware version. If you experience the above then it can now be fixed by upgrading your routers to firmware version 14.31 or above.