Question

High Latency and Packet loss

  • 19 October 2022
  • 20 replies
  • 2608 views

Badge

The Past couple of days I have been experiencing Latencies (Ping) of over 200ms and packet loss from 85%-100% pinging. Anyone else experiencing this? 


20 replies

The short answer is yes, others have seen it.  There’s another thread on here about it, but a few days ago, something broke ping on TMHI.  For me around 1 in 20 pings get a reply, today over 200ms(usually 40).

I’m running pinginfoview from nirsoft, which lets you set the port it uses. Webservers respond usually on port 80, DNS 53.  This lets me do a periodic ping to check my connection.  My connection has been more stable since they broke the ping, so there is a bright side.

Badge

Yeah That’s a problem, I am using a NETGATE 4100 Firewall with Starlink and T-Mobile in a failover configuration with the TMHI as the secondary and Starlink as primary. If Starlink goes offline which, it does in heavy rain the NETGATE switches to the TMHI until Starlink comes back online then it switches back. This has worked well until a couple of days ago. The problem is that the NETGATE uses Ping and Packet loss to determine an offline condition and with the excessive latency and packet loss on TMHI it thinks it is offline all the time and won’t switch so no internet access when Starlink goes off line.

I tried to find how to set the netgate to use a different port for its ping, but had no luck.  Another user yesterday mentioned the same thing.  It’s funny, because when I run a speedtest, the pings seem normal.  I don’t know what’s different.  Either the app, or www.speedtest.net still see normal ping response.

IMO ping to an open internet host is not a reliable gauge of connectivity.

I have been running tailscale to another host (also behind TMHI at another house) to determine if we are both connected.  “tailscale ping <host>” runs every 5 minutes and has been working very reliably.  You might consider that or a similar VPN solution to determine when you want to failover.

It is correct that a ping is not a reliable gauge of connectivity, but it is often used.  It seems to be what  is used by the Netgate firewall Skull52 has.

Badge

Mark,

You are correct ping is the only way to detect a down condition with pfsence. This was working for 2 months until just a couple of days ago. I have the Arcadyan KVD21 Gateway and I contacted support today told them the issue about the ridiculous latency and the 1 in 20 ping replies and 85% - 100% packet loss and that there were others complaining about the same issue the tech said they were aware of an issue but the Arcadyan was not affected, so they did the same old process of re-provisioning the gateway which of course didn’t fix it so now they are sending me a replacement Arcadyan we will see if that fixes it. 

I can tell you it probably won’t fix the ping.  I have the sagemcom, and have the same thing, over 90% of pings fail.  Something has broken icmp on TMHI.  I would expect them to fix it eventually, way too many people and devices use ping to monitor connectivity.  It is technically not a correct gauge, but it is often the only one available.

Badge

Yeah, I don’t hold much hope that it will. I think you are correct something has broken icmp on TMHI but they won't admit it. The tech didn’t tell me what gateways were affected but just that the Arcadyan was not ,which I don’t buy it. I think it is all of them considering the Arcadyan and the sagemcom are both experiencing it i am sure the Nokia is too.

Userlevel 7
Badge +8

I have the Nokia gateway & the excessive ping latency is common to all three of the T-Mobile gateways. A gateway replacement is a waste of time, effort & money. The problem has been reported by users coast to coast pretty much. You can do an ICMP ping for ipv4 or ipv6 and the result is the same. I have used Apple & Linux clients testing & it is pretty poor. It is common to see 70-80% packet loss running pings. Sure ICMP will have low priority & can be ignored but this is a recent behavioral change.

--- 8.8.8.8 ping statistics ---

100 packets transmitted, 19 packets received, 81.0% packet loss

round-trip min/avg/max/stddev = 75.896/110.618/150.142/18.738 ms

from netstat info:

Input histogram:

                echo reply: 56

                destination unreachable: 6175

                time exceeded: 104

Regarding the speedtest.net operation:

When a speedtest.net “test” is conducted to the server with a Wireshark capture you can see it opens a TCP session to the target test server. The TCP session (in my test run) is setup between TCP ports 8080 & 52830 for the packet exchange between the local client and the destination server. The local client TCP source port changes depending upon the packets. There are also some UDP exchange from time to time between the test server and the test client.

82.83.133.40.in-addr.arpa name = charlotte02.speedtest.windstream.net. < Target Server

Just prior to the session setup the local client, my MacBook Pro, repeats sending echo requests at 82.83.133.40. Over the course of the text there are 22 of the ICMP packets but all fail.

Result: All fail to reach the server. (no response found) ← Reason

Curious behavior as the trace route hits 192.0.0.1; then there are 4 responses 

traceroute to 40.133.83.82 (40.133.83.82), 64 hops max, 52 byte packets

1  www.webgui.nokiawifi.com (192.168.12.1)  1.042 ms  0.393 ms  0.330 ms

2  192.0.0.1 (192.0.0.1)  0.533 ms  0.560 ms  0.454 ms

3  * 192.0.0.1 (192.0.0.1)  27.806 ms *

4  * 192.0.0.1 (192.0.0.1)  42.273 ms  30.483 ms

5  192.0.0.1 (192.0.0.1)  27.970 ms  36.486 ms  28.959 ms

6  * * *

7  * * *

8  * * *

9  * * *

10  * 10.164.165.59 (10.164.165.59)  495.730 ms *

Non-authoritative answer:

10.164.165.59.in-addr.arpa name = 59.165.164.10.man-static.vsnl.net.in.

So it appears the traffic goes out the gateway but even performing the trace route using port 8080 or 443 it has issues. I have no clear idea where 192.0.0.1 is for sure. 

Is anyone else seeing the trace route where 192.0.0.1 is the next hop after the gateway? 

(00:50:b6:88:1a:f8 is the MAC address associated with the 192.0.0.1 IPv4 address from the Wireshark packet cap)

I am still picking the packet capture apart as there are a variety of odd issues but the speed test is completed regardless of the issues with the exchange of packets. 

 

Badge

The odd thing is that the failover to T-Mobile on the Netgate worked very well until a couple of days ago. I have been working with Netgate Support and we can’t find anything wrong with the Netgate.

Userlevel 7
Badge +8

I would like to think it is a temporary issue but given it is pretty consistently 80% packet loss and extreme latency it is more than likely over aggressive throttling. On another conversation a user in LA reported normal behavior without the excessive packet loss and extreme latency. I haven't seen enough responses to see if there is some regional profile that seems to be related but given the other user that chimed in from AZ on another conversation sees the same thing it is hard to say. It is probably some change T-Mobile has made. I doubt T-Mobile will readily communicate on the matter. 

It is interesting to see Speedtest.net run. I think the ping latency reporting by the application might not actually be from the pings but rather from the TCP session establishment. I know from my packet capture the TCP ports used and can see the three way handshake process etc… plus I see the pings that all seem to fail to get a response. Since I am seeing good speeds up and down and low jitter values it makes me a little suspicious about the reporting methods. Using fast.com the results are pretty similar so both tend to agree and report 47-48 ms latency. 

Userlevel 7
Badge +8

If you want to know more about ICMP packet propagation and latency I found a very good resource. Using ICMP for the mechanics of the Netgate failover is probably a poor choice. The document pretty much covers trace routing and the mechanics extremely well. Working in the industry for 22 years I know ICMP does receive low priority but is used all the time.

Based on that article it would be interesting to know if Ookia uses an MPLS core. If they are doing ICMP tunneling that would explain why their PING latency results are how they are vs pings from a client through the normal T-Mobile solution. I sort of think they do.

https://archive.nanog.org/meetings/nanog45/presentations/Sunday/RAS_traceroute_N45.pdf

Badge

Ok, Well I got the new router and of course it didn’t fix the ICMP issue but I did the the Netgate to work by disabling all down detection monitoring (Ping) for T-Mobile as it is the Tier2 secondary ISP so if starlink is down the it fails over to T-Mobile so down detection monitoring for T-Mobile is kind of moot anyway. The concern is that it did work.

Userlevel 7
Badge +8

Well, that makes sense. Now you have 100% confirmation. Support is just throwing hardware at the problem blindly. At least the solution is still working.

Badge

Yep, Not a hardware issue. There in some network issues that they are probably aware of but don’t know how to fix and as you said just throwing hardware at it hopping it may fix it. The weird thing is if you run a OOKLA speed test the ping is not to bad like upper 50s or low 60s which is not to bad but it finds the nearest server however if you actually run a ping from the command line 1 in 10 will return a reply and 65% to 95% packet loss. I think the only ones complaining are the Teckie users or gamers so it is not a huge priority for them. It really wouldn't have been an issue for me had I not setup Dual WAN with pfsense because T-Mobile is my backup ISP. 

Userlevel 7
Badge +8

I ran the Ookia speedtest.net application and took a Wireshark packet capture to see how that works. At the beginning of the session between the client on my machine and the speedtest.net server I was testing against I could see a series of pings and ALL of them were fails as no response to the ICMP packets sent by the ping utility. The two host connect via a TCP session and there are packets targeted to port 8080 from a different port(s). Depending upon the packet type/content the source port would change. Now if Ookia tunnels some ICMP packets between the two hosts or just has an alternate calculation for the latency, which may be, then it can and does report the value. From what I have seen in the capture the volume and rate of the packets from one client to the other are at a rapid rate so I recorded duplicate ACKs and retransmissions etc… It appeared rather chaotic at times but the test did succeed and values were reported. If they do run an MPLS tunnel for some of the session or not I cannot say but it works. Of the 22 ICMP packets sent all failed so not sure how the value was pulled except for the session establishment via the TCP handshake.

My 5G home internet is having packet loss.  After spending well over an hour with t-mobile support, they insisted there is nothing wrong, argued with me that this was normal.  They continue to say speed and latency are fine… however, REFUSE to understand that this is LOSS OF PACKETS and nothing to do with latency or speed.  Latency and speed are fine… but reliability IS NOT.  How does this manifest?  During my work audio/video calls, I see a pause every 30 seconds or so.  When I watch netflix, sporadically bombs out.  When my kids try to play a game, they see sporadic hangs (aka lag).  I run a ping test and clearly see the packet drops and over 10 minutes of ping… 3% packets are lost.  t-Mobile says they can run a ping test for a duration and things are great on their end… yep… thanks for listening to the customer.

 

Ping statistics for 8.8.8.8: ~10 mins
    Packets: Sent = 629, Received = 607, Lost = 22 (3% loss),
Approximate round trip times in milli-seconds:
    Minimum = 90ms, Maximum = 137ms, Average = 97ms

 

I guess t-mobile isn’t ready to understand that packet loss at this level is unacceptable or want to bother their network team to take a deeper look.  Until they lose a lot more customers, what else can I do?  The painful option to go back to the much more expensive and GB constrained Cricket (at least I can do my real job) or look to (also expensive) Starlink.

 

Sad... I was optimistic t-mobile was really trying to help the rural community get connected and they deliver a sub-par experience (no static IPs, poor network reliability) at an affordable price.  I guess you get what you pay for.

Userlevel 7
Badge +8

The latency is rather high so that doesn’t help and probably contributes to the packet loss. In effect you would have poor performance and retransmission of packets. Check your cellular metrics for both the 4G and 5G signals and identify the frequencies. If the 5G is n41 and the tower is not too far away it is good to know and understand the metric values. If your RSRP is good be sure to try to improve the RSRQ and SINR values. If you are able to reposition the gateway to improve the radio signal receive quality and the signal to noise ratio then the result should be improved performance as there should be less packet damage and fewer retransmissions. Sure if you can get improved signal receive power that also will help but try to improve the quality of the signal and don’t overlook that. Your location may not have the best signal from the tower so that can be a challenge. 

Thanks for the suggestions.  My latency and speeds are absolutely fine.  Here are my 5G metrics:

BAND n41

RSRQ 2

RSRP -93

SINR 22

From my reading on these metrics, this is all in the Excellent range.  There is a systematic loss every 30 seconds or so.  This didn’t happen in January the last time I used the service a lot.  I experienced this issue 20+ years ago with a DSL provider and ended up being a bad switch/router somewhere in their network.  Back then, a smaller phone company who had someone who LISTENED to the customer and actually had someone who knew networks.

T-Mobile Support “we pinged you, your speeds are good… sorry nothing more on our end” and then argues with me with their talking points of “speed good, ping worked”.  knowing I am many other are experiencing sub-par RELIABILITY quality for similar reasons is the mark of a failing service.

They are so close to a great service at a great price… but devil’s in the details.  They need to have static IPs for their modems, improve network resiliency and raise the bar on their technical customer support.  I am pursuing Starlink and guess will have to make that investment for a quality internet option in a remote location.

Userlevel 7
Badge +8

With the n41 metrics it should be working well. The problem is probably in the backhaul with a router in the network path to the internet. If they would and could dedicate a knowledgeable engineer to investigate the behavior in more detail they could probably find the problem. From what I have seen in examining traffic flows from here in the past the external IPv4 addresses rotate in the Atlanta area. I don’t have intimate knowledge of the hardware in place so I’m speculating on some aspects of their network architecture. They do use 464XLAT so offer no port forwarding to the end user. Since the mobile aspect/focus is on cellular service the phones do get priority over the fixed cellular gateways. In some places the service is really good but in others not so much. I am pretty sure the controls they leverage for bandwidth distribution are also a big factor as it appears they will compromise on service levels to expand subscription delivery. They may be planning to build out service delivery in a given location to better handle the loads but it seems the cart is before the horse in some places. Just my impression as I have observed a good number of subscribers in urban areas with more population density complaining of the same behavior. They are making an aggressive push to try to be the bigger dog in the fight. 

Reply