Monday, March 28, 2022

network troubleshooting with pv4.tcp_timestamp

in my work context, our company primarily develops proxy servers.

as of proxy server, we need to handle clients request, translate, transform, and then pass them to real backend servers. which we call them source server/site, or real server for more common name.


once in fault investigation, we found that our proxy server cannot connect to real server: www.cnki.net. 

in docker container which proxy server lives in, the TCP SYN packet just goes silent, which has no response completely.


in our further test, we found

1. other real server can be reached, and works normal

2. after random period, we can reach the real server for a short while.

3. in the end, we noticed the host server which ships docker container, can reach the real server correctly, and without interrupt.

4. mtr/traceroute shows TCP connectable to real server, even in container which runs proxy server.


this is weird, and we don't have any other option than utilize tcpdump and wireshark to analysis the sent packet, and try to find the differece.

and then we does that. but found no different between SYN packets. same field, same flag, even same TCP options shipped, just same everything. 


we stuck there for a while. suddenly we found an option is not very same, that was the timestamp option in TCP header.


the TSVal within container is significantly smaller than outside one.


 

we instantly reminded that linux has an option named ipv4.tcp_tw_recycle that considers timestamp value as proof whether packet is before or after recycling. the option is so buggy, that was removed from Linux 4.12 on 2017.

see also: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4396e46187ca5070219b81773c4e65088dac50cc


if one packet is considered before TIME_WAIT recycling, the linux kernel will drop that packet silently, without any notice.


the behavior is ok on internet, in beginning. but as IPv4 address is depleted, NAT gateway is all over the world, the behavior become buggy as real server sits behind the NAT gateway.

NAT gateway translate TCP packets, modify source IP to NAT gateway itself, assign random port of itself, and hides the real source IP and port. 

but ipv4.tcp_tw_recycle will treat same source IP + port as same remote host, which being replaced to NAT gateway's.


here is the problem, NAT gateway will reuse same self-IP and same port for different real remote client. when linux kernel recycles TIME_WAIT sockets, the source ip + port tuple represents more than one real remote host. and the TCP timestamp clock differs between there hosts. for example, host A will use 100k as timestamp, while host B uses 800k as timestamp; once they shares same NAT gateway's IP & port, and the sockets been recycled. host A's 101k timestamp will be explicitly dropped as of the linux kernel record the last timestamp for that source IP + port as 800k.


as we discovered the root cause of this situation, which is a incorrect configuration of the real site server. normal resolution is notice the website manager who runs server behind NAT Gateway to change that setting, which is set ipv4.tcp_tw_recycle to 0.


but the site manager is unreachable, and our product should not be affected by the website setting. we found the option ipv4.tcp_timestamp in the end. when set to zero, our TCP stack dismisses the timestamp option in the outgoing TCP packets. which disables website server's recycling logic: once no timestamp can be remembered, it is unable to drop packets from that port safely.

also, we found same solution for docker containers, which use sysctl option to modify container's system configuration.


PS

after that incident, we has further investigation for mtr & traceroute which can reach the real server on same host.

found that traceroute is not utilizing linux kernel's TCP stack, instead, it uses random_req() to generate a large number. this explains why traceroute can reach the real server, while host's TCP stack cannot.



PS2

we also investigated TCP stack's timestamp generation, it's from tcp_time_stamp_raw() which invokes ktime_get_ns()

https://github.com/torvalds/linux/blob/master/kernel/time/timekeeping.c#L817

https://www.kernel.org/doc/html/latest/core-api/timekeeping.html


see also
1. we can use bpf_ktime_get_ns in userspace to get system-space's ktime_get_ns(), also noticed that `bpf_ktime_get_ns and clock_gettime(MONOTONIC) are not based on the same time source`.
https://stackoverflow.com/questions/60970877/xdp-bpf-is-there-an-user-space-alternative-to-bpf-ktime-get-ns


2. RFC that describes TCP timestamp option.

https://datatracker.ietf.org/doc/html/rfc1323