Sending GET Requests From Linux to a SmartLinc Home Automation Controller is Very, Very Slow
Update! The problem turned out to be the microscopic window size advertised by the SmartLinc on initial connection. If I break up the initial message into two packets, with the first one only one byte long, the delay doesn't happen. So, I've written a little proxy that does exactly that. I've attached the proxy here:
I'll go ahead and leave all the rest of my earlier comments here for documentation of the problem
When we were getting ready to go out of town a couple of weeks ago, doing a good job of putting some of our lights on timers seemed like a good idea. I wound up buying a SmartLinc controller to run the system (there wasn't time to try to understand MisterHouse before we left; I've now got a controller that talks USB on order).
This controller sits on the home network, and runs a little web browser. Surprisingly, the interaction is vastly better when a Windows machine talks to it than a Linux machine (I'll show the wireshark traces in just a second). So I find myself with three questions:
- The important one is, how can I work around this?
- The less important one is, why is it happening in the first place?
And how can I hand-craft a test server to duplicate the behavior of the SmartLinc so I can try to figure out exactly what's triggering it?
OK, first attachment is a wireshark dump of firefox on my laptop establishing a connection to the controller, and sending its initial GET command (all the attached traces except the very last are filtered to only show packets involving the laptop, and not involving a VPN server it communicates with. Someone who is trying to help with this wanted to see more, so I added an unfiltered trace showing all the activity on the network from a few seconds before trying to load up the SmartLinc's page to after most of it has been loaded).
In the first tiny fraction of a second, the laptop performs a DNS query and the three-way handshake is completed. The laptop then sits for a full 4 seconds (in fact, almost exactly 4 seconds) before starting to send the GET request. It sits there for so long before sending the request that the SmartLinc actually sends a keepalive request! The delay isn't always four seconds; it varies from roughly one to roughly four but seems to always be quite close to an integral number of seconds. This delay is the real problem here; everything else in this post deals with comparing the response in various environments.
By contrast, here's what happens when firefox on the same laptop sends the request, running under XP:
This time, it sends the request immediately.
I see the fast behavior under XP, with both firefox and explorer. Just checked, and I also see good behavior from my daughter's IPod Touch.
I see the slow behavior under Linux with a bunch of different browsers, and with firefox or explorer under XP in a virtualbox VM, hosted by Linux on the laptop. My google phone shows the same behavior.
Next, let's take a look at the laptop (under linux) sending a GET to an apache server on another Linux machine in the house:
This time, we get a good fast request. One thing that's different here from both the Linux and Windows request to the SmartLinc is that the whole GET is sent in a single packet. Talking to the SmartLinc, they sent the G (in GET) as a packet by itself, and didn't continue until getting an ACK. This surprised me at first, but of course the SmartLinc advertised a window of 1 byte in the handshake.
I wrote a trivial client program, which establishes a connection to a server and sends a GET command, logging the time each system call completes. Here's the program --
(you'll notice I set TCP_NODELAY -- that was in case the Nagle algorithm was causing the delay. There's no difference whether it's set or not)
Here's a log of the program
and here's the wireshark trace.
So, the program thinks it got all the way past the writes almost immediately (from the log), but the first packet sent as a result of those writes has that four second delay (as indicated by the trace). I also tried corking and uncorking the socket; no difference. As you'd guess, using the client to talk to apache is fast.
The only significant differences I'm seeing in the server-side behavior are in the options sent back in the SYN/ACK packet during the handshake, and the TCP window size. The controller is sending a four byte options field, containing
- Maximum segment size: 532 bytes
Apache's host is sending a 20 byte field, setting
- Maximum segment size: 1460 bytes
- SACK permitted
- Timestamps: TSval and TSecr
- Window scale: 6
The window size from apache is 5792, while the size from the SmartLinc is 1.
(of course, if I'm missing a different significant difference, please let me know!)
So, my guess is that the missing options or different window size are causing some sort of problem. I tried to write a little server program to listen to (and discard) a GET request and sends a little bit of HTML back; my result here was that I simply wasn't able to accomplish a lot of the tests I wanted to run (I couldn't get the window size any where near 1, for instance). Is there a good way to control the options and parameters above on a fine-enough grain to duplicate the SmartLinc's handshake?
So, this is where I'm at. I'd like any ideas people might have; how to work around it, how to proceed in diagnosing exactly what's wrong....
Addendum -- I was asked to put a longer trace up to see if there might be some insights gleaned from that. Here it is; this trace is unfiltered; it shows everything that went across my home network while I was getting a page from the SmartLinc.