Sunday, June 29, 2008

Non-blocking sockets and Linux

Hi.

I just got through mucking around with system calls under Linux to make the network subsystem of the Sphere RPG engine work. Very painful experience.

Anyway, I fixed it, almost totally rewrote it, and I noticed a few issues with the old code.

First, reuse of a port on a machine was a pain, because my listening ports weren't using SO_REUSEADDR (which can be set with setsockopt at the SOL_SOCKET level). By what I read online, ports relinquish themselves after about a minute. Not fun when you're debugging network apps.

It didn't help that somebody conveniently forgot to convert a port number to network byte order when packing the listening port into the address structure.

Then, there was an issue with sockets being blocking. The Sphere network API is meant to be asynchronous, so obviously that was a big problem. But it's what followed that really caused problems.

The routine for checking the state of a given socket wasn't working at all. The code wasn't terrible, it's just that it didn't do the job it advertised. You'd think it'd be simple: checking if a socket is connected or not.

What. The. Hell.

Sockets are fairly well documented, you can find a lot of info about them by entering even vaguely related terms into Google. Non-blocking sockets are a whole different ball game. It took me days of debugging and trawling through Google to find links that were even remotely helpful.

A guy who was looking at the same code had his father help with a stop-gap solution that worked under Mac OS X. Performing polling on the socket was a step on the right track. But sockets under Linux act a fair bit differently, or they must have, because the code he got didn't work for me in Linux.

I spent most of yesterday reading pages of returned event flags from the poll() system call in an effort to find out how to check if a socket was connected or not.

The solution? When a peer closes their end of a connection, your socket receives a POLLIN event. An attempt to read() or recv() will give you zero.

The link that changed it all: I was operating under the false assumption that a disconnecting peer would raise a POLLHUP or POLLERR signal. How wrong I was.

I also checked for -1 without errno == EAGAIN, since this is asynchronous sockets we're talking about. Tests seem to show that everything is working. It even works for that guy using Mac OS X.

I need some tea.

1 comment:

Anonymous said...

everything was discovered before:
http://www.ibm.com/developerworks/library/l-sockpit/

as always