While I wrote this in September 2007, for various reasons I did not get around to putting the finishing touches on it. Please pretend you’re reading it then and not now!
After the hubbub over Sierra’s announcement that they were ceasing multiplayer support for Tribes 1 and the resulting scramble to locate a replacement master server, I decided to give a shot at writing one. The required feature set appeared simple enough to only take a week or so to implement but with enough gotchas to keep it suitably interesting. While I only had a vague idea of what was required, I got a jump start on proper design by finding Half-Life and Team Fortress Networking: Closing the Loop on Scalable Network Gaming Backend Services by Yahn W. Bernier, an article detailing the design, implementation, and potential problems of the Half Life master server. Even though some of the topics did not apply to the Tribes 1 requirements, e.g. I can’t alter the client’s behavior to auto rate-limit the server list transmission, the article was still quite valuable and an interesting read even if you aren’t implementing a master server.
My initial server was as basic as you can get: a hash table of servers behind a udp socket which wakes up periodically to time out servers. Tribes servers do not send a “going offline” packet, but as their heartbeat period is every 2 minutes the server will not remain in the list for long.
If the intended environment was a closed setting with a limited number of servers to track, this would be more than sufficient, but in the real world it is vulnerable to many issues and attacks, intentional or otherwise.
Issues and Attacks
Bogus Servers: The ability to add bogus servers is probably the most damaging problem due to the ability to inflate the server list which makes it possible to easily saturate the master server’s upload by requesting the server list repeatedly. The two ways of adding bogus servers are either by IP spoofing or simply sending thousands of keepalives from your IP with different ports. The Half Life Networking article goes in to this problem in depth and offers the solution of a challenge-response system where the master server, on receiving a heartbeat, sends a random value to the server and only allows the heartbeat to be registered if the server sends the correct value back.
While the challenge/response system solves the problem of IP spoofing completely, it is still trivial to listen and respond to the challenges on your real IP. To defeat this attack, it is necessary to limit each IP to a certain number of registered servers. This way, even if you can intercept and respond to 20,000 challenges, only X will be accepted and the server list will be no worse for the wear. To be honest, limiting an IP to a low number of servers, say 3-5, should be in place regardless of the chance for spam. The only reason to run multiple servers on a single IP is either NAT or that the additional servers are backups/rarely used. Server CPU lag is one of the worst experiences to have online, although many mods run so poorly they tend to anticipate this fact and over-compensate with ridiculous weapons and items which leave skill completely out of the equation. That is a topic for another day though.
Of note is the fact that while Tribes network protocol does include a sequence number which can be used as a challenge, it is only 16 bits wide. To prevent brute force attacks (however unlikely), the challenge-response value should have a timeout of a few seconds at most.
Upload Starvation: If an IP is spamming list requests to the master server, it might be possible to saturate the master server’s upload even with a relatively small server list. I wanted to be able to address the problem of spamming requests without requiring human intervention, but at the same time not penalize legitimate users who may click the refresh button a few times too often or have multiple players behind NAT.
The solution, ironically, came from Tribes. I implemented a penalty per IP system similar to the in-game chat spam penalty. Each request accrues a certain amount of penalty, and when the penalty reaches a certain limit the master server stops responding to that IP. The penalty is decreased by 1 for each second that passes, allowing the client to eventually access the master again. I also added a maximum penalty cap (currently at 10 seconds) so that bans will be over fairly quickly once the IP ceases to spam.
Unintentional flooding: Because Tribes has no auto-rate limiting mechanism to control the server list download speed, it’s possible the master server could either upload the server list faster than the client can handle or saturate it’s own upload by firing off too many packets at once when multiple requests come in. The obvious solution is for the master server to throttle it’s upload, i.e. queuing up packets instead of sending them off instantly. While a simple rate limited FIFO queue would keep the server’s upload from becoming saturated, it would also mean that after a certain point the packets on the tail end of a large queue would be timed out by the client before they had a chance to arrive. This means rate limiting needs to be done per-connection and not globally.
Tribes, however, throws a wrench in to how low you can rate limit a connection. When the first packet of a master server response arrives, Tribes creates X pending “ping” responses to wait for, where X is the number of packets left in the master server response. Because both the master server list request and server ping packets are the same (\x10\x03), Tribes piggybacks the master server list request in the server pending ping list and simply re-pings any packet in the list which times out. Unfortunately, the timestamps for a particular set of master server response packets are never updated when succeeding packets come in, meaning that if Tribes does not receive every master server packet within $pref::pingTimeoutTime milliseconds (default 900ms), it sends a new request for each remaining packet in the list.
So lets say the server list contains 3000 servers, and the master server batches 64 servers per packet. This would result in 47 packets at around ~450 bytes each, or 21k of data. If the master server uploads at 4k/s to the client, only 4k*0.9s = 3.6k of data will make it to the client before the remaining 38 packets are timed out. Tribes would then see 38 “pings” that timed out and re-ping each one, updating their packet keys and timestamps so that the packets still coming in from the previous request would be discarded. If Tribes hits $pref::pingRetryCount on any single master server packet, it will decide it can’t contact a master server and either move on to the next master or pop up an error box. This unfortunately means the per-client rate limit has to be set high enough to make sure the server list can be transmitted within a couple seconds at most, or 900ms to be safe. Assuming a maximum of 500 servers, or ~4k of data, 5k/s per client should be a decent limit.
Due to a poor choice of data structures, Tribes itself limits how large the list can grow. The server list and pending ping list are stored in vectors which grow by 5 items at a time!, leading to tons of potential copying on vector resizes and painful O(N) lookups. When the “to ping” list reaches ~6000 servers, Tribes locks up for longer than I care to wait. This is exacerbated when you raise $pref::pingTimeoutTime to accommodate larger server lists, leading to a realistic ceiling of around 4000 servers regardless of the timeout bug.
Because Tribes has no challenge-response mechanism available for server list requests, there is nothing to be done about Upload Starvation by IP spoofing. Please try not to annoy anyone who has the ability to spoof and the desire to spam your master server!
I thought about doing something with mirrors, but there really is no good reason to bother.
1. CPU load is a non-issue. Even if there were 50,000 live servers sending heartbeats every 2 minutes, that’s only 415 heartbeats a second. A heartbeat requires 1 hashtable lookup for the IP penalty, 1 hashtable lookup for the server table, and 3 hashtable lookups for the pending servers table if the server doesn’t exist. A worst case of 5 hash table lookups * 415 = 2075 hashtable lookups a second which would not even register on a load monitor.
Using a separate process to spam my master with heartbeats, I can get up to ~50,000 heartbeats a second without sending a challenge packet and ~35,000 heartbeats sending the challenge (and using ~1000kb/s up in the process). Using a separate machine to spam heartbeats peaks at around ~14,000 a second with around ~30% CPU load for the master server.
A “realistic” test of 18,000 servers, 22 server list requests a second, and 450 heartbeats a second resulted in ~3-10% CPU load and 2,800 kb/s upload. Memory usage peaked at around 40mb, largely due to the 530 concurrent 120k server lists being served from memory at 5kb/s. Reducing the server count to 1,000 while keeping the requests and heartbeats the same results in no measurable CPU load and 900k peak memory usage.
2. Bandwidth, while an issue with a community supporting 50,000 servers, is not an issue with the Tribes community.
- There are currently 121 active servers in the current master list.
- According to archive.org’s snapshot of gamespy’s stats page, Tribes only had ~250 servers up in July 2004.
- Going back even farther is one of Tim Sweeney’s old news posts from October 1999, showing 589 servers during Tribes’ heyday.
Lets do the math:
Servers Bytes / List Req/s 40kb up 160kb up 1999 589 4323 9.47 37.90 2004 250 1950 21.01 84.02 2007 121 1047 39.12 156.49
Even a crappy cable modem can handle 39 reqs/s with the current list, and a T1 could do just as well with the server count from 1999. Obviously it makes more sense to find a host that will stay up than to make a tangled net of mirrors. There are also other problems with unorganized mirrors such as all of the mirrors a server reports to going down and the fact that the Tribes client queries mirrors sequentially instead of randomly.
If a host can’t be found that is guaranteed to be up, then finding 5 such hosts and needlessly complicating the master server isn’t going to make the system more robust.
The only thing I don’t really like about the current code is the packet queuing. While the current server can obviously scale quite high, it should be possible to construct packets on the fly instead of batching them up in memory. The problem to be solved is how to keep multiple iterators in a hash table which can have elements added/removed at any time, and since the master has to tell Tribes how many packets it will be sending, to not send any additional servers which are added after the time of the list request. I have some ideas, but I should probably post this first instead of continuing to put it off.
The code is cross-platform, although I don’t have anything other than Linux to test on so you will probably need to jump through some hoops to compile on BSD or OSX.
All of the tunable parameters are at the top of t1master.cpp and should be fairly self explanatory. I didn’t want any external dependencies so the MOTD is “hardcoded”, although you can pass it on the command line. I also didn’t see a point in backing up the server list at any point since the list will fully regenerate itself in 2 minutes. Note that I do randomize the hash seed so even if a spoofer floods the server with bogus IPs, he won’t be able to logjam the hash tables with collisions!
t1spam.cpp is the program I used for stress testing by hammering the server with heartbeats and listreqs. You should see the lines to comment/uncomment in t1master.cpp(masterserver::process) and optionally in t1master.cpp(main) to simulate random request sources to get realistic loads for the hash tables and penalties.
- GitHub Repo: t1master