Back in the day online games were often released country by country, and even soft launch (or beta) user volumes were under tight control. Scaling up backend services was then very different from today when apps are usually released at once globally with a very limited soft launch period. Launch usually includes a global marketing push which requires the game backend to scale up from 0 to 100 very fast. Marketing we can control ourselves quite well but the fortunate ones getting additional push through app store featuring will lose a significant portion of new potential players if the game becomes unresponsive or at worst doesn’t accept new users at all. This is particularly true for Supernauts as our concurrent multiplayer features and social aspects of the game make it quite a bit more heavy on the server side than many other, more single player oriented games.
The purpose of this post is to shed some light on some of the choices we’ve done on the backend to guarantee a smooth launch for Supernauts later this year.
The main design principles of the Supernauts backend have been to avoid SPOFs (Single Point of Failure) in everything and to allow each component to be scaled up or down based on demand and load.
Our first point of contact with a user logging in is our custom HTTP API which logs the player in and selects to which game server the user should be routed to. This server can be chosen either based on load or if the player needs to visit some already existing live world, we can direct users to existing instances. If the worlds become too full we simply instance out new read-only versions of those worlds so that visiting and socializing isn’t hurt by worlds getting full.
Scaling up both the API and the game servers together or individually is trivial, and usually scalability problems are happening deeper down in the base layer of a distributed game server: the database and the messaging system
Databases and messaging
Database scalability is something we’ve kept in mind from day 1 of Supernauts development. We wanted to set up a database that is flexible for developers and enables fast production updates (at least comparing to a MySQL alter table on a big database…). The database had to support sharding out of the box and also be reliable as we are providing a service for users who are investing their time and money on the game. We understand it’s “just a game” but highest quality and reliability standards need to be kept througout the project.
We investigated a lot of different systems and ended up choosing MongoDB as it provides sharding and replica sets out of the box. It has a handy document model that lets us do the server to database communication without any additional object-relational mapping layers in between. MongoDB also provides, in comparison to some other NoSQL solutions, indexes and direct queries to the collections in case we need to query up something more complex than a simple “object for id” fetch.
For launch we’re doing our best to over estimate initial traffic and to guarantee there’s some headroom on the database for surprises (there’s always surprises). We can always scale the database cluster down once we know the real volumes of players but it’s better to worry about this later on than to spend the launch week scaling services up under load. Even if we do hit the full capacity of our launch cluster (a handful of replicated shards), we can add new shards on the fly or use our secret weapon of using even more powerful Amazon EC2 servers to run the database. We’re of course already using very powerful machines here but having the option of going even further is always nice to have.
“Redis just works” – Roope Kangas, Lead Server Developer
MongoDB isn’t always optimal for all purposes. For more temporary data and service to service messaging we’re using a wonderful tool called Redis. Redis is basically an in-memory database for very fast writes and reads. We use it mainly as a server index to figure out in which servers each world is running in, but also to store temporary listings like high score lists or user online status.
The killer feature for us in Redis is it’s pub/sub (publish & subscribe) system that allows fast messaging between different services. This is critical for features like team chat where team members might not sit on the same server at the same time.
Overall our philosophy is to start with as generic components as possible and then adding specialization when needed. Off-loading the temporary items from MongoDB to Redis is a good example of this.
Just launching a game with some estimated cluster size would be close to as crazy as flying blind, so we’ve done very extensive load tests both on the HTTP API and our game servers.
We’ve estimated beyond our wildest dreams the number of potential new users trying to enter the game directly at launch and ran tests to make sure these are the volumes we can handle. We’ve been using both fully programmatic testing as well as anonymized traffic recorded on in-house test sessions of the game. The most critical part to test is the login flow of a new user: this is one of the heaviest operations that happen on the backend as we need to set up all the data structures for a full player. Later in the game we can just do small updates to this data.
During load testing we’ve caught multiple issues that would have wreaked havoc at launch if not discovered beforehand. Still, one of the biggest issues we had was to be able to generate enough load for the test in a well orchestrated fashion. We actually ended up setting up a very sizeable cluster of small machines in Amazon EC2 just to generate load on other servers on Amazon :)
Monitoring and management
Despite extensive planning and testing it would be ridiculously stupid to assume that everything is smooth sailing after users start pouring in. Of course in the happy accident that it actually is the case, we can just sit back and relax with a well deserved craft beer, but it’s always more likely that something surprising does come up. This is why it’s absolutely critical that we have all the monitoring, logging, debugging and management tools in place.
If the assumption is that anything can go wrong, then you simply need to be prepared for everything. All the backend components need to have high enough log levels on to make sure we see what’s going on. In MongoDB we have to be able to pinpoint any database operations that slow things down. On higher level we need to be able to monitor the health of every Amazon EC2 instance out there, be it in use by the API, game servers, MongoDB, Redis or something else. Everything has to be properly configured and on the same timezone as otherwise sorting out which component caused the initial failure can be very hard. In the odd event that something happens without giving us enough information on what caused the problem, after fixing the issue the first priority is to make sure similar failure is easier to spot in the future.
To keep up with scale, everything is managed by Amazon Cloudformation. We can do lots of cool stuff with this from guaranteeing minimum instance counts to adding up more servers on the fly where needed.
Amazon services together with a wide set of logging and monitoring tools, without forgetting our incredibly talented backend team with “Combat Engineering” attitude I’m very confident that we’re well set up for the launch and beyond, just bring it on!