pikdum's blog


Building a World of Warcraft server in Elixir

Updates:

Thistle Tea is my new World of Warcraft private server project. You can log in, create a character, run around, and cast spells to kill mobs, with everything synchronized between players as expected for an MMO. It was floating around in my head to build this for a while, since I have an incurable nostalgia for early WoW. I first mentioned this on May 13th and didn’t expect to get any further than login, character creation, and spawning into the map. Here’s a recount of the first month of development.

Prep Work

Before coding, I did some research and came up a plan.

Day 1 - June 2nd

There are two main parts to a World of Warcraft server: the authentication server and the game server. Up first was authentication, since you can’t do anything without logging in.

To learn more about the requests and responses, I built a quick MITM proxy between the client and a MaNGOS server to log all packets. It wasn’t as useful as expected, since not much was consistent, but it did help me internalize how the requests and responses worked.

The first byte of an authentication packet is the opcode, which indicates which message it is, and the rest is a payload with the relevant data. I was able to extract the fields from the payload by pattern matching on the binary data.

The auth flow can be simplified as:

It uses SRP6, which I hadn’t heard of before this. Seems like the idea is to avoid transmitting an unencrypted password and instead both the client and server independently calculate a proof that only matches if they both know the correct password. If the proof matches, then authentication is successful.

So basically, what I needed to do was:

This whole part is well documented, but I still ran into some issues with the cryptography. Luckily, I found a blog post and an accompanying Elixir implementation, so I was able to substitute my broken cryptography with working cryptography. Without that, I would’ve been stuck at this part for a very long time (maybe forever). Wasn’t able to get login working on day 1, but I was close.

Links:

Day 2 - June 3rd

I spent some time cleaning up the code and found a logic error where I reversed some crypto bytes that weren’t supposed to be. Fixing that made auth work, finally getting a success with hardcoded credentials.

Next up was getting the realm list to work, by handling CMD_REALM_LIST and returning which game server to connect to.

This got me out of the tedious auth bits and I could get to building the game server.

Links:

Day 3 - June 4th

The goal for today was to get spawned into the world. But first more tedious auth bits.

The game server auth flow can be simplified as:

This negotaties how to encrypt/decrypt future packet headers. Luckily Shadowburn also had crypto code for this, so I was able to use it here. The server proof requires a value previously calculated by the authentication server, so I used an Agent to store that session value. It worked, but I later refactored it to use ETS for a simpler interface.

After that, it’s something like:

First was handling CMSG_CHAR_CREATE and CMSG_CHAR_ENUM, so I could create and list characters. I originally used an Agent for storage here as well, which made things quick to prototype.

Then I got side-tracked for a bit trying to get equipment to show up, since I had all the equipment display ids hardcoded to 0. I looked through the MaNGOS database and hardcoded a few proper display ids before moving on.

After that was handling CMSG_PLAYER_LOGIN. I found an example minimal SMSG_UPDATE_OBJECT spawn packet, which was supposed to spawn me in Northshire Abbey.

That’s probably the most important packet, since it does everything from:

It has a lot of different forms, can batch multiple object updates into a single packet, and has a compressed variant.

Whoops, had the coordinates a bit off.

After fixing that, I was in the human starting area as expected. No player model yet, though.

I thought movement was broken, but it turns out all keybinds were being unset on every login, so the movement keys just weren’t bound. Manually navigating to the keybinding configuration and resetting them to default allowed me to move around.

Next up was adding more to that spawn packet to use the player race and proper starting area. The starting areas were grabbed from a MaNGOS database that I converted over to SQLite and wired up with Ecto.

Last for the night was to get logout working.

The implementation was something like:

This was the first piece that really took advantage of Elixir’s message passing.

The white chat box was weird, but it was nice being able to log in.

Links:

Day 4 - June 5th

First up was reorganizing the code, since my game.ex GenServer was getting too large.

My strategy for that was:

It worked, but it messed with line numbers in error messages and made things harder to debug.

After that, I wanted to generate that spawn packet properly rather than hardcoding. The largest piece of this was figuring out the update mask for the update fields.

There are a ton of fields for the different types of objects SMSG_UPDATE_OBJECT handles. Before the raw object fields in the payload, there’s a bit mask with bits set at offsets that correspond to the fields being sent. Without that, the client wouldn’t know what to do with the values.

So, I needed to write a function that would generate this bit mask from the fields I pass in. Luckily it’s all well documented, but it still took a while to get to a working implementation.

Links:

Day 5 - June 6th

Referencing MaNGOS, I added some more messages that the server sends to the client after a CMSG_PLAYER_LOGIN. One of these, SMSG_ACCOUNT_DATA_TIMES, fixed the white chat box and keybinds being reset.

I also added SMSG_COMPRESSED_UPDATE_OBJECT, which compresses the SMSG_UPDATE_OBJECT packet with :zlib.compress/1. This was more straightforward than expected, and I made things use the compressed variant if it’s actually smaller. I’m expecting this to have even more benefits when I get to batching object updates, but right now I’m only updating objects one by one.

Movement would come up soon, so I started adding the handlers for those packets.

Day 6 - June 7th

In the update packet, I still had the object guid hardcoded. This is because it wants a packed guid and I needed to write some functions to handle that. Rather than the entire guid, a packed guid is a byte mask followed by all non-zero bytes. The byte mask has bits set that correspond to where the following bytes go in the unpacked guid. This is for optimizing packet size, since a guid is always 8 bytes but a packed guid can be as small as 2 bytes.

This took a while, because the client was crashing when I changed the packed guid from <<1, 4>> to anything else. After trying different things and wasting a lot of time, I realized that the guid was in two places in the packet and they needed to match. A quick fix later and things were working as expected.

Links:

Day 7 - June 8th

It was about time to start implementing the actual MMO features, starting with seeing other players. To test, I hardcoded another update packet after the player’s with a different guid, to try and spawn something other than the player.

Then I used a Registry to keep track of logged in players and their spawn packets. After entering the world, I would use Registry.dispatch/3 to:

After that, I added a similar dispatch when handling movement packets to broadcast movement to all other players. This is where the choice of Elixir really started to shine, and I quickly had players able to see each other move around the screen.

I tested this approach with multiple windows open and it was very cool to see everything synchronized.

I added a handler for CMSG_NAME_QUERY to get names to stop showing up as Unknown and also despawned players with SMSG_DESTROY_OBJECT when logging out.

This is where I started noticing a bug: occasionally I wouldn’t be able to decrypt a packet successfully, which would lead to all future attempts for that connection failing too, since there’s a counter as part of the decryption function. I couldn’t figure out how to resolve it yet, though, or reliably reproduce.

Links:

Day 8 - June 9th

To get chat working, I handled CMSG_MESSAGECHAT and broadcasted SMSG_MESSAGECHAT to players, using Registry.dispatch/3 here too. I only focused on /say here and it’s all players rather than nearby. Something to fix later.

Related to that weird decryption bug, I handled the case where the server received more than one packet at once. This might’ve helped a bit, but didn’t completely resolve the issue.

Links:

Day 9 - June 10th

I still had authentication with a hardcoded username, password, and salt, so it was about time to fix that. Rather than go with PostgreSQL or SQLite for the database, I decided to go with Mnesia, since one of my goals was to learn more about Elixir and its ecosystem. I briefly tried plain :mnesia, but decided to use Memento for a cleaner interface.

So, I added models for Account and Character and refactored everything to use them. The character object is kept in process state and only persisted to the database on logout or disconnect. Saving on a CMSG_PING or just periodically could be a good idea too, eventually. Right now data isn’t persisted to disk, since I’m still iterating on the data model, but that should be straightforward to toggle later.

Links:

Day 10 - June 11th

Today was standardizing the logging, handling a bit more of chat, and handling an unencrypted CMSG_PING. I was thinking that could be part of the intermittent decryption issues too, but looking back I don’t think I’ve ever had my client send that unencrypted anyways.

Day 11 - June 12th

I wanted equipment working so players weren’t naked all the time, so I started on that. Using the MaNGOS item_template table, I wired things up to set random equipment on character creation. Then I added that to the response to CMSG_CHAR_ENUM so they would show up in the login screen.

Up next was getting it showing in game.

Day 12 - June 13th

It took a bit to figure out the proper offsets for each piece of equipment in the update mask, but I eventually got it working.

Since equipment is part of the update object packet, it just worked for other players, which was nice.

Day 13 - June 14th

I had player movement synchronizing between players properly so I wanted to get sitting working too.

Whoops. Weird things happen when field offsets or sizes are incorrect when building that update mask.

After that, I wanted to play around a bit by randomizing equipment on every jump. Here I learned that you need to send all fields in the update object packet, like health, or they get reset. I was trying to just send the equipment changes but I’d die on every jump.

After making sure to send all fields, it was working as expected.

Day 14 - June 15th

Took a break.

Day 15 - June 16th

Today was refactoring and improvements. I reworked things into proper modules, since it was getting hard to debug when all the line numbers were wrong. Now game.ex called the appropriate module’s handle_packet/3 function, rather than combining everything with use.

I also reworked things so players were spawned with their current position instead of the initial position saved in the registry. This included some changes to make building an update packet more straightforward.

Day 16 - June 17th

Today was just playing around and no code changes.

Not sure why the model is messed up here, but it seems like it’s something with my computer rather than anything server related.

Day 17 - June 18th

The world was feeling a bit empty, so I wanted to spawn in mobs. First was hardcoding an update packet that should spawn a mob and having it trigger on /say.

After that, I used the creature table of the MaNGOS database to get proper mobs spawning. I used a GenServer for this so every mob would be a process and keep track of their own state. Communication between mobs and players would happen through message passing. First I hardcoded a few select ids in the starting area to load, and after that worked I loaded them all.

Rather than spawn all ~57k mobs for the player, I wired things up to only spawn mobs within a certain range. This looked like:

It worked really well and I could run around and see the mobs.

Next up was optimization and despawning mobs that were now out of range.

Day 18 - June 19th

For optimization, I didn’t want to send duplicate spawn packets for mobs that were already spawned. I also wanted to despawn mobs that were out of range. To do this, I used ETS to track which guids were spawned for a player.

In the dispatch, the logic was:

Despawning was done through the same SMSG_DESTROY_OBJECT packet used for despawning a player after logging out.

After getting that working, I ran around the world and explored for a bit.

I noticed something wrong when exploring Westfall. Bugs were spawning in the air and then falling down to the ground. Turns out I wasn’t separating mobs by map, so Westfall had mobs from Silithus mixed in. To fix, I reworked both the mob and player registries to use the map as the key.

Having mobs standing in place was a bit boring and I wanted them to move around. Turns out this is pretty complicated and I’ll actually have to use the map files to generate paths that don’t float or clip through the ground. There are a few projects for this, all a bit difficult to include in an Elixir project. I’m thinking RPC could work, but not sure if it’ll be performant enough yet.

The standard update object packet can be used for mob movement here, since it has a movement block, but there might be some more specialized packets to look into later too.

Without using the map data, I couldn’t get the server movement to line up with what happened in the client. So, I settled with getting mobs to spin at random speeds.

That was a bit silly and used a lot of CPU, so I tweaked it to just randomly change orientation instead.

Links:

Day 19 - June 20th

Here I got mob names working by implementing CMSG_CREATURE_QUERY. This crashed the client when querying mobs that didn’t have a model, so I removed them from being loaded. I also started loading in mob movement data and optimized the query a bit to speed up startup.

I finally got some people to help me test the networking later that day. It didn’t start very well.

Turns out I hadn’t tested this locally since adding mobs and the player/mob spawn/despawns were conflicting with each other due to guid collisions. Players were being constantly spawned in and out.

I did some emergency patching to make it so players are never despawned, even out of range. I also turned off /say spawning boars since that was getting annoying. That worked for now.

There were still some major issues. My helper had 450 ms latency and would crash when running to areas with a lot of mobs. I couldn’t reproduce, though, with my 60 ms latency.

Links:

Day 20 - June 21

To reproduce the issue from the previous night, I connected to my local server from my laptop on the same network. On my laptop, I used tc to simulate a ton of latency and wired things up so equipment would change on any movement instead of just jump. This sent a lot of packets when spinning and I was finally able to reproduce.

Turns out the crashing issues were from not receiving a complete packet, but still trying to decrypt and handle it. I was handling if the server got more than one packet, but not if the server got a partial packet.

Referencing Shadowburn’s implementation, the fix for this was to let the packet data accumulate until there’s enough to handle. This finally resolved the weird decryption issue I ran into on day 7.

For the guid collision issue, I added a large offset to creature guids so they won’t conflict with player guids.

Day 21 - June 22

Took a break.

Day 22 - June 23

Worked on CMSG_ITEM_NAME_QUERY a bit, but there’s still something wrong here. It could be that it’s trying to calculate damage using some values I’m not passing to the client yet.

Decided spells would be next, so I started on that. First was sending spells over with SMSG_INITIAL_SPELLS on login, using the initial spells in MaNGOS, so I’d have something in the spellbook. Everything was instant cast though, for some reason.

Turns out I needed to set unit_mod_cast_speed in the player update packet for cast times to show up properly in the client.

I started by handling CMSG_CAST_SPELL, which would send a successful SMSG_CAST_RESULT after the spell cast time, so other spells could be cast. I also handled CMSG_CANCEL_CAST, to cancel that timer. This implementation looked a bit like the logout logic.

The starting animation for casting a spell would play, but no cast bar or anything further.

Links:

Days 23 to 26 - June 24 to 27

Took a longer break.

Day 27 - June 28

I was able to get a cast bar showing up by sending SMSG_SPELL_START after receiving the cast spell packet.

The projectile effect took a bit longer to figure out. I needed to send a SMSG_SPELL_GO after the cast was complete, with the proper target guids.

Links:

Day 28 - June 29

I got self-cast spells working by setting the target guid to the player’s guid.

Day 29 - June 30

Another break.

Day 30 - July 1

Since I had spells somewhat working, next I had to clean up the implementation. I dispatched the SMSG_SPELL_START and SMSG_SPELL_GO packets to nearby players and fixed spell cancelling.

Day 31 - July 2

I added levels to mobs, random from their minimum to maximum level, previously hardcoded 1. Then I made spells do some hardcoded damage, so mobs could die. Mobs would still randomly change orientation when dead, so added a check to only move if alive.

That seemed like a good stopping point and was one month since I started writing code for the project.

Future Plans

I’ll slowly work on this, adding more functionality as I go. My goal isn’t a 1:1 Vanilla server, but more something that fits well with Elixir’s capabilities, so I don’t plan on accepting limitations for the sake of accuracy or similar. I’d like to see how many players this approach can handle and how it compares in performance to other implementations eventually too.

Some things on the list:

So still plenty more work to do. :)

Thanks to all the projects I’ve referenced for this, most of which I’ve tried to link here.

I wouldn’t have gotten very far without them and their awesome documentation.


Comments