A 3rd Person Camera in a Complex Voxel World

DreamCat Games
19 min readAug 25, 2020

Who’d have thought that in the first year of making my video game, the feature to take the most time would be the camera? Definitely not me. And the amount of camera code that I have now is not terribly much, but the time-consuming part was just failing over, and over, and over again in the design phase. Thankfully now I have a newfound appreciation for the complexities involved, and I came out of this experience with some effective algorithms that I want to share.

In this article, I’ll explain the core parts of my camera controller, and how it manages collisions with the environment.

Here’s a quick demo of my camera in action:

Camera collision resolution

Arc Ball Camera Primer

Just to start us off on solid ground, I’m going to explain the geometric principles of a simple 3rd person camera, referred to as the “arc ball camera” or “orbit camera.” The camera will always lie on a sphere centered at the target. Thus, we only require three parameters to describe the location of the camera: the radius R of the sphere and two angles theta and phi. We will refer to the angles by more descriptive names, where theta is the yaw and phi is the pitch.

Spherical coordinates

In order to get the position of the camera in Cartesian coordinates, we must start with the vector (0, 0, 1), then rotate it around the Y axis by the yaw, and then rotate it around the “pitch axis” by the pitch. The pitch axis must be some vector which is orthogonal to both the Y axis and the vector resulting from the first rotation. You can find this orthogonal vector using the cross product. Once you have a unit vector pointing in the right direction, scale it by the sphere’s radius. The final result of this calculation is a vector from the target to the camera, know as the “eye vector.”

The full procedure is implemented in unit_vector_from_yaw_and_pitch, making good use of the nalgebra library. (Scaling by the radius is done later).

Bonus exercise for the reader: how do you implement the inverse function, i.e. take a vector in Cartesian coordinates and extract the pitch and yaw? This is actually a useful function for doing things like initializing the camera’s polar coordinates with a given position.

Armed with this knowledge, it’s fairly straightforward to implement a camera controller that takes some input and moves the camera around the sphere or even scales the size of the sphere to zoom the camera in and out. Take mouse input for example. When holding the right mouse button, the (dx, dy) of the cursor will induce (yaw, pitch) rotations. The rotation amounts will be proportional to some sensitivity constants in the game’s settings. To scale the radius, I use the mouse’s scroll wheel.

One detail you must be sure consider is what happens when the pitch is close to ±90 degrees. When using a function like UnitQuaternion::face_towards, this is a degenerate case, as the eye vector is collinear with the “up vector.” This gives garbage results where the camera loses its mind. To avoid this, I do not allow the absolute value of my pitch angle to be greater than 89.9 degrees.

Designing a Camera for Your Game

Before we get much further, it’s important to start considering what we need our camera to be good at. You should be prepared to answer a lot of questions about your game before designing a camera, and you will still probably get it wrong many times. I definitely failed to have a clear set of requirements for my game before designing a camera, and that’s certainly one reason I’ve spent so much time on it.

Here are some good resources to help you answer important questions about how you want to design your camera:

Because I’m making a 3D strategy game, the most useful vantage point for the player will be from overhead, getting a full view of the battle field. And I’ll need to be able to use the cursor to select interactive objects, like characters, items, and floor tiles.

This top-down view is relatively common. The best example I could find of what I want is the camera from Dragon Age: Origins. This camera is pretty standard, but from a high enough angle, it actually chops off the top of all geometry so it doesn’t occlude the important parts of the scene, and it makes it easier to see a lot of the map at once. And what happens if you want to look up? One thing you could do is lock in close under the target and effectively use a first-person camera to look around. That gives the player a lot of viewing options!

Look at that beautiful chop top

Clearly the hardest part of implementing this kind of camera is chopping off all of the geometry when viewing from far above. How do you decide the height to chop at? Does the chopping height change as the height of the floor changes? How do you actually do the chopping? How do you remove bounding volumes so you can do raycasts through the transparent ceilings? Etc. This is a complex problem in its own right. I’m not saying it’s a bad idea, but it has a lot of implications for the way you design your maps and other game mechanics. While I expect that I will eventually come back to this technique or something similar, before taking a swing at map chopping, I decided to try something different.

Instead of having different camera modes depending on the viewing angle, why not just always do camera collisions with all of the map terrain? Reasons not to do this: it’s hard, and it can put your camera in tight or ambiguous situations. Even now, I think the biggest deficiency of my camera is that it can’t always give a wide enough view of the scene when it’s resolving collisions in a tight space, like a cave or hallway. And in a lot of cases, the problem is ill-defined! You have to make choices for the player about where the camera is going, even when there are multiple “valid” locations for it.

But even so, I found it useful to tackle the problem of camera collisions, since I expect there will be many scenarios where it is relevant. I think the collision resolution strategy I have now could function pretty well for a traditional 3rd person game.

Understanding the Map Geometry

In my game, I want to take full advantage of the 3D setting, and that means characters won’t often be traveling on simple, flat terrain. There will be tunnels, cliffs, ramps, ladders, bridges, etc. With such a range of possible geometries and topologies in the map, the camera controller will need to be smart about how it places the camera, and it will need a lot of information about the map to do its job.

I represent my game map as a voxel lattice. I talked about this in detail in my previous article on smooth voxel meshing. Every voxel has some properties that determine things like material, surface geometry, and whether the voxel is “empty.” This emptiness property is precisely what my camera system uses when checking for collisions.

To transform this discrete map into something that can be used for continuous collision detection, I create a bounding volume tree (BVT) of axis-aligned bounding boxes (AABBs), where each non-empty voxel gets a box. In fact, only those voxels which are adjacent to (share a face with) an empty voxel are chosen to have AABBs, since collisions should not be able to reach a voxel otherwise. So effectively, all “surface voxels” will have an AABB for doing collisions, which corresponds to the location of mesh surfaces. The overall effect is that if something collides with one of these AABBs, it will stop before hitting the surface of a mesh.

AABBs generated for surface voxels

You might wonder why we don’t just use the mesh itself for collisions. This should be possible, although it is a more complex geometric problem. The ncollide3d library has a TriMeshtype that implements theShapetrait, so it can be used for ray-casts and time of impact calculations. I personally haven’t tried this yet, but I am considering it. I chose the voxel AABBs for doing 3D picking of voxels in my UI. So it made sense to kill two birds with one stone.

Regardless of the type of shape used for collisions, the shapes will have bounding volumes which allow them to be stored in a BVT. The BVT implements the BVH trait from ncollide3d, which allows me to use a variety of useful functions to facilitate ray-casting and sphere-casting.

Camera Collisions 101: Sphere Casting

If we follow the rule, “never occlude the player character,” in the strictest sense, then the camera must always be the closest object to the target (usually the character) along the line of sight. We could try to cast a ray from the target along the line of sight and the point where it first collides with an object could be the camera’s location.

Unfortunately, there is a fundamental problem with using rays for this. A ray is simply too skinny. It can fit inside of any crevice, not matter how small. This will inevitably cause problems where the camera is seemingly inside of the map geometry.

Camera in a crevice

Fortunately the solution is simple. Instead of ray-casting, we’ll “sphere-cast.” This involves surrounding the camera with a sphere and moving the sphere along the line of sight until it collides with something. This ensures that any map geometry will be at a minimum radius from the camera.

The full path of the sphere forms a capsule shape

Although sphere-casting is conceptually similar to ray-casting, it’s technically not that similar in implementation. Calculating the contact point between two volumes is much different than that of a ray and a volume. But ncollide3d has us covered again with the time_of_impact function. We can give it any two shapes and their velocity vectors, and it tells us when they will collide.

But time_of_impact only works between two shapes. We have a whole BVT of AABBs! But don’t worry, ncollide3d has a Compound shape which combines many shapes into a single shape. This will work for most use cases.

However, because I have a map of many voxels, it is actually less efficient for me to store a Cuboid and it’s Isometry inside of a Compoundfor every single solid voxel. You could even say it’s already inefficient to store an AABB for every solid voxel, but I made this tradeoff to be compatible with ncollide’s algorithms and keep things simple until I’m operating at a scale that necessitates more efficiency.

So how do I accomplish sphere-casting with a bare BVT and no compound shapes? The trick is to cheaply narrow down the set of AABBs that our sphere could collide with before doing contact point calculations. This can be done by calculating the AABB of the sphere along the entire ray, and only considering AABBs inside of it.

Note: the longer the ray, the less effective the filter!

ncollide3d has the BoundingVolumeInterferencesCollector that does exactly this for us. Given an AABB, it will traverse our BVT and find all the AABBs that “interfere” with it.

Then once we have the narrower set of AABBs to collide with, we have to convert them to Cuboids in order to use time_of_impact. We’ll do this one at a time and remember which cube had the earliest time of impact. That will be the final contact point we choose for setting the position of the camera.

Doing contact point calculations for many objects can get expensive, so it’s important to keep an eye on this.

Where Basic Sphere-casting Fails

Great! We know how to cast spheres from the camera target to avoid occlusion. But this will often cause problems. For example, consider this terrain:

Sphere casting in a constant direction as the target moves along bumpy terrain

How do you think it will feel for the player to run along while the camera is quickly zooming in and out? Not good. Notice that if we want to solve this problem, we need to break the rule we mentioned earlier, “never occlude the player character.” Surely it’s better to bend this rule than to force the camera to bounce around and make the player sick.

Starting Simple

Just to understand the complexity of this problem, let’s analyze some really simple candidate solutions.

Let’s say we just ignored all collisions that happen below some threshold distance from the target. This would prevent the camera from zooming in too close to the target. But wait… then if the target gets too close to a wall, the camera can jump right through it. This solution won’t work in general.

What if we add a temporal smoothing filter to the camera radius so it’s less jumpy? This is actually a good idea, but it won’t solve the problem in general. Across different time scales, the smoothing will behave differently, and it will still be noticeable to the player.

What if we move the target above the ground by some distance so the sphere cast is less likely to collide with the ground? The target could have an anatomy, with feet and a head. The feet move along the ground and the head is the camera’s target. That actually works pretty well for being so simple! But again, it’s not a general solution. If the feet move under some overhang, the head might actually go inside of the geometry. Then your sphere cast will start with an immediate collision, and that makes the player feel really claustrophobic.

Something a Little Smarter

You may have noticed that, with a large enough radius from target to camera, the camera should never actually come in contact with the terrain. It’s only the fact that we cast the sphere starting from the target that gets us into trouble. In fact, the camera could just hover at a constant height, and there would be no problems, so long as it doesn’t run into a tall hill or maybe a stalactite.

So we might consider this fact and say, “we should be checking for collisions along the actual path of the camera.”

If the camera stays at a constant offset from the target, it only collides with that stalactite.

And we can use sphere-casting for that! If we remember the camera’s position on the last frame, then we can cast a sphere from that position to the newly desired position on the current frame. And what do we do if we encounter a collision? We could just stop the camera there, but that would suddenly change the viewing angle. We could try to slide the camera along the object it collided with towards the target. This opens up a whole new can of worms. What does it mean to “slide along” an arbitrarily complex piece of terrain? Well, we should at least try to avoid occlusion again, right? Does that put us back where we started? Yes.

“Is this a room? When did we enter it?”

After some deep thought about what kinds of camera obstacles are important and which ones should be ignored, you might begin to ask yourself silly questions like, “what is a room, in the geometric sense?” Or, “where is the boundary between being outside and inside of a room?”

At what point does a hump become a wall become an overhang become a room?

And even when you’re pretty sure something is a hallway, when do you actually cross the threshold between outside and in?

When did Gary really enter the hallway?

Don’t worry; you aren’t crazy. A room is an entirely human concept, and unless a space was built with the intention of being a room, then it’s hard to say whether it is one.

In a sense, this is the kind of ambiguous problem that our camera must solve. But this is just too much trouble for us right now. We need to simplify somehow. Thankfully, we don’t need the camera to be entirely automatic. We have a human who is willing to control the camera some of the time, and she just doesn’t want the camera logic to get in her way.

Clarifying the Problem

The most we should ask right now is for the camera to behave consistently and minimize movement while satisfying a couple “simple” constraints:

  1. Stay outside of solid geometry
  2. Respect the player’s input

We don’t have to get too smart yet and try to decide when we should enter a room. The human can do that for us if we let them.

Thankfully, sphere-casting remains a useful tool for staying outside of solid geometry, and it’s compatible with our constraint to respect the player’s input. At a minimum, the player expects that the viewing angle they requested is presented to them. Ideally, so would the radius, but that’s not always possible while resolving collisions.

It seems like we are checking all of the boxes except for “minimize movement.” And in all of the example scenarios we’ve considered, it seems like there must be at least one point on the sight line that satisfies our constraints and minimizes movement. This should be our goal.

Minimizing movement means that we must have some memory of the previous camera position. And then in the next frame we must choose a point on the new sight line that is close to the previous camera position. In the example below, we would be choosing between placing the camera at t2A or t2B. Since t2A is closer to the camera at t1, we should choose that!

An occluder got in our way in frame t2. Should we care?

The next question is, how can we be sure that t2A is a valid position? After all, we have to get around an obstacle to find it. And it might actually be the case that there is a second collision on the sight line that we must account for! Do we need to do multiple sphere casts? And where do we start those sphere-casts?

At this point it was clear to me that this is inherently a search problem. We are searching for the right place to start our sphere-cast so that it ends up both on the new sight line and close to our previous position.

Searching Around Obstacles

So what do you do when you need to search for stuff in a continuous environment and you have limited CPU cycles to spare? You could use ray-casting. Just send out a bunch of rays inside a cone centered towards the camera, and if any of them get sufficiently far without hitting an obstacle, send out another ray straight at the camera.

Ray-cast in many directions and hope one of the rays gets free

I think if you are clever enough, you can probably make this strategy work. Personally, I found it both expensive and confusing. How do you know how far is far enough to get around an obstacle? What angle should you choose? How do you know where to start a sphere-cast? I never figured it out. Maybe you can!

Instead, I decided to cheat a little and take advantage of the fact that I have a voxel grid that tells me where the obstacles are!

My Solution: Graph Search

Graph search, AKA pathfinding, is a big part of the solution I’m using right now in the camera for the Voxel Mapper project. In short, I start the path at the voxel containing the camera’s target, and find a path through empty voxels until I reach the voxel containing the desired camera position.

The grid is super low resolution in this example; in reality the voxels would be much smaller.

It’s possible that the desired camera position is inside of the terrain, i.e. a non-empty voxel, so it’s important to limit the number of search iterations before giving up. The nice part is that even if I give up, I’ll still remember the path that got me closest to the camera.

But then what should we do with the path? We must use it to figure out where we should start our sphere-cast, and it should be a place that:

  1. is on the sight line
  2. isn’t already colliding with terrain
  3. is close to the previous camera position

So what I do is I project the path onto the sight line and extract some information about those points to determine if they are “obstructed.”

Projecting the path points onto the sight line

Notice that the points of the path which are farthest from the sight line are the most obstructed. And some of the projected points lie inside of non-empty voxels, so we obviously can’t start our sphere-cast there. But some of those points are actually good candidates for the starting point of a sphere-cast.

So all of this information goes into my heuristic classification. I say a path point is “obstructed” if either:

  1. The point strays too far from the sight line. (I use some minimum distance threshold).
  2. The voxel containing the projection of the point onto the sight line is not path-connected to empty space.

I determine #2 by trying to find a path between that projected voxel and the path.

Classifying the path points as obstructed or not and grouping them into contiguous ranges

Then once I’ve classified each point as either obstructed or not, I take the contiguous ranges of non-obstructed voxels and select the most promising point from each range for a sphere-cast. The goal is to choose a point in the range which is not too close to either end of the range, since that would put it close to an obstacle and potentially interrupt the sphere-cast too soon.

Of all of those promising points, I choose the one which is closest (along the path) to the previous camera position (minimizing camera movement), and from that point, I start a sphere-cast along the sight line.

That’s it! I just summarized the whole algorithm which you saw in action at the top of the article. Of course, there are some finer details to work out. For that, please take a look at the open source Voxel Mapper project. The code in question can be found here.

Choosing a Good Graph Search Algorithm

One detail I’d like to explain a bit is my choice of graph search algorithm. There are so many to choose from. At first, I tried A*, since it is so familiar and guarantees a shortest path. However, it became evident that I would not be able to maintain good performance when pathfinding across large distances, since A* will expand many, many path options in order to achieve optimality. For my use case, I don’t actually need an optimal path. For this reason, I switched to using greedy best-first search, which will always expand the most promising vertex strictly by heuristic and disregarding the actual path cost. This won’t always find an optimal path, but it is fast.

And for a heuristic, at first I chose the L1 or Manhattan distance. But maybe you notice that this isn’t a great choice for my use case. Considering obstruction condition #1 above, it may happen that once the path goes around an obstruction, it stays far away from the sight line until it gets very close to the finish. That would mean that a good portion of the path is wrongfully classified as obstructed. Ideally, we would like the path to stay on the sight line for as long as possible.

In order to give preference to the path that follows the sight line, I will simply calculate the distance from the path vertex to the sight line and add it to the heuristic cost to break ties.

Take straight from the Voxel Mapper code. Focus on the heuristic definition.

For this heuristic, I owe thanks to Amit at RedBlobGames for his wonderful article on pathfinding heuristics.

Smoothing Your Camera Movements

As a palette cleanser, let’s talk about something a little different: camera smoothing.

If you want your game to feel smoother than Keith Stone, you’ll want to avoid stilted, discontinuous movements from your camera. If your camera shakes or jumps, the entire screen will do the same. It leaves a bad taste in my mouth.

So what I do is put what is essentially the discrete version of a low-pass filter on my camera transform. Here’s what I mean by low-pass filter, also known as exponential smoothing:

s(n + 1) = (1 — w) * x(n) + w * s(n)

x(n) is the input value and s(n) is the smoothed value. w is some weight in [0, 1). The closer w is to 1, the smoother that s(n) will be. I use w = 0.9, but it should probably be up to the player to configure this.

I apply this filter to the camera’s position and target. Then I can reconstruct the sight line from these two points, which requires calculating the pitch and yaw from a vector like we talked about before. I don’t recommend trying to do smoothing directly on the camera angles.

Future Work

There is definitely still a lot of work to be done on this camera. It works quite well for honoring the player’s input and avoiding obstacles, but it could be even better at minimizing movement and understanding context.

For example, if the player forces the camera into a tight space between the target and a wall, I could push the target away from the wall to give more space. I expect this to be a pretty simple near-term improvement.

In the long term, I will need the camera to be able to automatically adjust the viewing angle and distance as the target moves into new surroundings. One example of this is if the target goes around a corner in the terrain, the camera should pan around the corner instead of just colliding.

Another way to make the camera movement smoother is to minimize vertical movement as the target moves up and down along bumpy terrain. The yaw could stay the same, but instead of maintaining a constant pitch and radius, the camera could just stay at the same height (in world coordinates). This would make it so, on the screen, the target moves up and down and the terrain around it would stay relatively still. This is an important feature in platformer cameras so that player can have a constant reference frame when jumping.

Aside from controlling the camera’s configuration, it would also be nice to apply some rendering post-effects. One effect I’ve seen in other games is to dither the geometry that gets close to the camera or occludes the target. Then you can see your character through walls or other obstructions.

If I get around to implementing any of this stuff, I’ll try to write about it, and as always, keep it open source. ’Til next time, keep making awesome stuff and sharing with the world. Bye!

--

--