**UPDATE** I’ve just posted a new video to youtube, and embeded at the bottom of this post, showing the point cloud visualisation running in realtime.
You may know from this blog, that I work for a company called Square Enix, and before that Eidos. SE is a video games publisher, famous for Final Fantasy, and through the acquisition of UK publisher Eidos, Tomb Raider. Among other duties, I manage the metrics system and create data and visualisations that the business uses to make better games. All the data we collect is anonymous. For a while now I’ve been using this data to create Heatmaps of player activity. You can see some of these on the Just Cause website.
I’ve been using Processing to create tools that render the heatmaps, but while the logical structure of program is fairly simple, there are significant challenges in working with large datasets. The primary challenge is loading the data into memory. The data is all held in a SQL database, and while I could connect to the DB directly using processing, the DB is optimised for data-in operations, not data-out, so you don’t want to be pulling the data out too often. Instead, I dump the raw spatial data (X,Y,Z coordinates) into a CSV file, one record per row. I usually create heatmaps from datasets in excess of 1 million rows, and most of them are between 5 – 20 million rows (I have one that is 22 million rows!). A CSV file containing 10 million rows of spatial data is about 364MiB in size (the 22.3m row CSV is 802MiB!). In order to create the data structures in memory to hold sets this large, I have to work in 64bit mode to get over the Windows 32bit memory restrictions.
When I wrote the tool for heatmapping, I was pretty fresh to processing, and I can now see many ways to optimise this process to be more efficient at memory usage, but I’m not great at refactoring, and as it works, I’ve not been back to optimise it. The heatmapper application works by looking at the resolution of the image you want to create (eg. 2000x2000px) and then subdividing this into a cell structure whose size determined by the “cell resolution” variable. The cells are essentially buckets, into which I put the rows of raw data. Each row of raw data (the data’s X and Y value) is processed to find out which bucket that row of data fall in and then a Cell’s value is incremented by one every time an element of the raw data lands in it.
Once the whole raw dataset has been parsed, and the Cells populated with counts, we render the cells. This is a simple process of drawing a circle of the same diameter as the cell size. The colour of which is determined by the number of the raw data elements that landed in that cell. The palette is scaled so that the cell with the largest value of counts is always pure white, so there is no clipping in the colour (although you can remove this scaling, meaning that any cell with more than 255 counts in it will be white, the trade off is increased colour resolution for the less populated cells). I use a simple alpha blend to help smooth out the rendering. The results are pretty good.
What was bugging me was that I was getting this huge set of raw data, but only using two thirds of it. I never ventured into third dimension by factoring the Z coordinate (the ‘height’ spatial component) into the process. At the beginning of this year, I decided to do something about this. I started a project to rebuild the heatmapping application in 3D.
My first results we truly awful. I rendered 9000 points in rows using the OpenGL renderer, with colour to show the relative height of the points. With only 9k rows, I couldn’t manage more than a couple of frames a second. The culprit was the Map() function that I was using for scaling the colour palette, for each point, each frame. Once I removed that, Life got a lot better, but I couldn’t render more than a few thousand points without the framerate falling off a cliff.
It became obvious that I would need to start getting a bit more low level with OpenGL to get more performance. This is where I discovered the work of Andrés Colubri and his GLGraphics library for Processing. With some great support through the Processing Forums, I was able to move a lot of the heavy lifting from the CPU to the GFX card, and ultimately create a vertex buffer object holding the points. I can render 11 million points in realtime at 30fps, on my desktop computer. This was several orders of magnitude greater performance than I ever thought would be possible.
The JustCause2 dataset that I was using was a selection of player deaths specifically where the player had died from an impact event. This was great because players tend to spend a lot of time jumping off tall buildings or riding around in helicopters and planes, so impact in this context is generally impact with the geometry of the environment. When the data is rendered, you can see the underlying world, almost as clearly as if we were rendering the 3d mesh itself.
Finally, and arguably the most important aspect of any data visualisation is working out the best way to communicate it. The data looked great, but I couldn’t distribute it as source because of the size of the CSV files that accompany it. I needed to some way of taking the viewer through the data and showing them points of interest. Up stepped another Processing hero, Jean Pierre Charalambos and the ProScene library. With ProScene, and Jean Pierre’s help I was able to direct the camera around to create a visual tour of the data which I then rendered out as individual 1080p frames, and assembled later into an animation.
It’s not over, in fact this journey into 3D visualisation of big data is only just starting. The JC2 Point Cloud animation isn’t really a heatmap at all, more like a 3D scatter graph of points. I will be investigating how to build a 3D cellular structure next, to create a true study of interactions in the 3D space. I will most likely be using data from the upcoming Square Enix game, Deus Ex: Human Revolution as it’s level based structure is more suitable for this project than the vast open world of Just Cause 2.