Heatmaps, Point Clouds and Big Data in Processing

YouTube Preview Image

**UPDATE** I’ve just posted a new video to youtube, and embeded at the bottom of this post, showing the point cloud visualisation running in realtime.

You may know from this blog, that I work for a company called Square Enix, and before that Eidos.  SE is a video games publisher, famous for Final Fantasy, and through the acquisition of UK publisher Eidos, Tomb Raider.  Among other duties, I manage the metrics system and create data and visualisations that the business uses to make better games.  All the data we collect is anonymous. For a while now I’ve been using this data to create Heatmaps of player activity. You can see some of these on the Just Cause website.

I’ve been using Processing to create tools that render the heatmaps, but while the logical structure of program is fairly simple, there are significant challenges in working with large datasets. The primary challenge is loading the data into memory. The data is all held in a SQL database, and while I could connect to the DB directly using processing, the DB is optimised for data-in operations, not data-out, so you don’t want to be pulling the data out too often. Instead, I dump the raw spatial data (X,Y,Z coordinates) into a CSV file, one record per row. I usually create heatmaps from datasets in excess of 1 million rows, and most of them are between 5 – 20 million rows (I have one that is 22 million rows!). A CSV file containing 10 million rows of spatial data is about 364MiB in size (the 22.3m row CSV is 802MiB!). In order to create the data structures in memory to hold sets this large, I have to work in 64bit mode to get over the Windows 32bit memory restrictions.

When I wrote the tool for heatmapping, I was pretty fresh to processing, and I can now see many ways to optimise this process to be more efficient at memory usage, but I’m not great at refactoring, and as it works, I’ve not been back to optimise it. The heatmapper application works by looking at the resolution of the image you want to create (eg. 2000x2000px) and then subdividing this into a cell structure whose size determined by the “cell resolution” variable. The cells are essentially buckets, into which I put the rows of raw data. Each row of raw data (the data’s X and Y value) is processed to find out which bucket that row of data fall in and then a Cell’s value is incremented by one every time an element of the raw data lands in it.

A heatmap of more than 22.3 million player 'Extraction' events from Just Cause 2

Once the whole raw dataset has been parsed, and the Cells populated with counts, we render the cells. This is a simple process of drawing a circle of the same diameter as the cell size. The colour of which is determined by the number of the raw data elements that landed in that cell. The palette is scaled so that the cell with the largest value of counts is always pure white, so there is no clipping in the colour (although you can remove this scaling, meaning that any cell with more than 255 counts in it will be white, the trade off is increased colour resolution for the less populated cells). I use a simple alpha blend to help smooth out the rendering. The results are pretty good.

What was bugging me was that I was getting this huge set of raw data, but only using two thirds of it. I never ventured into third dimension by factoring the Z coordinate (the ‘height’ spatial component) into the process. At the beginning of this year, I decided to do something about this. I started a project to rebuild the heatmapping application in 3D.

My first results we truly awful. I rendered 9000 points in rows using the OpenGL renderer, with colour to show the relative height of the points. With only 9k rows, I couldn’t manage more than a couple of frames a second. The culprit was the Map() function that I was using for scaling the colour palette, for each point, each frame. Once I removed that, Life got a lot better, but I couldn’t render more than a few thousand points without the framerate falling off a cliff.

Early rendering of a point cloud from the JC2 Data

It became obvious that I would need to start getting a bit more low level with OpenGL to get more performance. This is where I discovered the work of Andrés Colubri and his GLGraphics library for Processing. With some great support through the Processing Forums, I was able to move a lot of the heavy lifting from the CPU to the GFX card, and ultimately create a vertex buffer object holding the points. I can render 11 million points in realtime at 30fps, on my desktop computer. This was several orders of magnitude greater performance than I ever thought would be possible.

The JustCause2 dataset that I was using was a selection of player deaths specifically where the player had died from an impact event. This was great because players tend to spend a lot of time jumping off tall buildings or riding around in helicopters and planes, so impact in this context is generally impact with the geometry of the environment. When the data is rendered, you can see the underlying world, almost as clearly as if we were rendering the 3d mesh itself.

Finally, and arguably the most important aspect of any data visualisation is working out the best way to communicate it. The data looked great, but I couldn’t distribute it as source because of the size of the CSV files that accompany it. I needed to some way of taking the viewer through the data and showing them points of interest. Up stepped another Processing hero, Jean Pierre Charalambos and the ProScene library. With ProScene, and Jean Pierre’s help I was able to direct the camera around to create a visual tour of the data which I then rendered out as individual 1080p frames, and assembled later into an animation.

It’s not over, in fact this journey into 3D visualisation of big data is only just starting. The JC2 Point Cloud animation isn’t really a heatmap at all, more like a 3D scatter graph of points. I will be investigating how to build a 3D cellular structure next, to create a true study of interactions in the 3D space. I will most likely be using data from the upcoming Square Enix game, Deus Ex: Human Revolution as it’s level based structure is more suitable for this project than the vast open world of Just Cause 2.

 

YouTube Preview Image

 

23 Responses to “Heatmaps, Point Clouds and Big Data in Processing”

  1. [...] is a visualization of multiplayer data from the open world action-adventure video game Just Cause 2. Each point represents a player death event during the game, and rendering all 10 million+ points [...]

  2. Simon says:

    Awesome job Jim – this looks simply beautiful!

  3. Dan Brickley says:

    Very nice! Any sense for how much can be done in this direction in modern browsers with processing.js and webgl?

  4. [...] is still in-development and he hopes to transition to the new Deus Ex game for his next effort.via Heatmaps, Point Clouds and Big Data in Processing – :: JimBlackhurst.com ::. This story written by Randall Hand Randall Hand is a visualization scientist working for a federal [...]

  5. [...] Anblick. Jim Blackhurst berichtet auf seinem Blog über die Arbeit mit SQL-basierten Datensätzen in Verbindung mit OpenGL-Darstellung großer [...]

  6. Max says:

    Can I ask what is the music in that video? Really beautiful.

  7. paul says:

    Very happy too know that I participate(in many deaths^^) to this beautifull work you’ve done

    congrats

  8. Thad Guidry says:

    In our next presentation, Jim is going to show us Pac-man eating all 11 million dots. Great Job Jim !

  9. [...] Von der Entwicklerseite von Jim Black, der dieses Video erstellt hat. via flowingdata [...]

  10. [...] of this map contains a bridge and the flattening distorts the values. Jim Blackhurst recently posted an article about his experiences rendering this type of data as 3D point [...]

  11. [...] this data and overlay a kill map while you choose which building to enter or which hill to climb. [Jim Blackhurst via Fast Company Tagged:gamesgamingjust cause 2mapssquare enixvideo [...]

  12. [...] this data and overlay a kill map while you choose which building to enter or which hill to climb. [Jim Blackhurst via Fast [...]

  13. [...] Heatmaps, Point Clouds and Big Data in Processing, Fast Company: Infographic Of The Day: Using 11.3M Player Deaths To Map A Videogame’s World (via Gizmodo) [...]

  14. [...] this data and overlay a kill map while you choose which building to enter or which hill to climb. [Jim Blackhurst via Fast CompanyArticle source: [...]

  15. Steve says:

    Very nice visualization. Have you considered simply using logarithmic scaling of the colors to avoid the clipping that you mentioned? (“The palette is scaled so that the cell with the largest value of counts is always pure white, so there is no clipping in the colour…”)

  16. [...] and music. He is clearly a data geek and loves it in its raw, visualized form. There is really neat post on his blog about the creation of this video that I recommend if you want to know more. Posted in Games | [...]

  17. [...] this data and overlay a kill map while you choose which building to enter or which hill to climb. [Jim Blackhurst via Fast Company Read full post from Tags : 3d Plane, Anonymous, Blackhurst, Brings, Create [...]

  18. Aidan says:

    Hi Jim,
    Love this video. Well done. Quite a few people on youtube (myself included) also love the music used in this video and i’d be eternally grateful if you could let us know what it is as I’d like to hear it in full and more from the same artist.

  19. Jim says:

    hi Aidan!
    Thanks for the comment. Yes I’ve seen a lot of comments about the music too, although the answer is a bit less interesting than im sure you hoped for. The music is a licence free track that comes with the pro edition of Sony vegas which I used for the editing. If you are interested, I originally cut the video to ‘heartbeats’ by Jose Gonallez, but wasn’t allowed to publicly release it for copyright reasons. I still think it fits heartbeats better.
    Jim

  20. gregg says:

    11 million data points is actually fairly small. for instance a 3D CT image is typically 20-100 million gray samples. we can render that in volume renderers at 30 FPS with a few pages of openGL code. it’s actually a homework assignment, and students typically complete it in a few hours. and when real-time texture mapping reaches webGL, you’ll be able to render in browsers.

    you also might want to look into compressing your data. after binning your data into discrete pixels, flatten the 3D world into a single 1D array, then just list the positions of the non-zero values each as a single index into this 1D array, so 8 bytes per sample is only 80M bytes. encode the difference between neighboring pixels (difference of indices) as a distance, gzip, and you’ll get it down to 5MB, depending on the spatial distribution of the points. simple enough. oct-trees and other more complex encodings can reduce it more, but there’s no need for that once you’ve hit 5MB.

    good luck. keep us updated!

Leave a Reply