How Python is Used at CloudMade
How Python is Used at CloudMade
This is a repost from my personal blog.
I’m starting a series on explaining how and why CloudMade uses Python. The following one explains why we ditched OSM’s stack in favour of in-house solution.
Intro
It’s been almost 2 years as CloudMade has ditched mod_tile and renderd as main rendering solution in favour of in-house solution. As the principle designer of the said alternative, I must say that this decision led to higher development pace. This article will try to cover the general architecture approach, reasons of decisions made and short comparison to other rendering alternatives.
Before The Switch
As some of you might know, CloudMade has its roots in OpenStreetMap and it was quite natural to adopt OSM’s software stack to have something to start with. But as CloudMade grew, the needs and requirements changed rapidly and the task of supporting and developing mod_tile became more of a burden, the decision to switch to more high-level language as the main was made. The language of choice was Python, due to its generous set of already existing spatial libraries (e.g. Shapely, GeoAlchemy, Mapnik bindings, etc), ease of deployment and its simpler support for cross-platform development. And, well, I knew it better than Scala, Ruby or Perl at that moment. Here goes a list of our tasks with mod_tile and renderd that we found easier to implement with Python:
- Variable priorities
- mod_tile has the notion of “dirty” and “general” requests, with dirty requests having lower priority and thus having the property of being rendered when there’s little-to-none on-demand rendering required. While this seems enough for most applications, it does has its warts, as it makes the priority system overall less general. What this means in practice, is that every time we need to add some special priority (i.e. in case we need to health-check system by forcing rendering) we get into adding quite a lot of code, rather than changing the “priority” property of the request. It might seem silly, but off the top of my head I can remember that we have at least 6 different priorities now
- Replicating cache
- When it comes to scaling rendering and serving of tiles, the simplest solution that comes to mind is adding more servers. It’s as simple, as pushing several links in web interface or even using automated process and Amazon Web Services API. But when you add new server with rendering stack installed you lose all the cache that has been on other servers and furthermore all the instances don’t share cache, which makes the cacheto use system less effective. There’re several solutions to this issue, each of them making use networking or database libraries, programming against which is tedious task in C (and C++).
- Being tied to Apache
- mod_tile is an Apache module, which makes it less interesting if you look at it from “commodity server” perspective. Having to program against a monster that is Apache, using its APR library is one giant leap into full-blown programmer depression. The autogenerated documentation make the matters even worse. And two last things about Apache are its comparatively slow serving of static files and complicated configuration scheme. One might say that Apache might be winning in other parts of comparison, but the things that have been mentioned were essential to our rendering services.
These were the main reasons to switch, as mod_tile and renderd didn’t seem like the right thing for CloudMade. Of course, there were a lot of others, more and less subjective reasons, but having even before mentioned ones, it was enough to seriously consider a switch.
The Switch
With all the warts of the existing system and requirements for the future in mind, we decided to move on with the new approach. There were several things to consider in our system:
- Decoupling
- This was our main goal — thoroughly decoupled system, where every part does one thing and does it good. This makes scaling much easier, but also incurs additional penalty on the amount of code, because of the need to write communication utilities. This also makes the system as a whole seem much more stable, as every other part of the system can work as a replacement in case of failure. Of course, the price is having network overhead and supervising system parts.
- Handling styles
- One of the main CloudMade web-services is the style editor, which gives ability to edit map styles using WYSIWYG technique. Handling thousands of Mapnik styles wasn’t something any existing system was prepared for, so unique way of doing exactly this had to be devised. Of course, this meant that style state in every part of the system had to be consistent at any given moment of time, making this even harder to accomplish.
- Cache expiry
- To minimize load on the system, as much cache as possible has to be available. But for rapidly changing OpenStreetMap data, having all tiles cached for month wouldn’t work and at the same time rendering all images on the fly would be an enormously heavy goal to accomplish. Whatever cache update approach is taken, unless there’s a hardware possibility to render maps on the fly, someone will be unhappy about cache expiry scheme.
- Health monitoring and high availability
- In order to meet requirement of having usable web services, one of the most important things to consider is having as high service uptime as possible. Without having health monitoring which knows about state of every part of the system the said objective is almost unreachable. Of course, the ideal can not achieved, but having a setup that covers at least 80% of the nodes would satisfy our needs.
The system that’s currently in use at CloudMade has been developed with exactly these goals in mind, with minor additions and subtractions along the way. To summarize, the goal was the system where every part has a maximum level of independency from every other while succumbing to the general goal of having fast and easily-deployed rendering stack.
To Be Continued
I’ll continue the talk about moving from mod_tile to our in-house system in follow-ups, where I’ll try to get into technical details, explain our shortcomings and issues that arised while developing.
Stay tuned.
January 17th, 2011 - Posted by Andrii Mishkovskyi in Uncategorized, developers, openstreetmap | | 11 Comments

on January 17th, 2011 at 12:47 pm
Ruby also provides spatial libraries. Bindings for GeOS are available. There’s also a Ruby library for accessing Mapnik. Granted, there’s no Geoalchemy.
Supporting Ruby for cross-platform deployment is no different from supporting Python.
Looks the real reason “why Cloudmade uses Python” is the fact that you know it better than Ruby. Which is a very valid reason. If you had known Ruby more, maybe you wouldn’t have used Python. I’m pretty sure Java also provides decent spatial libraries. The title is a bit misleading. It should have been “why we ditched mod_tile”, not “why we use Python”.
on January 17th, 2011 at 1:17 pm
Steve,
I don’t think starting a flame war in comments is acceptable, but nevertheless — there were no Ruby bindings to Mapnik in the late 2008 – early 2009. Yes, there is something that I would call “attempted bindings”, which can be found here — https://github.com/aub/ruby-mapnik but that’s all that I know of.
Now, Java does have *great* spatial library, called JTS, but I don’t regard Java as a huge improvement over C++ or C in terms of development speed. We could of course jump to Scala or Clojure, both perfectly combining simplicity of writing concurrent programs and ease of access to all JVM libraries, but that would be highly risky atm, as the future of both languages was uncertain (as is, no big companies using them).
Of course, nothing stops us from rewriting some parts of the system in Ruby, Java, Haskell or whatever right now, because the system itself is composed of independent nodes.
Furthermore, the title of the article didn’t come up when I posted it (because I forgot to put it there, duh) but now it’s correctly called “How Python is Used at CloudMade”. I hope this explains the intention behind the article a little bit clearer.
on January 17th, 2011 at 1:40 pm
“And two last things about Apache are its comparatively slow serving of static files…”. I’d be curious if you have any number about this and how it effects tile serving. From my benchmarks when adding locking for the throttling code in mod_tile, I could easily achieve in excess of 1Gbit/s in tile serving and several tens of thousands of tiles per second (can’t remember the exact numbers) on a small laptop. (Always requesting the same tile to benchmark mod_tile rather than disk access). At that point I suspect the computation of the md5 hash of tiles for the etag was what was using most CPU, which could be stored once on render in the metatile. But I concluded that disk I/O in serving random tiles would much more likely be the bottleneck than software limitations of apache or mod_tile.
Are you planning on releasing the code to your rendering stack? It would be interesting to see how it compares to current mod_tile with either renderd or tirex.
on January 17th, 2011 at 2:08 pm
>> And, well, I knew it better than Scala, Ruby or Perl at that moment.
LOLWUT?
on January 17th, 2011 at 2:25 pm
i have to say sorry: i wrongly read your sentence.
i read it as
>> And, well, I knew it IS better than Scala, Ruby or Perl at that moment.
on January 17th, 2011 at 4:05 pm
Nice to read some details on the professional use of Python. Just learned a bit Python a few month ago
on January 17th, 2011 at 8:52 pm
There are tons of websites that simply make garbage information,but yours are different , I am so happy to read your post!
on January 18th, 2011 at 8:11 am
@apmon
As it was quite a long time ago I don’t I’ll be able to get you concrete numbers any time soon. I must say that we didn’t use ab for benchmarking, as it doesn’t really show much, but rather JMeter, tsung and http_perf with the list of most popular tiles. I’ve also written small tool that “replays” logs at a given speed. I’m not going to speculate on numbers, but I remember that when run through these benchmarks Apache was around 5 times slower than nginx or lighty when serving tile from cache under normal circumstances (which is still very good, as nginx and lighty served cache at around 1 ms on average), but degraded under really heavy load. I must say, that improving serving from cache was a nice side-effect rather than our main purpose, but having ability to easily switch between different web servers was indeed one of our goals.
If you think that benchmarking this once again would make sense, I’ll try to allocate some time for that.
As for releasing code — I’m all for it, but this is something I don’t have any control over whatsoever. I’ll try to raise this question in the nearest time, but the last time this idea was rejected.
on January 18th, 2011 at 5:07 pm
Very interesting. A 5 times difference seems huge! And rather surprising at least to me. I agree that ab (which is what I used for the above numbers) is not particularly representative for real performance on a tile server, but with respect to benchmarking the tile server software performance it actually seems more accurate. From the web server point of view it shouldn’t matter if it always serves the same tile or a wide distribution as it does the same processing, what matters there should be the OS and below. So for benchmarking the software, rather than the full server, ab should give you a bigger difference than a “typical pattern”. Do you use a different file(system) layout than the hashed metatiles mod_tile uses? Another question is if the 1ms you quote for lighty is latency or average throughput? (The OSM tile server manages more than 1600 tiles/s with real traffic, although the latency is presumably way higher than 0.6 ms).
Imho, it would be interesting to do a proper benchmarking comparison between the various tile serving options that exist by now, e.g. your implementation, mod_tile, static serving through apache / lighty, tilecache and squid. Until now, I would have guessed the difference is likely to be irrelevant for practical purposes, but perhaps not if you see 5 times differences.
on January 19th, 2011 at 12:08 am
Nice article. Would love to here more from the technical aspects. I more or less hate mod_tile & tirex and currently I’m writing a solution in nodejs using toky cabinet and mapnik. Perhaps you can give me some hints about your distributed cache system? Also the protocol and render queue would be of greater interest.
Is the benchmark to for replaying server logs open source? Currently planing to write this myself…
on January 20th, 2011 at 12:47 pm
I’ll try to be more technical in future articles, as this one was basically an intro. I’m going to be quite busy in the next month (preparing tutorial for PyCon), but it’s a safe bet that the next article will be at most in the first part of February. It seems like this is an interesting theme.
The benchmark is not opensource and even more, it has never been checked into any repository. It seems so simple to write, that I didn’t think that releasing it would be beneficiary. It took me around two days to get it completely functional for our needs and there were not a lot of code.
BTW, I’ve been looking at your implementation of XAPI in node.js and I must say it looks cool. Keep up the good work!