Saturday, 18 February 2012

Some thoughts on Clojure performance

Edit: This post recently re-surfaced on hacker news and caused a bit of a stir, mainly because of a slightly sensational/misleading title (was "Why is Clojure so slow?"). I wrote this before Rich Hickey's Clojure/Conj 2011 keynote was published, in which he talks about most of my concerns (and outlines possible solutions).

Clojure is great in many ways, but one thing it can't be accused of is being particularly fast. What I mean by fast here is the speed in which Clojure programs execute. This is a well known issue in the Clojure community and have been discussed on the mailing list and stack overflow. I am not intending to troll or fuel any fires here, just jotting down some of my findings when looking into the matter.

Measuring the speed of code is quite tricky, because there are many parts to a program. This is especially true for Clojure programs where the JVM, Clojure bootstrapping and dynamic compilation is part of the puzzle. But the fact remains, running tools like Leiningen (lein) or "jack-in" is annoyingly slow, mainly because Clojure programs takes a long time to "spin up". Further more, even when Clojure is up and running, it's not particularly fast compared to it's JVM cousins, see The Computer Language Benchmarks Game.

Getting going...
The slow startup time is probably what you will notice first when starting to code in Clojure. Anything from running your first "hello worlds" to starting the REPL is painful. Here's some average numbers for a simple "Hello world" program taken on a 2.5GHz Core2Duo MacBook Pro;

Language Total running time OSX Relative Total running time Ubuntu Relative
C (printf) 0.011s 1 0.001s 1
C# (mono) 0.085s 7.7 0.027s 27
F# (mono) 0.105s 9.5 0.036s 36
Java 0.350s 32 0.038s 38
Scala 0.710s 64.5 0.228s 228
Groovy 0.740s 67.4 0.335s 335
JRuby (non-compiled) 0.820s 74.5
Clojure (uberjar) 1.250s 113.5 0.652s 652

What we can see it that Java itself accounts for 0.35s of the startup time, but unfortunately Clojure adds another second(!) on top of that. This 1.3s pause before main gets called is why Clojure is unsuitable for "terminal scripts". The running time of any scripts (like lein or starting the REPL) will be totally dominated by the startup time. Most Clojure developers will not notice this too much to, since they spend almost all their time in the REPL itself, but users of those Clojure programs will!

The CLR (mono) is about 4x faster getting going than the JVM. This is a big plus for the "#-py" languages. Also note that the difference between F# and C# is much smaller than Clojure and Java, so there isn't any excuse/president for more functional/declarative languages to start slower then their "bare" counterparts.

So why don't we use the Clojure/CLR on mono for stuff like lein then? Well, as it currently stands, the Clojure startup times are even worse on the CLR. The same hello world example as above clocks in a 1.8s! (using the debug Clojure/CLR assembly) - the difference between ClojureCLR and C# is an order of magnitude worse than Clojure and Java, some work left to be done in the ClojureCLR project...

What's taking so long?
Daniel Solano gave a talk on conj/2011 about Clojure and Android (slides), the performance part of that talk gives some valuable insights into Clojure internals and what happens when it starts up. My summary is that it spends 95% of the startup-time loading the clojure.core namespace (the clojure.lang.RT class in particular) and filling out all the metadata/docstrings etc for the methods. This process stresses the GC quite a bit, some 130k objects are allocated and 90k free-d during multiple invokes of the GC (3-6 times), the building up of meta data is one big source of this massive object churn.

Edit: By using the "-verbose:gc" flag when running the clojure test above, I notice a single collection, taking some 0.018s. This is different to Daniel's findings, but hardly surprising since he measured performance on the Dalvik VM.

Daniel mentions a few ideas to improve the situation, and some of those ideas sounds pretty good to me;
  • Having a separate jar for development (when you want all the docstring etc in the REPL) and a slim one with all that stuff removed to "runtime" jar (not to be confused with the existing clojure-slim jar file)
  • Serialising the clojure.core initialisation so it can be dumped into memory from disk when starting up
ClojureScript to the rescue
ClojureScript and Google's blistering fast javascript engine V8 is another way to go. When using the ClojureScript compiler on a hello word example with advanced optimisation, we end up with some 100kb of Javascript. The V8 engine runs this is in 0.140s, which is 2.5x faster than the "bare" Java version and 9x faster than the Clojure/JVM version! The Google Closure compiler certainly helps here by removing lots of unused code, and the resulting Javascript file is indeed free from all docstrings etc.

Also, Rich Hickey did mention ClojureScript as "Clojure's script story" when he unveiled ClojureScript last year - one of the main benefits is the much improved startup time.

Up and running...
How fast is Clojure at running your code once it finally has got going? A look around the The Computer Language Benchmarks Game gives you a good idea. Clojure is on average 4x slower than Java and 2x slower than Scala. There are a couple of reasons, and the biggest factor is Clojure's immutable data structures. The fact is that immutable data structures will always be slower then their mutable counterparts. The promise of Clojure's persistant data structures is that they have the same time complexity as the mutable equivalents, but they are not as fast - constant time factors do play a big role in running times. Most of the benchmarks above run for 50-200 seconds, so Clojure's startup time will be a factor in the results as well. Finally, dynamic languages a generally slower than static ones, because of the extra boxing overheads etc.

Conclusion
Where does all this leave us? Clojure is a beautiful, powerful and very useful language, but (in it's current incarnation) not great for small script-y programs. The problems with startup time can be solved, either by changes to Clojure itself or by exploring the ClojureScript route. I personally like the javascript track; Javascript has lower processor and space overhead than the JVM, so by making ClojureScript-scripting better, Clojure can be more widely used, reaching embedded systems etc.

However, in order to make ClojureScript a viable option for non-browser programs, there are certainly more work to be done. Some Node.js integration exist today, but a ":bare" compilation target would be a good addition. Then comes the small task of building up good out-of-browser APIs.

26 comments:

  1. the benchmark is most a boot time benchmark than a language speed benchmark...obviously java and java's descendants are not good suitables for scripts...I wish see a comparation with real code and solving a real problem, although I think your article is correct, when sold me clojure the people said it was really fast and many times as fast as static typed language, and really, compared with python or even nodejs, it's not so fast, even so...clojure is my favorite compared to pyhon or javascript

    ReplyDelete
  2. There are two aspects of the startup time for Clojure on the CLR: (1) JITting the code, and (2) executing code to set up the environment (clojure.core &co).

    Based on my experiments (see
    http://clojureclr.blogspot.com/2011/12/using-ngen-to-improve-clojureclr.html), (1) is much more significant than (2). This is relevant to ClojureCLR vs Clojure/JVM and in comparison to others. Using ngen, ClojureCLR can have startup times 1/4 that of Clojure/JVM. A simple println completes in 0.23 seconds on .Net (running on Dell Optiplex 960 with an Intel® Core™2 Quad Processor Q9550 2.83 GHz).

    ReplyDelete
  3. @coco "real code" comparisons - take a look at the Debian language shootout I linked to in the article. Some of these long running tests eliminate the startup time factor. Clojure is performs pretty good, especially if you consider that it's using immutable data structures...

    ReplyDelete
  4. @dmiller Thanks for the info, I wonder if you can get the same speed improvements using mono? If yes, I think you can have a very interesting conversation with the Leiningen boys :P

    ReplyDelete
    Replies
    1. that's right the clr and mono has faster bot times..now...clojure on android sucks because the startup is really slow...and mono run pretty well on android and iOS, I don't know much about that..but maybe using clojure on mono in android would be a solution for the slow boot time...http://www.koushikdutta.com/2009/01/microsoft-dlr-and-mono-bring-python-and.html

      Delete
    2. In Mono, it is called AOT, Ahead Of Time:
      http://www.mono-project.com/AOT

      Delete
  5. I wonder why there was a difference between F# and C#?

    I would of expected the IL to be almost the same...

    ReplyDelete
  6. @dthomas It was the smallest possible hello world for all languages, not sure why F# is slower. Might be a mono thing?

    ReplyDelete
  7. Both the way Clojure calls functions through vars instead of using method invocation directly and the large size of bytecode generated by the Clojure compiler (seems to include many redundant instructions) may prevent method inlining by the JIT which in turn reduces the JIT's potential "horizon" of optimization. Yet I don't know how much does this actually contribute to the bad runtime performance.

    Also what variant of the VM was used to measure the startup time? Both OpenJDK and Oracle JDK include two variants of the HotSpot VM: client and server. The client VM starts up faster but does less optimization.

    ReplyDelete
  8. @Mikhail I used the same JVM settings for all my tests, so I guess the relative number is of more value than the actual running time. Also, trying to launch these tests with different parameters (like -client, -server etc) doesn't have any measurable effect to the startup times.

    ReplyDelete
  9. > Debian's language shootout.

    It isn't "Debian's" anything - alioth.debian.org is a project hosting service like sourceforge or savannah.

    It isn't called "language shootout". The Virginia Tech shooting in April 2007 once again pushed gun violence into the media headlines. There was no wish to be associated with or to trivialise the slaughter behind the phrase shootout so the project was renamed back on 20th April 2007.

    ReplyDelete
  10. @igouy Thanks for pointing this out, I've updated the blog post accordingly.

    ReplyDelete
  11. I'd be interested to see what the difference is if the docstrings etc were removed from the core.

    ReplyDelete
  12. I just saw Rich Hickey's keynote presentation from Conj2011; http://blip.tv/clojure/rich-hickey-keynote-5970064

    He pretty much starts off by talking about making Clojure "leaner", faster at starting up etc.

    He mentions stuff like a "production" jar with less metadata, hoisted evaluator and even some kind of tree shaking ala ProGuard.

    ReplyDelete
  13. Regarding diff between C#/F#, I think it is such a small difference that we can probably just attribute it to having the additional burden of loading FSharp.Core.

    ReplyDelete
  14. I bet F# is slower than C# because you're using printfn in F# which requires FSharp.Core.dll to be pulled in and compiled. Try using System.Console.WriteLine from F# instead?

    ReplyDelete
  15. The missing aspect to these tests is threading. On modern machines the ability of run parallel efficiently is becoming ever more important. Clojure uses immutable data and thus makes this much easier. When one ports Java or (shudder) C over to using immutable data these languages slow down as well and the gap in performance shrinks whilst the time to write gap opens up. I am not a big clojure fan-boy but I do think it is worth comparing the strengths of languages as well as the weaknesses. The cost of malloc in C programs generally makes using immutable impractical and stupid slow so a proper immutable structure test would be very interesting to see.

    ReplyDelete
    Replies
    1. 5 out of 10 Clojure programs for those tasks make significant use of 4 cores.

      -- So what do you mean "The missing aspect to these tests is threading"?

      The thread-ring test is about task switching.

      The chameneos-redux test is about peer-to-peer symmetrical rendez-vous.

      -- So what do you mean "The missing aspect to these tests is threading"?

      Delete
  16. It might be interesting to do some other interpreted languages, just out of curiosity... NodeJS, Python, IronPython (.Net/Mono/DLR), IronRuby, and Rhino for comparison sake... though this is an interesting take, as startup time beyond a second is unsuitable for a command-line/console script, it isn't necessarily bad for a service that keeps running.

    ReplyDelete
    Replies
    1. See what The Wayback Machine archived in 2008

      http://web.archive.org/web/20080912193818/http://shootout.alioth.debian.org/gp4/benchmark.php?test=hello&lang=all

      Delete
  17. I'm much more interested in seeing how Clojure code competes on a "warm" server; one that has loaded all code, and given Hotspot a chance to optimize. And, as usual, this requires a more complex benchmark, one which leaves implementation much more open, so that Clojure and the other functional languages can work it idiomatic ways.

    That is, if you restrict Clojure to operating on a single thread, doing work that's really imperative, then it'll never have its chance to shine. On the other hand, if you have an algorithm that can be worked on in parallel across multiple threads, including something base on ForkJoin, then you will see Clojure as fast, or possibly faster, than Java ... especially if the Java code is restricting itself to a single thread, or using multiple threads and managing locks.

    As much as you want to avoid I/O in a benchmark, I think you are only going to see Clojure's strength in an I/O-based operation where work can be split across threads.

    You can clearly see in Clojure's evolution that they are following "Make it work. Make it right. Make it fast." Many of the changes over the last few years have been about speed ... chunked collections first, then transient collections, ... the next iteration will have composable mapping, reducing, and filtering functions.

    ReplyDelete
    Replies
    1. Do you know how large or small the time differences is between "warmed" and "cold start" for the benchmarks game Clojure programs Martin Trojer mentioned?

      For example, given that the fastest Clojure mandelbrot program takes ~55s CPU do you think the "warmed" program will be more than 5s CPU faster?


      >>if you restrict Clojure to operating on a single thread"<<

      Is that what was done? Here's the quad-core Q6600 CPU Load that's shown on the benchmarks game website for that Clojure mandelbrot program: 91% 95% 93% 93%

      Delete
    2. Can you point to this mandlebrot anectdote?

      Delete
    3. Didn't you see the benchmarks game link in Martin Trojer's blog post?

      Just 1 click from the home page and...

      × Program Source Code CPU secs Elapsed secs Memory KB Code B ≈ CPU Load
      4.1 Clojure #6 54.60 14.78 109,868 1069 96% 91% 92% 92%

      Just 2 clicks from the home page and ...



      Delete
  18. Is there any reason all that metadata/docstrings has to be loaded at startup time rather than on demand for individual classes when required?

    ReplyDelete
    Replies
    1. Why is OSX so slow compared to Ubuntu?

      Delete

Note: only a member of this blog may post a comment.