Wordmaps without Java
Word maps generated by wordle.net have been making the rounds. They are very cool representations of the frequency that various words appear in a hunk of text (such as a blog feed). Unfortunately, the code to generate these word maps is not open source, and it requires Java.
So I decided to take on Johnath’s challenge and produce something similar using HTML canvas and JavaScript:
You can take it for a spin too, but only if you have Firefox 3.1. Try it out!. I’m currently using some features that are specific to Firefox 3.1, such as JavaScript 1.8 and Canvas.measureText. I think I can backport this code to support Firefox 3 by checking for .mozMeasureText and .mozTextStyle. I don’t know whether Safari currently supports text drawing or measurement in their canvas implementation. If they do, this can probably be made to work there as well.
If you’re interested in the code, a Mercurial repository is available on hg.mozilla.org. There are a couple improvement possibilities noted in the README file. Some other possibilities that I’m just thinking of now:
- Produce an image map to make all the terms link to the relevant post(s).
- Produce SVG output to make the output scalable.
December 12th, 2008 at 6:54 pm
This is great. Nice job! I’ve been wanting to look into cross-browser canvas support some more.
December 12th, 2008 at 6:55 pm
Using Firefox 3.1 beta 2, it works, but it is far too slow(much slower than Wordle).
December 12th, 2008 at 7:28 pm
Very nice. One question: is it slow as all-get-out generally, or is it just me/the text I fed in/Wednesday’s minefield on Linux?
December 13th, 2008 at 4:50 am
Thats what I love about the web, modern browsers and open source. Everything is possible. Who needs proprietary closed source plugins?
But yes, the performance needs a lot of improvements. Even with 3 Words, it takes 5s or more.
And SVG would definitely be nice. With linkable text and more.
December 13th, 2008 at 6:16 am
[…] Ecco a voi Wordle applicato al feed di questo blog (meme del momento su Planet Mozilla, al punto che hanno realizzato pure una versione JavaScript). […]
December 13th, 2008 at 10:31 am
“Who needs proprietary closed source plugins?”
I guess Wordle does, in order to provide the performance that its users enjoy.
Given that the source code for the Java 6 and 7 platforms is readily available, I question your use of the phrase “closed source”. I agree with you that Java is still “proprietary”, but its source is there if you need it.
I chose Java for Wordle because it’s the only platform that can do what Wordle does in a reasonably cross-platform browser-based experience. The strategy has paid off, I’d say, given the hundreds of emails I’ve received from folks who are not technically inclined, but who have gotten a lot of joy from using Wordle.
December 13th, 2008 at 10:42 am
I forgot to give sincere kudos to Benjamin Smedberg for this really cool implementation; excuse me! That’s why I came here, but then got distracted by… political matters.
In looking at your source I learned about the “let” and “yield” keywords, which I’ve always yearned for in JavaScript, and never knew existed. I look forward to the day when I can use them. It would be cool if someone could do something analogous to what the Objective-J folks have done, and create a compiler that turns JavaScript 1.8 into JavaScript 1.5.
For its tight layouts, Wordle depends on Java’s Font and Java2D APIs, which give intersection-testing in an arbitrary coordinate space, using knowledge of a word’s “shape” (a series of splines). Does a canvas give arbitrary shape hit-testing? If not, you’ll probably be restricted to the word’s bounding box, and you won’t be able to get, say, a word between an “i” and its dot, or in the counters of a big “g”, or nestled above the “o”s in “Food”.
In case you’re interested, Wordle tries to put each word in its “preferred” location, then moves it around in a *spiral* until it fits.
December 13th, 2008 at 11:26 am
Seems that your script is unable to handle non-latin1 characters. How can we fix that?
December 13th, 2008 at 11:40 am
Oh look, http://www.visophyte.org/blog/2008/12/12/thunderbird-and-gloda-go-to-meme-town/ refers to “canvas.mozPathText and canvas.isPointInPath”!
December 13th, 2008 at 1:09 pm
[…] and make a Wordle of your own. If you’re using the beta for Firefox 3.1 you can check out Benjamin Smedberg’s version that not only doesn’t use Java like Wordle does, it also uses open source code. He said he […]
December 13th, 2008 at 2:31 pm
Awesome! Shame it’s ASCII-only, though.
December 14th, 2008 at 8:53 am
Johnathan: sorry, I did not mean to imply that Java was closed-source: I only meant that the worldle.net code was closed. I don’t even think this is necessarily a bad thing; it just hinders hackability, and since I’m a hacker at heart…
Canvas has isPointInPath, but it doesn’t have intersection testing between two paths… the hit-testing I’m doing now uses bitmaps, which is slow but effective. I’d be interested to see whether canvas or SVG could grow intersection testing.
Thanks for the tip about spirals!
Tomer/Simon: I don’t know why it’s ASCII-only… perhaps something about the \b in regexps, perhaps? Patches accepted!
December 14th, 2008 at 10:16 pm
\b is defined to match between two chars if one of them is in the class [a-zA-Z0-9_] and the other one is not. See section 15.10.2.6 of . In other words, it’s completely useless for non-ASCII text in JS. So are the \w and \W character classes, for the same reason. I don’t see an obvious simple way to do splitting on words using a JS regexp, basically. You could try splitting on /[^\w\u100-\uffff]/, I guess, to filter out ASCII non-word chars only. You’ll still lose if someone puts in Unicode spaces of various sorts, but dealing with that sort of thing is a pain. You’ll also lose with Kanji, but the concept of “word” there gets pretty fuzzy anyway.
As for performance, you’re basically off-trace here, for a few reasons:
1) JSOP_YIELD isn’t traced at the moment. So any |yield| usage throws you off trace.
2) JSOP_ARRAYPUSH is not traced. So things like |let wordlist = [t for (t in
getOwnProperties(wordmap))];| fall off trace.
3) JSOP_ENTERBLOCK is not traced. In particular, that’s causing a trace abort at the
|for (let wy = wordimg.height – 1; wy >= 0; –wy) {| in merge()
That’s probably enough. So basically, all of merge(), range() (uses yield), normalInt() (uses arraypush), hitTest() (ends up with enterblock), cause this code to fall off trace.
December 14th, 2008 at 10:39 pm
In my previous comment the section is of the ECMA spec, but the URL in angle brackets got eaten. I wish blogs would make it clear whether they escape angle brackets or just eat them. :(
In any case, s/let/var/ gives me a 3x speedup on second word placement here (the in placeWord loop), because all the ENTERBLOCK aborts drop out, and in particular because that more or less traces the loops in hitTest, modulo two incompatible inner tree aborts.
One does wish that improving performance against TM were not quite so much black magic without a debug build. ;)
In any case, with the above substitution, a profile shows something like 25% of the time in double-to-int conversions, 10% in unboxing doubles, at least 35% of the time in jit-generated code. So I’m not sure how much faster this would really get at this point with this algorithm. Maybe another 2x speedup, but I doubt it’d be more than that.
December 14th, 2008 at 11:02 pm
Oh, digging more into let, here’s what jsparse.cpp has to say about ENTERBLOCK:
* Each let () {…} or for (let …) … compiles to:
*
* JSOP_ENTERBLOCK … JSOP_LEAVEBLOCK
…
* Each var declaration in a let-block binds a name in at
* compile time, and allocates a slot on the operand stack at
* runtime via JSOP_ENTERBLOCK.
December 15th, 2008 at 12:59 pm
@Benjamin,
If you look again at my first comment, you’ll see that I was quoting and responding to a commenter about the Java plugin, not to you, when mentioning “closed source.” Of course, you’re right, I cannot share the Wordle source code, as it belongs to IBM.
If canvas supports intersection between a shape and a rectangle, then you can decompose a word into a tree of ever-smaller rectangles, and use those for hit testing. If not… you’re out of luck!
December 15th, 2008 at 1:09 pm
My current intersection/hit-testing code is all based on actual pixel testing. When it traces to native code, it performs pretty well; updated code is now live that traces better per bz’s suggestions. It does work and allow for placing words within large letters, around the dots in “i”, etc. It has some drawbacks, particular because there’s not a good way to optimize the “move this shape along a path until it hits something” operation, which is a lot faster using vector graphics and splines.
May 17th, 2013 at 9:59 pm
May I just say to Mr. Smedberg, as a person that wanted to use a word map like Tagxedo or Wordle and did NOT want the hassle of Java… Thank you. Keep up the good work.