ObMimic: Very, very late – but still alive

21 11 2009

Updated Feb 2018: OpenBrace Limited has closed down, and its ObMimic product is no longer supported.

It’s been a long, long time since I last mentioned ObMimic, so an update seems necessary.

At the time I thought it wouldn’t take more than a couple of months or so to get everything ready and put up a public website, discussion forums, bug list, and everything we need internally to release and support ObMimic.

Well, here we are, well over a year later, and there’s still no public website or ObMimic release.

So it seems worth a quick post to explain that ObMimic isn’t dead; it isn’t forgotten; it has just been much delayed and postponed, and is still “on the way”.

Why this long delay?

Well, partly this is the usual story that everything takes much longer than expected, even after you’ve allowed for the fact that everything takes much longer than expected.

Partly it’s from needing longer than expected to handle all the bugs and kludges encountered in the many third-party applications, tools and services involved, and getting everything to work nicely together.

Partly it’s a deliberate decision to wait until everything is reasonably complete and reliable before going public, even if that causes delay, rather than rushing out a quick-and-nasty website with lots of problems (and leaving us with a backlog of work and ongoing firefighting).

But mainly it’s due to having been sucked into other commitments, and spending time on several other things in order to “keep afloat” in the meantime. Such distractions have always seemed like a bad idea, but over the last year or so they’ve been a necessary evil.

It’s not that we’ve not been working on it. It’s just that there are always so many things that need doing, and the months fly by at alarming speed. As Brooks says in The Mythical Man Month: “How does a project get to be a year late? … One day at a time.” (but as also quoted there, “Good cooking takes time. If you are made to wait, it is to serve you better, and to please you”).

Anyway, during all these delays I’ve been deliberately keeping a low profile, and have wanted to steer clear of talking much about ObMimic until I can be sure that it’s genuinely imminent.

Rest assured though, a public beta and full release are still intended as soon as it’s ready (and we are getting there, despite the absence of any outward signs of progress); the product itself is still being maintained and improved in the meantime; and for anyone that wants an early look at it, a “private beta release” is still available if you contact me.

ArbitraryObject: A useful useless class

23 10 2009

I’ve recently introduced a trivial little class called “ArbitraryObject” into my Java test-case code. Here’s the full story…

When writing test cases in Java, every now and then one comes across a situation where an object is needed but its precise type doesn’t matter, and you just need to pick some arbitrary class to use as an example.

Sometimes any class at all will do; sometimes there are constraints on what the class must not be, but anything else will do (e.g. anything that isn’t class X or one of its subclasses or superclasses).

Most commonly this happens for tests of methods that take “Object” arguments – an obvious example is testing that an implementation of “equals(Object)” returns false for any argument that isn’t of the appropriate type.

Another common case is testing of generic classes and methods with “any class” generic parameters, where one needs to pick a class to be used as the generic parameter’s actual type.

In these situations, what class should one use?

Perhaps the simplest choices are Object or String. However, Object seems a poor choice for this in general – if you’re testing something that takes any Object, you probably want to test it with something more specific than Object itself (even if you do also want to test it with a basic Object). It’s also not going to work where you need something that isn’t inheritance-related to some particular class.

Similarly, although String can be very convenient for this, strings are so common as argument values and in test-case code that their use tends to blend into the background. So it’s hard to see when a string is being used for this purpose as opposed to being a “real” use of a string.

More generally, if you’re trying to show how some code handles any arbitrary type, neither Object nor String seem the most useful or convincing examples to pick.

What we’re really looking for is a class that meets the following criteria:

  • It shouldn’t be relevant in any way to the class being tested (isn’t the class being tested, doesn’t appear as a method argument anywhere in the class being tested, and isn’t a superclass or subclass of such types);
  • It shouldn’t be used otherwise in the test-case code (so as to avoid any confusion);
  • Ideally it ought to be somewhat out-of-the-ordinary (so that we can reasonably assume that the code being tested doesn’t give it any special treatment, and so that its use in the test-case code stands out as something unusual, and so as to emphasise that it’s just an arbitrary example representing any class you might happen to use);
  • It should be easy to construct instances of the class (it should have a public constructor that doesn’t require any non-trivial arguments or other set-up or configuration);
  • There shouldn’t be any significant side-effects or performance impact from creating and disposing of instances and using their Object methods such as equals/hashCode/toString (e.g. these shouldn’t do anything like thread creation, accessing of system or network resources etc).

Until now I’ve been picking classes for this fairly arbitrarily. Sometimes I just grab one of the primitive-wrapper classes like java.lang.Float or perhaps java.math.BigInteger if these aren’t otherwise involved in the code – even though they’re rather too widely used to be ideal for this. Otherwise I’ve picked something obscure but harmless from deep within the bowels of the JDK, such as java.util.zip.Adler32.

The problems with this approach are:

  • The intention and reason for using the chosen class aren’t obvious from the code;
  • The test-case ends up with an otherwise-unnecessary and rather misleading “import” and dependency on the chosen class (unless it’s a java.lang class, but the most suitable of those suffer the drawback of being too widely used);
  • Any searches for the chosen class will find these uses of it as well as its “genuine” uses;
  • There’s no easy way to find everywhere that this has been done (for example, if I ever want to change how I handle these situations).

So instead I’ve now started using a purpose-built “ArbitraryObject” class.

The only purpose of this class is to provide a suitably-named class that isn’t related to any other classes, isn’t otherwise relevant to either the test-case code or the code being tested, and isn’t used for any other purpose.

The main benefit is that this makes the intention of the test-case entirely explicit. Wherever ArbitraryObject is used, it’s clear that it represents the use of any class, at a point where a test needs this. In addition, the test-case code no longer has any dependencies on obscure classes that aren’t actually relevant; it’s easy to find all the places where this is being done; and searches for other classes aren’t going to find any “accidental” appearances of a class where it’s been used for this purpose.

ArbitraryObject must be the most trivial class I’ve ever written. Not even worth showing the code! It’s just a subclass of Object with a public no-argument constructor and nothing else.

Potentially one could argue for additional features, such as giving each instance a “name” to be shown in its “toString” representation, making it Serializable, and so forth. But none of that seems worth bothering with.

So this ArbitraryObject class is entirely trivial, and as a class it’s kind of useless, but the name in itself is useful to me.

Sometimes all you need is an explicit name.

The Java EE Verifier and indirect and optional dependencies

3 08 2009

Running the Java EE 5 Verifier can be a useful way of checking EAR files and other Java EE artifacts before deploying and running them.

However, once you start using third-party libraries there’s one set of rules in the verifier that are rather too idealistic: the requirement that all referenced classes need to be present in the application. If any classes are referenced but can’t be found, these are reported by the verifier as failures.

In theory, it’s perfectly reasonable that Java EE applications are basically supposed to be “self-contained”, and that all classes referenced within them need to be present within the application itself (obviously excluding those of the Java EE environment itself). Actually, Java’s “extension” mechanism is also supported as a way of using jars from outside of the application, but this has limitations and drawbacks of its own and doesn’t really change the overall picture. There’s a useful overview of this subject in the “Sun Developer Network” article Packaging Utility Classes or Library JAR Files in a Portable J2EE Application (this dates from J2EE 1.4, but is still broadly appropriate for Java EE 5).

Anyway, verifying that the application’s deliverable includes all referenced classes seems better than risking sudden “class not found” errors at run-time (possibly on a “live” system and possibly only in very specific situations). The trouble is that once you start using third-party libraries, you then also need to satisfy their own dependencies on further libraries, even where these are only needed by optional facilities that you never actually use. Then you also need all the libraries that those libraries reference, and so on. This can easily get out of hand, and require all sorts of libraries that aren’t ever actually used by your application.

As a simple example, take the UrlRewriteFilter library for rewriting URLs within Java EE web-applications. This is limited in scope and its normal use only involves a single jar, so you’d think it would be relatively self-contained.

However, one of its features is that you can configure its “logging” facilities to use any of a number of different logging APIs. In practice, I don’t use anything other than the default setting, which uses the normal servlet-context log. But its code includes references to log4j, commons-logging and SLF4J so that it can offer these as options. The documentation says that you need the relevant jar in your classpath if you’re using one of these APIs, but the Java EE Verifier tells you that they all need to be present – even if you’re not actually using them (on the perfectly reasonable basis that there’s code present that can call them).

That’s not the end of the story. The SLF4J API in turn uses “implementation” jars to talk to actual logging facilities, and includes references to classes that are only present in such implementation jars. So you also need at least one such SLF4J implementation jar. At this point you’re now looking at the SLF4J website and trying to figure out which of its many jars you need. What are they all? Does it matter which one you pick? Perhaps you need all of them? Do they have any further dependencies on yet more jars? Are there any configuration requirements? Are these safe to include in your application without learning more about SLF4J? Do they introduce any security risks?

So apart from anything else, you’re now having to find out more than you ever wanted to know about SLF4J, just because a third-party library you’re using has chosen to include it as an option. Ironically, a mechanism intended to give you a choice between several logging APIs has ended up requiring you to bundle all of them, even when you’re not actually using any of them!

Anyway, in addition to the log4j jar, the commons-logging jar, the SLF4J API jar, and an SLF4J implementation jar, the UrlRewriteFilter also needs a commons-httpclient jar (though again, nothing in my own particular use of UrlRewriteFilter appears to actually use this). That in turn also requires a commons-codec jar.

Fortunately, that’s the limit of it for UrlRewriteFilter. But it’s easy to see how a third-party jar could have a whole chain of dependencies due to “optional” facilities that you’re not actually using.

As a rather different example, another library that I’ve used recently appears to have an optional feature that allows the use of Python scripts for something or other. This is an optional feature in one particular corner of the library, and is something I have no need for. To support this feature, the code includes references to what I presume are Jython classes. As a result the verifier requires Jython to be present (and then presumably any other libraries that Jython might depend on in turn). Now, bundling Jython into my Java EE application just to satisfy the verifier and avoid a purely-theoretical risk of a run-time “class not found” error seems plain crazy. If the code ever does unexpectedly try to use Jython, I’d much rather have it fail with a run-time exception than have it work successfully and silently do who-knows-what. To add insult to injury, Jython is presumably able to call Python libraries that might or might not be present but that the verifier will know nothing about – so bundling Jython in order to satisfy the verifier might actually make the application more vulnerable to code not being found at run-time.

With the mass of third-party libraries available these days, and the variety of dependencies these sometimes have, I suspect there must be cases that are far, far worse than this. (Anyone out there willing to put forward a “worst case”?)

So what’s the answer? Obviously you do need to bundle the jars for all classes that are actually used, but for jars whose classes are referenced but never actually used (and any further jars that they reference in turn) I can see a number of alternatives:

  • Work through all the dependencies and bundle all the jars so that the verifier is happy with everything. Often this is entirely appropriate or at least acceptable, but as we’ve seen above, this cure isn’t always very practical, and in some cases it can be worse than the disease.
  • A variation on the above is to leave the “unnecessary” jars out of the application but run the verifier on an adjusted copy of the application that does include them. That is, produce a “real” deliverable with just the jars that are actually needed, and a separate adjusted copy of it that also includes any other jars necessary to keep the verifier happy but that you know aren’t actually needed by the application. The verification is run on this adjusted copy, which is then discarded. The drawback is that you still have to work through the entire chain of dependencies and track down and get hold of all of the jars, even for those that aren’t really needed. There’s also the risk that you’ll treat a jar as unnecessary when it isn’t, which is exactly the mistake that the verifier is trying to protect you from.
  • Another alternative is to just give up and not use the verifier. But it seems a shame to miss out on the other verification rules just because one particular rule isn’t always practical.
  • Ideally, it’d be nice to be able to configure the verifier to allow particular exceptions (perhaps to specify that this particular rule should be ignored, or maybe to specify an application-specific list of packages or classes whose absence should be tolerated). But as far as I can see there’s no way to do this at present.
  • Another approach is to inspect the verifier’s results manually so that you can ignore these failures where you want to, but can still see any other problems reported by the verifier. However, it’s always cumbersome and error-prone to have to manually check things after each build, especially where you might have to wade through a long list of “acceptable” errors in order to pick out any unexpected problems.
  • Potentially you could script something to examine the verifier output, pick which warnings and failures should and shouldn’t be ignored, and produce a filtered report and overall outcome based on just the failures you’re interested in. In the absence of suitable options built into the verifier, you could use this approach to support appropriate options yourself. This is probably the most flexible approach (in that you could also use it for any other types of verifier-reported errors that you want to ignore). But it seems like more work than this deserves, and it’d be rather fragile if the messages produced by the verifier ever change.
  • As a last resort, if the library containing the troublesome reference is open-source you could always try building your own customised version with the dependency removed (e.g. find and remove the relevant “import” statements and replace any use of the relevant classes with a suitable run-time exception). Clearly, even where this is possible it will usually be more trouble than it’s worth and will usually be a bad idea, but it’s another option to keep up your sleeve for extreme cases (e.g. to remove a dependency on an unnecessary jar that you can no longer obtain).

The approach I’ve adopted for the time being is to run the verifier on “adjusted” copies of my applications, but only use this for jars that I’m very confident aren’t needed and aren’t wanted in the “real” application. The actual handling of this is built into my standard build script, which builds the “adjusted” application based on an application-specific list of which extra jars need to be added into it.

In the longer term, I’m hoping that the entire approach to this might all change anyway… in a world of dynamic languages, OSGi bundles, and whatever eventually comes of Project JigSaw and other such “modularization” efforts, the existing Java EE rules and packaging mechanisms just don’t seem very appropriate anymore. It all feels like part of the mess that has grown up around packaging, jar dependencies, classpaths, “extension” jars etc, together with the various quirks and work-arounds that have found their way into individual specfications, APIs and tools (often to handle corner-cases and real-world practicalities that weren’t obvious when the relevant specification was first written).

So I’m hoping that at some point we’ll have a cleaner and more general solution to packaging and modularization, and this little quirk and all the complications around it will simply go away.

Oracle-Sun: The winner is Google?

25 04 2009

I think GWT, Android and its Dalvik VM and Google App Engine for Java are all very interesting. But as a developer I’ve been a bit put off by this nagging worry that they’re not “real” Java, and that going off in that direction would progressively take me outside the (so far) relatively-benign stewardship of Sun and the JCP.

The worry is that things like this would take me away from the community standards with a choice of competing and compatible implementations, and into a world of arbitrary single-vendor products with quirks and foibles, lock-in, arbitrary changes and suprise announcements that can suddenly turn your world upside-down.

With the Oracle acquisition of Sun, that perception could change very rapidly.

Hopefully Oracle will play nice and keep the good bits of Sun’s approach to Java whilst sorting out the problems. Hopefully they’ll manage to retain at least the key people, put in sufficient resources to finish and “polish” the things that Sun never managed to, and make better decisions on strategy and marketing. Make the JCP better, or at least not make it worse. And find a way to make good money from Java without having to do anything untoward. If we’re really lucky they’ll do all of that without making a mess of the technology itself.

But back in the real world, does anyone really think that’s how it’s going to go? I’ve only a little experience of Oracle products and their way of doing things, but what I’ve seen from them in the past doesn’t fill me with any great confidence for the future of Java.

So in my head I’m expecting this to degenerate into a rather old-school fight between Oracle, IBM and to some extent Google, with them all pushing increasingly proprietary and “integrated” stacks that live in their own idiosyncratic universes and gradually diverge from each other. With lots of theatre about joint specifications and standards, but increasingly nasty disagreements and battles over licencing. There will always be some independent open-source alternatives and smaller players for particular parts of the stack, but in this scenario I can’t see these managing to keep up with the big guys (nor having any real impact outside of their own small niches).

On paper, out of Oracle, IBM and Google, the advantage might now appear to lie with Oracle. But in practice they are all big enough to mess things up very badly. There’s certainly plenty of scope for them to do so.

On that basis, the Google stuff suddenly looks no more proprietary or risky than anyone else’s solutions, and could easily be the least-bad (and least-evil?) of an imperfect bunch. Arguably they already have a head-start down the road that the others are now about to follow. They’re also way more “zeitgeist” and future-oriented than IBM or Oracle.

Theoretically Microsoft and .NET might gain from all this. If people stop seeing the Java world as a neutral, no-lock-in platform based on community-agreed specifications, then .NET starts to look like just as valid an option as Oracle’s Java stack or IBM’s Java stack or Google’s Java offerings. But I’d expect that anyone with reason to choose Java rather than .NET would still do so even if Java were to fragment into separate vendor universes.

In the end this all depends on what Oracle actually do with Java, and with products such as Glassfish. At the moment we’re all just speculating based on our own experiences of Oracle and their products; their history with previous acquisitions; and an awful lot of guesswork. So for now I guess I have to give them the benefit of the doubt and wait until we see their strategy and plans and their actual behaviour.

But if I ever get any spare time, I’m now far more likely than before to spend it looking into Google’s java-related platforms and tools, and I can’t imagine I’m the only one…

So if I was Google I’d be feeling pretty happy about it on the whole.

Kibibytes and Mebibytes

19 03 2009

Whilst searching for something else, I stumbled across the use of the prefix “kibi-“ for the quantity 1,024 (i.e. 210), with this standing for “kilobinary” and having a corresponding symbol Ki.

For example, this leads to the term kibibyte for 1,024 bytes, kibibit for 1,024 bits, 4 KiB to indicate 4,096 bytes etc.

Apparently, as much as ten years ago the IEC started adopting a whole set of such “binary prefixes” to parallel the normal decimal SI prefixes of “kilo-“, “mega-“, “giga-” etc.

So as well as “kibi-“, we also have the prefixes “mebi-” (220), “gibi-” (230) and so forth, all the way up to “yobi-” (280).

For further details, see (for example):

I’m very surprised to have not come across any of this before, but I guess you learn something every day!

It all seems like a great idea for clearing up the ambiguity in the use of the normal SI prefixes when dealing with binary stuff (e.g. whether someone saying “megabyte” means 1,000,000 bytes or 1,024 x 1,024 bytes).

Some of the names don’t exactly trip off the tongue, and gave me the giggles at first, but that’s probably just from being unfamiliar.

I imagine it’s very hit-and-miss to get something like this into common usage, but I’d have thought that plenty of geeks would have latched onto these terms by now. At any rate, I’d like to think that Maurice Moss regularly puts these prefixes to good use.

Does anyone actually use these prefixes in real life or are they just an idea that has never gained any traction? Am I alone in having not heard of them or did this pass everybody by?

Maybe it’s time to start spreading the word and using these terms more widely…

Forcing Glassfish V2 to reload an auto-deployed web-application

31 01 2009

If you auto-deploy a war archive on Glassfish V2, any changes to the deployed application’s JSP files are picked up automatically. However, if you make changes to the deployed application’s web.xml file or any other such configuration files, you need some way to make Glassfish “reload” the application using the updated files.

It isn’t immediately apparent how to trigger this. At any rate, it had me scratching my head yesterday when I found myself trying to install a third-party application. The installation instructions led me to auto-deploy its war archive and then edit the deployed files, but the changes didn’t take effect.

I couldn’t see anything in the Glassfish admin console to make it stop and re-load the application, and the command-line facilities that I found for this don’t seem to apply to auto-deployed applications.

The obvious solution was to shut-down and restart Glassfish, but even that seemed to leave the application still using its original configuration and ignoring the changes.

Apparently the trick is that you have to put a file named .reload into the root of the deployed application’s directory structure.

This file’s timestamp is then checked by Glassfish and used to trigger reloading of the application. So you can force a reload at any time by “touching” or otherwise updating this “.reload” file.

I can’t claim any detailed knowledge in this area, and have only had a quick look, but I get the impression that this “.reload” mechanism is used by Glassfish for the reloading of all “exploded” directory deployments. For applications that are explicitly deployed from a specified directory structure, you can use the deploydir command with a “–force=true” option to force re-deployment (there might be other ways to do this, but that’s the most obvious I’ve seen so far). But on Glassfish V2 that doesn’t appear possible for auto-deployed applications, so the answer for those is to manually maintain the “.reload” file yourself.

For some other descriptions and information about this, see:

Some notes:

  • Manually touching/updating a “.reload” file also works for exploded archives that have been deployed via “deploydir” (i.e. as an alternative to using the “deploydir” command to force reloading).
  • The content of the “.reload” file doesn’t matter, and it can even be empty. It just has to be named “.reload” and must be in the root directory of the deployed application (that is, alongside the WEB-INF directory, not inside it).
  • Because the “.reload” file is in the root of the web-application and outside of its WEB-INF, it’s accessible to browsers just like a normal JSP, HTML or other such file would be. So it’s not something you’d want to have present in a live system (or you might want to take other steps to prevent it being accessible).

I haven’t looked in detail at whether Glassfish V3 has any improved mechanism for this, but:

  • The V3 Prelude’s “Application Deployment Guide” does have a page “To Reload Code or Deployment Descriptor Changes” that shows the same solution still in place.
  • Glassfish V3 also seems to have a new redeploy command for redeploying applications, which appears to be equivalent to “deploydir” with “–force=true” but doesn’t require a directory path, so can presumably be used on any application, including auto-deployed applications.

As a personal opinion, I’m quite happy with using auto-deployment for most purposes, but in general I’m very much against the idea of editing the resulting “deployed” files. It just doesn’t seem right to me, and I can see all sorts of potential problems.

So even where a third-party product is delivered as a war archive and requires customisation of its files, I prefer to make the necessary changes to an unzipped copy. I can then use my normal processes to build a finished, already-customized archive that can be deployed without needing any further changes.

But there are still times when it’s handy to auto-deploy a web-application or other component by just dropping its archive into Glassfish, and then be able to play around with it “in place” – for example, when first evaluating a third-party product, or when doing some quick experiments just to try something.

So being able to force reloading of an auto-deployed application remains useful.

Java’s String.trim has a strange idea of whitespace

11 11 2008

Java represents strings using UTF-16, so one might assume that its “trim” method for trimming whitespace would be based on Unicode’s view of which characters are whitespace. Or on Java’s. Or would at least be consistent with other JDK methods.

To my surprise, I’ve just realised that’s far from the case.

The String.trim() method talks about “whitespace”, but defines this in a very precise but rather crude and idiosyncratic way – it simply regards anything up to and including U+0020 (the usual space character) as whitespace, and anything above that as non-whitespace.

This results in it trimming the U+0020 space character and all “control code” characters below U+0020 (including the U+0009 tab character), but not the control codes or Unicode space characters that are above that.

Note that:

  • Some of the characters below U+0020 are control codes that I wouldn’t necessarily always want to regard as whitespace (e.g. U+0007 bell, U+0008 backspace).
  • There are further control codes in the range U+007F to U+009F, which String.trim() treats as non-whitespace.
  • There are plenty of other Unicode characters above U+0020 that should normally be recognized as whitespace (such as U+2003 EM SPACE, U+2007 FIGURE SPACE, U+3000 IDEOGRAPHIC SPACE).

So whilst String.trim() does trim tabs and spaces, it also trims some characters that you might not expect to be treated as whitespace, whilst ignoring other whitespace characters.

This seems far from ideal, and not what you might expect from a method whose headline says “… with leading and trailing whitespace omitted”.

In contrast:

  • The Character.isWhitespace(char) and Character.isWhitespace(int) methods are defined in terms of which characters are “whitespace according to Java”. This in turn is specified as the characters classified by Unicode as whitespace except for a few Unicode space characters that are “non-breaking” (though quite why these should always be considered to be non-whitespace isn’t obvious to me), plus a specified list of some other characters that aren’t classified as whitespace by Unicode but which you’d normally want to regard as whitespace (such as U+0009 TAB).
  • The Character.isSpaceChar(char) and Character.isSpaceChar(int) methods test whether a Unicode character is “specified to be a space character by the Unicode standard”.
  • The deprecated Character.isSpace(char) method tests for 5 specific characters that are “ISO-LATIN-1 white space”. Ironically, I suspect this deprecated method’s idea of whitespace is what many people are imagining when they use the non-deprecated String.trim() method.
  • The Character.isISOControl(char) and Character.isISOControl(int) methods test for the control codes below U+0020 whilst also recognising the control codes in the U+007F to U+009F range.

One can argue over which of these is the best definition of whitespace for any particular purpose, but the one thing that does seem clear is that String.trim() isn’t consistent with any of them, and doesn’t do anything particularly meaningful. It certainly doesn’t seem special enough to deserve being the String class’s only such “trim” method, and having a name that doesn’t indicate what set of characters it trims.

There is an old, old entry for this in Sun’s bug database (bug ID 4080617). However, this was long-ago closed as “not a defect”, on the basis that String.trim() does exactly what its Javadoc specifies (it trims characters which are not higher than U+0020). Never mind whether this is desirable or not, or how misleading it could be.

The most reasonable approach might be to add new methods to java.lang.String for “trimWhitespace” and “trimSpaceChars”, based respectively on the corresponding Character.isWhitespace and Character.isSpaceChar definitions of whitespace. Arguably it might also be worth having a “trimWhitespaceAndSpaceChars” method to trim all characters recognised as whitespace by either of those methods (because each includes characters that the other doesn’t, such as U+0009 TAB and Unicode’s non-breaking spaces, and sometimes you might want to treat all of these as whitespace).

It might also be safer if String.trim() was deprecated, as has been done with Character.isSpace(), possibly replacing it with a more accurately-named method for the existing behaviour (maybe “trimLowControlCodesAndSpace”?).

But in practice, at this point the damage has long since been set in stone, and changing this now could have such widespread impact that it probably isn’t feasible.

As for me, I’ll be removing all use of String.trim() from my code and treating it as if it were deprecated, on the basis that it’s misleading, often inappropriate, and too easy to misuse.

That leaves me looking for an alternative.

There are some existing widely-used libraries with relevant methods:

  • The Apache Commons Lang library has a StringUtils class with a “trim(String)” method that clearly documents that it trims “control characters” by using String.trim(), but also has a separate “strip(String)” method that trims whitespace based on Char.isWhitespace(char).
  • The Spring framework has a StringUtils class with a “trimWhitespace(String)” method (and various other such “trim…” methods) which appears to be based on Character.isWhitespace(char). It Javadoc doesn’t explicitly commit to any particular definition of “whitespace”, but it does refer to Character.isWhitespace(char) as a “see also”.

There are probably lots of other utility libraries with similar methods for this.

However, many of my projects don’t currently use these libraries, and introducing an additional library just for this doesn’t seem worthwhile. On top of which, some of my current code is critically dependent on which characters are trimmed, and “isWhitespace” might not always be what I want (e.g. if I want to treat both “breaking” and “non-breaking” spaces as whitespace).

Of course, this comes down to the usual arguments and trade-offs between using an existing library from elsewhere versus writing the code yourself (effort vs. further dependencies, licencing/redistribution, other useful facilities in the libraries, versioning, potential for “JAR hell” etc).

At the moment my judgement for my own particular circumstances and current projects is to avoid any dependency on these libraries, and handle this myself instead.

So I’ll probably add “trimWhitespace” and “trimSpaceChars” methods to my own utility routines to use in place of String.trim(). Possibly also a “trimWhitespaceAndSpaceChars”.

These will just be convenience methods built on top of a more fundamental method that takes an argument specifying which characters to regard as whitespace. That in turn will be provided by an interface for “filters” for Unicode characters, with each filter instance indicating yes/no for each character passed to it. Some predefined filters can then be provided for various sets of characters (Unicode whitespace, Java whitespace, ISO control codes etc), and others can be constructed for particular requirements as necessary.

I’ll probably also include a mechanism for combining and negating such filters, so that I can define filters for various sets of characters but also use combinations of them when trimming. Ideally this all needs to cater for Unicode code points rather than just chars, so as to cope correctly with Unicode supplementary characters above U+FFFF and represented within Strings by pairs of chars (in case any of these ever need to be recognised as whitespace, or in any other use of such filters).

An alternative approach might be to supply an explicit java.util.Set of the desired whitespace characters, but that’s not as convenient when you want to base the whitespace definition on an existing method such as Character.isWhitespace. In contrast, the “filter” approach can easily support building a filter from either such a method or from a given set of characters. So I think I’ve talked myself into the “filter” approach.

But more generally, is anyone else surprised by the String.trim() method’s definition? Or has everybody else known it all along? Or does nobody use String.trim anyway? Is everybody using Commons Lang’s “strip” or Spring’s “trimWhitespace” instead?

Or does nobody worry about which characters get trimmed when they trim whitespace?

%d bloggers like this: