I’ve just read Christopher Owen’s very interesting article The Monthly Code Smell – Issue 1. During the last year or two I’ve encountered all of the points he raises, and have some further thoughts on them.
Sets as Filters
I’ve recently come across “sets as filters” in the context of Java 5 “Enums”.
The java.util.EnumSet class provides methods for easily constructing and manipulating sets of Enum values. When you need to do this sort of thing, EnumSet is easily overlooked but can sometimes be easier and cleaner than dealing with multiple enum values individually. It can also be a useful type for passing multiple enum values to a method.
In particular, EnumSet’s “complementOf” method is a neat way to obtain a set of all values except for some specified ones, without having to iterate the values or explicitly specify all of the non-excluded values (and then having to maintain that).
The “ResourceReader” approach is one I’ve used myself, as I’ve previously described.
Christopher’s article did prompt me to go back and check what I’ve documented in my own interfaces and implementations. Sure enough, I found a minor omission – and also realised that my tests weren’t checking for distinct instances from repeated calls (as documented by the interface). So I’ve improved my code and tests as a result, and that’s always a good feeling.
However, I do think there tends to be a limit to how completely one can specify these sorts of things, because underneath it all you tend to hit JDK classes that don’t themselves document these issues (e.g. whether repeated calls return same or different instances, whether defensive copying is used or not etc).
This uncertainty tends to ripple out, leaving you with methods where all you can do is document that certain things are unspecified and client code shouldn’t make assumptions about them. I see this a lot with concurrency issues too – it’s hard to be precise about mutability, thread-safety etc when your code makes use of underlying JDK classes that aren’t explicit about it.
To some extent you can sometimes break the uncertainty with defensive copying, wrapping etc – but many JDK classes don’t lend themselves to that either, and it’s questionable how far it’s reasonable to go with this.
The “getBytes” trap is a good one too.
It’s not just “getBytes” and “charsets” either. There are various other JDK methods that ideally take a charset, but also have overloads that can be called without one (e.g. java.util.Formatter constructors, java.io.PrintStream and PrintWriter constructors, java.net.URLDecoder’s “decode” method and java.net.URLEndoder’s “encode” method etc).
Much the same issues also apply to methods that can take a Locale (e.g. String’s “toUpperCase” and “toLowerCase” methods). And, of course, you may well have methods of this kind in your own code and other libraries. Even Java source files can be misinterpreted by javac if the source file’s encoding doesn’t match what javac’s assuming – which as per “getBytes” defaults to a platform-dependent value (as I’ve previously blogged about here).
As Christopher points out, this is all made worse when the JDK methods involved are specified as having “undefined” behaviour when things go wrong. For these methods you can’t even reliably detect failures, can’t easily test that you’re handling any failures correctly, and can’t easily check that you’re catering for the different ways such failures might manifest themselves.
As a general rule I aim to always supply a charset, locale etc for all calls to all methods that can take such arguments. Similarly, where my own methods need such arguments I try to avoid providing overloads that assume a default, and instead force all callers to specify the charset or locale they want to use, or make it otherwise configurable (e.g. as a property of the relevant object). At worst, if a default is supported it should be something reasonably “safe” like UTF-8, shouldn’t vary depending on the underlying JVM or platform, and should be clearly documented.
However, it’s all too easy to miss calls without a charset or locale to methods where they are optional. Static analysis tools like FindBugs, PMD, Checkstyle etc should be able to help with this. FindBugs already has rules to spot some of these situations (e.g. use of String.toUpperCase or String.toLowerCase without a Locale). In practice this isn’t comprehensive, and at best it would still depend on knowing exactly which methods you need to avoid.
If I can find some time I might try and pin this down a bit further, and either develop suitable rules or raise it on FindBugs and/or PMD forums. Unless anyone else wants to tackle this in the meantime…
Beyond that, test cases can help a lot, but as you can gather from Christopher’s article and the above discussion it’s non-trivial to make sure these hit the spot.
My own approach is to try and make all character encodings and locales either configurable or supplied as method arguments rather than hard-coded. Then I have tests that iterate through a set of character encodings that are varied enough to spot most problems, and using some example strings that include “tricky” non-latin characters. But where failures have “undefined” behaviour, checking the results and testing the error handling still has to cater for all possible exceptions and apparently-normal-but-incorrect results. In particular it can be tricky to test the exception handling where the underlying JVM might or might not throw the relevant exception. It isn’t pretty, but I’ve seen bugs and portability problems creep through where I haven’t done this.
As Christopher points out, passing a charset to methods such as “getBytes” also leaves you having to deal with potential UnsupportedEncodingExceptions even when you know the encoding is guaranteed to be present. Which takes us back to how to handle impossible exceptions, my own views on which I’ve previously blogged about here.
That smells goods…
I’m really looking forward to further “Monthly Code Smells” from Christopher – hope he manages to keep it up.