Java represents strings using UTF-16, so one might assume that its “trim” method for trimming whitespace would be based on Unicode’s view of which characters are whitespace. Or on Java’s. Or would at least be consistent with other JDK methods.
To my surprise, I’ve just realised that’s far from the case.
The String.trim() method talks about “whitespace”, but defines this in a very precise but rather crude and idiosyncratic way – it simply regards anything up to and including U+0020 (the usual space character) as whitespace, and anything above that as non-whitespace.
This results in it trimming the U+0020 space character and all “control code” characters below U+0020 (including the U+0009 tab character), but not the control codes or Unicode space characters that are above that.
- Some of the characters below U+0020 are control codes that I wouldn’t necessarily always want to regard as whitespace (e.g. U+0007 bell, U+0008 backspace).
- There are further control codes in the range U+007F to U+009F, which String.trim() treats as non-whitespace.
- There are plenty of other Unicode characters above U+0020 that should normally be recognized as whitespace (such as U+2003 EM SPACE, U+2007 FIGURE SPACE, U+3000 IDEOGRAPHIC SPACE).
So whilst String.trim() does trim tabs and spaces, it also trims some characters that you might not expect to be treated as whitespace, whilst ignoring other whitespace characters.
This seems far from ideal, and not what you might expect from a method whose headline says “… with leading and trailing whitespace omitted”.
- The Character.isWhitespace(char) and Character.isWhitespace(int) methods are defined in terms of which characters are “whitespace according to Java”. This in turn is specified as the characters classified by Unicode as whitespace except for a few Unicode space characters that are “non-breaking” (though quite why these should always be considered to be non-whitespace isn’t obvious to me), plus a specified list of some other characters that aren’t classified as whitespace by Unicode but which you’d normally want to regard as whitespace (such as U+0009 TAB).
- The Character.isSpaceChar(char) and Character.isSpaceChar(int) methods test whether a Unicode character is “specified to be a space character by the Unicode standard”.
- The deprecated Character.isSpace(char) method tests for 5 specific characters that are “ISO-LATIN-1 white space”. Ironically, I suspect this deprecated method’s idea of whitespace is what many people are imagining when they use the non-deprecated String.trim() method.
- The Character.isISOControl(char) and Character.isISOControl(int) methods test for the control codes below U+0020 whilst also recognising the control codes in the U+007F to U+009F range.
One can argue over which of these is the best definition of whitespace for any particular purpose, but the one thing that does seem clear is that String.trim() isn’t consistent with any of them, and doesn’t do anything particularly meaningful. It certainly doesn’t seem special enough to deserve being the String class’s only such “trim” method, and having a name that doesn’t indicate what set of characters it trims.
There is an old, old entry for this in Sun’s bug database (bug ID 4080617). However, this was long-ago closed as “not a defect”, on the basis that String.trim() does exactly what its Javadoc specifies (it trims characters which are not higher than U+0020). Never mind whether this is desirable or not, or how misleading it could be.
The most reasonable approach might be to add new methods to java.lang.String for “trimWhitespace” and “trimSpaceChars”, based respectively on the corresponding Character.isWhitespace and Character.isSpaceChar definitions of whitespace. Arguably it might also be worth having a “trimWhitespaceAndSpaceChars” method to trim all characters recognised as whitespace by either of those methods (because each includes characters that the other doesn’t, such as U+0009 TAB and Unicode’s non-breaking spaces, and sometimes you might want to treat all of these as whitespace).
It might also be safer if String.trim() was deprecated, as has been done with Character.isSpace(), possibly replacing it with a more accurately-named method for the existing behaviour (maybe “trimLowControlCodesAndSpace”?).
But in practice, at this point the damage has long since been set in stone, and changing this now could have such widespread impact that it probably isn’t feasible.
As for me, I’ll be removing all use of String.trim() from my code and treating it as if it were deprecated, on the basis that it’s misleading, often inappropriate, and too easy to misuse.
That leaves me looking for an alternative.
There are some existing widely-used libraries with relevant methods:
- The Apache Commons Lang library has a StringUtils class with a “trim(String)” method that clearly documents that it trims “control characters” by using String.trim(), but also has a separate “strip(String)” method that trims whitespace based on Char.isWhitespace(char).
- The Spring framework has a StringUtils class with a “trimWhitespace(String)” method (and various other such “trim…” methods) which appears to be based on Character.isWhitespace(char). It Javadoc doesn’t explicitly commit to any particular definition of “whitespace”, but it does refer to Character.isWhitespace(char) as a “see also”.
There are probably lots of other utility libraries with similar methods for this.
However, many of my projects don’t currently use these libraries, and introducing an additional library just for this doesn’t seem worthwhile. On top of which, some of my current code is critically dependent on which characters are trimmed, and “isWhitespace” might not always be what I want (e.g. if I want to treat both “breaking” and “non-breaking” spaces as whitespace).
Of course, this comes down to the usual arguments and trade-offs between using an existing library from elsewhere versus writing the code yourself (effort vs. further dependencies, licencing/redistribution, other useful facilities in the libraries, versioning, potential for “JAR hell” etc).
At the moment my judgement for my own particular circumstances and current projects is to avoid any dependency on these libraries, and handle this myself instead.
So I’ll probably add “trimWhitespace” and “trimSpaceChars” methods to my own utility routines to use in place of String.trim(). Possibly also a “trimWhitespaceAndSpaceChars”.
These will just be convenience methods built on top of a more fundamental method that takes an argument specifying which characters to regard as whitespace. That in turn will be provided by an interface for “filters” for Unicode characters, with each filter instance indicating yes/no for each character passed to it. Some predefined filters can then be provided for various sets of characters (Unicode whitespace, Java whitespace, ISO control codes etc), and others can be constructed for particular requirements as necessary.
I’ll probably also include a mechanism for combining and negating such filters, so that I can define filters for various sets of characters but also use combinations of them when trimming. Ideally this all needs to cater for Unicode code points rather than just chars, so as to cope correctly with Unicode supplementary characters above U+FFFF and represented within Strings by pairs of chars (in case any of these ever need to be recognised as whitespace, or in any other use of such filters).
An alternative approach might be to supply an explicit java.util.Set of the desired whitespace characters, but that’s not as convenient when you want to base the whitespace definition on an existing method such as Character.isWhitespace. In contrast, the “filter” approach can easily support building a filter from either such a method or from a given set of characters. So I think I’ve talked myself into the “filter” approach.
But more generally, is anyone else surprised by the String.trim() method’s definition? Or has everybody else known it all along? Or does nobody use String.trim anyway? Is everybody using Commons Lang’s “strip” or Spring’s “trimWhitespace” instead?
Or does nobody worry about which characters get trimmed when they trim whitespace?