Java’s String.trim has a strange idea of whitespace

11 11 2008

Java represents strings using UTF-16, so one might assume that its “trim” method for trimming whitespace would be based on Unicode’s view of which characters are whitespace. Or on Java’s. Or would at least be consistent with other JDK methods.

To my surprise, I’ve just realised that’s far from the case.

The String.trim() method talks about “whitespace”, but defines this in a very precise but rather crude and idiosyncratic way – it simply regards anything up to and including U+0020 (the usual space character) as whitespace, and anything above that as non-whitespace.

This results in it trimming the U+0020 space character and all “control code” characters below U+0020 (including the U+0009 tab character), but not the control codes or Unicode space characters that are above that.

Note that:

  • Some of the characters below U+0020 are control codes that I wouldn’t necessarily always want to regard as whitespace (e.g. U+0007 bell, U+0008 backspace).
  • There are further control codes in the range U+007F to U+009F, which String.trim() treats as non-whitespace.
  • There are plenty of other Unicode characters above U+0020 that should normally be recognized as whitespace (such as U+2003 EM SPACE, U+2007 FIGURE SPACE, U+3000 IDEOGRAPHIC SPACE).

So whilst String.trim() does trim tabs and spaces, it also trims some characters that you might not expect to be treated as whitespace, whilst ignoring other whitespace characters.

This seems far from ideal, and not what you might expect from a method whose headline says “… with leading and trailing whitespace omitted”.

In contrast:

  • The Character.isWhitespace(char) and Character.isWhitespace(int) methods are defined in terms of which characters are “whitespace according to Java”. This in turn is specified as the characters classified by Unicode as whitespace except for a few Unicode space characters that are “non-breaking” (though quite why these should always be considered to be non-whitespace isn’t obvious to me), plus a specified list of some other characters that aren’t classified as whitespace by Unicode but which you’d normally want to regard as whitespace (such as U+0009 TAB).
  • The Character.isSpaceChar(char) and Character.isSpaceChar(int) methods test whether a Unicode character is “specified to be a space character by the Unicode standard”.
  • The deprecated Character.isSpace(char) method tests for 5 specific characters that are “ISO-LATIN-1 white space”. Ironically, I suspect this deprecated method’s idea of whitespace is what many people are imagining when they use the non-deprecated String.trim() method.
  • The Character.isISOControl(char) and Character.isISOControl(int) methods test for the control codes below U+0020 whilst also recognising the control codes in the U+007F to U+009F range.

One can argue over which of these is the best definition of whitespace for any particular purpose, but the one thing that does seem clear is that String.trim() isn’t consistent with any of them, and doesn’t do anything particularly meaningful. It certainly doesn’t seem special enough to deserve being the String class’s only such “trim” method, and having a name that doesn’t indicate what set of characters it trims.

There is an old, old entry for this in Sun’s bug database (bug ID 4080617). However, this was long-ago closed as “not a defect”, on the basis that String.trim() does exactly what its Javadoc specifies (it trims characters which are not higher than U+0020). Never mind whether this is desirable or not, or how misleading it could be.

The most reasonable approach might be to add new methods to java.lang.String for “trimWhitespace” and “trimSpaceChars”, based respectively on the corresponding Character.isWhitespace and Character.isSpaceChar definitions of whitespace. Arguably it might also be worth having a “trimWhitespaceAndSpaceChars” method to trim all characters recognised as whitespace by either of those methods (because each includes characters that the other doesn’t, such as U+0009 TAB and Unicode’s non-breaking spaces, and sometimes you might want to treat all of these as whitespace).

It might also be safer if String.trim() was deprecated, as has been done with Character.isSpace(), possibly replacing it with a more accurately-named method for the existing behaviour (maybe “trimLowControlCodesAndSpace”?).

But in practice, at this point the damage has long since been set in stone, and changing this now could have such widespread impact that it probably isn’t feasible.

As for me, I’ll be removing all use of String.trim() from my code and treating it as if it were deprecated, on the basis that it’s misleading, often inappropriate, and too easy to misuse.

That leaves me looking for an alternative.

There are some existing widely-used libraries with relevant methods:

  • The Apache Commons Lang library has a StringUtils class with a “trim(String)” method that clearly documents that it trims “control characters” by using String.trim(), but also has a separate “strip(String)” method that trims whitespace based on Char.isWhitespace(char).
  • The Spring framework has a StringUtils class with a “trimWhitespace(String)” method (and various other such “trim…” methods) which appears to be based on Character.isWhitespace(char). It Javadoc doesn’t explicitly commit to any particular definition of “whitespace”, but it does refer to Character.isWhitespace(char) as a “see also”.

There are probably lots of other utility libraries with similar methods for this.

However, many of my projects don’t currently use these libraries, and introducing an additional library just for this doesn’t seem worthwhile. On top of which, some of my current code is critically dependent on which characters are trimmed, and “isWhitespace” might not always be what I want (e.g. if I want to treat both “breaking” and “non-breaking” spaces as whitespace).

Of course, this comes down to the usual arguments and trade-offs between using an existing library from elsewhere versus writing the code yourself (effort vs. further dependencies, licencing/redistribution, other useful facilities in the libraries, versioning, potential for “JAR hell” etc).

At the moment my judgement for my own particular circumstances and current projects is to avoid any dependency on these libraries, and handle this myself instead.

So I’ll probably add “trimWhitespace” and “trimSpaceChars” methods to my own utility routines to use in place of String.trim(). Possibly also a “trimWhitespaceAndSpaceChars”.

These will just be convenience methods built on top of a more fundamental method that takes an argument specifying which characters to regard as whitespace. That in turn will be provided by an interface for “filters” for Unicode characters, with each filter instance indicating yes/no for each character passed to it. Some predefined filters can then be provided for various sets of characters (Unicode whitespace, Java whitespace, ISO control codes etc), and others can be constructed for particular requirements as necessary.

I’ll probably also include a mechanism for combining and negating such filters, so that I can define filters for various sets of characters but also use combinations of them when trimming. Ideally this all needs to cater for Unicode code points rather than just chars, so as to cope correctly with Unicode supplementary characters above U+FFFF and represented within Strings by pairs of chars (in case any of these ever need to be recognised as whitespace, or in any other use of such filters).

An alternative approach might be to supply an explicit java.util.Set of the desired whitespace characters, but that’s not as convenient when you want to base the whitespace definition on an existing method such as Character.isWhitespace. In contrast, the “filter” approach can easily support building a filter from either such a method or from a given set of characters. So I think I’ve talked myself into the “filter” approach.

But more generally, is anyone else surprised by the String.trim() method’s definition? Or has everybody else known it all along? Or does nobody use String.trim anyway? Is everybody using Commons Lang’s “strip” or Spring’s “trimWhitespace” instead?

Or does nobody worry about which characters get trimmed when they trim whitespace?

Advertisements

Actions

Information

14 responses

12 11 2008
Jason Morris

I must say I’m not in the least bit surprised by String.trim()’s implementation. The basic problem they’re dealing with is that generally the control characters will appear before the “whitespace” you’re looking to trim off. If this is the case, the front of the string will still have a nice bit of whitespace at the beginning of it.

Part of the advantage of the trim() method (being built on top of substring()) is that the resulting String shares the char[] with the original String object (just with new offset and length values). This yields very good space / time performance. In order to trim the whitespace of a String while leaving the leading control characters in, you’d need a completely new char[].

12 11 2008
ClosingBraces

Jason, thanks for the reply.

I guess I’d be even more surprised if it retained leading and trailing non-whitespace control characters but then removed leading and trailing whitespace from the remainder. I agree that couldn’t have as neat or efficient an implementation as just removing the start and end of the string. Think it’d be questionable anyway (e.g. retaining trailing “backspace” whilst trimming whitespace that precedes it; whether to retain line-feeds and/or carriage-returns as control characters or remove them as whitespace).

So we’re probably still only talking about which characters are trimmed from the very start and end of the string. I can see the argument for trimming control codes as well as whitespace, but think it depends on the source and nature of the string and what you’re trying to do with it. Hence it still seems odd to have a single trim method that offers no choice in this matter but describes itself as trimming “whitespace”. And odd to specify this as all of the low-range control characters but not “DEL” and none of the U+0080 to U+009F ISO control codes (which include “pad”, “next line”, “reverse line feed” etc).

12 11 2008
Scott Vachalek

I think the way trim() defines whitespace would have been entirely intuitive to any developer in the pre-Unicode era. In the ASCII world the only debate would have been whether it should have included 127 or not, and most developers would have accepted ignoring the issue in favor of more or less doubling the performance. It’s just a sign of its age, but I think you’re right in never using it. Given the number of decisions that need to be made in Unicode I don’t think it makes sense to add a modern version; leave it to the utility classes.

13 11 2008
ClosingBraces

Scott,

Good point! I would have found this entirely intuitive in the “pre-Unicode” era.

Overall I think it’s great that Java has used Unicode from the start, but inevitable that with hindsight some of the early APIs now look a bit quirky or half-baked. Personally I think a lot of the core APIs are showing their age like this.

13 11 2008
ClosingBraces

Just found yet another definition of whitespace within the JDK: the predefined “\s” character class in regular expressions, as defined by java.util.regex.Pattern.

This takes whitespace as being the U+0020 space character, ‘\t’ (U+0009), ‘\n’ (U+000A), U+000B, ‘\f’ (U+000C), and ‘\r’ (U+000D).

That’s different again from String.trim, Character.isWhitespace, Character.isSpaceChar, and even the deprecated Character.isSpace (which is similar but doesn’t include U+000B).

14 11 2009
ClosingBraces

Another little twist – Unicode’s definition of whitespace can itself change as Unicode evolves, so Java methods that say “as defined by Unicode” can potentially change as JDKs implement later versions of Unicode.

In particular, I’ve just tested some code against Sun’s “milestone 4” beta of JDK 7, and code that was treating U+200B ZERO WIDTH SPACE as Unicode whitespace is now treating that character as non-whitespace.

The Unicode 4.0.1 web-page says that the “general category” of this character has been changed (it’s changed from category Zs “Separator, Space” to category Cf “Other, Format”), which presumably accounts for this.

The relevant JDK method is specified in terms of Unicode “whitespace”, so even though the method hasn’t been explicitly changed in JDK 7, it covers a slightly different set of characters.

3 06 2010
johann azanza

I have been using String.trim and found no problems with it. I am just a normal programmer, while u guys sound and seem like “expert” programmers. So what is the problem then? I have been using it and found no problems so far.

3 06 2010
closingbraces

Hi Johann,

If it’s fine for your needs, fine, no problem. It’s just that if one is dealing with strings that can include control codes or non-ASCII characters, it doesn’t do quite what one might expect and isn’t entirely consistent with Java’s other “whitespace” methods.

For some people and situations that will matter (e.g. for “internationalized” strings), for others it won’t. As usual, you either need to know precisely what you want and how to achieve it, or you can just assume String.trim is at roughly what you want and accept whatever it happens to do.

20 10 2010
g

huh???

I juz now it today, omg …!!
why… why why this happened on trim of java sdk. 😦

28 12 2010
priyavenkat

Hi that was a fantastic post!!Was wondering why trim() was not trimming a deliberately introduced unicode whitespace character and found the answer here! Hats off!

29 12 2010
priyavenkat

And adding to this was yet another post here

26 08 2011
Tim McCormack

I’m finding [\p{javaWhitespace}\p{Zs}] to be a pretty good range for trimming whitespace. As a utility class: https://gist.github.com/1173941

18 02 2013
Raj
13 03 2015
What Is Whitespace? - GodMod.Eu

[…] String class offers a trim command. But that command has a strange definition of whitespace. Read this blog post by Mike Kaufman for details. The upshot: ‘trim’ only deletes characters numbered 32 […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




%d bloggers like this: