When is a string not a string?

jonskeet

from Jon Skeet's coding blog on 2014-11-07 18:17 (#2XYA)

As part of my "work" on the ECMA-334 TC49-TG2 technical group, standardizing C# 5 (which will probably be completed long after C# 6 is out" but it's a start!) I've had the pleasure of being exposed to some of the interesting ways in which Vladimir Reshetnikov has tortured C#. This post highlights one of the issues he's raised. As usual, it will probably never impact 99.999% of C# developers" but it's a lovely little problem to look at.

Relevant specifications referenced in this post:
- The Unicode Standard, version 7.0.0 - in particular, chapter 3
- C# 5 (Word document)
- ECMA-335 (CLI specification)

What is a string?

How would you define the string (or System.String) type? I can imagine a number of responses to that question, from vague to pretty specific, and not all well-defined:

"Some text"
A sequence of characters
A sequence of Unicode characters
A sequence of 16-bit characters
A sequence of UTF-16 code units

The last of these is correct. The C# 5 specification (section 1.3) states:

Character and string processing in C# uses Unicode encoding. The char type represents a UTF-16 code unit, and the string type represents a sequence of UTF-16 code units.

So far, so good. But that's C#. What about IL? What does that use, and does it matter? It turns out that it does" Strings need to be represented in IL as constants, and the nature of that representation is important, not only in terms of the encoding used, but how the encoded data is interpreted. In particular, a sequence of UTF-16 code units isn't always representable as a sequence of UTF-8 code units.

I feel ill (formed)

Consider the C# string literal "X\uD800Y". That is a string consisting of three UTF-16 code units:

0x0058 - 'X'
0xD800 - High surrogate
0x0059 - 'Y'

That's fine as a string - it's even a Unicode string according to the spec (item D80). However, it's ill-formed (item D84). That's because the UTF-16 code unit 0xD800 doesn't map to a Unicode scalar value (item D76) - the set of Unicode scalar values explicitly excludes the high/low surrogate code points.

Just in case you're new to surrogate pairs: UTF-16 only deals in 16-bit code units, which means it can't cope with the whole of Unicode (which ranges from U+0000 to U+10FFFF inclusive). If you want to represent a value greater than U+FFFF in UTF-16, you need to use two UTF-16 code units: a high surrogate (in the range 0xD800 to 0xDBFF) followed by a low surrogate (in the range 0xDC00 to 0xDFFF). So a high surrogate on its own makes no sense. It's a valid UTF-16 code unit in itself, but it only has meaning when followed by a low surrogate.

Show me some code!

So what does this have to do with C#? Well, string constants have to be represented in IL somehow. As it happens, there are two different representations: most of the time, UTF-16 is used, but attribute constructor arguments use UTF-8.

Let's take an example:

using System;using System.ComponentModel;using System.Text;using System.Linq;[Description(Value)]class Test{ const string Value = "X\ud800Y"; static void Main() { var description = (DescriptionAttribute) typeof(Test).GetCustomAttributes(typeof(DescriptionAttribute), true)[0]; DumpString("Attribute", description.Description); DumpString("Constant", Value); } static void DumpString(string name, string text) { var utf16 = text.Select(c => ((uint) c).ToString("x4")); Console.WriteLine("{0}: {1}", name, string.Join(" ", utf16)); }}

The output of this code (under .NET) is:

Attribute: 0058 fffd fffd 0059Constant: 0058 d800 0059

As you can see, the "constant" (Test.Value) has been preserved as a sequence of UTF-16 code units, but the attribute property has U+FFFD (the Unicode replacement character which is used to indicate broken data when decoding binary to text). Let's dig a little deeper and look at the IL for the attribute and the constant:

.custom instance void [System]System.ComponentModel.DescriptionAttribute::.ctor(string)
= ( 01 00 05 58 ED A0 80 59 00 00 )

.field private static literal string Value
= bytearray (58 00 00 D8 59 00 )

The format of the constant (Value) is really simple - it's just little-endian UTF-16. The format of the attribute is specified in ECMA-335 section II.23.3. Here, the meaning is:

Prolog (01 00)
Fixed arguments (for specified constructor signature)
- 05 58 ED A0 80 59 (a single string argument as a SerString)
  - 05 (the length, i.e. 5, as a PackedLen)
  - 58 ED A0 80 59 (the UTF-8-encoded form of the string)
Number of named arguments (00 00)
Named arguments (there aren't any)

The interesting part is the "UTF-8-encoded form of the string" here. It's not valid UTF-8, because the input isn't a well-formed string. The compiler has taken the high surrogate, determined that there isn't a low surrogate after it, and just treated it as a value to be encoded in the normal UTF-8 way of encoding anything in the range U+0800 to U+FFFF inclusive.

It's worth noting that if we had a full surrogate pair, UTF-8 would encode the single Unicode scalar value being represented, using 4 bytes. For example, if we change the declaration of Value to:

const string Value = "X\ud800\udc00Y";

then the UTF-8 bytes in the IL are 58 F0 90 80 80 59 - where F0 90 80 80 is the UTF-8 encoding for U+10000. That's a well-formed string, and we get the same value for both the description attribute and the constant.

So in our original example, the string constant (encoded as UTF-16 in the IL) is just decoded without checking whether or not it's ill-formed, whereas the attribute argument (encoded as UTF-8) is decoded with extra validation, which detects the ill-formed code unit sequence and replaces it.

Encoding behaviour

So which approach is right? According to the Unicode specification (item C10) both could be fine:

When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters.

and

Conformant processes cannot interpret ill-formed code unit sequences. However, the conformance clauses do not prevent processes from operating on code unit sequences that do not purport to be in a Unicode character encoding form. For example, for performance reasons a low-level string operation may simply operate directly on code units, without interpreting them as characters. See, especially, the discussion under D89.

It's not at all clear to me whether either the attribute argument or the constant value "purports to be in a Unicode character encoding form". In my experience, very few pieces of documentation or specification are clear about whether they expect a piece of text to be well-formed or not.

Additionally, System.Text.Encoding implementations can often be configured to determine how they behave when encoding or decoding ill-formed data. For example, Encoding.UTF8.GetBytes(Value) returns byte sequence 58 EF BF BD 59 - in other words, it spots the bad data and replaces it with U+FFFD as part of the encoding" so decoding this value will result in X U+FFFD Y with no problems. On the other hand, if you use new UTF8Encoding(true, true).GetBytes(Value), an exception will be thrown. The first constructor argument is whether or not to emit a byte order mark under certain circumstances; the second one is what dictates the encoding behaviour in the face of invalid data, along with the EncoderFallback and DecoderFallback properties.

Language behaviour

So should this compile at all? Well, the language specification doesn't currently prohibit it - but specifications can be changed :)

In fact, both csc and Roslyn do prohibit the use of ill-formed strings with certain attributes. For example, with DllImportAttribute:

[DllImport(Value)]static extern void Foo();

This gives an error when Value is ill-formed:

error CS0591: Invalid value for argument to 'DllImport' attribute

There may be other attributes this is applied to as well; I'm not sure.

If we take it as read that the ill-formed value won't be decoded back to its original form when the attribute is instantiated, I think it would be entirely reasonable to make it a compile-time failure - for attributes. (This is assuming that the runtime behaviour can't be changed to just propagate the ill-formed string.)

What about the constant value though? Should that be allowed? Can it serve any purpose? Well, the precise value I've given is probably not terribly helpful - but it could make sense to have a string constant which ends with a high surrogate or starts with a low surrogate" because it can then be combined with another string to form a well-formed UTF-16 string. Of course, you should be very careful about this sort of thing - read the Unicode Technical Report 36 "Security Considerations" for some thoroughly alarming possibilities.

Corollaries

One interesting aspect to all of this is that "string encoding arithmetic" doesn't behave as you might expect it to. For example, consider this method:

// Bad code!string SplitEncodeDecodeAndRecombine (string input, int splitPoint, Encoding encoding){ byte[] firstPart = encoding.GetBytes(input.Substring(0, splitPoint)); byte[] secondPart = encoding.GetBytes(input.Substring(splitPoint)); return encoding.GetString(firstPart) + encoding.GetString(secondPart); }

You might expect that this would be a no-op so long as everything is non-null and splitPoint is within range" but if you happen to split in the middle of a surrogate pair, it's not going to be happy. There may well be other potential problems lurking there, depending on things like normalization form - I don't think so, but at this point I'm unwilling to bet too heavily on string behaviour.

If you think the above code is unrealistic, just imagine partitioning a large body of text, whether that's across network packets, files, or whatever. You might feel clever for realizing that without a bit of care you'd get binary data split between UTF-16 code units" but even handling that doesn't save you. Yikes.

I'm tempted to swear off text data entirely at this point. Floating point is a nightmare, dates and times" well, you know my feelings about those. I wonder what projects are available that only need to deal with integers, and where all operations are guaranteed not to overflow. Let me know if you have any.

Conclusion

Text is hard.

Source	RSS or Atom Feed
Feed Location	http://codeblog.jonskeet.uk/feed/
Feed Title	Jon Skeet's coding blog
Feed Link	https://codeblog.jonskeet.uk/