Mastodawn

Nils Hayat Jun 30, 2023

Periodic reminder:

"Strings in Swift are Unicode correct"

It is not possible to have a String that is not valid Unicode, and so it is not possible for a String to fail UTF-8 conversion.

So there is no reason to use the failable `string.data(using: .utf8)!` that Swift bridges from NSString.

You can just use `Data(string.utf8)` and avoid the Optional.

In many cases, String(decoding:as:) is also a better choice than String(data:encoding:), but that depends on your use case a little.

Show thread

robb Jun 29, 2023

@cocoaphony I keep forgetting to use that overload and wish that particular idiom would throw a warning

Show thread

Adam Wulf Jun 29, 2023

@cocoaphony I’d be very curious to hear what sort of use case would need either decoding method, or, when would those decoding methods give different results?

Show thread

Rob Napier Jun 30, 2023

@adamwulf It's all about errors. Not every Data is valid UTF-8. What should happen if it isn't? `init(data:encoding:)` is Optional, and will return nil. `init(decoding:as:)` will decode any invalid bytes as � (REPLACEMENT CHARACTER).

If you know for certain that the data is valid, then `decoding` is the better choice. It's also good if you would rather a partial, corrupted string than no string. For example, if "François" were Latin-1 encoded, would you rather get "Fran�ois" or nil? It depends.

Show thread

Chris Eidhof Jun 30, 2023

@cocoaphony @adamwulf don't know if this helps, we wrote this five years ago, but I think it's still relevant: https://www.objc.io/blog/2018/02/13/string-to-data-and-back/

Swift Tip: String to Data and Back

When to force-unwrap, when to check for nil

Show thread

Adam Wulf Jun 30, 2023

@chris @cocoaphony TIL! Thank you both!

Show thread

Remko Tronçon Jun 30, 2023

@cocoaphony Thanks! Immediately put that tip to good use: https://github.com/remko/age-plugin-se/commit/5fed056a5b89aaccedf51128a67eec545bb88d53

Avoid bridged/optional utf8 conversion · remko/age-plugin-se@5fed056

Age plugin for Apple's Secure Enclave. Contribute to remko/age-plugin-se development by creating an account on GitHub.

GitHub

Show thread

Michael Tsai Jun 30, 2023

@cocoaphony Maybe that’s true with pure Swift strings, but if there’s an NSString underneath, calling certain non-failable APIs can make it explode at runtime as Swift finds that one of its assertions was violated during indexing. You can also see this in that sometimes String.decodeCString() will fail when given a string’s UTF-8 if you don’t allow it to repair.

Show thread

Rob Napier Jun 30, 2023

@mjtsai I don't understand what you mean here. How can you construct an NSString where the bridged `string.utf8` would fail? This would be a major stdlib/compiler bug.

Can you give an example of your decodeCString issue (that isn't a many-years deprecated API)? I'm not clear how that's related to `string.utf8`.

If you mean String(decoding:as:), that always repairs. That's what makes it different than String(data:encoding:).

Show thread

Michael Tsai Jun 30, 2023

@cocoaphony I’m not saying that `string.utf8` fails. I’m saying that you can have a Swift String that isn’t valid Unicode. For example, this can happen if the NSString was created with invalid data (happened to me with data from e-mails) or if you modify it without respecting composed character sequences (happened to me when using NSRegularExpression with the `UTF16View`).

Show thread

Michael Tsai Jun 30, 2023

@cocoaphony One of the places it will fail (crash) is calling `canBeConverted(to:)` on such an invalid String because Swift will belatedly discover that something isn’t right.

There is no issue with String.decodeCString(). Rather, you can use that to ensure that a String really *is* valid Unicode.

Show thread

Karoy Lorentey Jul 2, 2023

@mjtsai @cocoaphony A Swift String value always, *always* contains valid Unicode data. String values that hold a bridged NSString instance validate its contents on every access, transparently checking for and correcting unpaired surrogates as needed. Converting a String to UTF-8 or UTF-16 can never fail, and always produces valid data.

If you see otherwise, please do report! This is a crucial invariant, any violations need to get fixed.

Show thread

Karoy Lorentey Jul 2, 2023

@mjtsai @cocoaphony (The invariant does not apply when a String gets converted to NSString — for bridged strings, you get back the original instance in that case, which can of course hold invalid UTF-16 data.)

Show thread

Miguel Arroz Jul 2, 2023

@lorentey @mjtsai @cocoaphony It would be nice to have a String.utf8Data to make this explicit and avoid either the optional of .data(using: .utf8) or the Data(string.utf8) (which I didn't know it could be done). Same for 16.

Show thread

Karoy Lorentey Jul 3, 2023

@arroz @mjtsai @cocoaphony The `utf8` view is precisely that!

Show thread

Miguel Arroz Jul 3, 2023

@lorentey @mjtsai @cocoaphony I believe so, but it’s not obvious, at least to me, that I can create a Data directly from that view.

Show thread

Karoy Lorentey Jul 3, 2023

@arroz @mjtsai @cocoaphony What makes it non-obvious? Is there anything we could do to make that self-evident?

Show thread

Michael Tsai Jul 3, 2023

@lorentey @arroz @cocoaphony To me, the part that’s non-obvious is whether there’s a performance penalty for using `Data(string.utf8)`. It *looks* like it’s going to do some sort of slow iteration over every byte.

Show thread

Miguel Arroz Jul 3, 2023

@mjtsai @lorentey @cocoaphony That is a good reason. Other is there's nothing here that allows me to know this is a collection of UInt8: https://developer.apple.com/documentation/swift/string/utf8view
Finally, if I want data from a string, and there's a data(using:) method on String, why would I even think about using an intermediate UTF8View thing in the between? People usually go to the more obvious and direct API they can find. If utf8 and utf16 never fail, there should be non-optional methods to provide that directly.

String.UTF8View | Apple Developer Documentation

A view of a string’s contents as a collection of UTF-8 code units.

Apple Developer Documentation

Show thread

Karoy Lorentey Jul 3, 2023

@mjtsai @arroz @cocoaphony That isn’t good instinct! The utf8 view is the most efficient way to access a String’s underlying code units, which (again!) are *always* valid UTF-8. Copying data out of it is going to be faster than any other option, whether or not Data’s initializer works properly.

Going through Foundation APIs can be really slow in comparison — depending on the API, it can involve transcoding the UTF-8 string to UTF-16, only to call some NSString API to convert it directly back.

Show thread

Karoy Lorentey Jul 3, 2023

@mjtsai @arroz @cocoaphony The picture does change if the String is actually a bridged NSString — in that case, the old Foundation transcoding APIs can work without unnecessary roundtrips. I expect simply copying data out of the utf8 view will still work quite well though.

Show thread

Michael Tsai Jul 3, 2023

@lorentey @arroz @cocoaphony Yeah, I have read the source and see why it actually is fast. Just sharing my first impression, which may have been from before Strings were UTF-8 under the hood. I think it would be good to explain this more in the docs. UTF8View doesn’t really say that it’s the native storage, and half of the struct’s docs are about how C interop “creates a buffer.” So that sounds like the underlying storage is not UTF-8 or that it can’t be directly accessed.

Show thread

Michael Tsai Jul 3, 2023

@lorentey @arroz @cocoaphony And then the Data initializer talks about using the “elements of the sequence,” which sounds slower than the other initializers, which say they use “copied memory content.”

Show thread

Rob Napier Jul 3, 2023

@lorentey @arroz @mjtsai While I'm not convinced that anything should be changed, I absolutely see the problem. There is an extremely strong bias in Swift programmers for explicit methods. That's how things are discovered. You have a String. You want a Data. There is no search term you would use on the String API docs that would tell you that you want Data(string.utf8). When you understand everything, it's quite obvious, but you can't discover it organically.

Show thread

Rob Napier Jul 3, 2023

@lorentey @arroz @mjtsai This is also the case for pretty much everything new in the SwiftUI and "new Foundation" world. There is no way to "discover" how to use Regex (as I learned again today). You just have to find examples, and Apple doesn't give a lot of them.

In this case, you look for examples of String->Data, and you see years of `.data(using:)` examples because it's the ObjC way. Apple has to counter that with a torrent of "the right way."

Show thread

Miguel Arroz Jul 3, 2023

@cocoaphony @lorentey @mjtsai Its not that obvious because of what @mjtsai said. One would naturally assume that .data(using:) would be the fastest implementation possible (ideally even sharing the same buffer when appropriate), while other methods might be less efficient.

Show thread

Rob Napier Jul 3, 2023

@arroz @lorentey @mjtsai This is definitely also fair. Apple doesn't make a lot of performance promises, and historically hasn't provided a lot of source code, so we have to guess at what is most efficient, and methods on the type "feel" efficient.

When you know more, it feels obvious that there's no particular reason an NSString extension would be the most efficient way to interact with String, but Xcode completion doesn't make it obvious what are ObjC bridging extensions.

Show thread

Karoy Lorentey Jul 3, 2023

@arroz @mjtsai @cocoaphony The complication, of course, is that the backing store may not be in a matching format, so there isn’t always a UTF-8 buffer available for direct access. When there is one, withContiguousStorageIfAvailable gives you read-only direct access to it.

Show thread

Michael Tsai Jul 3, 2023

@lorentey @cocoaphony OK, I will try to get this into a small project that can reproduce the problem. Here is the stack trace in case that gives you any ideas. It then prints “Swift/StringUTF16View.swift:147: Fatal error: String index is out of bounds”.

Show thread

amonshiz Jul 3, 2023

@mjtsai @lorentey @cocoaphony Swift String going out of bounds on operations that seem to use old NSString Foundation methods bites us occasionally. We find that it ends up being a mixture of the type of index that is the issue. If we stay either all in Swift String world it is fine, same with all NSString world. Mix and match and you end up in for a bad time.

Show thread

Michael Tsai Jul 3, 2023

@amonshiz @lorentey @cocoaphony Yeah, I had some code that would get an NSRange from an NSString method and then do something with the String’s UTF16View. I thought the offsets would be equivalent, but they are not always. (Perhaps because Swift is fixing up some bad Unicode?) It works much better to not mix the types of indexes. But then I had to add a cover method for `-substringWithRange:` to check the parameter because if there’s an NSInvalidArgumentException I can’t catch it in Swift.

Show thread

Rob Napier Jul 3, 2023

@mjtsai @amonshiz @lorentey Yeah, in my experience you should not mix NSRange methods (which all come from NSString) with String if at all possible. I don't believe there's any promise that an NSRange can ever be converted to a Range<String.Index>.

That's an interesting point about UTF16View mismatches. I have used that assumption before, and I see how it might be invalid in the presence of surrogate pairs. I'll have to dig into that one a bit more.

Show thread

Michael Tsai Jul 3, 2023

@cocoaphony @amonshiz @lorentey Indeed, I stopped using `Range(nsRange, in: string)` because on certain OS versions that would crash instead of returning nil.

Show thread

Karoy Lorentey Jul 3, 2023

@mjtsai @amonshiz @cocoaphony The way String fixes up invalid UTF-16 in NSString instances does not change the length of the string or invalidate offsets. It’s merely replacing unpaired surrogates with a single replacement character within the BMP.

Show thread

Karoy Lorentey Jul 3, 2023

@mjtsai @cocoaphony This isn’t easily diagnosable without the code, but I’d really like to look at it — if there is a stdlib issue here, now is the time to resolve it for Swift 5.9.

Show thread

Lukas Valenta Jul 3, 2023

@cocoaphony @nicklockwood and when is String(decoding:as:) better than String(data:encoding:), please?

Show thread

Nick Lockwood Jul 3, 2023

@lvalenta @cocoaphony it sounds like it does error correction on invalid data rather than just returning nil? https://mastodon.social/@cocoaphony/110630843219077064

Show thread

Lukas Valenta Jul 3, 2023

@nicklockwood @cocoaphony right, that would be it, thanks! Looking at these two options, it’s sad there’s no third option that would throw an error you would be able to handle and possibly even understand if it showed why it failed