The change in PR #699 is mostly about characters that should never reasonably appear.
It started with just the space char (See #696).
I tried creating nrs with a non-breaking space to see if it would work, and it worked.
Then I tried with tab char, it worked.
I tried with newline, it worked.
So this needed exclusion for all char::is_whitespace
, no big deal, easy to do.
Then we also have control characters, easy to exclude using char::is_control
and there’s no reason to have control characters in any normal url, so that was obvious to include.
But then there’s also zero width space which is neither whitespace nor a control character. So we need some special exceptions. There’s 30 of those that are pretty obviously never going to be part of any url. These were manually selected from the unicode character database.
This is the point the PR gets to. It has the ‘obviously not gonna be in a url’ chars.
But we can (and probably will need to) go further.
There’s the character U+00C0 Capital Letter A with grave À vs U+0041 + U+0300 Latin Capital A + Combining grave accent À - looks identical but hashes to different xorurls. But both are linguistically valid and useful.
There’s also homoglyphs, U+006F Latin Small Letter o vs U+03BF Greek Small Letter omincron ο - looks identical, but hashes to different xorurls. But both are linguistically valid and useful.
Some questions are
- how are people entering these variations?
- how do we display these variations to people?
- how do we convert the unicode to bytes?
- percent encode?
- punycode?
- upper or lowercase?
- utf-8?
- utf-16?
- something else?
- do we use unicode normalization? which one?
- how do we convert these bytes to an xorurl? (this is already solved)
- is it possible to convert to and fro each representation?
- what are the risks and benefits of allowing confusables / homoglyphs / symbols / emoji etc? I always remind myself “I can only read English so would never be able to use Kanji urls, please do not limit your software to Kanji only”, then flip the two character sets to see that ascii is not adequate.
There’s a lot written elsewhere on this topic. This post is a bit of an intro to see that nrs beyond A-Za-z0-9 has some tricky situations. It could easily be 10x longer…