(Translated by https://www.hiragana.jp/)
⚓ T360575 Cannot use diacritics in Meeting URL and Group chat invite
Page MenuHomePhabricator

Cannot use diacritics in Meeting URL and Group chat invite
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

Create an event.
Add a url with diacritics or non latin characters to the Meeting URL and Group chat invite. Use this, for example:
https://testchat.com/Iñtërnâtiônàlizætiønمثال

What happens?:
Receive the following errors:
Enter a valid chat URL.
Enter a valid event meeting URL.

Screenshot 2024-03-20 at 3.14.16 PM.png (1×2 px, 295 KB)

What should have happened instead?:

The URL should be accepted as valid

Software version (skip for WMF-hosted wikis like Wikipedia):
testing on betacluster

Also see related ticket T360396

Event Timeline

The application validates URLs using PHP's native FILTER_VALIDATE_URL filter, which is based on RFC 2396. Internationalized domain names are not covered by that RFC, which instead restricts the usable characters to ASCII, plus percent-encoding for anything else. So, this behaviour in PHP is intentional and documented, see for instance this bug report.

Part of the confusion is due to the fact that browsers tend to hide the percent-encoding from users as much as they can. For instance, take this URL: https://zh.wikipedia.org/wiki/えい弃你. Phabricator does recognize it as a link, and if you open the page, you will see the exact same string in your browser's address bar (at least in chrome and firefox). However, if you then copy the URL again from the address bar and paste it elsewhere, e.g. here on phab, it will come out as https://zh.wikipedia.org/wiki/%E6%B0%B8%E4%B8%8D%E6%94%BE%E5%BC%83%E4%BD%A0 (percent-encoded). That's because your browser has transparently encoded the URL under the hood, and is only using the non-encoded version for presentational purposes.

As for actually fixing this: I guess it's doable, but I think it would be more of a feature request than a bug. Assuming that someone will copy the chat link from their browser or whatever service they're using, they should automatically get the percent-encoded version that will work everywhere. So, you would only be affected by this if you're manually typing the URL instead. Still, it should be possible to fix this. Whether it's a good idea to do so depends on what we're doing with the URLs: for instance, internationalised URIs can be used in HTML <a> elements and will work as expected, at least in modern browsers. However, it's hard to predict what happens when someone copies the URL and puts it somewhere, as it's up to the place where the URL is pasted to decide if they want to accept it or not. Same with returning the non-encoded URI from our API endpoints: whether the client will be able to recognize the URL is up to them. A related read is this SO thread.

All in all, it's more about the should we do it and not the is it possible. Perhaps we could wait and see if someone actually finds the current behaviour problematic?

For instance, take this URL: https://zh.wikipedia.org/wiki/えい弃你. Phabricator does recognize it as a link, and if you open the page, you will see the exact same string in your browser's address bar (at least in chrome and firefox). However, if you then copy the URL again from the address bar and paste it elsewhere, e.g. here on phab, it will come out as https://zh.wikipedia.org/wiki/%E6%B0%B8%E4%B8%8D%E6%94%BE%E5%BC%83%E4%BD%A0 (percent-encoded). That's because your browser has transparently encoded the URL under the hood, and is only using the non-encoded version for presentational purposes.

Hmmm thanks for the research and thoughts here @Daimona. I see what you are saying about copy pasting from the URL bar, but if you copy from pretty much anywhere else, or don't "copy as link address" and instead just do a ctrl-c (or whatever the copy shortcut is on a given machine) it will not copy the percent-encoded value, and then the link will display the error shown in the description. I see this as a problem going forward and enabling this on other wikis where non Latin characters might be more prominent. But, I don't know the percentages of instances where a client would not be able to recognize the URL though and if this would be a problem in practice (seeing as that SO thread was from 2010 ). @ifried thoughts?

cmelo changed the task status from Open to In Progress.Apr 3 2024, 12:24 PM
cmelo claimed this task.

Change #1017844 had a related patch set uploaded (by Cmelo; author: Cmelo):

[mediawiki/extensions/CampaignEvents@master] fix diacritics urls on meetging and chat url

https://gerrit.wikimedia.org/r/1017844

Change #1017844 merged by jenkins-bot:

[mediawiki/extensions/CampaignEvents@master] fix diacritics urls on meeting and chat url

https://gerrit.wikimedia.org/r/1017844

✅ The URL https://testchat.com/Iñtërnâtiônàlizætiønمثال was accepted as valid in Meeting URL and Group chat invite. Marking this as done/resolved.

Screen Recording 2024-04-23 at 4.20.37 PM.gif (1×2 px, 3 MB)