Pipe 2T4 Unicode?


in ask on (#2T4)
I'm wondering what the longterm plans are for Unicode support on |.

From my limited experimentation with posting so far, it seems any attempt to post a comment which Unicode characters directly is rejected. (As are comments containing the ampersand character, so encoding sequences are also verboten.) I've heard that there is some potential for abuse, which is why 'some other news aggregation sites' forbid non-ASCII posts, but there are some potential benefits, such as better punctuation (e.g. em-dashes, 'proper' open/close quotation marks), and in reproducing the some non-English words and names of people. There could also be benefits if somebody is trying to reproduce mathematical or physical formulae as well.

There are almost certainly other features which are higher priority, but I'd be interested to know if there are any plans to support character sets outside of ASCII/basic European?
score 0
  • Closed (not a story)
Reply 3 comments

UTF-8 (Score: 1)

by bryan@pipedot.org on 2014-02-20 08:02 (#38)

The "Content-Type" header has been: "text/html; charset=utf-8" (as well as the equivalent meta tag) since the first day.

As for posting comments, I'm currently being "overly safe" and only allowing keys that can be typed on a US keyboard (minus the ampersand). The idea is that I eventually loosen the rules a bit for most western languages and useful symbols (like euro, pound, and yen.)

The full set is just a huge potential source of abuse with non-printing characters, right-to-left switching, and CJK characters. Within minutes of Soylent News adding it, for example, people where posting pages of braille and other crap.

Re: UTF-8 (Score: 1)

by danieldvorkin@pipedot.org on 2014-02-21 20:04 (#46)

Yeah, letting things in a bit at a time seems like a good idea. Figure out what the "abuse threshhold" is and stop just short of it, if possible. ;)

Re: UTF-8 (Score: 1)

by survivorz@pipedot.org on 2014-02-28 19:34 (#8C)

Hi. I have created code in the past, a certain set of magical regular expressions, that complies with International Domain Names (IDN) version 2 standard, and only allows UTF-8 letters and punctuation (no symbols).

If you want, just tell me where to send this.