Even our fonts will betray us?

According to Christina Warren at mashable.com, the switch to allowing non-Latin alphabet characters in web domains could give scammers a brand new toolkit. That's because browsers can't render many non-Latin characters, and the approximations may be doppelgangers for trusted sites. Alternatively, an address in an alphabet like Cyrillic, which shares certain letterforms with the Latin alphabet, can appear indistinguishable from pre-existing Latin-alphabet addresses:

i-803d56bdaefa6466de143a5b6969bf83-Picture 2.png


It's only fair that users of different alphabets get to register their own addresses, but clearly there needs to be some kind of fix here, either from ICANN or from the tech side. But Warren cites a TimesOnline article from two days ago in which it sounds like no one is really taking a hard look at solutions.

More like this

This seems very odd.  The Internet -- including web sites and email -- has been found to have a very serious security flaw.  Civilized places such as Sweden and Puerto Rico are already fixing the problem.  There are plans to improve security for US .gov and .mil sites (government and military ,…
Friday will be the two-year anniversary of the signing of the Affordable Care Act, and there's plenty of discussion about the law's impacts and the upcoming Supreme Court oral arguments. While many of the law's provisions won't take effect until 2014, it's already having an impact on some aspects…
Facebook watchers are reporting that the service is about to launch a new feature for merchants that will allow merchants to target ads to users based upon users' email and phone numbers. That's a little confusing. Let me explain with a hypo-- As I understand it, it might work like this: ABC Corp…
One of the great paradoxes of contemporary society is that Americans by way of the Internet and specialized cable TV channels have greater access to scientific information than at any other time in history, yet knowledge of science and related policy matters remains very low. The problem is too…

In fact, people have been working on this since at least as far back as 2001.

Also, take note at who exactly is being quoted in those articles; the TimesOnline article quotes a trademark lawyer and a representative of brand-protection agency MarkMonitor, rather than actual experts in "cyber-crime" (I hate that term). For example:

âThey [Icann] seem to have started the process of allowing people to register domain names in non-Roman characters but donât seem to have put in place anything that obligates any registry to safeguard trademark rights or the rights of legitimate businesses that use the same name,â Mr Bennett said.

A registry can't preemptively deny a registration on trademark grounds because trademarks are not unique; several different companies can have the same trademark in different fields of business, or in different countries, or even in different parts of the same country; furthermore, FooBar Inc. has no right to interfere with, for example, my attempt to register foobar-sucks.com or whatever, and distinguishing that from a scammer's attempt to register foobar-inc.com would require actual human judgement.

The wikipedia article IDN Homograph attack seems to cover the basics of the issue reasonably well.

By Andrew G. (not verified) on 01 Jan 2010 #permalink

It's both a lot worse and a lot better than that.

And it's also not particularly new.

And the TimesOnline article is bad and misinformed. What else is new? (using a lawyer as a tech source? sheesh!)

First the bad.
The basic Latin and Greek alphabets occur several times in unicode for math/physics purposes. Thus, U+1D400 is Mathematical Bold Capital A, U+1D434 is Mathematical Italic Capital A, U+1D468 is Mathematical Bold Italic Capital A, etc.
Here are some of the Mathematical Capital A's: ð ð¨ðððð¸ð¬ð ððð¼ð°.
Here are some Mathematical Capital Alphas: ð¨ð¢ððð.
Here are various 0's: ððð¢ð¬ð¶.
And this is not a 'K' or a Kappa but the kelvin sign: K.
This is Latin Letter Small Capital A from the phonetic extensions: á´.

And then there are the various accented versions of ordinary Latin characters (which are critical in some -- most -- European languages). Would you notice the dÃfference/difference? ;)

Then the good.
The good news is that it's not much of a problem in practice and it's going to become even less of a problem.

First of all, if the browser doesn't support IDN (Internationalized Domain Names) then they will look like http://xn--5cab8c.dk-hostmaster.dk/ instead of http://æøå.dk-hostmaster.dk/.

Secondly, not all top-level domains accept the full range of unicode characters -- the Danish .dk only accepts æøåäöüé (which are necessary for our language + loan words from Sweden and Germany).

Thirdly, the problem has been known for years -- we've been able to use æøå etc in Danish domains since 2004 -- and the browser writers are fully aware of the phishing possibilities. They tend to disallow the special interpretation of xn-- domains or severely filter them using whitelists if the registrar doesn't do it. The Danish ones do, only allowing the handful of letters that are both necessary for us and won't cause problems for Danes.

Fourthly, the better browsers already have other anti-phishing measures built-in. Microsoft wrote a lot about it on their blog during the development of IE7, for example. Script mixing in URLs is a major red flag for such systems.

And the not new stuff:
Lots of domains have had this for years. The world has not ended. Browsers have protection built in. The registration authorities also try to protect against this. It has not caused major problems (not much compared to other kinds of phishing). For example, the URL can contain a password after the domain -- and if the domain is specified as an IP address then most people skip it and read the password instead... which might look suspiciously like a domain name. This trick has worked since the nineties (but I think most browsers catch it now).

Wikipedia covers the background pretty well.

By Peter Lund (not verified) on 01 Jan 2010 #permalink

Peter, thanks for the detailed reply- looks though like our comment functionality has failed you, since most of the characters in your comment don't display on my computer at least. Ironic?

Kevin, thanks also for the detailed reply - but I'm really not worried about trademark law as mr. Bennett is. I'm more concerned because I think people tend to be amazingly credulous, which is why phishing works at all. So if the problem doesn't exist that's great. But if it does exist, I think it opens up more possibilities for people to be dumb.

"...looks though like our comment functionality has failed you, since most of the characters in your comment don't display on my computer at least."

Nope. The problem is in the receiving end. The browser & OS need to support the character set and font.

I'm running Firefox and Crunchbang Linux, which have pretty good coverage. I could see almost all characters, only four Math A's are missing.

By Lassi Hippeläinen (not verified) on 01 Jan 2010 #permalink

What the heck is Cyrilliac? A digestive intolerance for Russian wheat?

Ah then the problem is probably that I am approving comments on the go from my iPhone. But that techsavvy people running better systems will have fewer problems doesn't comfort me. I'm not worried about YOU falling victim to phishing, Lassi, but people like my mom (sorry mom)

Peter, thanks for the detailed reply- looks though like our comment functionality has failed you, since most of the characters in your comment don't display on my computer at least. Ironic?

Nope, it's a lack of fonts with sufficient coverage. 5 of the Math A/Alphas and 1 of the numbers are missing at my end, too. I have them covered by other fonts, though, so they displayed fine in the Character Map applet (Ubuntu 9.10).

What is ironic is that the scienceblogs.com software doesn't quite support UTF8 correctly and that most of its blogs don't use UTF8. The few that do were fixed manually because their bloggers knew what they were doing.

If a web page is using an 8-bit character set then someone has had to make a choice at some point of which tiny set of a few hundred characters to support directly. All the rest have to be supported by character entities (ampersand-hash-digits-semicolon or ampersand-name-semicolon thingies. For example, 'Ã¥' has the name 'aring' and the number 198).

It's great fun when mixing stuff from different sources that don't fit in the same 8-bit encoding, such as the "Most German" list on scienceblogs :/

If you are not using UTF8 then you are doing it wrong.

By Peter Lund (Denmark) (not verified) on 02 Jan 2010 #permalink