Now on ScienceBlogs: A study that oversells massage therapy

ScienceBlogs Book Club: Inside the Outbreaks

bioephemera

a blog about the intersection of science, art, and culture by Jessica Palmer, PhD

Profile

Jessica Palmer has a PhD in Molecular Biology and has been blogging about the intersection of art and biology since 2006.

read the first BioE post.

The contents of this blog are the personal opinions of the author, independent of any organizations with which she is affiliated, and should not be construed as professional advice.

Search


Recent Posts

bioephemeral sampler

Categories

Archives

Blogroll

« "I'm just turning a rational eye to your dogma" | Main | Don't look now - it's another 'scientific consensus'! »

Even our fonts will betray us?

Category: BlogosphereWeb 2.0, New Media, and GadgetsWordsYikes!
Posted on: January 1, 2010 9:08 PM, by Jessica Palmer

According to Christina Warren at mashable.com, the switch to allowing non-Latin alphabet characters in web domains could give scammers a brand new toolkit. That's because browsers can't render many non-Latin characters, and the approximations may be doppelgangers for trusted sites. Alternatively, an address in an alphabet like Cyrillic, which shares certain letterforms with the Latin alphabet, can appear indistinguishable from pre-existing Latin-alphabet addresses:

Picture 2.png

Uh-oh.

It's only fair that users of different alphabets get to register their own addresses, but clearly there needs to be some kind of fix here, either from ICANN or from the tech side. But Warren cites a TimesOnline article from two days ago in which it sounds like no one is really taking a hard look at solutions.

Share on Facebook
Share on StumbleUpon
Share on Facebook
Find more posts in: Information Science

TrackBacks

TrackBack URL for this entry: http://scienceblogs.com/mt/pings/128306

Comments

1

In fact, people have been working on this since at least as far back as 2001.

Also, take note at who exactly is being quoted in those articles; the TimesOnline article quotes a trademark lawyer and a representative of brand-protection agency MarkMonitor, rather than actual experts in "cyber-crime" (I hate that term). For example:

“They [Icann] seem to have started the process of allowing people to register domain names in non-Roman characters but don’t seem to have put in place anything that obligates any registry to safeguard trademark rights or the rights of legitimate businesses that use the same name,” Mr Bennett said.

A registry can't preemptively deny a registration on trademark grounds because trademarks are not unique; several different companies can have the same trademark in different fields of business, or in different countries, or even in different parts of the same country; furthermore, FooBar Inc. has no right to interfere with, for example, my attempt to register foobar-sucks.com or whatever, and distinguishing that from a scammer's attempt to register foobar-inc.com would require actual human judgement.

The wikipedia article IDN Homograph attack seems to cover the basics of the issue reasonably well.

Posted by: Andrew G. | January 1, 2010 10:32 PM

2

It's both a lot worse and a lot better than that.

And it's also not particularly new.

And the TimesOnline article is bad and misinformed. What else is new? (using a lawyer as a tech source? sheesh!)


First the bad.
The basic Latin and Greek alphabets occur several times in unicode for math/physics purposes. Thus, U+1D400 is Mathematical Bold Capital A, U+1D434 is Mathematical Italic Capital A, U+1D468 is Mathematical Bold Italic Capital A, etc.
Here are some of the Mathematical Capital A's: 𝖠𝑨𝒜𝓐𝔄𝔸𝕬𝖠𝗔𝘈𝘼𝙰.
Here are some Mathematical Capital Alphas: 𝚨𝛢𝜜𝝖𝞐.
Here are various 0's: 𝟎𝟘𝟢𝟬𝟶.
And this is not a 'K' or a Kappa but the kelvin sign: K.
This is Latin Letter Small Capital A from the phonetic extensions: ᴀ.

And then there are the various accented versions of ordinary Latin characters (which are critical in some -- most -- European languages). Would you notice the dífference/difference? ;)


Then the good.
The good news is that it's not much of a problem in practice and it's going to become even less of a problem.

First of all, if the browser doesn't support IDN (Internationalized Domain Names) then they will look like http://xn--5cab8c.dk-hostmaster.dk/ instead of http://æøå.dk-hostmaster.dk/.

Secondly, not all top-level domains accept the full range of unicode characters -- the Danish .dk only accepts æøåäöüé (which are necessary for our language + loan words from Sweden and Germany).

Thirdly, the problem has been known for years -- we've been able to use æøå etc in Danish domains since 2004 -- and the browser writers are fully aware of the phishing possibilities. They tend to disallow the special interpretation of xn-- domains or severely filter them using whitelists if the registrar doesn't do it. The Danish ones do, only allowing the handful of letters that are both necessary for us and won't cause problems for Danes.

Fourthly, the better browsers already have other anti-phishing measures built-in. Microsoft wrote a lot about it on their blog during the development of IE7, for example. Script mixing in URLs is a major red flag for such systems.


And the not new stuff:
Lots of domains have had this for years. The world has not ended. Browsers have protection built in. The registration authorities also try to protect against this. It has not caused major problems (not much compared to other kinds of phishing). For example, the URL can contain a password after the domain -- and if the domain is specified as an IP address then most people skip it and read the password instead... which might look suspiciously like a domain name. This trick has worked since the nineties (but I think most browsers catch it now).


Wikipedia covers the background pretty well.

Posted by: Peter Lund | January 1, 2010 11:25 PM

3

Peter, thanks for the detailed reply- looks though like our comment functionality has failed you, since most of the characters in your comment don't display on my computer at least. Ironic?

Kevin, thanks also for the detailed reply - but I'm really not worried about trademark law as mr. Bennett is. I'm more concerned because I think people tend to be amazingly credulous, which is why phishing works at all. So if the problem doesn't exist that's great. But if it does exist, I think it opens up more possibilities for people to be dumb.

Posted by: Jessica Palmer Author Profile Page | January 2, 2010 1:41 AM

4

"...looks though like our comment functionality has failed you, since most of the characters in your comment don't display on my computer at least."

Nope. The problem is in the receiving end. The browser & OS need to support the character set and font.

I'm running Firefox and Crunchbang Linux, which have pretty good coverage. I could see almost all characters, only four Math A's are missing.

Posted by: Lassi Hippeläinen | January 2, 2010 2:31 AM

5

i think no absolutely no...it won't betray (:

Posted by: dizi izle | January 2, 2010 7:19 AM

6

What the heck is Cyrilliac? A digestive intolerance for Russian wheat?

Posted by: Jason R | January 2, 2010 7:31 AM

7

Ah then the problem is probably that I am approving comments on the go from my iPhone. But that techsavvy people running better systems will have fewer problems doesn't comfort me. I'm not worried about YOU falling victim to phishing, Lassi, but people like my mom (sorry mom)

Posted by: Jessica Palmer Author Profile Page | January 2, 2010 8:50 AM

8

I think Cyrilliac is a character from Dragon Age, Jason.

Posted by: Jessica Palmer Author Profile Page | January 2, 2010 8:54 AM

9
Peter, thanks for the detailed reply- looks though like our comment functionality has failed you, since most of the characters in your comment don't display on my computer at least. Ironic?

Nope, it's a lack of fonts with sufficient coverage. 5 of the Math A/Alphas and 1 of the numbers are missing at my end, too. I have them covered by other fonts, though, so they displayed fine in the Character Map applet (Ubuntu 9.10).

What is ironic is that the scienceblogs.com software doesn't quite support UTF8 correctly and that most of its blogs don't use UTF8. The few that do were fixed manually because their bloggers knew what they were doing.

If a web page is using an 8-bit character set then someone has had to make a choice at some point of which tiny set of a few hundred characters to support directly. All the rest have to be supported by character entities (ampersand-hash-digits-semicolon or ampersand-name-semicolon thingies. For example, 'å' has the name 'aring' and the number 198).

It's great fun when mixing stuff from different sources that don't fit in the same 8-bit encoding, such as the "Most German" list on scienceblogs :/

If you are not using UTF8 then you are doing it wrong.

Posted by: Peter Lund (Denmark) | January 2, 2010 10:09 AM

10

Here's the technical plenary presentation from the last Internet Engineering Task Force (IETF) meeting (November in Hiroshima); it's a PDF, and it's on this topic:
Internationalization in Names and Other Identifiers

Posted by: Barry Leiba | January 2, 2010 11:01 AM

11

Give Barry a white horse! :)

Posted by: Jessica Palmer Author Profile Page | January 2, 2010 11:21 AM

ScienceBlogs

Search ScienceBlogs:

Go to:

Advertisement
Follow ScienceBlogs on Twitter

© 2006-2011 ScienceBlogs LLC. ScienceBlogs is a registered trademark of ScienceBlogs LLC. All rights reserved.