Posts tagged: Unicode

Feb 28 2010

Accent Folding


A List Apart has been a steady source of thought-provoking inspiration over the years, not only from a website building perspective, but also because much of what they publish crosses boundaries and impacts other projects and interests in my life.

Their current article, Accent Folding, greatly impacts library data in general, and library catalogs in particular.  It deals with the issue of Unicode and pattern recognition, namely how one creates search tools that allow for variations in how words containing accents, stress marks, and other non-ascii characters.  The most succinct example:

There is no excuse for your software to play dumb when the user types “cafe” instead of “café.”

The article presents methods of “normalizing” text to allow for proper matching, and should be read by anyone who gets to deal with library data for reports and searching aids.  If you know how to use regular expressions, you will likely be in for a treat.

The other example they present, this time to demonstrate the limitations of accent folding, uses Japanese to illustrate just how differently the same data can be presented:

These four sentences all say “Children like to watch television” in Japanese:

  • Kanji: 子供はテレビを見るのが好きです。
  • Hiragana: こども は てれび を みる の が すき です 。
  • Romaji: kodomo wa terebi o miru noga suki desu.
  • Cyrillic: кодомо ва тэрэби о миру нога суки дэсу.

Even if you don’t end up applying this directly to your work, the information in this article will help your appreciation for the challenges contained within your data, and how tough it can be to make it “just work” sometimes.

  • Share/Bookmark
Jan 24 2009

Fonterrific


A recent post, and the resulting discussion, on Metafilter has put me in a Unicode font frenzy.  A few links of note from there and elsewhere:

As someone who has worked with an ILS that didn’t have Unicode support, which was then upgraded to support Unicode, and then changed jobs and is now working with an ILS with very limited Unicode support, I have a great appreciation for the benefits of Unicode.

Libraries should, in all that they do, attempt to store and present data in Unicode.  This includes our catalogs, web sites, and other data repositories.  Even if you offer very little outside of the standard Western characters, it makes your data that much more accessible and useful.

  • Share/Bookmark
FireStats icon Powered by FireStats