Discussion:
utf8only and normalization properties
Haudy Kazemi
2009-08-12 23:17:44 UTC
Permalink
Hello,

I'm wondering what are some use cases for ZFS's utf8only and
normalization properties. They are off/none by default, and can only be
set when the filesystem is created. When should they specifically be
enabled and/or disabled? (i.e. Where is using them a really good idea?
Where is using them a really bad idea?)

Looking forward, starting with Windows XP and OS X 10.5 clients, is
there any reason to change the defaults in order to minimize problems?

From the documentation at
http://dlc.sun.com/osol/docs/content/ZFSADMIN/gazss.html :

utf8only
Boolean
Off
This property indicates whether a file system should reject file names
that include characters that are not present in the UTF-8 character code
set. If this property is explicitly set to off, the normalization
property must either not be explicitly set or be set to none. The
default value for the utf8only property is off. This property cannot be
changed after the file system is created.

normalization
String
None
This property indicates whether a file system should perform a unicode
normalization of file names whenever two file names are compared, and
which normalization algorithm should be used. File names are always
stored unmodified, names are normalized as part of any comparison
process. If this property is set to a legal value other than none, and
the utf8only property was left unspecified, the utf8only property is
automatically set to on. The default value of the normalization property
is none. This property cannot be changed after the file system is created

Background: I've built a test system running OpenSolaris 2009.06 (b111)
with a ZFS RAIDZ1, with CIFS in workgroup mode. I'm testing with
Windows XP and Mac OS X 10.5 clients connecting via CIFS (no NFS or AFP).
I've set these properties during zfs create or immediately afterwards:
casesensitivity=mixed
compression=on
snapdir=visible

and ran this to set up nonrestrictive ACLs as suggested by Alan Wright
at the thread "[cifs-discuss] CIFS and permission mapping" at
http://opensolaris.org/jive/message.jspa?messageID=365620#365947
chmod A=everyone@:full_set:fd:allow /tank/home

Thanks!

-hk
Nicolas Williams
2009-08-12 23:48:06 UTC
Permalink
Post by Haudy Kazemi
I'm wondering what are some use cases for ZFS's utf8only and
normalization properties. They are off/none by default, and can only be
set when the filesystem is created. When should they specifically be
enabled and/or disabled? (i.e. Where is using them a really good idea?
Where is using them a really bad idea?)
These are for interoperability.

The world is converging on Unicode for filesystem object naming. If you
want to exclude non-Unicode strings then you should set utf8only (some
non-Unicode strings in some codesets can look like valid UTF-8 though).

But Unicode has multiple canonical and non-canonical ways of
representing certain characters (e.g., ´). Solaris and Windows
input methods tend to conform to NFKC, so they will interop even if you
don't enable the normalization feature. But MacOS X normalizes to NFD.

Therefore, if you need to interoperate with MacOS X then you should
enable the normalization feature.
Post by Haudy Kazemi
Looking forward, starting with Windows XP and OS X 10.5 clients, is
there any reason to change the defaults in order to minimize problems?
You should definetely enable normalization (see above).

It doesn't matter what normalization form you use, but "nfd" runs faster
than "nfc".

The normalization feature doesn't cost much if you use all US-ASCII file
names. And it doesn't cost much if your file names are mostly US-ASCII.

Nico
--
Haudy Kazemi
2009-08-13 22:57:57 UTC
Permalink
Post by Nicolas Williams
Post by Haudy Kazemi
I'm wondering what are some use cases for ZFS's utf8only and
normalization properties. They are off/none by default, and can only be
set when the filesystem is created. When should they specifically be
enabled and/or disabled? (i.e. Where is using them a really good idea?
Where is using them a really bad idea?)
These are for interoperability.
The world is converging on Unicode for filesystem object naming. If you
want to exclude non-Unicode strings then you should set utf8only (some
non-Unicode strings in some codesets can look like valid UTF-8 though).
But Unicode has multiple canonical and non-canonical ways of
representing certain characters (e.g., ´). Solaris and Windows
input methods tend to conform to NFKC, so they will interop even if you
don't enable the normalization feature. But MacOS X normalizes to NFD.
Therefore, if you need to interoperate with MacOS X then you should
enable the normalization feature.
Thank you for the reply. My goal is to configure the filesystem for the
lowest common denominator without knowing up front which clients will be
used. OS X and Win XP are listed because they are commonly used as
desktop OSes. Ubuntu Linux is a third potential desktop OS.

The normalization property documentation says "this property indicates
whether a file system should perform a unicode normalization of file
names whenever two file names are compared. File names are always
stored unmodified, names are normalized as part of any comparison
process." Where does the file system use filename comparisons and what
does it use them for? Filename collision checking? Sorting?

Is it used for any other operation, say when returning a filename to an
application? Would applications reading/writing files to a ZFS
filesystem ever notice the difference in normalization settings as long
as they produce filenames that do not conflict with existing names or
create invalid UTF8? The documentation says filenames are stored
unmodified, which sounds like things should be transparent to applications.

(In regard to filename collision checking, if non-normalized unmodified
filenames are always stored on disk, and they don't conflict in
non-normalized form, what would the point be of normalizing the
filenames for a comparison? To verify there isn't conflict in
normalized forms, and if there is no conflict with an existing file to
allow the filename to be written unmodified?)
Post by Nicolas Williams
Post by Haudy Kazemi
Looking forward, starting with Windows XP and OS X 10.5 clients, is
there any reason to change the defaults in order to minimize problems?
You should definetely enable normalization (see above).
It doesn't matter what normalization form you use, but "nfd" runs faster
than "nfc".
The normalization feature doesn't cost much if you use all US-ASCII file
names. And it doesn't cost much if your file names are mostly US-ASCII.
Nico
The ZFS documentation doesn't list the valid values for the
normalization property other than 'none. From your reply and from the
the official unicode docs at
http://unicode.org/reports/tr15/ and
http://unicode.org/faq/normalization.html
would it be correct to conclude that none, NFD, NFC, NFKC, and NFKD are
the only valid values for the ZFS normalization property? If so, I
suggest they be added to the documentation at
http://dlc.sun.com/osol/docs/content/ZFSADMIN/gazss.html

Thanks,

-hk
Nicolas Williams
2009-08-13 23:33:45 UTC
Permalink
Post by Haudy Kazemi
Post by Nicolas Williams
Therefore, if you need to interoperate with MacOS X then you should
enable the normalization feature.
Thank you for the reply. My goal is to configure the filesystem for the
lowest common denominator without knowing up front which clients will be
used. OS X and Win XP are listed because they are commonly used as
desktop OSes. Ubuntu Linux is a third potential desktop OS.
Right, so set normalization=formD .
Post by Haudy Kazemi
The normalization property documentation says "this property indicates
whether a file system should perform a unicode normalization of file
names whenever two file names are compared. File names are always
stored unmodified, names are normalized as part of any comparison
process." Where does the file system use filename comparisons and what
does it use them for? Filename collision checking? Sorting?
The system does filename comparisons when doing lookups
(open("/foo/bar/baz", ...) does at least three such lookups, for
example), and on create (since that involves a lookup).

Yes, this is about collisions. Consider a file named "รก" (that's "a"
with an acute accent). There are _two_ possible encodings for that name
in UTF-8. That means that you could have two files in the same
directory and with the same name, though they'd have different names if
you looked at the bytes that make up the names. That would be
confusing, at the very least.

To avoid such collisions you can enable normalization.

You can find more here:

http://blogs.sun.com/nico/entry/filesystem_i18n
Post by Haudy Kazemi
Is it used for any other operation, say when returning a filename to an
application? Would applications reading/writing files to a ZFS
No, directory listings always return the filename used when the file
name was created, without any normalization.
Post by Haudy Kazemi
filesystem ever notice the difference in normalization settings as long
as they produce filenames that do not conflict with existing names or
create invalid UTF8? The documentation says filenames are stored
unmodified, which sounds like things should be transparent to applications.
Applications shouldn't notice normalization being enabled. The only
reasons to disable normalization are: a) you don't want to force the use
of UTF-8, or b) you consistently use a single normalization form and you
don't want to pay a penalty for normalizing on lookup.

(b) is probably not a problem -- the normalization code is fast if you
use all US-ASCII strings, and it's linear with the number of non-ASCII,
Unicode codepoints in file names. But I don't have performance numbers
to share. I think that normalization should be enabled by default if
you enable utf8only, and utf8only should probably be enabled by default
in Solaris, but that's just my personal opinion.
Post by Haudy Kazemi
(In regard to filename collision checking, if non-normalized unmodified
filenames are always stored on disk, and they don't conflict in
non-normalized form, what would the point be of normalizing the
filenames for a comparison? To verify there isn't conflict in
normalized forms, and if there is no conflict with an existing file to
allow the filename to be written unmodified?)
Yes.
Post by Haudy Kazemi
The ZFS documentation doesn't list the valid values for the
normalization property other than 'none. From your reply and from the
The zfs(1M) manpage lists them:

normalization = none | formD | formKCf

That's not all existing Unicode normalization forms, no. The reason for
this is that we only normalize on lookup (the file names returned by
readdir are not normalized), and for that the forms C and D are
semantically equivalent, but K and non-K forms are not semantically
equivalent, so we need one K form and one non-K form. NFD is faster
than NFC, but the K forms require a trip through form C, so NFKC is
faster than NFKD (at least if I remember correctly). Which means that
NFD and NFKC were sufficient, and there's no reason to ever want NFC or
NFKD.
Post by Haudy Kazemi
suggest they be added to the documentation at
http://dlc.sun.com/osol/docs/content/ZFSADMIN/gazss.html
Yes, that's a good point.

PS: ZFS directories are hashed. When normalization is enabled, the
hash keys are normalized on create, but the hash contents are not,
so filenames rename unnormalized.
Nicolas Williams
2009-08-27 23:24:22 UTC
Permalink
So, the manpage seems to have a bug in it. The valid values for the
normalization property are:

none | formC | formD | formKC | formKD

Nico
--

Continue reading on narkive:
Loading...