Discussion:
dedup accounting anomaly / dedup experiments
(too old to reply)
Lutz Schumann
2010-07-01 08:33:41 UTC
Permalink
Hello list,

I wanted to test deduplication a little and did a experiment.

My question was: can I dedupe infinite or is ther a upper limit ?

So for that I did a very basic test.
- I created a ramdisk-pool (1GB)
- enabled dedup and
- wrote zeros to it (in one single file) until an error is returned.

The size of the pool was 1046 MB, I was able to write 62 GB to it then it says "no space left on device". The block size was 128k, so I was able to write 507.000 blocks to the pool.

With this device beeing full, I see the following:

1) zfs list reports that no space is left (AVAIL=0)
2) zpool reports that the dedup factor was ~507.000x
3) zpool reports also that 8,6 MB of space were allocated in the pool (0% used)

So for me it looks like there is something broken in ZFS accounting with dedupe.

- zpool and zfs usage free space reporting do not align
- the real deduplication factor was not 507.000 (meaning I would have been able to write 507.000x1GB = a lot to the pool)
- when calculating 1046 MB / 507000 = 2.1 KB, somehow for each block of 128k, 2,1 KB of data bas been written (assuming zfs list is correct). What is this ? Metadata ? Meaning that I have aprox 1.6 % of Meatadata in ZFS (1/(128k/2,1k)) ?

I repeatet the same thing for a recordsize of 32k. The funny thing is:
- Also 60 GB could be written before "no space left"
- 31 MB of space were alloated in the pool (zpool list)

The version of the pool is 25.

During the experiment I could nicely see:
- that performance on ramdisk is CPU bound doing ~125 MB /sec per Core.
- performance scales linearly with adding CPU cores. (125 MB/s cor 1core, 253 Mb/s for 2core, 408 MB/s for 4core).
- that the upper size of the deduplication table is blocks * ~150 Byte, indipendent of the dedupe factor
- the ddt does not grow for deduplicatable blocks (zdb -D)
- performance goes down factor of ~4 when switching from allocation policy of "closest" to "best fit" (when the pool fills rate drops from 250 MB/s to 67 MB/s. I suspect even worse results for spinning media because of the head movements (>10x slow down).

Anyone knowing why the dedup factor is wrong ? Any insights on what has actually been written (compressed meta data, deduped meta data .. etc.) would be greatly appreshiated.

Regards,
Robert
--
This message posted from opensolaris.org
Will Murnane
2010-07-02 07:14:27 UTC
Permalink
Post by Lutz Schumann
Hello list,
I wanted to test deduplication a little and did a experiment.
My question was:  can I dedupe infinite or is ther a upper limit ?
So for that I did a very basic test.
-  I created a ramdisk-pool (1GB)
- enabled dedup and
- wrote zeros to it (in one single file) until an error is returned.
I don't know about the rest of your test, but writing zeroes to a ZFS
filesystem is probably not a very good test, because ZFS recognizes
these blocks of zeroes and doesn't actually write anything. Unless
maybe encryption is on, but maybe not even then.

I'd write a little program that initializes 128k of memory to a
particular pattern, then writes it to disk until it gets ENOSPC (or
some other error code, I suppose). That should force the first block
to actually be written, and all the others to point to it.
Post by Lutz Schumann
- that performance on ramdisk is CPU bound doing ~125 MB /sec per Core.
Ouch! That's all?

Will
Lutz Schumann
2010-07-02 16:56:37 UTC
Permalink
Hi,
Post by Will Murnane
I don't know about the rest of your test, but writing
zeroes to a ZFS
filesystem is probably not a very good test, because
ZFS recognizes
these blocks of zeroes and doesn't actually write
anything. Unless
maybe encryption is on, but maybe not even then.
Not true. If I want ZFS to write zeros, ZFS does write zeros. You can simply check this by doing a dd. ZFS does not filter "zero" writes.
Post by Will Murnane
Post by Lutz Schumann
- that performance on ramdisk is CPU bound doing
~125 MB /sec per Core.
Ouch! That's all?
Per Core ! So for 8 Core System this is ~1 GB / sec - More then mosts disks can handle. If you use 50% (must save some for scrub) you are ok.

Robert
--
This message posted from opensolaris.org
Darren J Moffat
2010-07-02 18:45:14 UTC
Permalink
Post by Lutz Schumann
Post by Will Murnane
I don't know about the rest of your test, but writing
zeroes to a ZFS
filesystem is probably not a very good test, because
ZFS recognizes
these blocks of zeroes and doesn't actually write
anything. Unless
maybe encryption is on, but maybe not even then.
Not true. If I want ZFS to write zeros, ZFS does write zeros. You can simply check this by doing a dd. ZFS does not filter "zero" writes.
Actually it does if you have compression turned on and the blocks
compress away to 0 bytes.

See
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zio.c#zio_write_bp_init

Specifically line 1005:

1005 if (psize == 0) {
1006 zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
1007 } else {
--
Darren J Moffat
Lutz Schumann
2010-07-03 08:44:08 UTC
Permalink
Post by Darren J Moffat
Actually it does if you have compression turned on
and the blocks
compress away to 0 bytes.
See
http://src.opensolaris.org/source/xref/onnv/onnv-gate/
usr/src/uts/common/fs/zfs/zio.c#zio_write_bp_init
1005 if (psize == 0) {
1006 zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
1007 } else {
Interesting, did quick test:
Writing zeros async: Dedup=on, Compression=off -> 60 MB / sec
Writing zeros async: Dedup=off, Compression=on -> 480 MB /sec
Writing zeros async: Dedup=off, Compression=off -> 12 MB / sec

It seems if I do a sync write with dedup=off, compress=on, I get 12 MB / sec.

In both cases I see disk /IO. So for me itlooks like meta data writes - the I/O beeing limited to 480 Mb/s seems to be limited by the async meta data updates (this is a VM, so slow).

So does that mean that zero data blocks are not written, but meta data blocks ?

Robert
--
This message posted from opensolaris.org
Brandon High
2010-07-09 08:45:18 UTC
Permalink
On Thu, Jul 1, 2010 at 1:33 AM, Lutz Schumann
Post by Lutz Schumann
Anyone knowing why the dedup factor is wrong ? Any insights on what has
actually been written (compressed meta data, deduped meta data .. etc.)
would be greatly appreshiated.
Metadata and ditto blocks. Even with dedup, zfs will write multiple copies
of blocks after reaching a certain threshold.

-B
--
Brandon High : ***@freaks.com
Loading...