Discussion:
RFE: Un-dedup for unique blocks
Jim Klimov
2013-01-19 16:42:38 UTC
Permalink
Hello all,

While revising my home NAS which had dedup enabled before I gathered
that its RAM capacity was too puny for the task, I found that there is
some deduplication among the data bits I uploaded there (makes sense,
since it holds backups of many of the computers I've worked on - some
of my homedirs' contents were bound to intersect). However, a lot of
the blocks are in fact "unique" - have entries in the DDT with count=1
and the blkptr_t bit set. In fact they are not deduped, and with my
pouring of backups complete - they are unlikely to ever become deduped.

Thus these many unique "deduped" blocks are just a burden when my
system writes into the datasets with dedup enabled, when it walks the
superfluously large DDT, when it has to store this DDT on disk and in
ARC, maybe during the scrubbing... These entries bring lots of headache
(or performance degradation) for zero gain.

So I thought it would be a nice feature to let ZFS go over the DDT
(I won't care if it requires to offline/export the pool) and evict the
entries with count==1 as well as locate the block-pointer tree entries
on disk and clear the dedup bits, making such blocks into regular unique
ones. This would require rewriting metadata (less DDT, new blockpointer)
but should not touch or reallocate the already-saved userdata (blocks'
contents) on the disk. The new BP without the dedup bit set would have
the same contents of other fields (though its parents would of course
have to be changed more - new DVAs, new checksums...)

In the end my pool would only track as deduped those blocks which do
already have two or more references - which, given the "static" nature
of such backup box, should be enough (i.e. new full backups of the same
source data would remain deduped and use no extra space, while unique
data won't waste the resources being accounted as deduped).

What do you think?
//Jim
Nico Williams
2013-01-20 01:59:04 UTC
Permalink
I've wanted a system where dedup applies only to blocks being written
that have a good chance of being dups of others.

I think one way to do this would be to keep a scalable Bloom filter
(on disk) into which one inserts block hashes.

To decide if a block needs dedup one would first check the Bloom
filter, then if the block is in it, use the dedup code path, else the
non-dedup codepath and insert the block in the Bloom filter. This
means that the filesystem would store *two* copies of any
deduplicatious block, with one of those not being in the DDT.

This would allow most writes of non-duplicate blocks to be faster than
normal dedup writes, but still slower than normal non-dedup writes:
the Bloom filter will add some cost.

The nice thing about this is that Bloom filters can be sized to fit in
main memory, and will be much smaller than the DDT.

It's very likely that this is a bit too obvious to just work.

Of course, it is easier to just use flash. It's also easier to just
not dedup: the most highly deduplicatious data (VM images) is
relatively easy to manage using clones and snapshots, to a point
anyways.

Nico
--
Richard Elling
2013-01-20 02:32:20 UTC
Permalink
bloom filters are a great fit for this :-)

-- richard
Post by Nico Williams
I've wanted a system where dedup applies only to blocks being written
that have a good chance of being dups of others.
I think one way to do this would be to keep a scalable Bloom filter
(on disk) into which one inserts block hashes.
To decide if a block needs dedup one would first check the Bloom
filter, then if the block is in it, use the dedup code path, else the
non-dedup codepath and insert the block in the Bloom filter. This
means that the filesystem would store *two* copies of any
deduplicatious block, with one of those not being in the DDT.
This would allow most writes of non-duplicate blocks to be faster than
the Bloom filter will add some cost.
The nice thing about this is that Bloom filters can be sized to fit in
main memory, and will be much smaller than the DDT.
It's very likely that this is a bit too obvious to just work.
Of course, it is easier to just use flash. It's also easier to just
not dedup: the most highly deduplicatious data (VM images) is
relatively easy to manage using clones and snapshots, to a point
anyways.
Nico
--
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2013-01-20 16:02:48 UTC
Permalink
Post by Nico Williams
I've wanted a system where dedup applies only to blocks being written
that have a good chance of being dups of others.
I think one way to do this would be to keep a scalable Bloom filter
(on disk) into which one inserts block hashes.
To decide if a block needs dedup one would first check the Bloom
filter, then if the block is in it, use the dedup code path,
How is this different or better than the existing dedup architecture? If you found that some block about to be written in fact matches the hash of an existing block on disk, then you've already determined it's a duplicate block, exactly as you would, if you had dedup enabled. In that situation, gosh, it sure would be nice to have the extra information like reference count, and pointer to the duplicate block, which exists in the dedup table.

In other words, exactly the way existing dedup is already architected.
Post by Nico Williams
The nice thing about this is that Bloom filters can be sized to fit in
main memory, and will be much smaller than the DDT.
If you're storing all the hashes of all the blocks, how is that going to be smaller than the DDT storing all the hashes of all the blocks?
Nico Williams
2013-01-20 17:29:04 UTC
Permalink
Bloom filters are very small, that's the difference. You might only need a
few bits per block for a Bloom filter. Compare to the size of a DDT entry.
A Bloom filter could be cached entirely in main memory.
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2013-01-20 18:29:36 UTC
Permalink
Post by Nico Williams
To decide if a block needs dedup one would first check the Bloom
filter, then if the block is in it, use the dedup code path, else the
non-dedup codepath and insert the block in the Bloom filter.
Sorry, I didn't know what a Bloom filter was before I replied before - Now I've read the wikipedia article and am consequently an expert. *sic* ;-)

It sounds like, what you're describing... The first time some data gets written, it will not produce a hit in the Bloom filter, so it will get written to disk without dedup. But now it has an entry in the Bloom filter. So the second time the data block gets written (the first duplicate) it will produce a hit in the Bloom filter, and consequently get a dedup DDT entry. But since the system didn't dedup the first one, it means the second one still needs to be written to disk independently of the first one. So in effect, you'll always "miss" the first duplicated block write, but you'll successfully dedup n-1 duplicated blocks. Which is entirely reasonable, although not strictly optimal. And sometimes you'll get a false positive out of the Bloom filter, so sometimes you'll be running th
e dedup code on blocks which are actually unique, but with some intelligently selected parameters such as Bloom table size, you can get this probability to be reasonably small, like less tha
n 1%.

In the wikipedia article, they say you can't remove an entry from the Bloom filter table, which would over time cause consistent increase of false positive probability (approaching 100% false positives) from the Bloom filter and consequently high probability of dedup'ing blocks that are actually unique; but with even a minimal amount of thinking about it, I'm quite sure that's a solvable implementation detail. Instead of storing a single bit for each entry in the table, store a counter. Every time you create a new entry in the table, increment the different locations; every time you remove an entry from the table, decrement. Obviously a counter requires more bits than a bit, but it's a linear increase of size, exponential increase of utility, and within the implementation limits of avai
lable hardware. But there may be a more intelligent way of accomplishing the same goal. (Like I said, I've only thought about this minimally).

Meh, well. Thanks for the interesting thought. For whatever it's worth.
Edward Harvey
2013-01-20 16:16:59 UTC
Permalink
So ... The way things presently are, ideally you would know in advance what stuff you were planning to write that has duplicate copies. You could enable dedup, then write all the stuff that's highly duplicated, then turn off dedup and write all the non-duplicate stuff. Obviously, however, this is a fairly implausible actual scenario.

In reality, while you're writing, you're going to have duplicate blocks mixed in with your non-duplicate blocks, which fundamentally means the system needs to be calculating the cksums and entering into DDT, even for the unique blocks... Just because the first time the system sees each duplicate block, it doesn't yet know that it's going to be duplicated later.

But as you said, after data is written, and sits around for a while, the probability of duplicating unique blocks diminishes over time. So they're just a burden.

I would think, the ideal situation would be to take your idea of un-dedup for unique blocks, and take it a step further. Un-dedup unique blocks that are older than some configurable threshold. Maybe you could have a command for a sysadmin to run, to scan the whole pool performing this operation, but it's the kind of maintenance that really should be done upon access, too. Somebody goes back and reads a jpg from last year, system reads it and consequently loads the DDT entry, discovers that it's unique and has been for a long time, so throw out the DDT info.

But, by talking about it, we're just smoking pipe dreams. Cuz we all know zfs is developmentally challenged now. But one can dream...

finglonger
Richard Elling
2013-01-21 00:19:30 UTC
Permalink
Post by Edward Harvey
But, by talking about it, we're just smoking pipe dreams. Cuz we all know zfs is developmentally challenged now. But one can dream...
I disagree the ZFS is developmentally challenged. There is more development
now than ever in every way: # of developers, companies, OSes, KLOCs, features.
Perhaps the level of maturity makes progress appear to be moving slower than
it seems in early life?

-- richard
Tim Cook
2013-01-21 00:51:16 UTC
Permalink
Post by Edward Harvey
Post by Edward Harvey
But, by talking about it, we're just smoking pipe dreams. Cuz we all
know zfs is developmentally challenged now. But one can dream...
I disagree the ZFS is developmentally challenged. There is more development
now than ever in every way: # of developers, companies, OSes, KLOCs, features.
Perhaps the level of maturity makes progress appear to be moving slower than
it seems in early life?
-- richard
Well, perhaps a part of it is marketing. Maturity isn't really an excuse
for not having a long-term feature roadmap. It seems as though "maturity"
in this case equals stagnation. What are the features being worked on we
aren't aware of? The big ones that come to mind that everyone else is
talking about for not just ZFS but openindiana as a whole and other storage
platforms would be:
1. SMB3 - hyper-v WILL be gaining market share over the next couple years,
not supporting it means giving up a sizeable portion of the market. Not to
mention finally being able to run SQL (again) and Exchange on a fileshare.
2. VAAI support.
3. the long-sought bp-rewrite.
4. full drive encryption support.
5. tiering (although I'd argue caching is superior, it's still a checkbox).

There's obviously more, but those are just ones off the top of my head that
others are supporting/working on. Again, it just feels like all the work
is going into fixing bugs and refining what is there, not adding new
features. Obviously Saso personally added features, but overall there
don't seem to be a ton of announcements to the list about features that
have been added or are being actively worked on. It feels like all these
companies are just adding niche functionality they need that may or may not
be getting pushed back to mainline.

/debbie-downer
Richard Elling
2013-01-21 03:51:15 UTC
Permalink
Post by Richard Elling
Post by Edward Harvey
But, by talking about it, we're just smoking pipe dreams. Cuz we all know zfs is developmentally challenged now. But one can dream...
I disagree the ZFS is developmentally challenged. There is more development
now than ever in every way: # of developers, companies, OSes, KLOCs, features.
Perhaps the level of maturity makes progress appear to be moving slower than
it seems in early life?
-- richard
Well, perhaps a part of it is marketing.
A lot of it is marketing :-/
Post by Richard Elling
Maturity isn't really an excuse for not having a long-term feature roadmap. It seems as though "maturity" in this case equals stagnation. What are the features being worked on we aren't aware of?
Most of the illumos-centric discussion is on the developer's list. The ZFSonLinux
and BSD communities are also quite active. Almost none of the ZFS developers hang
Post by Richard Elling
1. SMB3 - hyper-v WILL be gaining market share over the next couple years, not supporting it means giving up a sizeable portion of the market. Not to mention finally being able to run SQL (again) and Exchange on a fileshare.
I know of at least one illumos community company working on this. However, I do not
know their public plans.
Post by Richard Elling
2. VAAI support.
VAAI has 4 features, 3 of which have been in illumos for a long time. The remaining
feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor product,
but the CEO made a conscious (and unpopular) decision to keep that code from the
community. Over the summer, another developer picked up the work in the community,
but I've lost track of the progress and haven't seen an RTI yet.
Post by Richard Elling
3. the long-sought bp-rewrite.
Go for it!
Post by Richard Elling
4. full drive encryption support.
This is a key management issue mostly. Unfortunately, the open source code for
handling this (trousers) covers much more than keyed disks and can be unwieldy.
I'm not sure which distros picked up trousers, but it doesn't belong in the illumos-gate
and it doesn't expose itself to ZFS.
Post by Richard Elling
5. tiering (although I'd argue caching is superior, it's still a checkbox).
You want to add tiering to the OS? That has been available for a long time via the
(defunct?) SAM-QFS project that actually delivered code
http://hub.opensolaris.org/bin/view/Project+samqfs/

If you want to add it to ZFS, that is a different conversation.
-- richard
Post by Richard Elling
There's obviously more, but those are just ones off the top of my head that others are supporting/working on. Again, it just feels like all the work is going into fixing bugs and refining what is there, not adding new features. Obviously Saso personally added features, but overall there don't seem to be a ton of announcements to the list about features that have been added or are being actively worked on. It feels like all these companies are just adding niche functionality they need that may or may not be getting pushed back to mainline.
/debbie-downer
--

***@RichardElling.com
+1-760-896-4422
Robert Milkowski
2013-01-29 14:08:13 UTC
Permalink
From: Richard Elling
Sent: 21 January 2013 03:51
VAAI has 4 features, 3 of which have been in illumos for a long time. The remaining
feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor
product, 
but the CEO made a conscious (and unpopular) decision to keep that code
from the 
community. Over the summer, another developer picked up the work in the
community, 
but I've lost track of the progress and haven't seen an RTI yet.
That is one thing that always bothered me... so it is ok for others, like
Nexenta, to keep stuff closed and not in open, while if Oracle does it they
are bad?

Isn't it at least a little bit being hypocritical? (bashing Oracle and doing
sort of the same)
--
Robert Milkowski
http://milek.blogspot.com
Sašo Kiselkov
2013-01-29 14:21:30 UTC
Permalink
Post by Richard Elling
From: Richard Elling
Sent: 21 January 2013 03:51
VAAI has 4 features, 3 of which have been in illumos for a long time. The
remaining
feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor
product,
but the CEO made a conscious (and unpopular) decision to keep that code
from the
community. Over the summer, another developer picked up the work in the
community,
but I've lost track of the progress and haven't seen an RTI yet.
That is one thing that always bothered me... so it is ok for others, like
Nexenta, to keep stuff closed and not in open, while if Oracle does it they
are bad?
Isn't it at least a little bit being hypocritical? (bashing Oracle and doing
sort of the same)
Nexenta is a downstream repository that chooses to keep some of their
new developments in-house while making others open. Most importantly,
they participate and make a conscious effort to play nice.

Contrast this with Oracle. Oracle swoops in and buys up Sun, closes
*all* of the technologies it can turn a profit on, changes licensing
terms to extremely draconian and in the process takes a dump on all of
the open-source community and large numbers of their customers.

Now imagine which of these two is more popular in the community?

(Disclaimer: my company was formerly an almost exclusive Sun shop.)

Cheers,
--
Saso
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2013-01-29 15:03:34 UTC
Permalink
Post by Robert Milkowski
That is one thing that always bothered me... so it is ok for others, like
Nexenta, to keep stuff closed and not in open, while if Oracle does it they
are bad?
Oracle, like Nexenta, and my own company CleverTrove, and Microsoft, and Netapp, has every right to close source development, if they believe it's beneficial to their business. For all we know, Oracle might not even have a choice about it - it might have been in the terms of settlement with NetApp (because open source ZFS definitely hurt NetApp business.) The real question is, in which situations, is it beneficial to your business to be closed source, as opposed to open source? There's the whole redhat/centos dichotomy. At first blush, it would seem redhat gets screwed by centos (or oracle linux) but then you realize how many more redhat derived systems are out there, compared to suse, etc. By allowing people to use it for free, it actually gains popularity, and then redhat actually h
as a successful support business model as compared to suse, which tanked.

But it's useless to argue about whether oracle's making the right business choice, whether open or closed source is better for their business. Cuz it's their choice, regardless who agrees. Arguing about it here isn't going to do any good.

Those of us who gained something and no longer count on having that benefit moving forward have a tendency to say "You gave it to me for free before, now I'm pissed off because you're not giving it to me for free anymore." instead of "thanks for what you gave before."

The world moves on. There's plenty of time to figure out which solution is best for you, the consumer, in the future product offerings: commercial closed source product offering, open source product offering, or something completely different such as btrfs.
Richard Elling
2013-01-29 22:28:44 UTC
Permalink
Post by Richard Elling
From: Richard Elling
Sent: 21 January 2013 03:51
VAAI has 4 features, 3 of which have been in illumos for a long time. The
remaining
feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor
product,
but the CEO made a conscious (and unpopular) decision to keep that code
from the
community. Over the summer, another developer picked up the work in the
community,
but I've lost track of the progress and haven't seen an RTI yet.
That is one thing that always bothered me... so it is ok for others, like
Nexenta, to keep stuff closed and not in open, while if Oracle does it they
are bad?
Nexenta is just as bad. For the record, the illumos-community folks who worked at
Nexenta at the time were overruled by executive management. Some of those folks
are now executive management elsewhere :-)
Post by Richard Elling
Isn't it at least a little bit being hypocritical? (bashing Oracle and doing
sort of the same)
No, not at all.
-- richard

--

***@RichardElling.com
+1-760-896-4422
Pasi Kärkkäinen
2013-02-03 16:20:36 UTC
Permalink
Post by Tim Cook
2. VAAI support.
VAAI has 4 features, 3 of which have been in illumos for a long time. The remaining
feature (SCSI UNMAP) was done by Nexenta and exists in their NexentaStor product,
but the CEO made a conscious (and unpopular) decision to keep that code from the
community. Over the summer, another developer picked up the work in the community,
but I've lost track of the progress and haven't seen an RTI yet.
I assume SCSI UNMAP is implemented in Comstar in NexentaStor?
Isn't Comstar CDDL licensed?

There's also this:
https://www.illumos.org/issues/701

.. which says UNMAP support was added to Illumos Comstar 2 years ago.


-- Pasi

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2013-01-21 13:28:16 UTC
Permalink
Post by Richard Elling
I disagree the ZFS is developmentally challenged.
As an IT consultant, 8 years ago before I heard of ZFS, it was always easy to sell Ontap, as long as it fit into the budget. 5 years ago, whenever I told customers about ZFS, it was always a quick easy sell. Nowadays, anybody who's heard of it says they don't want it, because they believe it's a dying product, and they're putting their bets on linux instead. I try to convince them otherwise, but I'm trying to buck the word on the street. They don't listen, however much sense I make. I can only sell ZFS to customers nowadays, who have still never heard of it.

"Developmentally challenged" doesn't mean there is no development taking place. It means the largest development effort is working closed-source, and not available for free (except some purposes), so some consumers are going to follow their path, while others are going to follow the open source branch illumos path, which means both disunity amongst developers and disunity amongst consumers, and incompatibility amongst products. So far, in the illumos branch, I've only seen bugfixes introduced since zpool 28, no significant introduction of new features. (Unlike the oracle branch, which is just as easy to sell as ontap).

Which presents a challenge. Hence the term, "challenged."

Right now, ZFS is the leading product as far as I'm concerned. Better than MS VSS, better than Ontap, better than BTRFS. It is my personal opinion that one day BTRFS will eclipse ZFS due to oracle's unsupportive strategy causing disparity and lowering consumer demand for zfs, but of course, that's just a personal opinion prediction for the future, which has yet to be seen. So far, every time I evaluate BTRFS, it fails spectacularly, but the last time I did, was about a year ago. I'm due for a BTRFS re-evaluation now.
Dan Swartzendruber
2013-01-21 13:35:58 UTC
Permalink
Zfs on linux (ZOL) has made some pretty impressive strides over the last
year or so...
Sašo Kiselkov
2013-01-21 17:03:34 UTC
Permalink
On 01/21/2013 02:28 PM, Edward Ned Harvey
Post by Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
Post by Richard Elling
I disagree the ZFS is developmentally challenged.
As an IT consultant, 8 years ago before I heard of ZFS, it was always easy
to sell Ontap, as long as it fit into the budget. 5 years ago, whenever I
told customers about ZFS, it was always a quick easy sell. Nowadays,
anybody who's heard of it says they don't want it, because they believe
it's a dying product, and they're putting their bets on linux instead. I
try to convince them otherwise, but I'm trying to buck the word on the street.
They don't listen, however much sense I make. I can only sell ZFS to
customers nowadays, who have still never heard of it.
Yes, Oracle did some serious damage to ZFS' and its own reputation. My
former employer used to be an almost exclusive Sun-shop. The moment
Oracle took over and decided to tank the products aimed at our segment,
we waved our beloved Sun hardware goodbye. Larry has clearly delineated
his marketing strategy: either you're a Fortune500, or you can fuck
right off.
Post by Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
"Developmentally challenged" doesn't mean there is no development taking place.
It means the largest development effort is working closed-source, and not
available for free (except some purposes), so some consumers are going to
follow their path,
I would contest that point. Besides encryption (which I think was
already well underway by the time Oracle took over), AFAIK nothing much
improved in Oracle ZFS. Oracle only considers Sun a vehicle to sell its
software products on (DB, ERP, CRM, etc.). Anything that doesn't fit
into that strategy (e.g. Thumper) got butchered and thrown to the side.
Post by Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
while others are going to follow the open source branch illumos path, which
means both disunity amongst developers and disunity amongst consumers, and
incompatibility amongst products.
I can't talk about "disunity" among devs (how would that manifest
itself?), but as far as incompatibility among products, I've yet to come
across it. In fact, thanks to ZFS feature flags, different feature sets
can coexist peacefully and give admins unprecedented control over their
storage pools. Version control in ZFS used to be a "take it or leave it"
approach, now you can selectively enable and use only features you want to.
Post by Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
So far, in the illumos branch, I've only seen bugfixes introduced since
zpool 28, no significant introduction of new features.
I've had #3035 LZ4 compression for ZFS and GRUB integrated just a few
days back and I've got #3137 L2ARC compression up for review as we
speak. Waiting for #3137 to integrate, I'm looking to focus on multi-MB
record sizes next, and then perhaps taking a long hard look at reducing
the in-memory DDT footprint.
Post by Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
(Unlike the oracle branch, which is just as easy to sell as ontap).
Again, what significant features did they add besides encryption? I'm
not saying they didn't, I'm just not aware of that many.
Post by Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
Which presents a challenge. Hence the term, "challenged."
Agreed, it is a challenge and needs to be taken seriously. We are up
against a lot of money and man-hours invested by big-name companies, so
I fully agree there. We need to rally ourselves as a community hold
together tightly.
Post by Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
Right now, ZFS is the leading product as far as I'm concerned. Better
than MS VSS, better than Ontap, better than BTRFS. It is my personal
opinion that one day BTRFS will eclipse ZFS due to oracle's unsupportive
strategy causing disparity and lowering consumer demand for zfs, but of
course, that's just a personal opinion prediction for the future, which
has yet to be seen. So far, every time I evaluate BTRFS, it fails
spectacularly, but the last time I did, was about a year ago. I'm due
for a BTRFS re-evaluation now.
Let us know at ***@lists.illumos.org how that goes, perhaps write a blog
post about your observations. I'm sure the BTRFS folks came up with some
neat ideas which we might learn from.

Cheers,
--
Saso
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2013-01-22 02:56:31 UTC
Permalink
Post by Sašo Kiselkov
as far as incompatibility among products, I've yet to come
across it
I was talking about ... install solaris 11, and it's using a new version of zfs that's incompatible with anything else out there. And vice-versa. (Not sure if feature flags is the default, or zpool 28 is the default, in various illumos-based distributions. But my understanding is that once you upgrade to feature flags, you can't go back to 28. Which means, mutually, anything >28 is incompatible with each other.) You have to typically make a conscious decision and plan ahead, and intentionally go to zpool 28 and no higher, if you want compatibility between systems.
Post by Sašo Kiselkov
post about your observations. I'm sure the BTRFS folks came up with some
neat ideas which we might learn from.
Actually - I've written about it before (but it'll be difficult to find, and nothing earth shattering, so not worth the search.) I don't think there's anything that zfs developers don't already know. Basic stuff like fsck, and ability to shrink and remove devices, those are the things btrfs has and zfs doesn't. (But there's lots more stuff that zfs has and btrfs doesn't. Just making sure my previous comment isn't seen as a criticism of zfs, or a judgement in favor of btrfs.)

And even with a new evaluation, the conclusion can't be completely clear, nor immediate. Last evaluation started about 10 months ago, and we kept it in production for several weeks or a couple of months, because it appeared to be doing everything well. (Except for features that were known to be not-yet implemented, such as read-only snapshots (aka quotas) and btrfs-equivalent of "zfs send.") Problem was, the system was unstable, crashing about once a week. No clues why. We tried all sorts of things in kernel, hardware, drivers, with and without support, to diagnose and capture the cause of the crashes. Then one day, I took a blind stab in the dark (for the ninetieth time) and I reformatted the storage volume ext4 instead of btrfs. After that, no more crashes. That was approx 8 months ago.

I think the only thing I could learn upon a new evaluation is: #1 I hear "btrfs send" is implemented now. I'd like to see it with my own eyes before I believe it. #2 I hear quotas (read-only snapshots) are implemented now. Again, I'd like to see it before I believe it. #3 Proven stability. Never seen it yet with btrfs. Want to see it with my eyes and stand the test of time before it earns my trust.
Sašo Kiselkov
2013-01-22 07:01:57 UTC
Permalink
On 01/22/2013 03:56 AM, Edward Ned Harvey
Post by Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
Post by Sašo Kiselkov
as far as incompatibility among products, I've yet to come
across it
I was talking about ... install solaris 11, and it's using a new version
of zfs that's incompatible with anything else out there. And vice-versa.
Wait, you're complaining about a closed-source vendor who did a
conscious effort to fuck the rest of the community over? I think you're
crying on the wrong shoulder - it wasn't the open ZFS community that
pulled this dick move. Yes, you can argue that the customer isn't
interested in politics, but unfortunately, there are some things that we
simply can't do anything about - the ball is in Oracle's court on this one.
Post by Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
(Not sure if feature flags is the default, or zpool 28 is the default,
in various illumos-based distributions. But my understanding is that
once you upgrade to feature flags, you can't go back to 28. Which means,
mutually, anything >28 is incompatible with each other.) You have to
typically make a conscious decision and plan ahead, and intentionally go
to zpool 28 and no higher, if you want compatibility between systems.
Yes, feature flags is the default, simply because it is a way for open
ZFS vendors to interoperate. Oracle is an important player in ZFS for
sure, but we can't let their unwillingness to cooperate with others hold
the whole community in stasis - that is actually what they would have
wanted.
Post by Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
Post by Sašo Kiselkov
post about your observations. I'm sure the BTRFS folks came up with some
neat ideas which we might learn from.
Actually - I've written about it before (but it'll be difficult to find,
and nothing earth shattering, so not worth the search.) I don't think
there's anything that zfs developers don't already know. Basic stuff like
fsck, and ability to shrink and remove devices, those are the things btrfs
has and zfs doesn't. (But there's lots more stuff that zfs has and btrfs
doesn't. Just making sure my previous comment isn't seen as a criticism
of zfs, or a judgement in favor of btrfs.)
Well, I learned of the LZ4 compression algorithm in a benchmark
comparison of ZFS, BTRFS and other filesystem compression. Seeing that
there were better things out there I decided to try and push the state
of ZFS compression ahead a little.
Post by Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
And even with a new evaluation, the conclusion can't be completely clear,
nor immediate. Last evaluation started about 10 months ago, and we kept
it in production for several weeks or a couple of months, because it
appeared to be doing everything well. (Except for features that were known
to be not-yet implemented, such as read-only snapshots (aka quotas) and
btrfs-equivalent of "zfs send.") Problem was, the system was unstable,
crashing about once a week. No clues why. We tried all sorts of things
in kernel, hardware, drivers, with and without support, to diagnose and
capture the cause of the crashes. Then one day, I took a blind stab in the
dark (for the ninetieth time) and I reformatted the storage volume ext4
instead of btrfs. After that, no more crashes. That was approx 8 months ago.
Even negative results are results. I'm sure the BTRFS devs would be
interested in your crash dumps. Not saying that you are in any way
obligated to provide them - just pointing out that perhaps you were
hitting some snag that could have been resolved (or not).
Post by Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
I think the only thing I could learn upon a new evaluation is: #1 I hear
"btrfs send" is implemented now. I'd like to see it with my own eyes before
I believe it. #2 I hear quotas (read-only snapshots) are implemented now.
Again, I'd like to see it before I believe it. #3 Proven stability. Never
seen it yet with btrfs. Want to see it with my eyes and stand the test of
time before it earns my trust.
Do not underestimate these guys. They could have come up with a cool new
feature that we haven't heard about anything at all. One of the things
knocking around in my head ever since it was mentioned a while back on
these mailing lists was a metadata-caching device, i.e. a small yet
super-fast small device that would allow you to just store the pool
topology for very fast scrub/resilver. These are the sort of things that
I meant - they could have thought about filesystems in ways that haven't
been done widely before. While BTRFS may be developmentally behind ZFS,
one still has to have great respect for the intellect of its developers
- these guys are not dumb.

Cheers,
--
Saso
Darren J Moffat
2013-01-22 11:30:29 UTC
Permalink
Post by Sašo Kiselkov
Again, what significant features did they add besides encryption? I'm
not saying they didn't, I'm just not aware of that many.
Just a few examples:

Solaris ZFS already has support for 1MB block size.

Support for SCSI UNMAP - both issuing it and honoring it when it is the
backing store of an iSCSI target.

It also has a lot of performance improvements and general bug fixes in
the Solaris 11.1 release.
--
Darren J Moffat
Tomas Forsman
2013-01-22 11:57:50 UTC
Permalink
Post by Darren J Moffat
Post by Sašo Kiselkov
Again, what significant features did they add besides encryption? I'm
not saying they didn't, I'm just not aware of that many.
Solaris ZFS already has support for 1MB block size.
Support for SCSI UNMAP - both issuing it and honoring it when it is the
backing store of an iSCSI target.
Would this apply to say a SATA SSD used as ZIL? (which we have, a
vertex2ex with supercap)

/Tomas
--
Tomas Forsman, ***@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
Darren J Moffat
2013-01-22 13:18:13 UTC
Permalink
Post by Tomas Forsman
Post by Darren J Moffat
Post by Sašo Kiselkov
Again, what significant features did they add besides encryption? I'm
not saying they didn't, I'm just not aware of that many.
Solaris ZFS already has support for 1MB block size.
Support for SCSI UNMAP - both issuing it and honoring it when it is the
backing store of an iSCSI target.
Would this apply to say a SATA SSD used as ZIL? (which we have, a
vertex2ex with supercap)
If the device advertises the UNMAP feature and you are running Solaris
11.1 it should attempt to use it.
--
Darren J Moffat
Sašo Kiselkov
2013-01-22 13:13:13 UTC
Permalink
Post by Darren J Moffat
Post by Sašo Kiselkov
Again, what significant features did they add besides encryption? I'm
not saying they didn't, I'm just not aware of that many.
Solaris ZFS already has support for 1MB block size.
Working on that as we speak.
I'll see your 1MB and raise you another 7 :P
Post by Darren J Moffat
Support for SCSI UNMAP - both issuing it and honoring it when it is the
backing store of an iSCSI target.
AFAIK, the first isn't in Illumos' ZFS, while the latter one is (though
I might be mistaken). In any case, interesting features.
Post by Darren J Moffat
It also has a lot of performance improvements and general bug fixes in
the Solaris 11.1 release.
Performance improvements such as?

Cheers,
--
Saso
Robert Milkowski
2013-01-29 13:59:22 UTC
Permalink
Post by Sašo Kiselkov
Post by Darren J Moffat
It also has a lot of performance improvements and general bug fixes
in
Post by Darren J Moffat
the Solaris 11.1 release.
Performance improvements such as?
Dedup'ed ARC for one.
0 block automatically "dedup'ed" in-memory.
Improvements to ZIL performance.
Zero-copy zfs+nfs+iscsi
...
--
Robert Milkowski
http://milek.blogspot.com
Sašo Kiselkov
2013-01-29 14:06:48 UTC
Permalink
Post by Robert Milkowski
Post by Sašo Kiselkov
Post by Darren J Moffat
It also has a lot of performance improvements and general bug fixes
in
Post by Darren J Moffat
the Solaris 11.1 release.
Performance improvements such as?
Dedup'ed ARC for one.
0 block automatically "dedup'ed" in-memory.
Improvements to ZIL performance.
Zero-copy zfs+nfs+iscsi
...
Cool, thanks for the inspiration on my next work in Illumos' ZFS.

Cheers,
--
Saso
Michel Jansens
2013-01-22 13:20:26 UTC
Permalink
Maybe 'shadow migration' ? (eg: zfs create -o shadow=nfs://server/dir
pool/newfs)

Michel
Post by Darren J Moffat
Post by Sašo Kiselkov
Again, what significant features did they add besides encryption? I'm
not saying they didn't, I'm just not aware of that many.
Solaris ZFS already has support for 1MB block size.
Support for SCSI UNMAP - both issuing it and honoring it when it is
the backing store of an iSCSI target.
It also has a lot of performance improvements and general bug fixes
in the Solaris 11.1 release.
--
Darren J Moffat
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Michel Jansens
***@ulb.ac.be
Darren J Moffat
2013-01-22 13:29:01 UTC
Permalink
Maybe 'shadow migration' ? (eg: zfs create -o shadow=nfs://server/dir
pool/newfs)
That isn't really a ZFS feature, since it happens at the VFS layer. The
ZFS support there is really about getting the options passed through and
checking status but the core of the work happens at the VFS layer.

Shadow migration works with UFS as well!

Since I'm replying here are a few others that have been introduced in
Solaris 11 or 11.1.

There is also the new improved ZFS share syntax for NFS and CIFS in
Solaris 11.1 where you can much more easily inherit and also override
individual share properties.

There is improved diganostics rules.

ZFS support for Immutable Zones (mostly a VFS feature) & Extended
(privilege) Policy and aliasing of datasets in Zones (so you don't see
the part of the dataset hierarchy above the bit delegated to the zone).

UEFI GPT label support for root pools with GRUB2 and on SPARC with OBP.

New "sensitive" per file flag.

Various ZIL and ARC performance improvements.

Preallocated ZVOLs - for swap/dump.
Michel
Post by Darren J Moffat
Post by Sašo Kiselkov
Again, what significant features did they add besides encryption? I'm
not saying they didn't, I'm just not aware of that many.
Solaris ZFS already has support for 1MB block size.
Support for SCSI UNMAP - both issuing it and honoring it when it is
the backing store of an iSCSI target.
It also has a lot of performance improvements and general bug fixes in
the Solaris 11.1 release.
--
Darren J Moffat
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Michel Jansens
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--
Darren J Moffat
Darren J Moffat
2013-01-22 13:39:26 UTC
Permalink
Post by Darren J Moffat
Since I'm replying here are a few others that have been introduced in
Solaris 11 or 11.1.
and another one I can't believe I missed since I was one of the people
that helped design it and I did codereview...

Per file sensitively labels for TX configurations.

and I'm sure I'm still missing stuff that is in Solaris 11 and 11.1.
--
Darren J Moffat
Sašo Kiselkov
2013-01-22 14:12:32 UTC
Permalink
Post by Darren J Moffat
Post by Darren J Moffat
Since I'm replying here are a few others that have been introduced in
Solaris 11 or 11.1.
and another one I can't believe I missed since I was one of the people
that helped design it and I did codereview...
Per file sensitively labels for TX configurations.
Can you give some details on that? Google search are turning up pretty dry.

Cheers,
--
Saso
C***@oracle.com
2013-01-22 14:30:27 UTC
Permalink
Post by Sašo Kiselkov
Post by Darren J Moffat
Post by Darren J Moffat
Since I'm replying here are a few others that have been introduced in
Solaris 11 or 11.1.
and another one I can't believe I missed since I was one of the people
that helped design it and I did codereview...
Per file sensitively labels for TX configurations.
Can you give some details on that? Google search are turning up pretty dry.
Start here:

http://docs.oracle.com/cd/E26502_01/html/E29017/managefiles-1.html#scrolltoc


Look for "multilevel datasets".

Casper
Jim Klimov
2013-01-22 21:45:29 UTC
Permalink
Post by Darren J Moffat
Preallocated ZVOLs - for swap/dump.
Sounds like something I proposed on these lists, too ;)
Does this preallocation only mean filling an otherwise ordinary
ZVOL with zeroes (or some other pattern) - if so, to what effect?

Or is it also supported to disable COW for such datasets, so that
the preallocated swap/dump zvols might remain contiguous on the
faster tracks of the drive (i.e. like a dedicated partition, but
with benefits of ZFS checksums and maybe compression)?

Thanks,
//Jim
Sašo Kiselkov
2013-01-22 22:03:37 UTC
Permalink
Post by Jim Klimov
Post by Darren J Moffat
Preallocated ZVOLs - for swap/dump.
Or is it also supported to disable COW for such datasets, so that
the preallocated swap/dump zvols might remain contiguous on the
faster tracks of the drive (i.e. like a dedicated partition, but
with benefits of ZFS checksums and maybe compression)?
I highly doubt it, as it breaks one of the fundamental design principles
behind ZFS (always maintain transactional consistency). Also,
contiguousness and compression are fundamentally at odds (contiguousness
requires each block to remain the same length regardless of contents,
compression varies block length depending on the entropy of the contents).

Cheers,
--
Saso
Jim Klimov
2013-01-22 22:22:56 UTC
Permalink
Post by Sašo Kiselkov
Post by Jim Klimov
Post by Darren J Moffat
Preallocated ZVOLs - for swap/dump.
Or is it also supported to disable COW for such datasets, so that
the preallocated swap/dump zvols might remain contiguous on the
faster tracks of the drive (i.e. like a dedicated partition, but
with benefits of ZFS checksums and maybe compression)?
I highly doubt it, as it breaks one of the fundamental design principles
behind ZFS (always maintain transactional consistency). Also,
contiguousness and compression are fundamentally at odds (contiguousness
requires each block to remain the same length regardless of contents,
compression varies block length depending on the entropy of the contents).
Well, dump and swap devices are kind of special in that they need
verifiable storage (i.e. detectable to have no bit-errors) but not
really consistency as in sudden-power-off transaction protection.
Both have a lifetime span of a single system uptime - like L2ARC,
for example - and will be reused anew afterwards - after a reboot,
a power-surge, or a kernel panic.

So while metadata used to address the swap ZVOL contents may and
should be subject to common ZFS transactions and COW and so on,
and jump around the disk along with rewrites of blocks, the ZVOL
userdata itself may as well occupy the same positions on the disk,
I think, rewriting older stuff. With mirroring likely in place as
well as checksums, there are other ways than COW to ensure that
the swap (at least some component thereof) contains what it should,
even with intermittent errors of some component devices.

Likewise, swap/dump breed of zvols shouldn't really have snapshots,
especially not automatic ones (and the installer should take care
of this at least for the two zvols it creates) ;)

Compression for swap is an interesting matter... for example, how
should it be accounted? As dynamic expansion and/or shrinking of
available swap space (or just of space needed to store it)?

If the latter, and we still intend to preallocate and guarantee
that the swap has its administratively predefined amount of
gigabytes, compressed blocks can be aligned on those starting
locations as if they were not compressed. In effect this would
just decrease the bandwidth requirements, maybe.

For dump this might be just a bulky compressed write from start
to however much it needs, within the preallocated psize limits...

//Jim
Nico Williams
2013-01-22 22:32:47 UTC
Permalink
IIRC dump is special.

As for swap... really, you don't want to swap. If you're swapping you
have problems. Any swap space you have is to help you detect those
problems and correct them before apps start getting ENOMEM. There
*are* exceptions to this, such as Varnish. For Varnish and any other
apps like it I'd dedicate an entire flash drive to it, no ZFS, no
nothing.

Nico
--
Jim Klimov
2013-01-22 23:27:17 UTC
Permalink
Post by Nico Williams
IIRC dump is special.
As for swap... really, you don't want to swap. If you're swapping you
have problems. Any swap space you have is to help you detect those
problems and correct them before apps start getting ENOMEM. There
*are* exceptions to this, such as Varnish. For Varnish and any other
apps like it I'd dedicate an entire flash drive to it, no ZFS, no
nothing.
I know of this stance, and in general you're right. But... ;)

Sometimes, there are once-in-a-longtime tasks that might require
enormous virtual memory that you wouldn't normally provision
proper hardware for (RAM, SSD) and/or cases when you have to run
similarly greedy tasks on hardware with limited specs (i.e. home
PC capped at 8GB RAM). As an example I might think of a ZDB walk
taking about 35-40GB VM on my box. This is not something I do
every month, but when I do - I need it to complete regardless
that I have 5 times less RAM on that box (and kernel's equivalent
of that walk fails with scanrate hell because it can't swap, btw).

On another hand, there are tasks like VirtualBox which "require"
swap to be configured in amounts equivalent to VM RAM size, but
don't really swap (most of the time). Setting aside SSDs for this
task might be too expensive, if they are never to be used in real
practice.

But this point is more of a task for swap device tiering (like
with Linux swap priorities), as I proposed earlier last year...

//Jim
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2013-01-22 23:54:53 UTC
Permalink
Post by Nico Williams
As for swap... really, you don't want to swap. If you're swapping you
have problems.
For clarification, the above is true in Solaris and derivatives, but it's not universally true for all OSes. I'll cite linux as the example, because I know it. If you provide swap to a linux kernel, it considers this a degree of freedom when choosing to evict data from the cache, versus swapping out idle processes (or zombie processes.) As long as you swap out idle process memory that is colder than some cache memory, swap actually improves performance. But of course, if you have any active process starved of ram and consequently thrashing swap actively, of course, you're right. It's bad bad bad to use swap that way.

In solaris, I've never seen it swap out idle processes; I've only seen it use swap for the bad bad bad situation. I assume that's all it can do with swap.
Gary Mills
2013-01-23 03:50:15 UTC
Permalink
Post by Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
Post by Nico Williams
As for swap... really, you don't want to swap. If you're swapping you
have problems.
In solaris, I've never seen it swap out idle processes; I've only
seen it use swap for the bad bad bad situation. I assume that's all
it can do with swap.
You would be wrong. Solaris uses swap space for paging. Paging out
unused portions of an executing process from real memory to the swap
device is certainly beneficial. Swapping out complete processes is a
desperation move, but paging out most of an idle process is a good
thing.
--
-Gary Mills- -refurb- -Winnipeg, Manitoba, Canada-
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2013-01-23 12:36:11 UTC
Permalink
Post by Gary Mills
Post by Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
In solaris, I've never seen it swap out idle processes; I've only
seen it use swap for the bad bad bad situation. I assume that's all
it can do with swap.
You would be wrong. Solaris uses swap space for paging. Paging out
unused portions of an executing process from real memory to the swap
device is certainly beneficial. Swapping out complete processes is a
desperation move, but paging out most of an idle process is a good
thing.
You seem to be emphasizing the distinction between swapping and paging. My point though, is that I've never seen the swap usage (which is being used for paging) on any solaris derivative to be used nonzero, for the sake of keeping something in cache. It seems to me, that solaris will always evict all cache memory before it swaps (pages) out even the most idle process memory.
Ray Arachelian
2013-01-23 17:46:40 UTC
Permalink
Paging out unused portions of an executing process from real memory to
the swap device is certainly beneficial. Swapping out complete
processes is a desperation move, but paging out most of an idle
process is a good thing.
It gets even better. Executables become part of the swap space via
mmap, so that if you have a lot of copies of the same process running in
memory, the executable bits don't waste any more space (well, unless you
use the sticky bit, although that might be deprecated, or if you copy
the binary elsewhere.) There's lots of awesome fun optimizations in
UNIX. :)
C***@oracle.com
2013-01-23 19:48:21 UTC
Permalink
Post by Ray Arachelian
On Tue, Jan 22, 2013 at 11:54:53PM +0000, Edward Ned Harvey (opensolarisisdeadlongliveopensolari
Paging out unused portions of an executing process from real memory to
the swap device is certainly beneficial. Swapping out complete
processes is a desperation move, but paging out most of an idle
process is a good thing.
It gets even better. Executables become part of the swap space via
mmap, so that if you have a lot of copies of the same process running in
memory, the executable bits don't waste any more space (well, unless you
use the sticky bit, although that might be deprecated, or if you copy
the binary elsewhere.) There's lots of awesome fun optimizations in
UNIX. :)
The "sticky bit" has never been used in that form of SunOS for as long
as I remember (SunOS 3.x) and probably before that. It no longer makes
sense in demand-paged executables.

Casper
Joerg Schilling
2013-02-01 17:05:15 UTC
Permalink
Post by C***@oracle.com
Post by Ray Arachelian
It gets even better. Executables become part of the swap space via
mmap, so that if you have a lot of copies of the same process running in
memory, the executable bits don't waste any more space (well, unless you
use the sticky bit, although that might be deprecated, or if you copy
the binary elsewhere.) There's lots of awesome fun optimizations in
UNIX. :)
The "sticky bit" has never been used in that form of SunOS for as long
as I remember (SunOS 3.x) and probably before that. It no longer makes
sense in demand-paged executables.
SunOS-3.0 introduced NFS-root and swap on NFS. For that reason, the meaning of
the sticky bit was changed to mean "do not cache write this file".

Note that SunOS-3.0 appeared with the new Sun3 machines (first build on
24.12.1985).

Jörg
--
EMail:***@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
***@cs.tu-berlin.de (uni)
***@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/
URL: http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
C***@oracle.com
2013-01-23 08:41:10 UTC
Permalink
Post by Nico Williams
IIRC dump is special.
As for swap... really, you don't want to swap. If you're swapping you
have problems. Any swap space you have is to help you detect those
problems and correct them before apps start getting ENOMEM. There
*are* exceptions to this, such as Varnish. For Varnish and any other
apps like it I'd dedicate an entire flash drive to it, no ZFS, no
nothing.
Yes and no: the system reserves a lot of additional memory (Solaris
doesn't over-commits swap) and swap is needed to support those
reservations. Also, some pages are dirtied early on and never touched
again; those pages should not be kept in memory.

But continuously swapping is clearly a sign of a system too small for its
job.

Of course, compressing and/or encrypting swap has interesting issues: in
order to free memory by swapping pages out requires even more memory.

Casper
Jim Klimov
2013-01-23 08:47:04 UTC
Permalink
Post by C***@oracle.com
Yes and no: the system reserves a lot of additional memory (Solaris
doesn't over-commits swap) and swap is needed to support those
reservations. Also, some pages are dirtied early on and never touched
again; those pages should not be kept in memory.
I believe, by the symptoms, that this is what happens often
in particular to Java processes (app-servers and such) - I do
regularly see these have large "VM" sizes and much (3x) smaller
"RSS" sizes. One explanation I've seen is that JVM nominally
depends on a number of shared libraries which are loaded to
fulfill the runtime requirements, but aren't actively used and
thus go out into swap quickly. I chose to trust that statement ;)

//Jim
Ian Collins
2013-01-23 10:11:29 UTC
Permalink
Post by Jim Klimov
Post by C***@oracle.com
Yes and no: the system reserves a lot of additional memory (Solaris
doesn't over-commits swap) and swap is needed to support those
reservations. Also, some pages are dirtied early on and never touched
again; those pages should not be kept in memory.
I believe, by the symptoms, that this is what happens often
in particular to Java processes (app-servers and such) - I do
regularly see these have large "VM" sizes and much (3x) smaller
"RSS" sizes.
Being swapped out is probably the best thing that can be done to most
Java processes :)
--
Ian.
Sašo Kiselkov
2013-01-22 23:31:08 UTC
Permalink
Post by Jim Klimov
Post by Sašo Kiselkov
Post by Jim Klimov
Post by Darren J Moffat
Preallocated ZVOLs - for swap/dump.
Or is it also supported to disable COW for such datasets, so that
the preallocated swap/dump zvols might remain contiguous on the
faster tracks of the drive (i.e. like a dedicated partition, but
with benefits of ZFS checksums and maybe compression)?
I highly doubt it, as it breaks one of the fundamental design principles
behind ZFS (always maintain transactional consistency). Also,
contiguousness and compression are fundamentally at odds (contiguousness
requires each block to remain the same length regardless of contents,
compression varies block length depending on the entropy of the contents).
Well, dump and swap devices are kind of special in that they need
verifiable storage (i.e. detectable to have no bit-errors) but not
really consistency as in sudden-power-off transaction protection.
I get your point, but I would argue that if you are willing to
preallocate storage for these, then putting dump/swap on an iSCSI LUN as
opposed to having it locally is kind of pointless anyway. Since they are
used rarely, having them "thin provisioned" is probably better in a
iSCSI environment than wasting valuable network-storage resources on
something you rarely need.
Post by Jim Klimov
Both have a lifetime span of a single system uptime - like L2ARC,
for example - and will be reused anew afterwards - after a reboot,
a power-surge, or a kernel panic.
For the record, the L2ARC is not transactionally consistent. It use a
completely different allocation strategy from the main pool (essentially
a simple rotor). Besides, if you plan to shred your dump contents after
reboot anyway, why fat-provision them? I can understand swap, but dump?
Post by Jim Klimov
So while metadata used to address the swap ZVOL contents may and
should be subject to common ZFS transactions and COW and so on,
and jump around the disk along with rewrites of blocks, the ZVOL
userdata itself may as well occupy the same positions on the disk,
I think, rewriting older stuff. With mirroring likely in place as
well as checksums, there are other ways than COW to ensure that
the swap (at least some component thereof) contains what it should,
even with intermittent errors of some component devices.
You don't understand, the transactional integrity in ZFS isn't just to
protect the data you put in, it's also meant to protect ZFS' internal
structure (i.e. the metadata). This includes the layout of your zvols
(which are also just another dataset). I understand that you want to
view a this kind of fat-provisioned zvol as a simple contiguous
container block, but it is probably more hassle to implement than it's
worth.
Post by Jim Klimov
Likewise, swap/dump breed of zvols shouldn't really have snapshots,
especially not automatic ones (and the installer should take care
of this at least for the two zvols it creates) ;)
If you are talking about the standard opensolaris-style
boot-environments, then yes, this is taken into account. Your BE lives
under rpool/ROOT, while swap and dump are rpool/swap and rpool/dump
respectively (both thin-provisioned, since they are rarely needed).
Post by Jim Klimov
Compression for swap is an interesting matter... for example, how
should it be accounted? As dynamic expansion and/or shrinking of
available swap space (or just of space needed to store it)?
Since compression occurs way below the dataset layer, your zvol capacity
doesn't change with compression, even though how much space it actually
uses in the pool can. A zvol's capacity pertains to its logical
attributes, i.e. most importantly the maximum byte offset within it
accessible to an application (in this case, swap). How the underlying
blocks are actually stored and how much space they take up is up to the
lower layers.
Post by Jim Klimov
If the latter, and we still intend to preallocate and guarantee
that the swap has its administratively predefined amount of
gigabytes, compressed blocks can be aligned on those starting
locations as if they were not compressed. In effect this would
just decrease the bandwidth requirements, maybe.
But you forget that a compressed block's physical size fundamentally
depends on its contents. That's why compressed zvols still appear the
same size as before. What changes is how much space they occupy on the
underlying pool.
Post by Jim Klimov
For dump this might be just a bulky compressed write from start
to however much it needs, within the preallocated psize limits...
I hope you now understand the distinction between the logical size of a
zvol and its actual in-pool size. We can't tie one to other, since it
would result in unpredictable behavior for the application (write one
set of data, get capacity X, write another set, get capacity Y - how to
determine in advance how much fits in? You can't).

Cheers,
--
Saso
Jim Klimov
2013-01-23 01:17:32 UTC
Permalink
The discussion gets suddenly hot and interesting - albeit quite diverged
from the original topic ;)

First of all, as a disclaimer, when I have earlier proposed such changes
to datasets for swap (and maybe dump) use, I've explicitly proposed that
this be a new dataset type - compared to zvol and fs and snapshot that
we have today. Granted, this distinction was lost in today's exchange
of words, but it is still an important one - especially since it means
that while basic ZFS (or rather ZPOOL) rules are maintained, the dataset
rules might be redefined ;)

I'll try to reply to a few points below, snipping a lot of older text.
Post by Sašo Kiselkov
Post by Jim Klimov
Well, dump and swap devices are kind of special in that they need
verifiable storage (i.e. detectable to have no bit-errors) but not
really consistency as in sudden-power-off transaction protection.
I get your point, but I would argue that if you are willing to
preallocate storage for these, then putting dump/swap on an iSCSI LUN as
opposed to having it locally is kind of pointless anyway. Since they are
used rarely, having them "thin provisioned" is probably better in a
iSCSI environment than wasting valuable network-storage resources on
something you rarely need.
I am not sure what in my post led you to think that I meant iSCSI
or otherwise networked storage to keep swap and dump. Some servers
have local disks, you know - and in networked storage environments
the local disks are only used to keep the OS image, swap and dump ;)
Post by Sašo Kiselkov
Besides, if you plan to shred your dump contents after
reboot anyway, why fat-provision them? I can understand swap, but dump?
Guarantee that the space is there... Given the recent mischiefs
with dumping (i.e. the context is quite stripped compared to the
general kernel work, so multithreading broke somehow) I guess that
pre-provisioned sequential areas might also reduce some risks...
though likely not - random metadata would still have to get into
the pool.
Post by Sašo Kiselkov
You don't understand, the transactional integrity in ZFS isn't just to
protect the data you put in, it's also meant to protect ZFS' internal
structure (i.e. the metadata). This includes the layout of your zvols
(which are also just another dataset). I understand that you want to
view a this kind of fat-provisioned zvol as a simple contiguous
container block, but it is probably more hassle to implement than it's
worth.
I'd argue that transactional integrity in ZFS primarily protects
metadata, so that there is a tree of always-actual block pointers.
There is this octopus of a block-pointer tree whose leaf nodes
point to data blocks - but only as DVAs and checksums, basically.
Nothing really requires data to be or not be COWed and stored at
a different location than the previous version of the block at
the same logical offset for the data consumers (FS users, zvol
users), except that we want that data to be readable even after
a catastrophic pool close (system crash, poweroff, etc.).

We don't (AFAIK) have such a requirement for swap. If the pool
which contained swap kicked the bucket, we probably have a
larger problem whose solution will likely involve reboot and thus
recycling of all swap data.

And for single-device errors with (contiguous) preallocated
unrelocatable swap, we can protect with mirrors and checksums
(used upon read, within this same uptime that wrote the bits).
Post by Sašo Kiselkov
Post by Jim Klimov
Likewise, swap/dump breed of zvols shouldn't really have snapshots,
especially not automatic ones (and the installer should take care
of this at least for the two zvols it creates) ;)
If you are talking about the standard opensolaris-style
boot-environments, then yes, this is taken into account. Your BE lives
under rpool/ROOT, while swap and dump are rpool/swap and rpool/dump
respectively (both thin-provisioned, since they are rarely needed).
I meant the attribute for zfs-auto-snapshots service, i.e.:
rpool/swap com.sun:auto-snapshot false local

As I wrote, I'd argue that for "new" swap (and maybe dump) datasets
the snapshot action should not even be implemented.
Post by Sašo Kiselkov
Post by Jim Klimov
Compression for swap is an interesting matter... for example, how
should it be accounted? As dynamic expansion and/or shrinking of
available swap space (or just of space needed to store it)?
Since compression occurs way below the dataset layer, your zvol capacity
doesn't change with compression, even though how much space it actually
uses in the pool can. A zvol's capacity pertains to its logical
attributes, i.e. most importantly the maximum byte offset within it
accessible to an application (in this case, swap). How the underlying
blocks are actually stored and how much space they take up is up to the
lower layers.
...
Post by Sašo Kiselkov
But you forget that a compressed block's physical size fundamentally
depends on its contents. That's why compressed zvols still appear the
same size as before. What changes is how much space they occupy on the
underlying pool.
I won't argue with this, as it is perfectly correct for zvols and
undefined for the mythical new dataset type ;)

However, regarding dump and size prediction - when I created dump
zvol's manually and fed them to dumpadm, it can complain that the
device is too small. Then at some point it accepts the given size,
even though it is some value not like the system RAM or anything.
So I guess the system also does some guessing in this case?..
If so, preallocating as many bytes as it thinks minimally required
and then allowing compression to stuff more data in, might help to
actually save the larger dumps in cases the system (dumpadm) made
a wrong guess.

//Jim
Matthew Ahrens
2013-01-24 00:04:23 UTC
Permalink
Post by Darren J Moffat
Preallocated ZVOLs - for swap/dump.
Darren, good to hear about the cool stuff in S11.

Just to clarify, is this preallocated ZVOL different than the preallocated
dump which has been there for quite some time (and is in Illumos)? Can you
use it for other zvols besides swap and dump?

Some background: the zfs dump device has always been preallocated ("thick
provisioned"), so that we can reliably dump. By definition, something has
gone horribly wrong when we are dumping, so this code path needs to be as
small as possible to have any hope of getting a dump. So we preallocate
the space for dump, and store a simple linked list of disk segments where
it will be stored. The dump device is not COW, checksummed, deduped,
compressed, etc. by ZFS.

In Illumos (and S10), swap was treated more or less like a regular zvol.
This leads to some tricky code paths because ZFS allocates memory from
many points in the code as it is writing out changes. I could see
advantages to the simplicity of a preallocated swap volume, using the same
code that already existed for preallocated dump. Of course, the loss of
checksumming and encryption is much more of a concern with swap (which is
critical for correct behavior) than with dump (which is nice to have for
debugging).

--matt
Darren J Moffat
2013-01-24 10:06:40 UTC
Permalink
On Tue, Jan 22, 2013 at 5:29 AM, Darren J Moffat
Preallocated ZVOLs - for swap/dump.
Darren, good to hear about the cool stuff in S11.
Just to clarify, is this preallocated ZVOL different than the
preallocated dump which has been there for quite some time (and is in
Illumos)? Can you use it for other zvols besides swap and dump?
It is the same but we are using it for swap now too. It isn't available
for general use.
Some background: the zfs dump device has always been preallocated
("thick provisioned"), so that we can reliably dump. By definition,
something has gone horribly wrong when we are dumping, so this code path
needs to be as small as possible to have any hope of getting a dump. So
we preallocate the space for dump, and store a simple linked list of
disk segments where it will be stored. The dump device is not COW,
checksummed, deduped, compressed, etc. by ZFS.
For the sake of others, I know you know this Matt, the dump system does
the compression so ZFS didn't need to anyway.
In Illumos (and S10), swap was treated more or less like a regular zvol.
This leads to some tricky code paths because ZFS allocates memory from
many points in the code as it is writing out changes. I could see
advantages to the simplicity of a preallocated swap volume, using the
same code that already existed for preallocated dump. Of course, the
loss of checksumming and encryption is much more of a concern with swap
(which is critical for correct behavior) than with dump (which is nice
to have for debugging).
We have encryption for dump because it is hooked in to the zvol code.

For encrypting swap Illumos could do the same as Solaris 11 does and use
lofi. I changed swapadd so that if "encryption" is specified in the
options field of the vfstab entry it creates a lofi shim over the swap
device using 'lofiadm -e'. This provides you encrypted swap regardless
of what the underlying "disk" is (normal ZVOL, prealloc ZVOL, real disk
slide, SVM mirror etc).
--
Darren J Moffat
Jim Klimov
2013-01-24 11:29:18 UTC
Permalink
Post by Darren J Moffat
On Tue, Jan 22, 2013 at 5:29 AM, Darren J Moffat
Preallocated ZVOLs - for swap/dump.
Darren, good to hear about the cool stuff in S11.
Yes, thanks, Darren :)
Post by Darren J Moffat
Just to clarify, is this preallocated ZVOL different than the
preallocated dump which has been there for quite some time (and is in
Illumos)? Can you use it for other zvols besides swap and dump?
It is the same but we are using it for swap now too. It isn't available
for general use.
Some background: the zfs dump device has always been preallocated
("thick provisioned"), so that we can reliably dump. By definition,
something has gone horribly wrong when we are dumping, so this code path
needs to be as small as possible to have any hope of getting a dump. So
we preallocate the space for dump, and store a simple linked list of
disk segments where it will be stored. The dump device is not COW,
checksummed, deduped, compressed, etc. by ZFS.
Comparing these two statements, can I say (and be correct) that the
preallocated swap devices would lack COW (as I proposed too) and thus
likely snapshots, but would also lack the checksums? (we might live
without compression, though that was once touted as a bonus for swap
over zfs, and certainly can do without dedup)

Basically, they are seemingly little different from preallocated
disk slices - and for those an admin might have better control over
the dedicated disk locations (i.e. faster tracks in a small-seek
stroke range), except that ZFS datasets are easier to resize...
right or wrong?

//Jim
Sašo Kiselkov
2013-01-22 13:29:42 UTC
Permalink
Post by Michel Jansens
Maybe 'shadow migration' ? (eg: zfs create -o shadow=nfs://server/dir
pool/newfs)
Hm, interesting, so it works as a sort of replication system, except
that the data needs to be read-only and you can start accessing it on
the target before the initial sync. Did I get that right?

--
Saso
Darren J Moffat
2013-01-22 13:37:24 UTC
Permalink
Post by Sašo Kiselkov
Post by Michel Jansens
Maybe 'shadow migration' ? (eg: zfs create -o shadow=nfs://server/dir
pool/newfs)
Hm, interesting, so it works as a sort of replication system, except
that the data needs to be read-only and you can start accessing it on
the target before the initial sync. Did I get that right?
The source filesystem needs to be read-only. It works at the VFS layer
so it doesn't copy snapshots or clones over. Once mounted it appears
like all the original data is instantly there.

There is an (optional) shadowd that pushes the migration along, but it
will complete on its own anyway.

shadowstat(1M) gives information on the status of the migrations.
--
Darren J Moffat
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2013-01-22 15:32:17 UTC
Permalink
Post by Darren J Moffat
Support for SCSI UNMAP - both issuing it and honoring it when it is the
backing store of an iSCSI target.
When I search for scsi unmap, I come up with all sorts of documentation that ... is ... like reading a medical journal when all you want to know is the conversion from 98.6F to C.

Would you mind momentarily, describing what SCSI UNMAP is used for? If I were describing to a customer (CEO, CFO) I'm not going to tell them about SCSI UNMAP, I'm going to say the new system has a new feature that enables ... or solves the ___ problem...

Customer doesn't *necessarily* have to be as clueless as CEO/CFO. Perhaps just another IT person, or whatever.
Sašo Kiselkov
2013-01-22 15:48:07 UTC
Permalink
On 01/22/2013 04:32 PM, Edward Ned Harvey
Post by Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
Post by Darren J Moffat
Support for SCSI UNMAP - both issuing it and honoring it when it is the
backing store of an iSCSI target.
When I search for scsi unmap, I come up with all sorts of documentation that ... is ... like reading a medical journal when all you want to know is the conversion from 98.6F to C.
Would you mind momentarily, describing what SCSI UNMAP is used for? If I were describing to a customer (CEO, CFO) I'm not going to tell them about SCSI UNMAP, I'm going to say the new system has a new feature that enables ... or solves the ___ problem...
Customer doesn't *necessarily* have to be as clueless as CEO/CFO. Perhaps just another IT person, or whatever.
SCSI Unmap is a feature of the SCSI protocol that is used by SSDs to
signal that a given data block is no longer in use by the filesystem and
may be erased.

TL&DR:
It makes writing to flash faster. Flash write latency degrades with
time, this prevents it from happening. Keep in mind that this is only
important for sync-write workloads (e.g. Databases, NFS, etc.), not
async-write workloads (file servers, bulk storage). For ZFS this is a
win if you're using a flash-based slog (ZIL) device. You can entirely
side-step this issue (and performance-sensitive applications often do)
by placing the slog onto a device not based on flash, e.g. DDRDrive x1,
ZeusRAM, etc.

THE DETAILS:
As you may know, flash memory cells, by design, cannot be overwritten.
They can only be read (very fast), written when they are empty (called
"programmed", still quite fast) or erased (slow as hell). To implement
overwriting, when a flash controller detects an attempt to overwrite an
already programmed flash cell, it instead holds the write while it
erases the block first (which takes a lot of time), and only then
programs it with the new data.

Before SCSI Unmap (also called TRIM in SATA) filesystems had no way of
talking to the underlying flash memory to tell it that a given block of
data has been freed (e.g. due to a user deleting a file). So sooner or
later, a filesystem used up all empty blocks on the flash device and
essentially every write had to first erase some flash blocks to
complete. This impacts synchronous I/O write latency (e.g. ZIL, sync
database I/O, etc.).

With Unmap, a filesystem can preemptively tell the flash controller that
a given data block is no longer needed and the flash controller can, at
its leisure, pre-erase it. Thus, as long as you have free space on your
filesystem, most, if not all of your writes will be direct program
writes, not erase-program.

Cheers,
--
Saso
Andrew Gabriel
2013-01-22 15:51:18 UTC
Permalink
Post by Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
Post by Darren J Moffat
Support for SCSI UNMAP - both issuing it and honoring it when it is the
backing store of an iSCSI target.
When I search for scsi unmap, I come up with all sorts of documentation that ... is ... like reading a medical journal when all you want to know is the conversion from 98.6F to C.
Would you mind momentarily, describing what SCSI UNMAP is used for? If I were describing to a customer (CEO, CFO) I'm not going to tell them about SCSI UNMAP, I'm going to say the new system has a new feature that enables ... or solves the ___ problem...
Customer doesn't *necessarily* have to be as clueless as CEO/CFO. Perhaps just another IT person, or whatever.
SCSI UNMAP (or SATA TRIM) is a means of telling a storage device that
some blocks are no longer needed. (This might be because a file has been
deleted in the filesystem on the device.)

In the case of a Flash device, it can optimise usage by knowing this,
e.g. it can perhaps perform a background erase on the real blocks so
they're ready for reuse sooner, and/or better optimise wear leveling by
having more spare space to play with. There are some devices in which
this enables the device to improve its lifetime by performing better
wear leveling when having more spare space. It can also help by avoiding
some read-modify-write operations, if the device knows the data that is
in the rest of the 4k block is no loner needed.

In the case of an iSCSI LUN target, these blocks no longer need to be
archived, and if sparse space allocation is in use, the space they
occupied can be freed off. In the particular case of ZFS provisioning
the iSCSI LUN (COMSTAR), you might get performance improvements by
having more free space to play with during other write operations to
allow better storage layout optimisation.

So, bottom line is longer life of SSDs (maybe higher performance too if
there's less waiting for erases during writes), and better space
utilisation and performance for a ZFS COMSTAR target.
--
Andrew Gabriel
Darren J Moffat
2013-01-22 15:53:32 UTC
Permalink
On 01/22/13 15:32, Edward Ned Harvey
Post by Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
Post by Darren J Moffat
Support for SCSI UNMAP - both issuing it and honoring it when it is the
backing store of an iSCSI target.
When I search for scsi unmap, I come up with all sorts of documentation that ... is ... like reading a medical journal when all you want to know is the conversion from 98.6F to C.
Would you mind momentarily, describing what SCSI UNMAP is used for? If I were describing to a customer (CEO, CFO) I'm not going to tell them about SCSI UNMAP, I'm going to say the new system has a new feature that enables ... or solves the ___ problem...
Customer doesn't *necessarily* have to be as clueless as CEO/CFO. Perhaps just another IT person, or whatever.
It is a mechanism for part of the storage system above the "disk" (eg
ZFS) to inform the "disk" that it is no longer using a given set of blocks.

This is useful when using an SSD - see Saso's excellent response on that.

However it can also be very useful when your "disk" is an iSCSI LUN. It
allows the filesystem layer (eg ZFS or NTFS, etc) when on iSCSI LUN that
advertises SCSI UNMAP to tell the target there are blocks in that LUN it
isn't using any more (eg it just deleted some blocks).

This means you can get more accurate space usage when using things like
iSCSI.

ZFS in Solaris 11.1 issues SCSI UNMAP to devices that support it and the
ZVOLs when exported over COMSTAR advertise it too.

In the iSCSI case it is mostly about improved space accounting and
utilisation. This is particularly interesting with ZFS when snapshots
and clones of ZVOLs come into play.

Some vendors call this (and thins like it) "Thin Provisioning", I'd say
it is more "accurate communication between 'disk' and filesystem" about
in use blocks.
--
Darren J Moffat
C***@oracle.com
2013-01-22 16:00:52 UTC
Permalink
Post by Darren J Moffat
Some vendors call this (and thins like it) "Thin Provisioning", I'd say
it is more "accurate communication between 'disk' and filesystem" about
in use blocks.
In some cases, users of disks are charged by bytes in use; when not using
SCSI UNMAP, a set of disks used for a zpool will in the end be charged for
the whole reservation; this becomes costly when your standard usage is
much less than your peak usage.

Thin provisioning can now be used for zpools as long as the underlying
LUNs have support for SCSI UNMAP


Casper
Sašo Kiselkov
2013-01-22 16:02:16 UTC
Permalink
Post by C***@oracle.com
Post by Darren J Moffat
Some vendors call this (and thins like it) "Thin Provisioning", I'd say
it is more "accurate communication between 'disk' and filesystem" about
in use blocks.
In some cases, users of disks are charged by bytes in use; when not using
SCSI UNMAP, a set of disks used for a zpool will in the end be charged for
the whole reservation; this becomes costly when your standard usage is
much less than your peak usage.
Thin provisioning can now be used for zpools as long as the underlying
LUNs have support for SCSI UNMAP
Looks like an interesting technical solution to a political problem :D

Cheers,
--
Saso
Darren J Moffat
2013-01-22 16:34:58 UTC
Permalink
Post by Sašo Kiselkov
Post by C***@oracle.com
Post by Darren J Moffat
Some vendors call this (and thins like it) "Thin Provisioning", I'd say
it is more "accurate communication between 'disk' and filesystem" about
in use blocks.
In some cases, users of disks are charged by bytes in use; when not using
SCSI UNMAP, a set of disks used for a zpool will in the end be charged for
the whole reservation; this becomes costly when your standard usage is
much less than your peak usage.
Thin provisioning can now be used for zpools as long as the underlying
LUNs have support for SCSI UNMAP
Looks like an interesting technical solution to a political problem :D
There is also a technical problem too: because if you can't inform the
backing store that you no longer need the blocks it can't free them
either so they get stuck in snapshots unnecessarily.
--
Darren J Moffat
Sašo Kiselkov
2013-01-22 17:10:43 UTC
Permalink
Post by Darren J Moffat
Post by Sašo Kiselkov
Post by C***@oracle.com
Post by Darren J Moffat
Some vendors call this (and thins like it) "Thin Provisioning", I'd say
it is more "accurate communication between 'disk' and filesystem" about
in use blocks.
In some cases, users of disks are charged by bytes in use; when not using
SCSI UNMAP, a set of disks used for a zpool will in the end be charged for
the whole reservation; this becomes costly when your standard usage is
much less than your peak usage.
Thin provisioning can now be used for zpools as long as the underlying
LUNs have support for SCSI UNMAP
Looks like an interesting technical solution to a political problem :D
There is also a technical problem too: because if you can't inform the
backing store that you no longer need the blocks it can't free them
either so they get stuck in snapshots unnecessarily.
Yes, I understand the technical merit of the solution. I'm just amused
that a noticeable side-effect is lower licensing costs (by that I don't
of course mean that the issue is unimportant, just that I find it
interesting what the world has come to) - I'm not trying to ridicule.

Cheers,
--
Saso
Ian Collins
2013-01-22 21:58:14 UTC
Permalink
Post by Darren J Moffat
It is a mechanism for part of the storage system above the "disk" (eg
ZFS) to inform the "disk" that it is no longer using a given set of blocks.
This is useful when using an SSD - see Saso's excellent response on that.
However it can also be very useful when your "disk" is an iSCSI LUN. It
allows the filesystem layer (eg ZFS or NTFS, etc) when on iSCSI LUN that
advertises SCSI UNMAP to tell the target there are blocks in that LUN it
isn't using any more (eg it just deleted some blocks).
That is something I have been waiting a long time for! I have to run a
periodic "fill the pool with zeros" cycle on a couple of iSCSI backed
pools to reclaim free space.

I guess the big question is do oracle storage appliances advertise SCSI
UNMAP?
--
Ian.
Jim Klimov
2013-01-21 00:40:18 UTC
Permalink
Post by Edward Harvey
But, by talking about it, we're just smoking pipe dreams. Cuz we all know zfs is developmentally challenged now. But one can dream...
I beg to disagree. While most of my contribution was so far about
learning stuff and sharing with others, as well as planting some
new ideas and (hopefully, seen as constructively) doubting others -
including the implementation we have now - and I do have yet to
see someone pick up my ideas and turn them into code (or prove
why they are rubbish) -- overall I can't say that development
stagnated by some metric of stagnation or activity.

Yes, maybe there were more "cool new things" per year popping up
with Sun's concentrated engineering talent and financing, but now
it seems that most players - wherever they work now - took a pause
from the marathon, to refine what was done in the decade before.
And this is just as important as churning out innovations faster
than people can comprehend or audit or use them.

As a loud example of present active development - take the LZ4
quests completed by Saso recently. From what I gather, this is a
single man's job done "on-line" in the view of fellow list members
over a few months, almost like a reality-show; and I guess anyone
with enough concentration, time and devotion could do likewise.

I suspect many of my proposals to the list might also take some
half of a man-year to complete. Unfortunately for the community
and for part of myself, I now have some higher daily priorities
so that I likely won't sit down and code lots of stuff in the
nearest years (until that Priority goes to school, or so). Maybe
that's why I'm eager to suggest quests for brilliant coders here
who can complete the job better and faster than I ever would ;)
So I'm doing the next best things I can do to help the progress :)

And I don't believe this is in vain, that the development ceased
and my writings are only destined to be "stuffed under the carpet".
Be it these RFEs or dome others, better and more useful, I believe
they shall be coded and published in common ZFS code. Sometime...

//Jim
Bob Friesenhahn
2013-01-22 15:27:18 UTC
Permalink
Post by Jim Klimov
Yes, maybe there were more "cool new things" per year popping up
with Sun's concentrated engineering talent and financing, but now
it seems that most players - wherever they work now - took a pause
from the marathon, to refine what was done in the decade before.
And this is just as important as churning out innovations faster
than people can comprehend or audit or use them.
I am on most of the mailing lists where zfs is discussed and it is
clear that significant issues/bugs are continually being discovered
and fixed. Fixes come from both the Illumos community and from
outside it (e.g. from FreeBSD).

Zfs is already quite feature rich. Many of us would lobby for
bug fixes and performance improvements over "features".

Sašo Kiselkov's LZ4 compression additions may qualify as "features"
yet they also offer rather profound performance improvements.

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Tomas Forsman
2013-01-20 18:55:30 UTC
Permalink
Post by Jim Klimov
Hello all,
While revising my home NAS which had dedup enabled before I gathered
that its RAM capacity was too puny for the task, I found that there is
some deduplication among the data bits I uploaded there (makes sense,
since it holds backups of many of the computers I've worked on - some
of my homedirs' contents were bound to intersect). However, a lot of
the blocks are in fact "unique" - have entries in the DDT with count=1
and the blkptr_t bit set. In fact they are not deduped, and with my
pouring of backups complete - they are unlikely to ever become deduped.
Another RFE would be 'zfs dedup mypool/somefs' and basically go through
and do a one-shot dedup. Would be useful in various scenarios. Possibly
go through the entire pool at once, to make dedups intra-datasets (like
"the real thing").

/Tomas
--
Tomas Forsman, ***@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
Jim Klimov
2013-01-20 21:58:24 UTC
Permalink
Post by Tomas Forsman
Post by Jim Klimov
Hello all,
While revising my home NAS which had dedup enabled before I gathered
that its RAM capacity was too puny for the task, I found that there is
some deduplication among the data bits I uploaded there (makes sense,
since it holds backups of many of the computers I've worked on - some
of my homedirs' contents were bound to intersect). However, a lot of
the blocks are in fact "unique" - have entries in the DDT with count=1
and the blkptr_t bit set. In fact they are not deduped, and with my
pouring of backups complete - they are unlikely to ever become deduped.
Another RFE would be 'zfs dedup mypool/somefs' and basically go through
and do a one-shot dedup. Would be useful in various scenarios. Possibly
go through the entire pool at once, to make dedups intra-datasets (like
"the real thing").
Yes, but that was asked before =)

Actually, the pool's metadata does contain all the needed bits (i.e.
checksum and size of blocks) such that a scrub-like procedure could
try and find same blocks among unique ones (perhaps with a filter
of "this" block being referenced from a dataset that currently wants
dedup), throw one out and add a DDT entry to another.
Post by Tomas Forsman
So ... The way things presently are, ideally you would know in
advance what stuff you were planning to write that has duplicate
copies. You could enable dedup, then write all the stuff that's
highly duplicated, then turn off dedup and write all the
non-duplicate stuff. Obviously, however, this is a fairly
implausible actual scenario.
Well, I guess I could script a solution that uses ZDB to dump the
blockpointer tree (about 100Gb of text on my system), and some
perl or sort/uniq/grep parsing over this huge text to find blocks
that are the same but not deduped - as well as those single-copy
"deduped" ones, and toggle the dedup property while rewriting the
block inside its parent file with DD.

This would all be within current ZFS's capabilities and ultimately
reach the goals of deduping pre-existing data as well as dropping
unique blocks from the DDT. It would certainly not be a real-time
solution (likely might take months on my box - just fetching the
BP tree took a couple of days) and would require more resources
than needed otherwise (rewrites of same userdata, storing and
parsing of addresses as text instead of binaries, etc.)

But I do see how this is doable even today even by a non-expert ;)
(Not sure I'd ever get around to actually doing this thus, though -
it is not a very "clean" solution nor a performant one).

As a bonus, however, this ZDB dump would also provide an answer
to a frequently-asked question: "which files on my system intersect
or are the same - and have some/all blocks in common via dedup?"
Knowledge of this answer might help admins with some policy
decisions, be it witch-hunt for hoarders of same files or some
pattern-making to determine which datasets should keep "dedup=on"...

My few cents,
//Jim
Loading...