Discussion:
enterprise scale redundant Solaris 10/ZFS server providing NFSv4/CIFS
(too old to reply)
Paul B. Henson
2007-09-19 23:27:48 UTC
Permalink
We are looking for a replacement enterprise file system to handle storage
needs for our campus. For the past 10 years, we have been happily using DFS
(the distributed file system component of DCE), but unfortunately IBM
killed off that product and we have been running without support for over a
year now. We have looked at a variety of possible options, none of which
have proven fruitful. We are currently investigating the possibility of a
Solaris 10/ZFS implementation. I have done a fair amount of reading and
perusal of the mailing list archives, but I apologize in advance if I ask
anything I should have already found in a FAQ or other repository.

Basically, we are looking to provide initially 5 TB of usable storage,
potentially scaling up to 25-30TB of usable storage after successful
initial deployment. We would have approximately 50,000 user home
directories and perhaps 1000 shared group storage directories. Access to
this storage would be via NFSv4 for our UNIX infrastructure, and CIFS for
those annoying Windows systems you just can't seem to get rid of ;).

I read that initial versions of ZFS had scalability issues with such a
large number of file systems, resulting in extremely long boot times and
other problems. Supposedly a lot of those problems have been fixed in the
latest versions of OpenSolaris, and many of the fixes have been backported
to the official Solaris 10 update 4? Will that version of Solaris
reasonably support 50 odd thousand ZFS file systems?

I saw a couple of threads in the mailing list archives regarding NFS not
transitioning file system boundaries, requiring each and every ZFS
filesystem (50 thousand-ish in my case) to be exported and mounted on the
client separately. While that might be feasible with an automounter, it
doesn't really seem desirable or efficient. It would be much nicer to
simply have one mount point on the client with all the home directories
available underneath it. I was wondering whether or not that would be
possible with the NFSv4 pseudo-root feature. I saw one posting that
indicated it might be, but it wasn't clear whether or not that was a
current feature or something yet to be implemented. I have no requirements
to support legacy NFSv2/3 systems, so a solution only available via NFSv4
would be acceptable.

I was planning to provide CIFS services via Samba. I noticed a posting a
while back from a Sun engineer working on integrating NFSv4/ZFS ACL support
into Samba, but I'm not sure if that was ever completed and shipped either
in the Sun version or pending inclusion in the official version, does
anyone happen to have an update on that? Also, I saw a patch proposing a
different implementation of shadow copies that better supported ZFS
snapshots, any thoughts on that would also be appreciated.

Is there any facility for managing ZFS remotely? We have a central identity
management system that automatically provisions resources as necessary for
users, as well as providing an interface for helpdesk staff to modify
things such as quota. I'd be willing to implement some type of web service
on the actual server if there is no native remote management; in that case,
is there any way to directly configure ZFS via a programmatic API, as
opposed to running binaries and parsing the output? Some type of perl
module would be perfect.

We need high availability, so are looking at Sun Cluster. That seems to add
an extra layer of complexity <sigh>, but there's no way I'll get signoff on
a solution without redundancy. It would appear that ZFS failover is
supported with the latest version of Solaris/Sun Cluster? I was speaking
with a Sun SE who claimed that ZFS would actually operate active/active in
a cluster, simultaneously writable by both nodes. From what I had read, ZFS
is not a cluster file system, and would only operate in the active/passive
failover capacity. Any comments?

The SE also told me that Sun Cluster requires hardware raid, which
conflicts with the general recommendation to feed ZFS raw disk. It seems
such a configuration would either require configuring zdevs directly on the
raid LUNs, losing ZFS self-healing and checksum correction features, or
losing space to not only the hardware raid level, but a partially redundant
ZFS level as well. What is the general consensus on the best way to deploy
ZFS under a cluster using hardware raid?

Any other thoughts/comments on the feasibility or practicality of a
large-scale ZFS deployment like this?

Thanks much...
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Richard Elling
2007-09-20 18:36:12 UTC
Permalink
a few comments below...
Post by Paul B. Henson
We are looking for a replacement enterprise file system to handle storage
needs for our campus. For the past 10 years, we have been happily using DFS
(the distributed file system component of DCE), but unfortunately IBM
killed off that product and we have been running without support for over a
year now. We have looked at a variety of possible options, none of which
have proven fruitful. We are currently investigating the possibility of a
Solaris 10/ZFS implementation. I have done a fair amount of reading and
perusal of the mailing list archives, but I apologize in advance if I ask
anything I should have already found in a FAQ or other repository.
Basically, we are looking to provide initially 5 TB of usable storage,
potentially scaling up to 25-30TB of usable storage after successful
initial deployment. We would have approximately 50,000 user home
directories and perhaps 1000 shared group storage directories. Access to
this storage would be via NFSv4 for our UNIX infrastructure, and CIFS for
those annoying Windows systems you just can't seem to get rid of ;).
50,000 directories aren't a problem, unless you also need 50,000 quotas and
hence 50,000 file systems. Such a large, single storage pool system will
be an outlier... significantly beyond what we have real world experience
with.
Post by Paul B. Henson
I read that initial versions of ZFS had scalability issues with such a
large number of file systems, resulting in extremely long boot times and
other problems. Supposedly a lot of those problems have been fixed in the
latest versions of OpenSolaris, and many of the fixes have been backported
to the official Solaris 10 update 4? Will that version of Solaris
reasonably support 50 odd thousand ZFS file systems?
There have been improvements in performance and usability. Not all
performance problems were in ZFS, but large numbers of file systems exposed
other problems. However, I don't think that this has been characterized.
Post by Paul B. Henson
I saw a couple of threads in the mailing list archives regarding NFS not
transitioning file system boundaries, requiring each and every ZFS
filesystem (50 thousand-ish in my case) to be exported and mounted on the
client separately. While that might be feasible with an automounter, it
doesn't really seem desirable or efficient. It would be much nicer to
simply have one mount point on the client with all the home directories
available underneath it. I was wondering whether or not that would be
possible with the NFSv4 pseudo-root feature. I saw one posting that
indicated it might be, but it wasn't clear whether or not that was a
current feature or something yet to be implemented. I have no requirements
to support legacy NFSv2/3 systems, so a solution only available via NFSv4
would be acceptable.
I was planning to provide CIFS services via Samba. I noticed a posting a
while back from a Sun engineer working on integrating NFSv4/ZFS ACL support
into Samba, but I'm not sure if that was ever completed and shipped either
in the Sun version or pending inclusion in the official version, does
anyone happen to have an update on that? Also, I saw a patch proposing a
different implementation of shadow copies that better supported ZFS
snapshots, any thoughts on that would also be appreciated.
This work is done and, AFAIK, has been integrated into S10 8/07.
Post by Paul B. Henson
Is there any facility for managing ZFS remotely? We have a central identity
management system that automatically provisions resources as necessary for
users, as well as providing an interface for helpdesk staff to modify
things such as quota. I'd be willing to implement some type of web service
on the actual server if there is no native remote management; in that case,
is there any way to directly configure ZFS via a programmatic API, as
opposed to running binaries and parsing the output? Some type of perl
module would be perfect.
This is a loaded question. There is a webconsole interface to ZFS which can
be run from most browsers. But I think you'll find that the CLI is easier
for remote management.
Post by Paul B. Henson
We need high availability, so are looking at Sun Cluster. That seems to add
an extra layer of complexity <sigh>, but there's no way I'll get signoff on
a solution without redundancy. It would appear that ZFS failover is
supported with the latest version of Solaris/Sun Cluster? I was speaking
with a Sun SE who claimed that ZFS would actually operate active/active in
a cluster, simultaneously writable by both nodes. From what I had read, ZFS
is not a cluster file system, and would only operate in the active/passive
failover capacity. Any comments?
Active/passive only. ZFS is not supported over pxfs and ZFS cannot be
mounted simultaneously from two different nodes.

For most large file servers, people will split the file systems across
servers such that under normal circumstances, both nodes are providing
file service. This implies two or more storage pools.
Post by Paul B. Henson
The SE also told me that Sun Cluster requires hardware raid, which
conflicts with the general recommendation to feed ZFS raw disk. It seems
such a configuration would either require configuring zdevs directly on the
raid LUNs, losing ZFS self-healing and checksum correction features, or
losing space to not only the hardware raid level, but a partially redundant
ZFS level as well. What is the general consensus on the best way to deploy
ZFS under a cluster using hardware raid?
The SE is mistaken. Sun^H^Holaris Cluster supports a wide variety of
JBOD and RAID array solutions. For ZFS, I recommend a configuration which
allows ZFS to repair corrupted data.
Post by Paul B. Henson
Any other thoughts/comments on the feasibility or practicality of a
large-scale ZFS deployment like this?
For today, quotas would be the main hurdle.
I've read some blogs where people put UFS on ZFS zvols to overcome the
quota problem. However, that seems to be too complicated for me, especially
when high service availability is important.
-- richard
Paul B. Henson
2007-09-20 19:49:29 UTC
Permalink
Post by Richard Elling
50,000 directories aren't a problem, unless you also need 50,000 quotas
and hence 50,000 file systems. Such a large, single storage pool system
will be an outlier... significantly beyond what we have real world
experience with.
Yes, considering that 45,000 of those users will be students, we definitely
need separate quotas for each one :).

Hmm, I get a bit of a shiver down my spine at the prospect of deploying a
critical central service in a relatively untested configuration 8-/. What
is the maximum number of file systems in a given pool that has undergone
some reasonable amount of real world deployment?

One issue I have is that our previous filesystem, DFS, completely spoiled
me with its global namespace and location transparency. We had three fairly
large servers, with the content evenly dispersed among them, but from the
perspective of the client any user's files were available at
/dfs/user/<username>, regardless of which physical server they resided on.
We could even move them around between servers transparently.

Unfortunately, there aren't really any filesystems available with similar
features and enterprise applicability. OpenAFS comes closest, we've been
prototyping that but the lack of per file ACLs bites, and as an add-on
product we've had issues with kernel compatibility across upgrades.

I was hoping to replicate a similar feel by just having one large file
server with all the data on it. If I split our user files across multiple
servers, we would have to worry about which server contained what files,
which would be rather annoying.

There are some features in NFSv4 that seem like they might someday help
resolve this problem, but I don't think they are readily available in
servers and definitely not in the common client.
Post by Richard Elling
Post by Paul B. Henson
I was planning to provide CIFS services via Samba. I noticed a posting a
while back from a Sun engineer working on integrating NFSv4/ZFS ACL support
into Samba, but I'm not sure if that was ever completed and shipped either
in the Sun version or pending inclusion in the official version, does
anyone happen to have an update on that? Also, I saw a patch proposing a
different implementation of shadow copies that better supported ZFS
snapshots, any thoughts on that would also be appreciated.
This work is done and, AFAIK, has been integrated into S10 8/07.
Excellent. I did a little further research myself on the Samba mailing
lists, and it looks like ZFS ACL support was merged into the official
3.0.26 release. Unfortunately, the patch to improve shadow copy performance
on top of ZFS still appears to be floating around the technical mailing
list under discussion.
Post by Richard Elling
Post by Paul B. Henson
Is there any facility for managing ZFS remotely? We have a central identity
management system that automatically provisions resources as necessary for
[...]
Post by Richard Elling
This is a loaded question. There is a webconsole interface to ZFS which can
be run from most browsers. But I think you'll find that the CLI is easier
for remote management.
Perhaps I should have been more clear -- a remote facility available via
programmatic access, not manual user direct access. If I wanted to do
something myself, I would absolutely login to the system and use the CLI.
However, the question was regarding an automated process. For example, our
Perl-based identity management system might create a user in the middle of
the night based on the appearance in our authoritative database of that
user's identity, and need to create a ZFS filesystem and quota for that
user. So, I need to be able to manipulate ZFS remotely via a programmatic
API.
Post by Richard Elling
Active/passive only. ZFS is not supported over pxfs and ZFS cannot be
mounted simultaneously from two different nodes.
That's what I thought, I'll have to get back to that SE. Makes me wonder as
to the reliability of his other answers :).
Post by Richard Elling
For most large file servers, people will split the file systems across
servers such that under normal circumstances, both nodes are providing
file service. This implies two or more storage pools.
Again though, that would imply two different storage locations visible to
the clients? I'd really rather avoid that. For example, with our current
Samba implementation, a user can just connect to
'\\files.csupomona.edu\<username>' to access their home directory or
'\\files.csupomona.edu\<groupname>' to access a shared group directory.
They don't need to worry on which physical server it resides or determine
what server name to connect to.
Post by Richard Elling
The SE is mistaken. Sun^H^Holaris Cluster supports a wide variety of
JBOD and RAID array solutions. For ZFS, I recommend a configuration
which allows ZFS to repair corrupted data.
That would also be my preference, but if I were forced to use hardware
RAID, the additional loss of storage for ZFS redundancy would be painful.

Would anyone happen to have any good recommendations for an enterprise
scale storage subsystem suitable for ZFS deployment? If I recall correctly,
the SE we spoke with recommended the StorageTek 6140 in a hardware raid
configuration, and evidently mistakenly claimed that Cluster would not work
with JBOD.

Thanks...
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
James F. Hranicky
2007-09-20 20:05:33 UTC
Permalink
Post by Paul B. Henson
One issue I have is that our previous filesystem, DFS, completely spoiled
me with its global namespace and location transparency. We had three fairly
large servers, with the content evenly dispersed among them, but from the
perspective of the client any user's files were available at
/dfs/user/<username>, regardless of which physical server they resided on.
We could even move them around between servers transparently.
This can be solved using an automounter as well. All home directories
are specified as

/nfs/home/user

in the passwd map, then have a homes map that maps

/nfs/home/user -> /nfs/homeXX/user

then have a map that maps

/nfs/homeXX -> serverXX:/export/homeXX

You can have any number of servers serving up any number of homes
filesystems. Moving users between servers means only changing the
mapping in the homes map. The user never knows the difference, only
seeing the homedir as

/nfs/home/user

(we used amd)
Post by Paul B. Henson
Again though, that would imply two different storage locations visible to
the clients? I'd really rather avoid that. For example, with our current
Samba implementation, a user can just connect to
'\\files.csupomona.edu\<username>' to access their home directory or
'\\files.csupomona.edu\<groupname>' to access a shared group directory.
They don't need to worry on which physical server it resides or determine
what server name to connect to.
Samba can be configured to map homes drives to /nfs/home/%u . Let samba use
the automounter setup and it's just as transparent on the CIFS side.

This is how we had things set up at my previous place of employment and
it worked extremely well. Unfortunately, due to lack of BSD-style quotas
and due to the fact that snapshots counted toward ZFS quota, I decided
against using ZFS for filesystem service -- the automounter setup cannot
mitigate the bunches-of-little-filesystems problem.

Jim
Paul B. Henson
2007-09-20 22:17:52 UTC
Permalink
Post by James F. Hranicky
This can be solved using an automounter as well.
Well, I'd say more "kludged around" than "solved" ;), but again unless
you've used DFS it might not seem that way.

It just seems rather involved, and relatively inefficient to continuously
be mounting/unmounting stuff all the time. One of the applications to be
deployed against the filesystem will be web service, I can't really
envision a web server with tens of thousands of NFS mounts coming and
going, seems like a lot of overhead.

I might need to pursue a similar route though if I can't get one large
system to house everything in one place.
Post by James F. Hranicky
Samba can be configured to map homes drives to /nfs/home/%u . Let samba use
the automounter setup and it's just as transparent on the CIFS side.
I'm planning to use NFSv4 with strong authentication and authorization
through, and intended to run Samba directly on the file server itself
accessing storage locally. I'm not sure that Samba would be able to acquire
local Kerberos credentials and switch between them for the users, without
that access via NFSv4 isn't very doable.
Post by James F. Hranicky
and due to the fact that snapshots counted toward ZFS quota, I decided
Yes, that does seem to remove a bit of their value for backup purposes. I
think they're planning to rectify that at some point in the future.
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Chris Kirby
2007-09-20 23:10:36 UTC
Permalink
Post by Paul B. Henson
Post by James F. Hranicky
and due to the fact that snapshots counted toward ZFS quota, I decided
Yes, that does seem to remove a bit of their value for backup purposes. I
think they're planning to rectify that at some point in the future.
We're adding a style of quota that only includes the bytes
referenced by the active fs. Also, there will be a matching
style for reservations.

"some point in the future" is very soon (weeks). :-)

-Chris
Paul B. Henson
2007-09-20 23:33:23 UTC
Permalink
We're adding a style of quota that only includes the bytes referenced by
the active fs. Also, there will be a matching style for reservations.
"some point in the future" is very soon (weeks). :-)
I don't think my management will let me run Solaris Express on a production
server ;), how does that translate into availability into a
released/supported version? Would that be something released as a patch to
the just made available U4, or delayed until the next complete update
release?
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
James F. Hranicky
2007-09-21 16:26:16 UTC
Permalink
Post by Paul B. Henson
Post by James F. Hranicky
This can be solved using an automounter as well.
Well, I'd say more "kludged around" than "solved" ;), but again unless
you've used DFS it might not seem that way.
Hey, I liked it :->
Post by Paul B. Henson
It just seems rather involved, and relatively inefficient to continuously
be mounting/unmounting stuff all the time. One of the applications to be
deployed against the filesystem will be web service, I can't really
envision a web server with tens of thousands of NFS mounts coming and
going, seems like a lot of overhead.
Well, that's why ZFS wouldn't work for us :-( .
Post by Paul B. Henson
I might need to pursue a similar route though if I can't get one large
system to house everything in one place.
Post by James F. Hranicky
Samba can be configured to map homes drives to /nfs/home/%u . Let samba use
the automounter setup and it's just as transparent on the CIFS side.
I'm planning to use NFSv4 with strong authentication and authorization
through, and intended to run Samba directly on the file server itself
accessing storage locally. I'm not sure that Samba would be able to acquire
local Kerberos credentials and switch between them for the users, without
that access via NFSv4 isn't very doable.
Makes sense -- in that case you would be looking at multiple SMB
servers, though.

Jim
Paul B. Henson
2007-09-22 00:25:37 UTC
Permalink
Post by James F. Hranicky
Post by Paul B. Henson
It just seems rather involved, and relatively inefficient to continuously
be mounting/unmounting stuff all the time. One of the applications to be
deployed against the filesystem will be web service, I can't really
envision a web server with tens of thousands of NFS mounts coming and
going, seems like a lot of overhead.
Well, that's why ZFS wouldn't work for us :-( .
Although, I'm just saying that from my gut -- does anyone have any actual
experience with automounting thousands of file systems? Does it work? Is it
horribly inefficient? Poor performance? Resource intensive?
Post by James F. Hranicky
Makes sense -- in that case you would be looking at multiple SMB servers,
though.
Yes, with again the resultant problem of worrying about where a user's
files are when they want to access them :(.
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Peter Tribble
2007-09-22 10:28:56 UTC
Permalink
Post by Paul B. Henson
Post by James F. Hranicky
Post by Paul B. Henson
It just seems rather involved, and relatively inefficient to continuously
be mounting/unmounting stuff all the time. One of the applications to be
deployed against the filesystem will be web service, I can't really
envision a web server with tens of thousands of NFS mounts coming and
going, seems like a lot of overhead.
Well, that's why ZFS wouldn't work for us :-( .
Although, I'm just saying that from my gut -- does anyone have any actual
experience with automounting thousands of file systems? Does it work? Is it
horribly inefficient? Poor performance? Resource intensive?
Used to do this for years with 20,000 filesystems automounted - each user
home directory was automounted separately. Never caused any problems,
either with NIS+ or the automounter or the NFS clients and server. And much
of the time that was with hardware that would today be antique. So I wouldn't
expect any issues on the automounting part. [Except one - see later.]

That was with a relatively small number of ufs filesystems on the server holding
the data. When we first got hold of zfs I did try the exercise of one zfs
filesystem per user on the server, just to see how it would work. While managing
20,00 filesystems with the automounter was trivial, the attempt to manage
20,000 zfs filesystems wasn't entirely successful. In fact, based on
that experience
I simply wouldn't go down the road of one user per filesystem.

[There is one issue with automounting large number of filesystems on
a Solaris 10 client. Every mount or unmount triggers SMF activity, and
can drive SMF up the wall. We saw one of the svc daemons hog a whole
cpu on our mailserver (constantly checking for .forward files in user home
directories). This has been fixed, I believe, but only very recently in S10.]
--
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
Paul B. Henson
2007-09-24 21:51:43 UTC
Permalink
Post by Peter Tribble
filesystem per user on the server, just to see how it would work. While
managing 20,00 filesystems with the automounter was trivial, the attempt
to manage 20,000 zfs filesystems wasn't entirely successful. In fact,
based on that experience I simply wouldn't go down the road of one user
per filesystem.
Really? Could you provide further detail about what problems you
experienced? Our current filesystem based on DFS effectively utilizes a
separate filesystem per user (although in DFS terminology they are called
filesets), and we've never had a problem managing them.
Post by Peter Tribble
directories). This has been fixed, I believe, but only very recently in S10.]
As long as the fix has been included in U4 we should be good...
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Peter Tribble
2007-09-25 21:27:56 UTC
Permalink
Post by Paul B. Henson
Post by Peter Tribble
filesystem per user on the server, just to see how it would work. While
managing 20,00 filesystems with the automounter was trivial, the attempt
to manage 20,000 zfs filesystems wasn't entirely successful. In fact,
based on that experience I simply wouldn't go down the road of one user
per filesystem.
Really? Could you provide further detail about what problems you
experienced? Our current filesystem based on DFS effectively utilizes a
separate filesystem per user (although in DFS terminology they are called
filesets), and we've never had a problem managing them.
This was some time ago (a very long time ago, actually). There are two
fundamental problems:

1. Each zfs filesystem consumes kernel memory. Significant amounts, 64K
is what we worked out at the time. For normal numbers of filesystems that's
not a problem; multiply it by tens of thousands and you start to hit serious
resource usage.

2. The zfs utilities didn't scale well as the number of filesystems increased.

I just kept on issuing zfs create until the machine had had enough. It got
through the first 10,000 without too much difficulty (as I recall that
took several
hours), but soon got bogged down after that, to the point where it took a day
to do anything. At which point (at about 15000 filesystems on a 1G system)
it ran out of kernel memory and died. At this point it wouldn't even boot.

I know that some work has gone into improving the performance of the
utilities, and things like in-kernel sharetab (we never even tried to
share all those filesystems) are there to improve scalability. Perhaps
I should find a spare machine and try repeating the experiment.
--
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
Paul B. Henson
2007-09-25 22:47:53 UTC
Permalink
Post by Peter Tribble
This was some time ago (a very long time ago, actually). There are two
1. Each zfs filesystem consumes kernel memory. Significant amounts, 64K
is what we worked out at the time. For normal numbers of filesystems that's
not a problem; multiply it by tens of thousands and you start to hit serious
resource usage.
Every server we've bought for about the last year came with 4 GB of memory,
the servers we would deploy for this would have at least 8 if not 16GB.
Given the downtrend in memory prices, hopefully memory would not be an
issue.
Post by Peter Tribble
2. The zfs utilities didn't scale well as the number of filesystems increased.
[...]
Post by Peter Tribble
share all those filesystems) are there to improve scalability. Perhaps
I should find a spare machine and try repeating the experiment.
There have supposedly been lots of improvements in scalability, based on my
review of mailing list archives. If you do find the time to experiment
again, I'd appreciate hearing what you find...
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Jonathan Loran
2007-09-23 04:59:09 UTC
Permalink
Paul,

My gut tells me that you won't have much trouble mounting 50K file
systems with ZFS. But who knows until you try. My questions for you is
can you lab this out? you could build a commodity server with a ZFS
pool on it. Heck it could be a small pool, one disk, and then put your
50K file systems on that. Reboot, thrash about, and see what happens.
Then the next step would be fooling with the client side of things. If
you can get time on a chunk of your existing client systems, see if you
can mount a bunch of those 50K file systems smoothly. Off hours,
perhaps. The next problem of course, and to be honest, this may be the
killer, test with your name service in the loop. You may need netgroups
to delineate permissions for your shares, and to define your automounter
maps. In my personal experience, with about 1-2% as many shares and
mount points as you need, the name servers gets stressed out really
fast. There have been some issues around LDAP port reuse in Solaris
that can cause some headaches as well, but there are patches to help you
too. Also, as you may know, Linux doesn't play well with hundreds of
concurrent mount operations. If you use Linux NFS clients in your
environment, be sure to lab that out as well.

At any rate, you may indeed be an outlier with so many file systems and
NFS mounts, but I imagine many of us are waiting on the edge of our
seats to see if you can make it all work. Speaking for my self, I would
love to know how ZFS, NFS and LDAP scale up to such a huge system.

Regards,

Jon
Post by Paul B. Henson
Post by James F. Hranicky
Post by Paul B. Henson
It just seems rather involved, and relatively inefficient to continuously
be mounting/unmounting stuff all the time. One of the applications to be
deployed against the filesystem will be web service, I can't really
envision a web server with tens of thousands of NFS mounts coming and
going, seems like a lot of overhead.
Well, that's why ZFS wouldn't work for us :-( .
Although, I'm just saying that from my gut -- does anyone have any actual
experience with automounting thousands of file systems? Does it work? Is it
horribly inefficient? Poor performance? Resource intensive?
Post by James F. Hranicky
Makes sense -- in that case you would be looking at multiple SMB servers,
though.
Yes, with again the resultant problem of worrying about where a user's
files are when they want to access them :(.
--
- _____/ _____/ / - Jonathan Loran - -
- / / / IT Manager -
- _____ / _____ / / Space Sciences Laboratory, UC Berkeley
- / / / (510) 643-5146 ***@ssl.berkeley.edu
- ______/ ______/ ______/ AST:7731^29u18e3
Paul B. Henson
2007-09-24 21:58:49 UTC
Permalink
Post by Jonathan Loran
My gut tells me that you won't have much trouble mounting 50K file
systems with ZFS. But who knows until you try. My questions for you is
can you lab this out?
Yeah, after this research phase has been completed, we're going to have to
go into a prototyping phase. I should be able to get funding for a half
dozen or so x4100 systems to play with. We standardized on those systems
for our Linux deployment.
Post by Jonathan Loran
test with your name service in the loop. You may need netgroups to
delineate permissions for your shares, and to define your automounter
maps.
We're planning to use NFSv4 with Kerberos authentication, so shouldn't need
netgroups. Tentatively I think I'd put automounter maps in LDAP, although
doing so for both Solaris and Linux at the same time based on a little
quick research seems possibly problematic.
Post by Jonathan Loran
Also, as you may know, Linux doesn't play well with hundreds of
concurrent mount operations. If you use Linux NFS clients in your
environment, be sure to lab that out as well.
I didn't know that -- we're currently using RHEL 4 and Gentoo distributions
for a number of services. I've done some initial testing of NFSv4, but
never tried lots of simultaneous mounts...
Post by Jonathan Loran
At any rate, you may indeed be an outlier with so many file systems and
NFS mounts, but I imagine many of us are waiting on the edge of our seats
to see if you can make it all work. Speaking for my self, I would love
to know how ZFS, NFS and LDAP scale up to such a huge system.
I don't necessarily mind being a pioneer, but not on this particular
project -- it has a rather high visibility and it would not be good for it
to blow chunks after deployment when use starts scaling up 8-/.
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Jonathan Loran
2007-09-24 22:20:37 UTC
Permalink
Post by Paul B. Henson
Post by Jonathan Loran
My gut tells me that you won't have much trouble mounting 50K file
systems with ZFS. But who knows until you try. My questions for you is
can you lab this out?
Yeah, after this research phase has been completed, we're going to have to
go into a prototyping phase. I should be able to get funding for a half
dozen or so x4100 systems to play with. We standardized on those systems
for our Linux deployment.
Post by Jonathan Loran
test with your name service in the loop. You may need netgroups to
delineate permissions for your shares, and to define your automounter
maps.
We're planning to use NFSv4 with Kerberos authentication, so shouldn't need
netgroups. Tentatively I think I'd put automounter maps in LDAP, although
doing so for both Solaris and Linux at the same time based on a little
quick research seems possibly problematic.
We finally got autofs maps via LDAP working smoothly with both Linux
(CentOS 4.x and 5.x) and Solaris (8,9,10). It took a lot of trial and
error. We settled on the Fedora Directory server, because that worked
across the board. I'm not the admin who did the leg work on that
though, so I can't really comment as to where we ran into problems. If
you want, I can find out more on that and respond off the list.
Post by Paul B. Henson
Post by Jonathan Loran
Also, as you may know, Linux doesn't play well with hundreds of
concurrent mount operations. If you use Linux NFS clients in your
environment, be sure to lab that out as well.
I didn't know that -- we're currently using RHEL 4 and Gentoo distributions
for a number of services. I've done some initial testing of NFSv4, but
never tried lots of simultaneous mounts...
Sort of an old problem, but using the insecure option in your
exports/shares and mount opt helps. May have been patched by now
though. Too much Linux talk for this list ;)
Post by Paul B. Henson
Post by Jonathan Loran
At any rate, you may indeed be an outlier with so many file systems and
NFS mounts, but I imagine many of us are waiting on the edge of our seats
to see if you can make it all work. Speaking for my self, I would love
to know how ZFS, NFS and LDAP scale up to such a huge system.
I don't necessarily mind being a pioneer, but not on this particular
project -- it has a rather high visibility and it would not be good for it
to blow chunks after deployment when use starts scaling up 8-/.
Good luck.
--
- _____/ _____/ / - Jonathan Loran - -
- / / / IT Manager -
- _____ / _____ / / Space Sciences Laboratory, UC Berkeley
- / / / (510) 643-5146 ***@ssl.berkeley.edu
- ______/ ______/ ______/ AST:7731^29u18e3
Richard Elling
2007-09-24 17:35:14 UTC
Permalink
Post by Paul B. Henson
Post by James F. Hranicky
Post by Paul B. Henson
It just seems rather involved, and relatively inefficient to continuously
be mounting/unmounting stuff all the time. One of the applications to be
deployed against the filesystem will be web service, I can't really
envision a web server with tens of thousands of NFS mounts coming and
going, seems like a lot of overhead.
Well, that's why ZFS wouldn't work for us :-( .
Although, I'm just saying that from my gut -- does anyone have any actual
experience with automounting thousands of file systems? Does it work? Is it
horribly inefficient? Poor performance? Resource intensive?
Yes. Sun currently has over 45,000 users with automounted home directories.
I do not know how many servers are involved, though, in part because home
directories are highly available services and thus their configuration is
abstracted away from the clients. Suffice to say, there is more than one
server. Measuring mount performance would vary based on where in the world
you were, so it probably isn't worth the effort.
-- richard
Paul B. Henson
2007-09-24 22:12:01 UTC
Permalink
Post by Richard Elling
Yes. Sun currently has over 45,000 users with automounted home
directories. I do not know how many servers are involved, though, in part
because home directories are highly available services and thus their
configuration is abstracted away from the clients.
Hmm, highly available home directories -- that sounds like what I'm looking
for ;).

Any other Sun employees on the list that might be able to provide further
details of the internal Sun ZFS/NFS auto mounted home directory
implementation?
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Richard Elling
2007-09-24 17:21:33 UTC
Permalink
Post by Paul B. Henson
Post by James F. Hranicky
This can be solved using an automounter as well.
Well, I'd say more "kludged around" than "solved" ;), but again unless
you've used DFS it might not seem that way.
It just seems rather involved, and relatively inefficient to continuously
be mounting/unmounting stuff all the time. One of the applications to be
deployed against the filesystem will be web service, I can't really
envision a web server with tens of thousands of NFS mounts coming and
going, seems like a lot of overhead.
I might need to pursue a similar route though if I can't get one large
system to house everything in one place.
I can't imagine a web server serving tens of thousands of pages. I think
you should put a more scalable architecture in place, if that is your goal.
BTW, there are many companies that do this: google, yahoo, etc. In no
case do they have a single file system or single server dishing out
thousands of sites.
-- richard
Paul B. Henson
2007-09-24 22:10:15 UTC
Permalink
Post by Richard Elling
I can't imagine a web server serving tens of thousands of pages. I think
you should put a more scalable architecture in place, if that is your
goal. BTW, there are many companies that do this: google, yahoo, etc.
In no case do they have a single file system or single server dishing out
thousands of sites.
Our current implementation already serves tens of thousands of pages, and
it's for the most part running on 8-10 year old hardware. We have three
core DFS servers housing files, and three web servers serving content. The
only time we've ever had a problem was once we got Slashdot'd by a staff
member's personal project:

http://www.csupomona.edu/~jelerma/springfield/map/index.html


other than that, it's been fine. I can't imagine brand-new hardware running
shiny new filesystems couldn't handle the same load 10-year-old hardware
has been? Although arguably, considering I can't find anything equivalent
feature wise to DFS, perhaps the current offerings aren't equivalent
scalability-wise either :(...
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Andy Lubel
2007-09-20 20:32:51 UTC
Permalink
Post by Paul B. Henson
Post by Richard Elling
50,000 directories aren't a problem, unless you also need 50,000 quotas
and hence 50,000 file systems. Such a large, single storage pool system
will be an outlier... significantly beyond what we have real world
experience with.
Yes, considering that 45,000 of those users will be students, we definitely
need separate quotas for each one :).
Hmm, I get a bit of a shiver down my spine at the prospect of deploying a
critical central service in a relatively untested configuration 8-/. What
is the maximum number of file systems in a given pool that has undergone
some reasonable amount of real world deployment?
15,500 is the most I see in this article:

http://developers.sun.com/solaris/articles/nfs_zfs.html

Looks like its completely scalable but your boot time may suffer the more
you have. Just don't reboot :)
Post by Paul B. Henson
One issue I have is that our previous filesystem, DFS, completely spoiled
me with its global namespace and location transparency. We had three fairly
large servers, with the content evenly dispersed among them, but from the
perspective of the client any user's files were available at
/dfs/user/<username>, regardless of which physical server they resided on.
We could even move them around between servers transparently.
If it was so great why did IBM kill it? Did they have an alternative with
the same functionality?
Post by Paul B. Henson
Unfortunately, there aren't really any filesystems available with similar
features and enterprise applicability. OpenAFS comes closest, we've been
prototyping that but the lack of per file ACLs bites, and as an add-on
product we've had issues with kernel compatibility across upgrades.
I was hoping to replicate a similar feel by just having one large file
server with all the data on it. If I split our user files across multiple
servers, we would have to worry about which server contained what files,
which would be rather annoying.
There are some features in NFSv4 that seem like they might someday help
resolve this problem, but I don't think they are readily available in
servers and definitely not in the common client.
Post by Richard Elling
Post by Paul B. Henson
I was planning to provide CIFS services via Samba. I noticed a posting a
while back from a Sun engineer working on integrating NFSv4/ZFS ACL support
into Samba, but I'm not sure if that was ever completed and shipped either
in the Sun version or pending inclusion in the official version, does
anyone happen to have an update on that? Also, I saw a patch proposing a
different implementation of shadow copies that better supported ZFS
snapshots, any thoughts on that would also be appreciated.
This work is done and, AFAIK, has been integrated into S10 8/07.
Excellent. I did a little further research myself on the Samba mailing
lists, and it looks like ZFS ACL support was merged into the official
3.0.26 release. Unfortunately, the patch to improve shadow copy performance
on top of ZFS still appears to be floating around the technical mailing
list under discussion.
Post by Richard Elling
Post by Paul B. Henson
Is there any facility for managing ZFS remotely? We have a central identity
management system that automatically provisions resources as necessary for
[...]
Post by Richard Elling
This is a loaded question. There is a webconsole interface to ZFS which can
be run from most browsers. But I think you'll find that the CLI is easier
for remote management.
Perhaps I should have been more clear -- a remote facility available via
programmatic access, not manual user direct access. If I wanted to do
something myself, I would absolutely login to the system and use the CLI.
However, the question was regarding an automated process. For example, our
Perl-based identity management system might create a user in the middle of
the night based on the appearance in our authoritative database of that
user's identity, and need to create a ZFS filesystem and quota for that
user. So, I need to be able to manipulate ZFS remotely via a programmatic
API.
Post by Richard Elling
Active/passive only. ZFS is not supported over pxfs and ZFS cannot be
mounted simultaneously from two different nodes.
That's what I thought, I'll have to get back to that SE. Makes me wonder as
to the reliability of his other answers :).
Post by Richard Elling
For most large file servers, people will split the file systems across
servers such that under normal circumstances, both nodes are providing
file service. This implies two or more storage pools.
Again though, that would imply two different storage locations visible to
the clients? I'd really rather avoid that. For example, with our current
Samba implementation, a user can just connect to
'\\files.csupomona.edu\<username>' to access their home directory or
'\\files.csupomona.edu\<groupname>' to access a shared group directory.
They don't need to worry on which physical server it resides or determine
what server name to connect to.
Post by Richard Elling
The SE is mistaken. Sun^H^Holaris Cluster supports a wide variety of
JBOD and RAID array solutions. For ZFS, I recommend a configuration
which allows ZFS to repair corrupted data.
That would also be my preference, but if I were forced to use hardware
RAID, the additional loss of storage for ZFS redundancy would be painful.
Would anyone happen to have any good recommendations for an enterprise
scale storage subsystem suitable for ZFS deployment? If I recall correctly,
the SE we spoke with recommended the StorageTek 6140 in a hardware raid
configuration, and evidently mistakenly claimed that Cluster would not work
with JBOD.
I really have to disagree, we have 6120 and 6130's and if I had the option
to actually plan out some storage I would have just bought a thumper. You
could probably buy 2 for the cost of that 6140.
Post by Paul B. Henson
Thanks...
-Andy Lubel
--
Tim Spriggs
2007-09-20 20:41:10 UTC
Permalink
Post by Andy Lubel
Post by Paul B. Henson
That would also be my preference, but if I were forced to use hardware
RAID, the additional loss of storage for ZFS redundancy would be painful.
Would anyone happen to have any good recommendations for an enterprise
scale storage subsystem suitable for ZFS deployment? If I recall correctly,
the SE we spoke with recommended the StorageTek 6140 in a hardware raid
configuration, and evidently mistakenly claimed that Cluster would not work
with JBOD.
I really have to disagree, we have 6120 and 6130's and if I had the option
to actually plan out some storage I would have just bought a thumper. You
could probably buy 2 for the cost of that 6140.
We are in a similar situation. It turns out that buying two thumpers is
cheaper per TB than buying more shelves for an IBM N7600. I don't know
about power/cooling considerations yet though.
Paul B. Henson
2007-09-20 22:37:28 UTC
Permalink
Post by Tim Spriggs
We are in a similar situation. It turns out that buying two thumpers is
cheaper per TB than buying more shelves for an IBM N7600. I don't know
about power/cooling considerations yet though.
It's really a completely different class of storage though, right? I don't
know offhand what an IBM N7600 is, but presumably some type of SAN device?
Which can be connected simultaneously to multiple servers for clustering?

An x4500 looks great if you only want a bunch of storage with the
reliability/availability provided by a relatively fault-tolerant server.
But if you want to be able to withstand server failure, or continue to
provide service while having one server down for maintenance/patching, it
doesn't seem appropriate.
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Tim Spriggs
2007-09-20 22:54:50 UTC
Permalink
Post by Paul B. Henson
Post by Tim Spriggs
We are in a similar situation. It turns out that buying two thumpers is
cheaper per TB than buying more shelves for an IBM N7600. I don't know
about power/cooling considerations yet though.
It's really a completely different class of storage though, right? I don't
know offhand what an IBM N7600 is, but presumably some type of SAN device?
Which can be connected simultaneously to multiple servers for clustering?
An x4500 looks great if you only want a bunch of storage with the
reliability/availability provided by a relatively fault-tolerant server.
But if you want to be able to withstand server failure, or continue to
provide service while having one server down for maintenance/patching, it
doesn't seem appropriate.
It's an IBM re-branded NetApp which can which we are using for NFS and
iSCSI.
Paul B. Henson
2007-09-20 23:31:37 UTC
Permalink
Post by Tim Spriggs
It's an IBM re-branded NetApp which can which we are using for NFS and
iSCSI.
Ah, I see.

Is it comparable storage though? Does it use SATA drives similar to the
x4500, or more expensive/higher performance FC drives? Is it one of the
models that allows connecting dual clustered heads and failing over the
storage between them?

I agree the x4500 is a sweet looking box, but when making price comparisons
sometimes it's more than just the raw storage... I wish I could just drop
in a couple of x4500's and not have to worry about the complexity of
clustering <sigh>...
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Tim Spriggs
2007-09-21 00:51:52 UTC
Permalink
Post by Paul B. Henson
Is it comparable storage though? Does it use SATA drives similar to the
x4500, or more expensive/higher performance FC drives? Is it one of the
models that allows connecting dual clustered heads and failing over the
storage between them?
I agree the x4500 is a sweet looking box, but when making price comparisons
sometimes it's more than just the raw storage... I wish I could just drop
in a couple of x4500's and not have to worry about the complexity of
clustering <sigh>...
It is configured with SATA drives and does support failover for NFS.
iSCSI is another story at the moment.

The x4500 is very sweet and the only thing stopping us from buying two
instead of another shelf is the fact that we have lost pools on Sol10u3
servers and there is no easy way of making two pools redundant (ie the
complexity of clustering.) Simply sending incremental snapshots is not a
viable option.

The pools we lost were pools on iSCSI (in a mirrored config) and they
were mostly lost on zpool import/export. The lack of a recovery
mechanism really limits how much faith we can put into our data on ZFS.
It's safe as long as the pool is safe... but we've lost multiple pools.

-Tim
Gino
2007-09-21 16:45:17 UTC
Permalink
Post by Tim Spriggs
The x4500 is very sweet and the only thing stopping
us from buying two
instead of another shelf is the fact that we have
lost pools on Sol10u3
servers and there is no easy way of making two pools
redundant (ie the
complexity of clustering.) Simply sending incremental
snapshots is not a
viable option.
The pools we lost were pools on iSCSI (in a mirrored
config) and they
were mostly lost on zpool import/export. The lack of
a recovery
mechanism really limits how much faith we can put
into our data on ZFS.
It's safe as long as the pool is safe... but we've
lost multiple pools.
Hello Tim,
did you try SNV60+ or S10U4 ?

Gino


This message posted from opensolaris.org
Tim Spriggs
2007-09-21 16:55:13 UTC
Permalink
Post by Gino
Post by Tim Spriggs
The x4500 is very sweet and the only thing stopping
us from buying two
instead of another shelf is the fact that we have
lost pools on Sol10u3
servers and there is no easy way of making two pools
redundant (ie the
complexity of clustering.) Simply sending incremental
snapshots is not a
viable option.
The pools we lost were pools on iSCSI (in a mirrored
config) and they
were mostly lost on zpool import/export. The lack of
a recovery
mechanism really limits how much faith we can put
into our data on ZFS.
It's safe as long as the pool is safe... but we've
lost multiple pools.
Hello Tim,
did you try SNV60+ or S10U4 ?
Gino
Hi Gino,

We need Solaris proper for these systems and we will have to
schedule a significant downtime to patch update to U4.

-Tim
Paul B. Henson
2007-09-21 19:11:52 UTC
Permalink
Post by Tim Spriggs
The x4500 is very sweet and the only thing stopping us from buying two
instead of another shelf is the fact that we have lost pools on Sol10u3
servers and there is no easy way of making two pools redundant (ie the
complexity of clustering.) Simply sending incremental snapshots is not a
viable option.
The pools we lost were pools on iSCSI (in a mirrored config) and they
were mostly lost on zpool import/export. The lack of a recovery
mechanism really limits how much faith we can put into our data on ZFS.
It's safe as long as the pool is safe... but we've lost multiple pools.
Lost data doesn't give me a warm fuzzy 8-/. Were you running an officially
supported version of Solaris at the time? If so, what did Sun support have
to say about this issue?
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Tim Spriggs
2007-09-21 19:50:57 UTC
Permalink
Post by Paul B. Henson
Post by Tim Spriggs
The x4500 is very sweet and the only thing stopping us from buying two
instead of another shelf is the fact that we have lost pools on Sol10u3
servers and there is no easy way of making two pools redundant (ie the
complexity of clustering.) Simply sending incremental snapshots is not a
viable option.
The pools we lost were pools on iSCSI (in a mirrored config) and they
were mostly lost on zpool import/export. The lack of a recovery
mechanism really limits how much faith we can put into our data on ZFS.
It's safe as long as the pool is safe... but we've lost multiple pools.
Lost data doesn't give me a warm fuzzy 8-/. Were you running an officially
supported version of Solaris at the time? If so, what did Sun support have
to say about this issue?
Sol 10 with just about all patches up to date.

I joined this list in hope of a good answer. After answering a few
questions over two days I had no hope of recovering the data. Don't
import/export (especially between systems) without serious cause, at
least not with U3. I haven't tried updating our servers yet and I don't
intend to for a while now. The filesystems contained databases that were
luckily redundant and could be rebuilt, but our DBA was not too happy to
have to do that at 3:00am.

I still have a pool that can not be mounted or exported. It shows up
with zpool list but nothing under zfs list. zpool export gives me an IO
error and does nothing. On the next downtime I am going to attempt to
yank the lun out from under its feet (as gently as I can) after I have
stopped all other services.

Still, we are using ZFS but we are re-thinking on how to deploy/manage
it. Our original model had us exporting/importing pools in order to move
zone data between machines. We had done the same with UFS on iSCSI
without a hitch. ZFS worked for about 8 zone moves and then killed 2
zones. The major operational difference between the moves involved a
reboot of the global zones. The initial import worked but after the
reboot the pools were in a bad state reporting errors on both drives in
the mirror. I exported one (bad choice) and attempted to gain access to
the other. Now attempting to import the first pool will panic a
solaris/opensolaris box very reliably. The second is in the state I
described above. Also, the drive labels are intact according to zdb.

When we don't move pools around, zfs seems to be stable on both Solaris
and OpenSolaris. I've done snapshots/rollbacks/sends/receives/clones/...
without problems. We even have zvols exported as mirrored luns from an
OpenSolaris box. It mirrors the luns that the IBM/NetApp box exports and
seems to be doing fine with that. There are a lot of other people that
seem to have the same opinion and use zfs with direct attached storage.

-Tim

PS: "when I have a lot of time" I might try to reproduce this by:

m2# zpool create test mirror iscsi_lun1 iscsi_lun2
m2# zpool export test
m1# zpool import -f test
m1# reboot
m2# reboot
eric kustarz
2007-09-21 20:03:37 UTC
Permalink
Post by Tim Spriggs
Post by Paul B. Henson
Post by Tim Spriggs
The x4500 is very sweet and the only thing stopping us from
buying two
instead of another shelf is the fact that we have lost pools on Sol10u3
servers and there is no easy way of making two pools redundant (ie the
complexity of clustering.) Simply sending incremental snapshots is not a
viable option.
The pools we lost were pools on iSCSI (in a mirrored config) and they
were mostly lost on zpool import/export. The lack of a recovery
mechanism really limits how much faith we can put into our data on ZFS.
It's safe as long as the pool is safe... but we've lost multiple pools.
Lost data doesn't give me a warm fuzzy 8-/. Were you running an officially
supported version of Solaris at the time? If so, what did Sun
support have
to say about this issue?
Sol 10 with just about all patches up to date.
I joined this list in hope of a good answer. After answering a few
questions over two days I had no hope of recovering the data. Don't
import/export (especially between systems) without serious cause, at
least not with U3. I haven't tried updating our servers yet and I don't
intend to for a while now. The filesystems contained databases that were
luckily redundant and could be rebuilt, but our DBA was not too happy to
have to do that at 3:00am.
I still have a pool that can not be mounted or exported. It shows up
with zpool list but nothing under zfs list. zpool export gives me an IO
error and does nothing. On the next downtime I am going to attempt to
yank the lun out from under its feet (as gently as I can) after I have
stopped all other services.
Still, we are using ZFS but we are re-thinking on how to deploy/manage
it. Our original model had us exporting/importing pools in order to move
zone data between machines. We had done the same with UFS on iSCSI
without a hitch. ZFS worked for about 8 zone moves and then killed 2
zones. The major operational difference between the moves involved a
reboot of the global zones. The initial import worked but after the
reboot the pools were in a bad state reporting errors on both
drives in
the mirror. I exported one (bad choice) and attempted to gain
access to
the other. Now attempting to import the first pool will panic a
solaris/opensolaris box very reliably. The second is in the state I
described above. Also, the drive labels are intact according to zdb.
When we don't move pools around, zfs seems to be stable on both Solaris
and OpenSolaris. I've done snapshots/rollbacks/sends/receives/
clones/...
without problems. We even have zvols exported as mirrored luns from an
OpenSolaris box. It mirrors the luns that the IBM/NetApp box
exports and
seems to be doing fine with that. There are a lot of other people that
seem to have the same opinion and use zfs with direct attached
storage.
-Tim
m2# zpool create test mirror iscsi_lun1 iscsi_lun2
m2# zpool export test
m1# zpool import -f test
m1# reboot
m2# reboot
Since I haven't actually looked into what problem caused your pools
to become damaged/lost, i can only guess that its possibly due to the
pool being actively imported on multiple machines (perhaps even
accidentally).

If it is that, you'll be happy to note that we specifically no longer
that to happen (unless you use the -f flag):
http://blogs.sun.com/erickustarz/entry/poor_man_s_cluster_end
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6282725

Looks like it just missed the s10u4 cut off, but should be in s10_u5.

In your above example, there should be no reason why you have to use
the '-f' flag on import (the pool was cleanly exported) - when you're
moving the pool from system to system, this can get you into trouble
if things don't go exactly how you planned.

eric
Tim Spriggs
2007-09-21 20:20:02 UTC
Permalink
Post by Tim Spriggs
m2# zpool create test mirror iscsi_lun1 iscsi_lun2
m2# zpool export test
m1# zpool import -f test
m1# reboot
m2# reboot
Since I haven't actually looked into what problem caused your pools to
become damaged/lost, i can only guess that its possibly due to the
pool being actively imported on multiple machines (perhaps even
accidentally).
If it is that, you'll be happy to note that we specifically no longer
http://blogs.sun.com/erickustarz/entry/poor_man_s_cluster_end
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6282725
Looks like it just missed the s10u4 cut off, but should be in s10_u5.
In your above example, there should be no reason why you have to use
the '-f' flag on import (the pool was cleanly exported) - when you're
moving the pool from system to system, this can get you into trouble
if things don't go exactly how you planned.
eric
That's a very possible prognosis. Even when the pools are exported from
one system, they are still marked as attached (thus the -f was
necessary). Since I rebooted both systems at the same time I guess it's
possible that they both made claim to the pool and corrupted it.

I'm glad this will be fixed in the future.

-Tim
Paul B. Henson
2007-09-22 00:34:37 UTC
Permalink
Post by Tim Spriggs
Still, we are using ZFS but we are re-thinking on how to deploy/manage
it. Our original model had us exporting/importing pools in order to move
zone data between machines. We had done the same with UFS on iSCSI
[...]
Post by Tim Spriggs
When we don't move pools around, zfs seems to be stable on both Solaris
and OpenSolaris. I've done snapshots/rollbacks/sends/receives/clones/...
Sounds like your problems are in an area we probably wouldn't be delving
into... Thanks for the detail.
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Andy Lubel
2007-09-21 15:55:01 UTC
Permalink
Post by Paul B. Henson
Post by Tim Spriggs
It's an IBM re-branded NetApp which can which we are using for NFS and
iSCSI.
Yeah its fun to see IBM compete with its OEM provider Netapp.
Post by Paul B. Henson
Ah, I see.
Is it comparable storage though? Does it use SATA drives similar to the
x4500, or more expensive/higher performance FC drives? Is it one of the
models that allows connecting dual clustered heads and failing over the
storage between them?
I agree the x4500 is a sweet looking box, but when making price comparisons
sometimes it's more than just the raw storage... I wish I could just drop
in a couple of x4500's and not have to worry about the complexity of
clustering <sigh>...
zfs send/receive.


Netapp is great, we have about 6 varieties in production here. But what I
pay in maintenance and up front cost on just 2 filers, I can buy a x4500 a
year, and have a 3 year warranty each time I buy. It just depends on the
company you work for.

I haven't played too much with anything but netapp and storagetek.. But once
I got started on zfs I just knew it was the future; and I think netapp
realizes that too. And if apple does what I think it will, it will only get
better :)

Fast, Cheap, Easy - you only get 2. Zfs may change that.
Paul B. Henson
2007-09-22 00:19:12 UTC
Permalink
Post by Andy Lubel
Yeah its fun to see IBM compete with its OEM provider Netapp.
Yes, we had both IBM and Netapp out as well. I'm not sure what the point
was... We do have some IBM SAN equipment on site, I suppose if we had gone
with the IBM variant we could have consolidated support.
Post by Andy Lubel
Post by Paul B. Henson
sometimes it's more than just the raw storage... I wish I could just drop
in a couple of x4500's and not have to worry about the complexity of
clustering <sigh>...
zfs send/receive.
If I understand correctly, that would be sort of a poor man's replication?
So you would result with a physical copy on server2 of all of the data on
server1? What would you do when server1 crashed and died? One of the
benefits of a real cluster would be the automatic failover, and fail back
when the server recovered.
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Paul B. Henson
2007-09-20 22:34:37 UTC
Permalink
Post by Andy Lubel
Looks like its completely scalable but your boot time may suffer the more
you have. Just don't reboot :)
I'm not sure if it's accurate, but the SE we were meeting with claimed that
we could failover all of the filesystems to one half of the cluster, reboot
the other half, fail them back, reboot the first half, and have rebooted
both cluster members with no downtime. I guess as long as the active
cluster member does not fail during the potentially lengthy downtime of the
one rebooting.
Post by Andy Lubel
If it was so great why did IBM kill it?
I often daydreamed of a group of high-level IBM executives tied to chairs
next to a table filled with rubber hoses ;), for the sole purpose of
getting that answer.

I think they killed it because the market of technically knowledgeable and
capable people that were able to use it to its full capacity was relatively
limited, and the average IT shop was happy with Windoze :(.
Post by Andy Lubel
Did they have an alternative with the same functionality?
No, not really. Depending on your situation, they recommended
transitioning to GPFS or NFSv4, but neither really met the same needs as
DFS.
Post by Andy Lubel
I really have to disagree, we have 6120 and 6130's and if I had the option
to actually plan out some storage I would have just bought a thumper. You
could probably buy 2 for the cost of that 6140.
Thumper = x4500, right? You can't really cluster the internal storage of an
x4500, so assuming high reliability/availability was a requirement that
sort of rules that box out.
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Gary Mills
2007-09-20 21:22:45 UTC
Permalink
Post by Paul B. Henson
Post by Richard Elling
50,000 directories aren't a problem, unless you also need 50,000 quotas
and hence 50,000 file systems. Such a large, single storage pool system
will be an outlier... significantly beyond what we have real world
experience with.
Hmm, I get a bit of a shiver down my spine at the prospect of deploying a
critical central service in a relatively untested configuration 8-/. What
is the maximum number of file systems in a given pool that has undergone
some reasonable amount of real world deployment?
You should consider a Netapp filer. It will do both NFS and CIFS,
supports disk quotas, and is highly reliable. We use one for 30,000
students and 3000 employees. Ours has never failed us.
--
-Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
Dickon Hood
2007-09-20 22:07:12 UTC
Permalink
On Thu, Sep 20, 2007 at 16:22:45 -0500, Gary Mills wrote:

: You should consider a Netapp filer. It will do both NFS and CIFS,
: supports disk quotas, and is highly reliable. We use one for 30,000
: students and 3000 employees. Ours has never failed us.

And they might only lightly sue you for contemplating zfs if you're
really, really lucky...
--
Dickon Hood

Due to digital rights management, my .sig is temporarily unavailable.
Normal service will be resumed as soon as possible. We apologise for the
inconvenience in the meantime.

No virus was found in this outgoing message as I didn't bother looking.
Paul B. Henson
2007-09-20 22:47:44 UTC
Permalink
Post by Dickon Hood
: You should consider a Netapp filer. It will do both NFS and CIFS,
: supports disk quotas, and is highly reliable. We use one for 30,000
: students and 3000 employees. Ours has never failed us.
And they might only lightly sue you for contemplating zfs if you're
really, really lucky...
Don't even get me started on the subject of software patents ;)...
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Paul B. Henson
2007-09-20 22:46:37 UTC
Permalink
Post by Gary Mills
You should consider a Netapp filer. It will do both NFS and CIFS,
supports disk quotas, and is highly reliable. We use one for 30,000
students and 3000 employees. Ours has never failed us.
We had actually just finished evaluating Netapp before I started looking
into Solaris/ZFS. For a variety of reasons, it was not suitable to our
requirements.

One, for example, was that it did not support simultaneous operation in an
MIT Kerberos realm for NFS authentication while at the same time belonging
to an active directory domain for CIFS authentication. Their workaround was
to have the filer behave like an NT4 server rather than a Windows 2000+
server, which seemed pretty stupid. That also resulted in the filer
not supporting NTLMv2, which was unacceptable.

Another issue we had was with access control. Their approach to ACLs was
just flat out ridiculous. You had UNIX mode bits, NFSv4 ACLs, and CIFs
ACLs, all disjoint, and which one was actually being used and how they
interacted was extremely confusing and not even accurately documented. We
wanted to be able to have the exact same permissions applied whether via
NFSv4 or CIFs, and ideally allow changing permissions via either access
protocol. That simply wasn't going to happen with Netapp.

Their Kerberos implementation only supported DES, not 3DES or AES, their
LDAP integration only supported the legacy posixGroup/memberUid attribute
as opposed to the more modern groupOfNames/member attribute for group
membership.

They have some type of remote management API, but it just wasn't very clean
IMHO.

As far as quotas, I was less than impressed with their implementation.
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
eric kustarz
2007-09-21 04:11:23 UTC
Permalink
Post by Paul B. Henson
Post by Gary Mills
You should consider a Netapp filer. It will do both NFS and CIFS,
supports disk quotas, and is highly reliable. We use one for 30,000
students and 3000 employees. Ours has never failed us.
We had actually just finished evaluating Netapp before I started looking
into Solaris/ZFS. For a variety of reasons, it was not suitable to our
requirements.
<omitting some stuff>
Post by Paul B. Henson
As far as quotas, I was less than impressed with their implementation.
Would you mind going into more details here?

eric
Paul B. Henson
2007-09-22 00:15:25 UTC
Permalink
Post by eric kustarz
Post by Paul B. Henson
As far as quotas, I was less than impressed with their implementation.
Would you mind going into more details here?
The feature set was fairly extensive, they supported volume quotas for
users or groups, or "qtree" quotas, which similar to the ZFS quota would
limit space for a particular directory and all of its contents regardless
of user/group ownership.

But all quotas were set in a single flat text file. Anytime you added a new
quota, you needed to turn off quotas, then turn them back on, and quota
enforcement was disabled while it recalculated space utilization.

Like a lot of aspects of the filer, it seemed possibly functional but
rather kludgy. I hate kludgy :(. I'd have to go review the documentation to
recall the other issues I had with it, quotas were one of the last things
we reviewed and I'd about given up taking notes at that point.
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
James F. Hranicky
2007-09-25 14:47:23 UTC
Permalink
Post by Paul B. Henson
But all quotas were set in a single flat text file. Anytime you added a new
quota, you needed to turn off quotas, then turn them back on, and quota
enforcement was disabled while it recalculated space utilization.
I believe in later versions of the OS 'quota resize' did this without
the massive recalculation.

Jim
Mike Gerdts
2007-09-21 17:32:24 UTC
Permalink
Post by Paul B. Henson
Again though, that would imply two different storage locations visible to
the clients? I'd really rather avoid that. For example, with our current
Samba implementation, a user can just connect to
'\\files.csupomona.edu\<username>' to access their home directory or
'\\files.csupomona.edu\<groupname>' to access a shared group directory.
They don't need to worry on which physical server it resides or determine
what server name to connect to.
MS-DFS could be helpful here. You could have a virtual samba instance
that generates MS-DFS redirects to the appropriate spot. At one point
in the past I wrote a script (long since lost - at a different job)
that would automatically convert automounter maps into the
appropriately formatted symbolic links used by the Samba MS-DFS
implementation. It worked quite well for giving one place to
administer the location mapping while providing transparency to the
end-users.

Mike
--
Mike Gerdts
http://mgerdts.blogspot.com/
Paul B. Henson
2007-09-22 00:30:05 UTC
Permalink
Post by Mike Gerdts
MS-DFS could be helpful here. You could have a virtual samba instance
that generates MS-DFS redirects to the appropriate spot. At one point in
That's true, although I rather detest Microsoft DFS (they stole the acronym
from DCE/DFS, even though particularly the initial versions sucked
feature-wise in comparison). Also, the current release version of MacOS X
does not support CIFS DFS referrals. I'm not sure if the upcoming version
is going to rectify that or not. Windows clients not belonging to the
domain also occasionally have problems accessing shares across different
servers.

Although it is definitely something to consider if I'm going to be unable
to achieve my single namespace by having one large server...

Thanks...
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Ed Plese
2007-09-22 01:50:46 UTC
Permalink
Post by Paul B. Henson
Post by Richard Elling
Post by Paul B. Henson
I was planning to provide CIFS services via Samba. I noticed a posting a
while back from a Sun engineer working on integrating NFSv4/ZFS ACL support
into Samba, but I'm not sure if that was ever completed and shipped either
in the Sun version or pending inclusion in the official version, does
anyone happen to have an update on that? Also, I saw a patch proposing a
different implementation of shadow copies that better supported ZFS
snapshots, any thoughts on that would also be appreciated.
This work is done and, AFAIK, has been integrated into S10 8/07.
Excellent. I did a little further research myself on the Samba mailing
lists, and it looks like ZFS ACL support was merged into the official
3.0.26 release. Unfortunately, the patch to improve shadow copy performance
on top of ZFS still appears to be floating around the technical mailing
list under discussion.
ZFS ACL support was going to be merged into 3.0.26 but 3.0.26 ended up
being a security fix release and the merge got pushed back. The next
release will be 3.2.0 and ACL support will be in there.

As others have pointed out though, Samba is included in Solaris 10
Update 4 along with support for ZFS ACLs, Active Directory, and SMF.

The patches for the shadow copy module can be found here:

http://www.edplese.com/samba-with-zfs.html

There are hopefully only a few minor changes that I need to make to them
before submitting them again to the Samba team.

I recently compiled the module for someone to use with Samba as shipped
with U4 and he reported that it worked well. I've made the compiled
module available on this page as well if anyone is interested in testing
it.

The patch doesn't improve performance anymore in order to preserve
backwards compatibility with the existing module but adds usability
enhancements for both admins and end-users. It allows shadow copy
functionality to "just work" with ZFS snapshots without having to create
symlinks to each snapshot in the root of each share. For end-users it
allows the "Previous Versions" list to be sorted chronologically to make
it easier to use. If performance is an issue the patch can be
modified to improve performance like the original patch did but this
only affects directory listings and is likely negligible in most cases.
Post by Paul B. Henson
Post by Richard Elling
Post by Paul B. Henson
Is there any facility for managing ZFS remotely? We have a central identity
management system that automatically provisions resources as necessary for
[...]
Post by Richard Elling
This is a loaded question. There is a webconsole interface to ZFS which can
be run from most browsers. But I think you'll find that the CLI is easier
for remote management.
Perhaps I should have been more clear -- a remote facility available via
programmatic access, not manual user direct access. If I wanted to do
something myself, I would absolutely login to the system and use the CLI.
However, the question was regarding an automated process. For example, our
Perl-based identity management system might create a user in the middle of
the night based on the appearance in our authoritative database of that
user's identity, and need to create a ZFS filesystem and quota for that
user. So, I need to be able to manipulate ZFS remotely via a programmatic
API.
While it won't help you in your case since your users access the files
using protocols other than CIFS, if you use only CIFS it's possible to
configure Samba to automatically create a user's home directory the
first time the user connects to the server. This is done using the
"root preexec" share option in smb.conf and an example is provided at
the above URL.


Ed Plese
Paul B. Henson
2007-09-24 21:47:31 UTC
Permalink
Post by Ed Plese
ZFS ACL support was going to be merged into 3.0.26 but 3.0.26 ended up
being a security fix release and the merge got pushed back. The next
release will be 3.2.0 and ACL support will be in there.
Arg, you're right, I based that on the mailing list posting:

http://marc.info/?l=samba-technical&m=117918697907120&w=2

but checking the actual release notes shows no ZFS mention. 3.0.26 to
3.2.0? That seems an odd version bump...
Post by Ed Plese
As others have pointed out though, Samba is included in Solaris 10
Update 4 along with support for ZFS ACLs, Active Directory, and SMF.
I usually prefer to use the version directly from the source, but depending
on the timeliness of the release of 3.2.0 maybe I'll have to make an
exception. SMF I know is the new Solaris service management framework
replacing /etc/init.d scripts, but what additional active directory support
does the Sun branded samba include over stock?
Post by Ed Plese
http://www.edplese.com/samba-with-zfs.html
Ah, I thought I recognized your name :), I came across that page while
researching ZFS. Thanks for your work on that patch , hopefully it will be
accepted into mainstream samba soon.
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Mike Gerdts
2007-09-24 22:04:14 UTC
Permalink
Post by Paul B. Henson
but checking the actual release notes shows no ZFS mention. 3.0.26 to
3.2.0? That seems an odd version bump...
3.0.x and before are GPLv2. 3.2.0 and later are GPLv3.

http://news.samba.org/announcements/samba_gplv3/
--
Mike Gerdts
http://mgerdts.blogspot.com/
Richard Elling
2007-09-24 17:56:37 UTC
Permalink
Post by Paul B. Henson
Post by Richard Elling
50,000 directories aren't a problem, unless you also need 50,000 quotas
and hence 50,000 file systems. Such a large, single storage pool system
will be an outlier... significantly beyond what we have real world
experience with.
Yes, considering that 45,000 of those users will be students, we definitely
need separate quotas for each one :).
or groups... think long tail.
Post by Paul B. Henson
Hmm, I get a bit of a shiver down my spine at the prospect of deploying a
critical central service in a relatively untested configuration 8-/. What
is the maximum number of file systems in a given pool that has undergone
some reasonable amount of real world deployment?
good question. I might have some field data on this, but won't be able to
look at it for a month or three. Perhaps someone on the list will brag ;-)
Post by Paul B. Henson
One issue I have is that our previous filesystem, DFS, completely spoiled
me with its global namespace and location transparency. We had three fairly
large servers, with the content evenly dispersed among them, but from the
perspective of the client any user's files were available at
/dfs/user/<username>, regardless of which physical server they resided on.
We could even move them around between servers transparently.
Unfortunately, there aren't really any filesystems available with similar
features and enterprise applicability. OpenAFS comes closest, we've been
prototyping that but the lack of per file ACLs bites, and as an add-on
product we've had issues with kernel compatibility across upgrades.
I was hoping to replicate a similar feel by just having one large file
server with all the data on it. If I split our user files across multiple
servers, we would have to worry about which server contained what files,
which would be rather annoying.
There are some features in NFSv4 that seem like they might someday help
resolve this problem, but I don't think they are readily available in
servers and definitely not in the common client.
Post by Richard Elling
Post by Paul B. Henson
I was planning to provide CIFS services via Samba. I noticed a posting a
while back from a Sun engineer working on integrating NFSv4/ZFS ACL support
into Samba, but I'm not sure if that was ever completed and shipped either
in the Sun version or pending inclusion in the official version, does
anyone happen to have an update on that? Also, I saw a patch proposing a
different implementation of shadow copies that better supported ZFS
snapshots, any thoughts on that would also be appreciated.
This work is done and, AFAIK, has been integrated into S10 8/07.
Excellent. I did a little further research myself on the Samba mailing
lists, and it looks like ZFS ACL support was merged into the official
3.0.26 release. Unfortunately, the patch to improve shadow copy performance
on top of ZFS still appears to be floating around the technical mailing
list under discussion.
Post by Richard Elling
Post by Paul B. Henson
Is there any facility for managing ZFS remotely? We have a central identity
management system that automatically provisions resources as necessary for
[...]
Post by Richard Elling
This is a loaded question. There is a webconsole interface to ZFS which can
be run from most browsers. But I think you'll find that the CLI is easier
for remote management.
Perhaps I should have been more clear -- a remote facility available via
programmatic access, not manual user direct access. If I wanted to do
something myself, I would absolutely login to the system and use the CLI.
However, the question was regarding an automated process. For example, our
Perl-based identity management system might create a user in the middle of
the night based on the appearance in our authoritative database of that
user's identity, and need to create a ZFS filesystem and quota for that
user. So, I need to be able to manipulate ZFS remotely via a programmatic
API.
I'd argue that it isn't worth the trouble.
zfs create
zfs set
is all that would be required. If you are ok with inheritance, zfs create
will suffice.
Post by Paul B. Henson
Post by Richard Elling
Active/passive only. ZFS is not supported over pxfs and ZFS cannot be
mounted simultaneously from two different nodes.
That's what I thought, I'll have to get back to that SE. Makes me wonder as
to the reliability of his other answers :).
Post by Richard Elling
For most large file servers, people will split the file systems across
servers such that under normal circumstances, both nodes are providing
file service. This implies two or more storage pools.
Again though, that would imply two different storage locations visible to
the clients? I'd really rather avoid that. For example, with our current
Samba implementation, a user can just connect to
'\\files.csupomona.edu\<username>' to access their home directory or
'\\files.csupomona.edu\<groupname>' to access a shared group directory.
They don't need to worry on which physical server it resides or determine
what server name to connect to.
Yes, that sort of abstraction is achievable using several different
technologies. In general, such services aren't scalable for a single server.
Post by Paul B. Henson
Post by Richard Elling
The SE is mistaken. Sun^H^Holaris Cluster supports a wide variety of
JBOD and RAID array solutions. For ZFS, I recommend a configuration
which allows ZFS to repair corrupted data.
That would also be my preference, but if I were forced to use hardware
RAID, the additional loss of storage for ZFS redundancy would be painful.
Would anyone happen to have any good recommendations for an enterprise
scale storage subsystem suitable for ZFS deployment? If I recall correctly,
the SE we spoke with recommended the StorageTek 6140 in a hardware raid
configuration, and evidently mistakenly claimed that Cluster would not work
with JBOD.
Any. StorageTek products preferred, of course.
-- richard
Paul B. Henson
2007-09-24 22:15:11 UTC
Permalink
Post by Richard Elling
Post by Paul B. Henson
Perhaps I should have been more clear -- a remote facility available via
programmatic access, not manual user direct access. If I wanted to do
I'd argue that it isn't worth the trouble.
zfs create
zfs set
is all that would be required. If you are ok with inheritance, zfs create
will suffice.
Well, considering that some days we automatically create accounts for
thousands of students, I wouldn't want to be the one stuck typing 'zfs
create' a thousand times 8-/. And that still wouldn't resolve our
requirement for our help desk staff to be able to manage quotas through our
existing identity management system.
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Dale Ghent
2007-09-25 04:12:14 UTC
Permalink
Post by Paul B. Henson
Well, considering that some days we automatically create accounts for
thousands of students, I wouldn't want to be the one stuck typing 'zfs
create' a thousand times 8-/. And that still wouldn't resolve our
requirement for our help desk staff to be able to manage quotas through our
existing identity management system.
Not to sway you away from ZFS/NFS considerations, but I'd like to add
that people who in the past used DFS typically went on to replace it
with AFS. Have you considered it?

/dale
Paul B. Henson
2007-09-25 22:43:03 UTC
Permalink
Post by Dale Ghent
Not to sway you away from ZFS/NFS considerations, but I'd like to add
that people who in the past used DFS typically went on to replace it with
AFS. Have you considered it?
You're right, AFS is the first choice coming to mind when replacing DFS. We
actually implemented an OpenAFS prototype last year and have been running
it for internal use only since then.

Unfortunately, like almost everything we've looked at, AFS is a step
backwards from DFS. As the precursor to DFS, AFS has enough similarities to
DFS to make the features it lacks almost more painful.

No per file access control lists is a serious bummer. Integration with
Kerberos 5 rather than the internal kaserver is still at a bit of a duct
tape level, and only support DES. Having to maintain an additional
repository of user/group information (pts) is a bit of a pain, while there
are long-term goals to replace that with some type of LDAP integration I
don't see that anytime soon.

One of the most annoying things is that AFS requires integration at the
kernel level, yet is not maintained by the same people that maintain the
kernel. Frequently a Linux kernel upgrade will break AFS, and the
developers need to scramble to release a patch or update to resolve it.
While we are not currently using AFS under Solaris, based on mailing list
traffic similar issues arise. One of the benefits of NFSv4 is that it is a
core part of the operating system, unlikely to be lightly broken during
updates.
--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | ***@csupomona.edu
California State Polytechnic University | Pomona CA 91768
Vincent Fox
2007-09-26 00:32:05 UTC
Permalink
Post by Paul B. Henson
The SE also told me that Sun Cluster requires
hardware raid, which
conflicts with the general recommendation to feed ZFS
raw disk. It seems
such a configuration would either require configuring
zdevs directly on the
raid LUNs, losing ZFS self-healing and checksum
correction features, or
losing space to not only the hardware raid level, but
a partially redundant
ZFS level as well. What is the general consensus on
the best way to deploy
ZFS under a cluster using hardware raid?
I have a pair of 3510FC units, each export 2 RAID-5 (5-disk) LUNs.

On the T2000 to I map a LUN from each array into a mirror set, then add the 2nd set the same way into the ZFS pool. I guess it's RAID-5+1+0. Yes we have multipath SAN setup too.

e.g.

{cyrus1:vf5:133} zpool status -v
pool: ms1
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
ms1 ONLINE 0 0 0
mirror ONLINE 0 0 0
c4t600C0FF0000000000A73D97F16461700d0 ONLINE 0 0 0
c4t600C0FF0000000000A719D7C1126E500d0 ONLINE 0 0 0
mirror ONLINE 0 0 0
c4t600C0FF0000000000A73D94517C4A900d0 ONLINE 0 0 0
c4t600C0FF0000000000A719D38B93FD200d0 ONLINE 0 0 0

errors: No known data errors

Works great. Nothing beats having an entire 3510FC down and never having users notice there is a problem. I was replacing a controller in the 2nd array and goofed up my cabling taking the entire array offline. Not a hiccup in service, although I could see the problem in zpool status. I sorted everything out plugged it up right, and everything was fine.

I like very much that the 3510 knows it has a global spare that is used for that array, and having that level of things handled locally. In ZFS AFAICT, there is no way to specify what affinity a spare has so a spare from one array if it went hot to replace one in the other array, becomes an undesirable dependency.


This message posted from opensolaris.org
Vincent Fox
2007-09-26 00:39:30 UTC
Permalink
Post by Paul B. Henson
We need high availability, so are looking at Sun
Cluster. That seems to add
an extra layer of complexity <sigh>, but there's no
way I'll get signoff on
a solution without redundancy. It would appear that
ZFS failover is
supported with the latest version of Solaris/Sun
Cluster? I was speaking
with a Sun SE who claimed that ZFS would actually
operate active/active in
a cluster, simultaneously writable by both nodes.
From what I had read, ZFS
is not a cluster file system, and would only operate
in the active/passive
failover capacity. Any comments?
The SE is not correct. There are relatively few applications in
Sun Cluster that run scalable. Most of them are "failover".
ZFS is definitely not a global file system, so that's one problem.
And NFS is a failover service.

This can actually be an asset to you. Think of it this way, you
have a KNOWN capacity. You do not have to worry that a failure
of one node at peak leaves you crippled.

Also have you ever had Sun patches break things? We
certainly have enough scars from that. You can patch the idle
node, do a cluster switch so it's now the active node, and verify
function for a few days before patching the other node. If there's a
problem that crops up due to some new patch, you switch
it back the other way until you sort that out.


This message posted from opensolaris.org

Continue reading on narkive:
Loading...