Discussion:
Petabyte pool?
(too old to reply)
Marion Hakanson
2013-03-16 01:09:34 UTC
Permalink
Raw Message
Greetings,

Has anyone out there built a 1-petabyte pool? I've been asked to look
into this, and was told "low performance" is fine, workload is likely
to be write-once, read-occasionally, archive storage of gene sequencing
data. Probably a single 10Gbit NIC for connectivity is sufficient.

We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis,
using 4TB "nearline SAS" drives, giving over 100TB usable space (raidz3).
Back-of-the-envelope might suggest stacking up eight to ten of those,
depending if you want a "raw marketing petabyte", or a proper "power-of-two
usable petabyte".

I get a little nervous at the thought of hooking all that up to a single
server, and am a little vague on how much RAM would be advisable, other
than "as much as will fit" (:-). Then again, I've been waiting for
something like pNFS/NFSv4.1 to be usable for gluing together multiple
NFS servers into a single global namespace, without any sign of that
happening anytime soon.

So, has anyone done this? Or come close to it? Thoughts, even if you
haven't done it yourself?

Thanks and regards,

Marion
Ray Van Dolson
2013-03-16 01:17:46 UTC
Permalink
Raw Message
Post by Marion Hakanson
Greetings,
Has anyone out there built a 1-petabyte pool? I've been asked to look
into this, and was told "low performance" is fine, workload is likely
to be write-once, read-occasionally, archive storage of gene sequencing
data. Probably a single 10Gbit NIC for connectivity is sufficient.
We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis,
using 4TB "nearline SAS" drives, giving over 100TB usable space (raidz3).
Back-of-the-envelope might suggest stacking up eight to ten of those,
depending if you want a "raw marketing petabyte", or a proper "power-of-two
usable petabyte".
I get a little nervous at the thought of hooking all that up to a single
server, and am a little vague on how much RAM would be advisable, other
than "as much as will fit" (:-). Then again, I've been waiting for
something like pNFS/NFSv4.1 to be usable for gluing together multiple
NFS servers into a single global namespace, without any sign of that
happening anytime soon.
So, has anyone done this? Or come close to it? Thoughts, even if you
haven't done it yourself?
Thanks and regards,
Marion
We've come close:

***@mes-str-imgnx-p1:~$ zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
datapool 978T 298T 680T 30% 1.00x ONLINE -
syspool 278G 104G 174G 37% 1.00x ONLINE -

Using a Dell R720 head unit, plus a bunch of Dell MD1200 JBODs dual
pathed to a couple of LSI SAS switches.

Using Nexenta but no reason you couldn't do this w/ $whatever.

We did triple parity and our vdev membership is set up such that we can
lose up to three JBODs and still be functional (one vdev member disk
per JBOD).

This is with 3TB NL-SAS drives.

Ray
Kristoffer Sheather @ CloudCentral
2013-03-16 01:21:22 UTC
Permalink
Raw Message
Well, off the top of my head:

2 x Storage Heads, 4 x 10G, 256G RAM, 2 x Intel E5 CPU's
8 x 60-Bay JBOD's with 60 x 4TB SAS drives
RAIDZ2 stripe over the 8 x JBOD's

That should fit within 1 rack comfortably and provide 1 PB storage..

Regards,

Kristoffer Sheather
Cloud Central
Scale Your Data Center In The Cloud
Phone: 1300 144 007 | Mobile: +61 414 573 130 | Email:
***@cloudcentral.com.au
LinkedIn: | Skype: kristoffer.sheather | Twitter:
http://twitter.com/kristofferjon

----------------------------------------
From: "Marion Hakanson" <***@ohsu.edu>
Sent: Saturday, March 16, 2013 12:12 PM
To: ***@lists.illumos.org
Subject: [zfs] Petabyte pool?

Greetings,

Has anyone out there built a 1-petabyte pool? I've been asked to look
into this, and was told "low performance" is fine, workload is likely
to be write-once, read-occasionally, archive storage of gene sequencing
data. Probably a single 10Gbit NIC for connectivity is sufficient.

We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis,
using 4TB "nearline SAS" drives, giving over 100TB usable space (raidz3).
Back-of-the-envelope might suggest stacking up eight to ten of those,
depending if you want a "raw marketing petabyte", or a proper
"power-of-two
usable petabyte".

I get a little nervous at the thought of hooking all that up to a single
server, and am a little vague on how much RAM would be advisable, other
than "as much as will fit" (:-). Then again, I've been waiting for
something like pNFS/NFSv4.1 to be usable for gluing together multiple
NFS servers into a single global namespace, without any sign of that
happening anytime soon.

So, has anyone done this? Or come close to it? Thoughts, even if you
haven't done it yourself?

Thanks and regards,

Marion

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed:
https://www.listbox.com/member/archive/rss/182191/23629987-2afa167a
Modify Your Subscription:
https://www.listbox.com/member/?member_id=23629987&id_secret=23629987-c48148
a8
Powered by Listbox: http://www.listbox.com
Bob Friesenhahn
2013-03-16 14:20:56 UTC
Permalink
Raw Message
Post by Kristoffer Sheather @ CloudCentral
2 x Storage Heads, 4 x 10G, 256G RAM, 2 x Intel E5 CPU's
8 x 60-Bay JBOD's with 60 x 4TB SAS drives
RAIDZ2 stripe over the 8 x JBOD's
That should fit within 1 rack comfortably and provide 1 PB storage..
What does one do for power? What are the power requirements when the
system is first powered on? Can drive spin-up be staggered between
JBOD chassis? Does the server need to be powered up last so that it
does not time out on the zfs import?

Bob
--
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Jim Klimov
2013-03-16 19:27:08 UTC
Permalink
Raw Message
Post by Bob Friesenhahn
Post by Kristoffer Sheather @ CloudCentral
2 x Storage Heads, 4 x 10G, 256G RAM, 2 x Intel E5 CPU's
8 x 60-Bay JBOD's with 60 x 4TB SAS drives
RAIDZ2 stripe over the 8 x JBOD's
That should fit within 1 rack comfortably and provide 1 PB storage..
What does one do for power? What are the power requirements when the
system is first powered on? Can drive spin-up be staggered between JBOD
chassis? Does the server need to be powered up last so that it does not
time out on the zfs import?
I guess you can use managed PDUs like those from APC (many models for
varied socket types and amounts); they can be scripted on an advanced
level, and on a basic level I think delays can be just configured
per-socket to make the staggered startup after giving power from the
wall (UPS) regardless of what the boxes' individual power sources can
do. Conveniently, they also allow to do a remote hard-reset of hung
boxes without walking to the server room ;)

My 2c,
//Jim Klimov
Tim Cook
2013-03-16 19:43:08 UTC
Permalink
Raw Message
Post by Jim Klimov
Post by Bob Friesenhahn
Post by Kristoffer Sheather @ CloudCentral
2 x Storage Heads, 4 x 10G, 256G RAM, 2 x Intel E5 CPU's
8 x 60-Bay JBOD's with 60 x 4TB SAS drives
RAIDZ2 stripe over the 8 x JBOD's
That should fit within 1 rack comfortably and provide 1 PB storage..
What does one do for power? What are the power requirements when the
system is first powered on? Can drive spin-up be staggered between JBOD
chassis? Does the server need to be powered up last so that it does not
time out on the zfs import?
I guess you can use managed PDUs like those from APC (many models for
varied socket types and amounts); they can be scripted on an advanced
level, and on a basic level I think delays can be just configured
per-socket to make the staggered startup after giving power from the
wall (UPS) regardless of what the boxes' individual power sources can
do. Conveniently, they also allow to do a remote hard-reset of hung
boxes without walking to the server room ;)
My 2c,
//Jim Klimov
Any modern JBOD should have the intelligence built in to stagger drive
spin-up. I wouldn't spend money on one that didn't. There's really no
need to stagger the JBOD power-up at the PDU.

As for the head, yes it should have a delayed power on which you can
typically set in the BIOS.

--Tim
Jim Klimov
2013-03-16 19:41:20 UTC
Permalink
Raw Message
Post by Bob Friesenhahn
Post by Kristoffer Sheather @ CloudCentral
2 x Storage Heads, 4 x 10G, 256G RAM, 2 x Intel E5 CPU's
8 x 60-Bay JBOD's with 60 x 4TB SAS drives
RAIDZ2 stripe over the 8 x JBOD's
That should fit within 1 rack comfortably and provide 1 PB storage..
What does one do for power? What are the power requirements when the
system is first powered on? Can drive spin-up be staggered between JBOD
chassis? Does the server need to be powered up last so that it does not
time out on the zfs import?
Giving this question a second thought, I think JBODs should spin-up
quickly (i.e. when power is given) while the server head(s) take time
to pass POST, initialize their HBAs and other stuff. Booting 8 JBODs,
one every 15 seconds to complete a typical spin-up power draw, would
take a couple of minutes. It is likely that a server booted along with
the first JBOD won't get to importing the pool this quickly ;)

Anyhow, with such a system attention should be given to redundant power
and cooling, including redundant UPSes preferably fed from different
power lines going into the room.

This does not seem like a fantastic power sucker, however. 480 drives at
15W would consume 7200W; add a bit for processor/RAM heads (perhaps
a kW?) and this would still fit into 8-10kW, so a couple of 15kVA UPSes
(or more smaller ones) should suffice including redundancy. This might
overall exceed a rack in size though. But for power/cooling this seems
like a standard figure for a 42U rack or just a bit more.

//Jim
Kristoffer Sheather @ CloudCentral
2013-03-16 01:24:33 UTC
Permalink
Raw Message
Actually, you could use 3TB drives and with a 6/8 RAIDZ2 stripe achieve
1080 TB usable.

You'll also need 8-16 x SAS ports available on each storage head to provide
redundant multi-pathed SAS connectivity to the JBOD's, recommend LSI
9207-8E's for those and Intel X520-DA2's for the 10G NIC's.

----------------------------------------
From: "Kristoffer Sheather @ CloudCentral"
<***@cloudcentral.com.au>
Sent: Saturday, March 16, 2013 12:21 PM
To: ***@lists.illumos.org
Subject: re: [zfs] Petabyte pool?

Well, off the top of my head:

2 x Storage Heads, 4 x 10G, 256G RAM, 2 x Intel E5 CPU's
8 x 60-Bay JBOD's with 60 x 4TB SAS drives
RAIDZ2 stripe over the 8 x JBOD's

That should fit within 1 rack comfortably and provide 1 PB storage..

Regards,

Kristoffer Sheather
Cloud Central
Scale Your Data Center In The Cloud
Phone: 1300 144 007 | Mobile: +61 414 573 130 | Email:
***@cloudcentral.com.au
LinkedIn: | Skype: kristoffer.sheather | Twitter:
http://twitter.com/kristofferjon

----------------------------------------
From: "Marion Hakanson" <***@ohsu.edu>
Sent: Saturday, March 16, 2013 12:12 PM
To: ***@lists.illumos.org
Subject: [zfs] Petabyte pool?

Greetings,

Has anyone out there built a 1-petabyte pool? I've been asked to look
into this, and was told "low performance" is fine, workload is likely
to be write-once, read-occasionally, archive storage of gene sequencing
data. Probably a single 10Gbit NIC for connectivity is sufficient.

We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis,
using 4TB "nearline SAS" drives, giving over 100TB usable space (raidz3).
Back-of-the-envelope might suggest stacking up eight to ten of those,
depending if you want a "raw marketing petabyte", or a proper
"power-of-two
usable petabyte".

I get a little nervous at the thought of hooking all that up to a single
server, and am a little vague on how much RAM would be advisable, other
than "as much as will fit" (:-). Then again, I've been waiting for
something like pNFS/NFSv4.1 to be usable for gluing together multiple
NFS servers into a single global namespace, without any sign of that
happening anytime soon.

So, has anyone done this? Or come close to it? Thoughts, even if you
haven't done it yourself?

Thanks and regards,

Marion

-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed:
https://www.listbox.com/member/archive/rss/182191/23629987-2afa167a
Modify Your Subscription:
https://www.listbox.com/member/?member_id=23629987&id_secret=23629987-c48148
a8
Powered by Listbox: http://www.listbox.com
Jan Owoc
2013-03-16 01:29:47 UTC
Permalink
Raw Message
Post by Marion Hakanson
Has anyone out there built a 1-petabyte pool?
I'm not advising against your building/configuring a system yourself,
but I suggest taking look at the "Petarack":
http://www.aberdeeninc.com/abcatg/petarack.htm

It shows it's been done with ZFS :-).

Jan
Richard Elling
2013-03-16 04:57:10 UTC
Permalink
Raw Message
Post by Marion Hakanson
Greetings,
Has anyone out there built a 1-petabyte pool?
Yes, I've done quite a few.
Post by Marion Hakanson
I've been asked to look
into this, and was told "low performance" is fine, workload is likely
to be write-once, read-occasionally, archive storage of gene sequencing
data. Probably a single 10Gbit NIC for connectivity is sufficient.
We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis,
using 4TB "nearline SAS" drives, giving over 100TB usable space (raidz3).
Back-of-the-envelope might suggest stacking up eight to ten of those,
depending if you want a "raw marketing petabyte", or a proper "power-of-two
usable petabyte".
Yes. NB, for the PHB, using N^2 is found 2B less effective than N^10.
Post by Marion Hakanson
I get a little nervous at the thought of hooking all that up to a single
server, and am a little vague on how much RAM would be advisable, other
than "as much as will fit" (:-). Then again, I've been waiting for
something like pNFS/NFSv4.1 to be usable for gluing together multiple
NFS servers into a single global namespace, without any sign of that
happening anytime soon.
NFS v4 or DFS (or even clever sysadmin + automount) offers single namespace
without needing the complexity of NFSv4.1, lustre, glusterfs, etc.
Post by Marion Hakanson
So, has anyone done this? Or come close to it? Thoughts, even if you
haven't done it yourself?
Don't forget about backups :-)
-- richard


--

***@RichardElling.com
+1-760-896-4422
Richard Yao
2013-03-16 12:23:07 UTC
Permalink
Raw Message
Post by Richard Elling
Post by Marion Hakanson
So, has anyone done this? Or come close to it? Thoughts, even if you
haven't done it yourself?
Don't forget about backups :-)
-- richard
Transferring 1 PB over a 10 gigabit link will take at least 10 days when
overhead is taken into account. The backup system should have a
dedicated 10 gigabit link at the minimum and using incremental send/recv
will be extremely important.




-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription: https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com
Schweiss, Chip
2013-03-17 01:05:22 UTC
Permalink
Raw Message
I just recently built an OpenIndiana 151a7 system that is currently 1/2 PB
that will be expanded to 1 PB as we collect imaging data for the Human
Connectome Project at Washington University in St. Louis. It is very much
like your use case as this is an offsite backup system that will write once
and read rarely.

It has displaced a BlueArc DR system because their mechanisms for syncing
over distances could not keep up with our data generation rate. The fact
it cost 5x per TB as homebrew helped the decision also.

It is currently 180 4TB SAS Seagate Constellations in 4 Supermicro JBODs.
The JBODS currently are in two branches only cascading once. When
expanded 4 JBODs will be on each branch. The pool is configured as 9 zvols
of 19 drives in raidz3. The remaining disks are configured as hot
spares. Metedata only is cached in 128GB ram and 2 480GB Intel 520 SSDs
for L2ARC. Sync (ZIL) is turned off since the worst that would happen is
that we would need to rerun an rsync job.

Two identical servers were built for a cold standby configuration. Since
it is a DR system the need for a hot standby was ruled out since even
several hours downtime would not be an issue. Each server is fitted with 2
LSI 9207-8e HBAs configured as redundant multipath to the JBODs.

Before putting in into service I ran several iozone tests to benchmark the
pool. Even with really fat vdevs the performance is impressive. If
you're interested in that data let me know. It has many hours of idle
time each day so additional performance tests are not out of the question
either.

Actually I should say I designed and configured the system. The system was
assembled by a colleague at UMINN. If you would like more details on the
hardware I have a very detailed assembly doc I wrote and would be happy to
share.

The system receives daily rsyncs from our production BlueArc system. The
rsyncs are split into 120 parallel rsync jobs. This overcomes the latency
slow down TCP suffers from and we see total throughput between
500-700Mb/s. The BlueArc has 120TB of 15k SAS tiered to NL-SAS. All
metadata is on the SAS pool. The ZFS system outpaces the BlueArc on
metadata when rsync does its tree walk.

Given all the safeguards built into ZFS, I would not hesitate to build a
production system at the multi-petabyte scale. If a channel to disks are
no longer available it will simply stop writing and data will be safe.
Given the redundant paths, power supplies, etc, the odds of that happening
are very unlikely. The single points of failure left when running a single
server remain at the motherboard, CPU and RAM level. Build a hot standby
server and human error becomes the most likely failure.

-Chip
Post by Marion Hakanson
Greetings,
Has anyone out there built a 1-petabyte pool? I've been asked to look
into this, and was told "low performance" is fine, workload is likely
to be write-once, read-occasionally, archive storage of gene sequencing
data. Probably a single 10Gbit NIC for connectivity is sufficient.
We've had decent success with the 45-slot, 4U SuperMicro SAS disk chassis,
using 4TB "nearline SAS" drives, giving over 100TB usable space (raidz3).
Back-of-the-envelope might suggest stacking up eight to ten of those,
depending if you want a "raw marketing petabyte", or a proper "power-of-two
usable petabyte".
I get a little nervous at the thought of hooking all that up to a single
server, and am a little vague on how much RAM would be advisable, other
than "as much as will fit" (:-). Then again, I've been waiting for
something like pNFS/NFSv4.1 to be usable for gluing together multiple
NFS servers into a single global namespace, without any sign of that
happening anytime soon.
So, has anyone done this? Or come close to it? Thoughts, even if you
haven't done it yourself?
Thanks and regards,
Marion
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Loading...