partioned cache devices

Discussion:

Andrew Werchowiecki

2013-03-15 05:17:33 UTC

Hi all,

I'm having some trouble with adding cache drives to a zpool, anyone got any ideas?

***@Pyzee:~$ sudo zpool add aggr0 cache c25t10d1p2
Password:
cannot open '/dev/dsk/c25t10d1p2': I/O error
***@Pyzee:~$

I have two SSDs in the system, I've created an 8gb partition on each drive for use as a mirrored write cache. I also have the remainder of the drive partitioned for use as the read only cache. However, when attempting to add it I get the error above.

Here's a zpool status:

pool: aggr0
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Thu Feb 21 21:13:45 2013
1.13T scanned out of 20.0T at 106M/s, 51h52m to go
74.2G resilvered, 5.65% done
config:

NAME STATE READ WRITE CKSUM
aggr0 DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
c7t5000C50035CA68EDd0 ONLINE 0 0 0
c7t5000C5003679D3E2d0 ONLINE 0 0 0
c7t50014EE2B16BC08Bd0 ONLINE 0 0 0
c7t50014EE2B174216Dd0 ONLINE 0 0 0
c7t50014EE2B174366Bd0 ONLINE 0 0 0
c7t50014EE25C1E7646d0 ONLINE 0 0 0
c7t50014EE25C17A62Cd0 ONLINE 0 0 0
c7t50014EE25C17720Ed0 ONLINE 0 0 0
c7t50014EE206C2AFD1d0 ONLINE 0 0 0
c7t50014EE206C8E09Fd0 ONLINE 0 0 0
c7t50014EE602DFAACAd0 ONLINE 0 0 0
c7t50014EE602DFE701d0 ONLINE 0 0 0
c7t50014EE20677C1C1d0 ONLINE 0 0 0
replacing-13 UNAVAIL 0 0 0
c7t50014EE6031198C1d0 UNAVAIL 0 0 0 cannot open
c7t50014EE0AE2AB006d0 ONLINE 0 0 0 (resilvering)
c7t50014EE65835480Dd0 ONLINE 0 0 0
logs
mirror-1 ONLINE 0 0 0
c25t10d1p1 ONLINE 0 0 0
c25t9d1p1 ONLINE 0 0 0

errors: No known data errors

As you can see, I've successfully added the 8gb partitions in a write caches. Interestingly, when I do a zpool iostat -v it shows the total as 111gb:

capacity operations bandwidth
pool alloc free read write read write
--------------------------- ----- ----- ----- ----- ----- -----
aggr0 20.0T 7.27T 1.33K 139 81.7M 4.19M
raidz2 20.0T 7.27T 1.33K 115 81.7M 2.70M
c7t5000C50035CA68EDd0 - - 566 9 6.91M 241K
c7t5000C5003679D3E2d0 - - 493 8 6.97M 242K
c7t50014EE2B16BC08Bd0 - - 544 9 7.02M 239K
c7t50014EE2B174216Dd0 - - 525 9 6.94M 241K
c7t50014EE2B174366Bd0 - - 540 9 6.95M 241K
c7t50014EE25C1E7646d0 - - 549 9 7.02M 239K
c7t50014EE25C17A62Cd0 - - 534 9 6.93M 241K
c7t50014EE25C17720Ed0 - - 542 9 6.95M 241K
c7t50014EE206C2AFD1d0 - - 549 9 7.02M 239K
c7t50014EE206C8E09Fd0 - - 526 10 6.94M 241K
c7t50014EE602DFAACAd0 - - 576 10 6.91M 241K
c7t50014EE602DFE701d0 - - 591 10 7.00M 239K
c7t50014EE20677C1C1d0 - - 530 10 6.95M 241K
replacing - - 0 922 0 7.11M
c7t50014EE6031198C1d0 - - 0 0 0 0
c7t50014EE0AE2AB006d0 - - 0 622 2 7.10M
c7t50014EE65835480Dd0 - - 595 10 6.98M 239K
logs - - - - - -
mirror 740K 111G 0 43 0 2.75M
c25t10d1p1 - - 0 43 3 2.75M
c25t9d1p1 - - 0 43 3 2.75M
--------------------------- ----- ----- ----- ----- ----- -----
rpool 7.32G 12.6G 2 4 41.9K 43.2K
c4t0d0s0 7.32G 12.6G 2 4 41.9K 43.2K
--------------------------- ----- ----- ----- ----- ----- -----

Something funky is going on here...

Wooks

Ian Collins

2013-03-15 08:53:11 UTC

Permalink

Andrew Werchowiecki wrote:
>
> Hi all,
>
> I'm having some trouble with adding cache drives to a zpool, anyone
> got any ideas?
>
> ***@Pyzee:~$ sudo zpool add aggr0 cache c25t10d1p2
>
> Password:
>
> cannot open '/dev/dsk/c25t10d1p2': I/O error
>
> ***@Pyzee:~$
>
> I have two SSDs in the system, I've created an 8gb partition on each
> drive for use as a mirrored write cache. I also have the remainder of
> the drive partitioned for use as the read only cache. However, when
> attempting to add it I get the error above.
>

Create one 100% Solaris partition and then use format to create two slices.

--
Ian.

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2013-03-15 12:44:49 UTC

Permalink

> From: zfs-discuss-***@opensolaris.org [mailto:zfs-discuss-
> ***@opensolaris.org] On Behalf Of Andrew Werchowiecki
>
> ***@Pyzee:~$ sudo zpool add aggr0 cache c25t10d1p2
> Password:
> cannot open '/dev/dsk/c25t10d1p2': I/O error
> ***@Pyzee:~$
>
> I have two SSDs in the system, I've created an 8gb partition on each drive for
> use as a mirrored write cache. I also have the remainder of the drive
> partitioned for use as the read only cache. However, when attempting to add
> it I get the error above.

Sounds like you're probably running into confusion about how to partition the drive. If you create fdisk partitions, they will be accessible as p0, p1, p2, but I think p0 unconditionally refers to the whole drive, so the first partition is p1, and the second is p2.

If you create one big solaris fdisk parititon and then slice it via "partition" where s2 is typically the encompassing slice, and people usually use s1 and s2 and s6 for actual slices, then they will be accessible via s1, s2, s6

Generally speaking, it's unadvisable to split the slog/cache devices anyway. Because:

If you're splitting it, evidently you're focusing on the wasted space. Buying an expensive 128G device where you couldn't possibly ever use more than 4G or 8G in the slog. But that's not what you should be focusing on. You should be focusing on the speed (that's why you bought it in the first place.) The slog is write-only, and the cache is a mixture of read/write, where it should be hopefully doing more reads than writes. But regardless of your actual success with the cache device, your cache device will be busy most of the time, and competing against the slog.

You have a mirror, you say. You should probably drop both the cache & log. Use one whole device for the cache, use one whole device for the log. The only risk you'll run is:

Since a slog is write-only (except during mount, typically at boot) it's possible to have a failure mode where you think you're writing to the log, but the first time you go back and read, you discover an error, and discover the device has gone bad. In other words, without ever doing any reads, you might not notice when/if the device goes bad. Fortunately, there's an easy workaround. You could periodically (say, once a month) script the removal of your log device, create a junk pool, write a bunch of data to it, scrub it (thus verifying it was written correctly) and in the absence of any scrub errors, destroy the junk pool and re-add the device as a slog to the main pool.

I've never heard of anyone actually being that paranoid, and I've never heard of anyone actually experiencing the aforementioned possible undetected device failure mode. So this is all mostly theoretical.

Mirroring the slog device really isn't necessary in the modern age.

Andrew Werchowiecki

2013-03-17 02:01:59 UTC

Permalink

It's a home set up, the performance penalty from splitting the cache devices is non-existant, and that work around sounds like some pretty crazy amount of overhead where I could instead just have a mirrored slog.

I'm less concerned about wasted space, more concerned about amount of SAS ports I have available.

I understand that p0 refers to the whole disk... in the logs I pasted in I'm not attempting to mount p0. I'm trying to work out why I'm getting an error attempting to mount p2, after p1 has successfully mounted. Further, this has been done before on other systems in the same hardware configuration in the exact same fashion, and I've gone over the steps trying to make sure I haven't missed something but can't see a fault.

I'm not keen on using Solaris slices because I don't have an understanding of what that does to the pool's OS interoperability.
________________________________________
From: Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) [***@nedharvey.com]
Sent: Friday, 15 March 2013 8:44 PM
To: Andrew Werchowiecki; zfs-***@opensolaris.org
Subject: RE: partioned cache devices

> From: zfs-discuss-***@opensolaris.org [mailto:zfs-discuss-
> ***@opensolaris.org] On Behalf Of Andrew Werchowiecki
>
> ***@Pyzee:~$ sudo zpool add aggr0 cache c25t10d1p2
> Password:
> cannot open '/dev/dsk/c25t10d1p2': I/O error
> ***@Pyzee:~$
>
> I have two SSDs in the system, I've created an 8gb partition on each drive for
> use as a mirrored write cache. I also have the remainder of the drive
> partitioned for use as the read only cache. However, when attempting to add
> it I get the error above.

Sounds like you're probably running into confusion about how to partition the drive. If you create fdisk partitions, they will be accessible as p0, p1, p2, but I think p0 unconditionally refers to the whole drive, so the first partition is p1, and the second is p2.

If you create one big solaris fdisk parititon and then slice it via "partition" where s2 is typically the encompassing slice, and people usually use s1 and s2 and s6 for actual slices, then they will be accessible via s1, s2, s6

Generally speaking, it's unadvisable to split the slog/cache devices anyway. Because:

If you're splitting it, evidently you're focusing on the wasted space. Buying an expensive 128G device where you couldn't possibly ever use more than 4G or 8G in the slog. But that's not what you should be focusing on. You should be focusing on the speed (that's why you bought it in the first place.) The slog is write-only, and the cache is a mixture of read/write, where it should be hopefully doing more reads than writes. But regardless of your actual success with the cache device, your cache device will be busy most of the time, and competing against the slog.

You have a mirror, you say. You should probably drop both the cache & log. Use one whole device for the cache, use one whole device for the log. The only risk you'll run is:

Since a slog is write-only (except during mount, typically at boot) it's possible to have a failure mode where you think you're writing to the log, but the first time you go back and read, you discover an error, and discover the device has gone bad. In other words, without ever doing any reads, you might not notice when/if the device goes bad. Fortunately, there's an easy workaround. You could periodically (say, once a month) script the removal of your log device, create a junk pool, write a bunch of data to it, scrub it (thus verifying it was written correctly) and in the absence of any scrub errors, destroy the junk pool and re-add the device as a slog to the main pool.

I've never heard of anyone actually being that paranoid, and I've never heard of anyone actually experiencing the aforementioned possible undetected device failure mode. So this is all mostly theoretical.

Mirroring the slog device really isn't necessary in the modern age.

Richard Elling

2013-03-17 03:46:44 UTC

Permalink

On Mar 16, 2013, at 7:01 PM, Andrew Werchowiecki <***@xpanse.com.au> wrote:

> It's a home set up, the performance penalty from splitting the cache devices is non-existant, and that work around sounds like some pretty crazy amount of overhead where I could instead just have a mirrored slog.
>
> I'm less concerned about wasted space, more concerned about amount of SAS ports I have available.
>
> I understand that p0 refers to the whole disk... in the logs I pasted in I'm not attempting to mount p0. I'm trying to work out why I'm getting an error attempting to mount p2, after p1 has successfully mounted. Further, this has been done before on other systems in the same hardware configuration in the exact same fashion, and I've gone over the steps trying to make sure I haven't missed something but can't see a fault.

You can have only one Solaris partition at a time. Ian already shared the answer, "Create one 100%
Solaris partition and then use format to create two slices."
-- richard

>
> I'm not keen on using Solaris slices because I don't have an understanding of what that does to the pool's OS interoperability.
> ________________________________________
> From: Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) [***@nedharvey.com]
> Sent: Friday, 15 March 2013 8:44 PM
> To: Andrew Werchowiecki; zfs-***@opensolaris.org
> Subject: RE: partioned cache devices
>
>> From: zfs-discuss-***@opensolaris.org [mailto:zfs-discuss-
>> ***@opensolaris.org] On Behalf Of Andrew Werchowiecki
>>
>> ***@Pyzee:~$ sudo zpool add aggr0 cache c25t10d1p2
>> Password:
>> cannot open '/dev/dsk/c25t10d1p2': I/O error
>> ***@Pyzee:~$
>>
>> I have two SSDs in the system, I've created an 8gb partition on each drive for
>> use as a mirrored write cache. I also have the remainder of the drive
>> partitioned for use as the read only cache. However, when attempting to add
>> it I get the error above.
>
> Sounds like you're probably running into confusion about how to partition the drive. If you create fdisk partitions, they will be accessible as p0, p1, p2, but I think p0 unconditionally refers to the whole drive, so the first partition is p1, and the second is p2.
>
> If you create one big solaris fdisk parititon and then slice it via "partition" where s2 is typically the encompassing slice, and people usually use s1 and s2 and s6 for actual slices, then they will be accessible via s1, s2, s6
>
> Generally speaking, it's unadvisable to split the slog/cache devices anyway. Because:
>
> If you're splitting it, evidently you're focusing on the wasted space. Buying an expensive 128G device where you couldn't possibly ever use more than 4G or 8G in the slog. But that's not what you should be focusing on. You should be focusing on the speed (that's why you bought it in the first place.) The slog is write-only, and the cache is a mixture of read/write, where it should be hopefully doing more reads than writes. But regardless of your actual success with the cache device, your cache device will be busy most of the time, and competing against the slog.
>
> You have a mirror, you say. You should probably drop both the cache & log. Use one whole device for the cache, use one whole device for the log. The only risk you'll run is:
>
> Since a slog is write-only (except during mount, typically at boot) it's possible to have a failure mode where you think you're writing to the log, but the first time you go back and read, you discover an error, and discover the device has gone bad. In other words, without ever doing any reads, you might not notice when/if the device goes bad. Fortunately, there's an easy workaround. You could periodically (say, once a month) script the removal of your log device, create a junk pool, write a bunch of data to it, scrub it (thus verifying it was written correctly) and in the absence of any scrub errors, destroy the junk pool and re-add the device as a slog to the main pool.
>
> I've never heard of anyone actually being that paranoid, and I've never heard of anyone actually experiencing the aforementioned possible undetected device failure mode. So this is all mostly theoretical.
>
> Mirroring the slog device really isn't necessary in the modern age.
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-***@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--

ZFS and performance consulting
http://www.RichardElling.com

Fajar A. Nugraha

2013-03-17 07:03:45 UTC

Permalink

On Sun, Mar 17, 2013 at 1:01 PM, Andrew Werchowiecki <
***@xpanse.com.au> wrote:

> I understand that p0 refers to the whole disk... in the logs I pasted in
> I'm not attempting to mount p0. I'm trying to work out why I'm getting an
> error attempting to mount p2, after p1 has successfully mounted. Further,
> this has been done before on other systems in the same hardware
> configuration in the exact same fashion, and I've gone over the steps
> trying to make sure I haven't missed something but can't see a fault.
>
>
How did you create the partition? Are those marked as solaris partition, or
something else (e.g. fdisk on linux use type "83" by default).

I'm not keen on using Solaris slices because I don't have an understanding
> of what that does to the pool's OS interoperability.
>

Linux can read solaris slice and import solaris-made pools just fine, as
long as you're using compatible zpool version (e.g. zpool version 28).

--
Fajar

Andrew Werchowiecki

2013-03-19 05:23:28 UTC

Permalink

I did something like the following:

format -e /dev/rdsk/c5t0d0p0
fdisk
1 (create)
F (EFI)
6 (exit)
partition
label
1
y
0
usr
wm
64
4194367e
1
usr
wm
4194368
117214990
label
1
y

Total disk size is 9345 cylinders
Cylinder size is 12544 (512 byte) blocks

Cylinders
Partition Status Type Start End Length %
========= ====== ============ ===== === ====== ===
1 EFI 0 9345 9346 100

partition> print
Current partition table (original):
Total disk sectors available: 117214957 + 16384 (reserved sectors)

Part Tag Flag First Sector Size Last Sector
0 usr wm 64 2.00GB 4194367
1 usr wm 4194368 53.89GB 117214990
2 unassigned wm 0 0 0
3 unassigned wm 0 0 0
4 unassigned wm 0 0 0
5 unassigned wm 0 0 0
6 unassigned wm 0 0 0
8 reserved wm 117214991 8.00MB 117231374

This isn't the output from when I did it but it is exactly the same steps that I followed.

Thanks for the info about slices, I may give that a go later on. I'm not keen on that because I have clear evidence (as in zpools set up this way, right now, working, without issue) that GPT partitions of the style shown above work and I want to see why it doesn't work in my set up rather than simply ignoring and moving on.

From: Fajar A. Nugraha [mailto:***@fajar.net]
Sent: Sunday, 17 March 2013 3:04 PM
To: Andrew Werchowiecki
Cc: zfs-***@opensolaris.org
Subject: Re: [zfs-discuss] partioned cache devices

On Sun, Mar 17, 2013 at 1:01 PM, Andrew Werchowiecki <***@xpanse.com.au<mailto:***@xpanse.com.au>> wrote:
I understand that p0 refers to the whole disk... in the logs I pasted in I'm not attempting to mount p0. I'm trying to work out why I'm getting an error attempting to mount p2, after p1 has successfully mounted. Further, this has been done before on other systems in the same hardware configuration in the exact same fashion, and I've gone over the steps trying to make sure I haven't missed something but can't see a fault.

How did you create the partition? Are those marked as solaris partition, or something else (e.g. fdisk on linux use type "83" by default).

I'm not keen on using Solaris slices because I don't have an understanding of what that does to the pool's OS interoperability.

Linux can read solaris slice and import solaris-made pools just fine, as long as you're using compatible zpool version (e.g. zpool version 28).

--
Fajar

Ian Collins

2013-03-19 06:14:00 UTC

Permalink

Andrew Werchowiecki wrote:
>
> Thanks for the info about slices, I may give that a go later on. I’m
> not keen on that because I have clear evidence (as in zpools set up
> this way, right now, working, without issue) that GPT partitions of
> the style shown above work and I want to see why it doesn’t work in my
> set up rather than simply ignoring and moving on.
>

Didn't you read Richard's post? "You can have only one Solaris partition
at a time."

Your original example failed when you tried to add a second.

--
Ian.

Cindy Swearingen

2013-03-19 19:38:51 UTC

Permalink

Hi Andrew,

Your original syntax was incorrect.

A p* device is a larger container for the d* device or s* devices.
In the case of a cache device, you need to specify a d* or s* device.
That you can add p* devices to a pool is a bug.

Adding different slices from c25t10d1 as both log and cache devices
would need the s* identifier, but you've already added the entire
c25t10d1 as the log device. A better configuration would be using
c25t10d1 for log and using c25t9d1 for cache or provide some spares
for this large pool.

After you remove the log devices, re-add like this:

# zpool add aggr0 log c25t10d1
# zpool add aggr0 cache c25t9d1

You might review the ZFS recommendation practices section, here:

http://docs.oracle.com/cd/E26502_01/html/E29007/zfspools-4.html#storage-2

See example 3-4 for adding a cache device, here:

http://docs.oracle.com/cd/E26502_01/html/E29007/gayrd.html#gazgw

Always have good backups.

Thanks, Cindy

On 03/18/13 23:23, Andrew Werchowiecki wrote:
> I did something like the following:
>
> format -e /dev/rdsk/c5t0d0p0
>
> fdisk
>
> 1 (create)
>
> F (EFI)
>
> 6 (exit)
>
> partition
>
> label
>
> 1
>
> y
>
> 0
>
> usr
>
> wm
>
> 64
>
> 4194367e
>
> 1
>
> usr
>
> wm
>
> 4194368
>
> 117214990
>
> label
>
> 1
>
> y
>
> Total disk size is 9345 cylinders
>
> Cylinder size is 12544 (512 byte) blocks
>
> Cylinders
>
> Partition Status Type Start End Length %
>
> ========= ====== ============ ===== === ====== ===
>
> 1 EFI 0 9345 9346 100
>
> partition> print
>
> Current partition table (original):
>
> Total disk sectors available: 117214957 + 16384 (reserved sectors)
>
> Part Tag Flag First Sector Size Last Sector
>
> 0 usr wm 64 2.00GB 4194367
>
> 1 usr wm 4194368 53.89GB 117214990
>
> 2 unassigned wm 0 0 0
>
> 3 unassigned wm 0 0 0
>
> 4 unassigned wm 0 0 0
>
> 5 unassigned wm 0 0 0
>
> 6 unassigned wm 0 0 0
>
> 8 reserved wm 117214991 8.00MB 117231374
>
> This isn’t the output from when I did it but it is exactly the same
> steps that I followed.
>
> Thanks for the info about slices, I may give that a go later on. I’m not
> keen on that because I have clear evidence (as in zpools set up this
> way, right now, working, without issue) that GPT partitions of the style
> shown above work and I want to see why it doesn’t work in my set up
> rather than simply ignoring and moving on.
>
> *From:*Fajar A. Nugraha [mailto:***@fajar.net]
> *Sent:* Sunday, 17 March 2013 3:04 PM
> *To:* Andrew Werchowiecki
> *Cc:* zfs-***@opensolaris.org
> *Subject:* Re: [zfs-discuss] partioned cache devices
>
> On Sun, Mar 17, 2013 at 1:01 PM, Andrew Werchowiecki
> <***@xpanse.com.au
> <mailto:***@xpanse.com.au>> wrote:
>
> I understand that p0 refers to the whole disk... in the logs I
> pasted in I'm not attempting to mount p0. I'm trying to work out why
> I'm getting an error attempting to mount p2, after p1 has
> successfully mounted. Further, this has been done before on other
> systems in the same hardware configuration in the exact same
> fashion, and I've gone over the steps trying to make sure I haven't
> missed something but can't see a fault.
>
> How did you create the partition? Are those marked as solaris partition,
> or something else (e.g. fdisk on linux use type "83" by default).
>
> I'm not keen on using Solaris slices because I don't have an
> understanding of what that does to the pool's OS interoperability.
>
> Linux can read solaris slice and import solaris-made pools just fine, as
> long as you're using compatible zpool version (e.g. zpool version 28).
>
> --
>
> Fajar
>
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-***@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Jim Klimov

2013-03-19 20:27:29 UTC

Permalink

On 2013-03-19 20:38, Cindy Swearingen wrote:
> Hi Andrew,
>
> Your original syntax was incorrect.
>
> A p* device is a larger container for the d* device or s* devices.
> In the case of a cache device, you need to specify a d* or s* device.
> That you can add p* devices to a pool is a bug.

I disagree; at least, I've always thought differently:
the "d" device is the whole disk denomination, with a
unique number for a particular controller link ("c+t").

The disk has some partitioning table, MBR or GPT/EFI.
In these tables, partition "p0" stands for the table
itself (i.e. to manage partitioning), and the rest kind
of "depends". In case of MBR tables, one partition may
be named as having a Solaris (or Solaris2) type, and
there it holds a SMI table of Solaris slices, and these
slices can hold legacy filesystems or components of ZFS
pools. In case of GPT, the GPT-partitions can be used
directly by ZFS. However, they are also denominated as
"slices" in ZFS and format utility.

I believe, Solaris-based OSes accessing a "p"-named
partition and an "s"-named slice of the same number
on a GPT disk should lead to the same range of bytes
on disk, but I am not really certain about this.

Also, if a "whole disk" is given to ZFS (and for OSes
other that the latest Solaris 11 this means non-rpool
disks), then ZFS labels the disk as GPT and defines a
partition for itself plus a small trailing partition
(likely to level out discrepancies with replacement
disks that might happen to be a few sectors too small).
In this case ZFS reports that it uses "cXtYdZ" as a
pool component, since it considers itself in charge
of the partitioning table and its inner contents, and
doesn't intend to share the disk with other usages
(dual-booting and other OSes' partitions, or SLOG and
L2ARC parts, etc). This also "allows" ZFS to influence
hardware-related choices, like caching and throttling,
and likely auto-expansion with the changed LUN sizes
by fixing up the partition table along the way, since
it assumes being 100% in charge of the disk.

I don't think there is a "crime" in trying to use the
partitions (of either kind) as ZFS leaf vdevs, even the
zpool(1M) manpage states that:

... The following virtual devices are supported:
disk
A block device, typically located under /dev/dsk.
ZFS can use individual slices or partitions,
though the recommended mode of operation is to use
whole disks. ...

This is orthogonal to the fact that there can only be
one Solaris slice table, inside one partition, on MBR.
AFAIK this is irrelevant on GPT/EFI - no SMI slices there.

On my old home NAS with OpenSolaris I certainly did have
MBR partitions on the rpool intended initially for some
dual-booted OSes, but repurposed as L2ARC and ZIL devices
for the storage pool on other disks, when I played with
that technology. Didn't gain much with a single spindle ;)

HTH,
//Jim Klimov

Andrew Gabriel

2013-03-19 21:07:39 UTC

Permalink

On 03/19/13 20:27, Jim Klimov wrote:
> I disagree; at least, I've always thought differently:
> the "d" device is the whole disk denomination, with a
> unique number for a particular controller link ("c+t").
>
> The disk has some partitioning table, MBR or GPT/EFI.
> In these tables, partition "p0" stands for the table
> itself (i.e. to manage partitioning),

p0 is the whole disk regardless of any partitioning.
(Hence you can use p0 to access any type of partition table.)

> and the rest kind
> of "depends". In case of MBR tables, one partition may
> be named as having a Solaris (or Solaris2) type, and
> there it holds a SMI table of Solaris slices, and these
> slices can hold legacy filesystems or components of ZFS
> pools. In case of GPT, the GPT-partitions can be used
> directly by ZFS. However, they are also denominated as
> "slices" in ZFS and format utility.

The GPT partitioning spec requires the disk to be FDISK
partitioned with just one single FDISK partition of type EFI,
so that tools which predate GPT partitioning will still see
such a GPT disk as fully assigned to FDISK partitions, and
therefore less likely to be accidentally blown away.

> I believe, Solaris-based OSes accessing a "p"-named
> partition and an "s"-named slice of the same number
> on a GPT disk should lead to the same range of bytes
> on disk, but I am not really certain about this.

No, you'll see just p0 (whole disk), and p1 (whole disk
less space for the backwards compatible FDISK partitioning).

> Also, if a "whole disk" is given to ZFS (and for OSes
> other that the latest Solaris 11 this means non-rpool
> disks), then ZFS labels the disk as GPT and defines a
> partition for itself plus a small trailing partition
> (likely to level out discrepancies with replacement
> disks that might happen to be a few sectors too small).
> In this case ZFS reports that it uses "cXtYdZ" as a
> pool component,

For an EFI disk, the device name without a final p* or s*
component is the whole EFI partition. (It's actually the
s7 slice minor device node, but the s7 is dropped from
the device name to avoid the confusion we had with s2
on SMI labeled disks being the whole SMI partition.)

> since it considers itself in charge
> of the partitioning table and its inner contents, and
> doesn't intend to share the disk with other usages
> (dual-booting and other OSes' partitions, or SLOG and
> L2ARC parts, etc). This also "allows" ZFS to influence
> hardware-related choices, like caching and throttling,
> and likely auto-expansion with the changed LUN sizes
> by fixing up the partition table along the way, since
> it assumes being 100% in charge of the disk.
>
> I don't think there is a "crime" in trying to use the
> partitions (of either kind) as ZFS leaf vdevs, even the
> zpool(1M) manpage states that:
>
> ... The following virtual devices are supported:
> disk
> A block device, typically located under /dev/dsk.
> ZFS can use individual slices or partitions,
> though the recommended mode of operation is to use
> whole disks. ...

Right.

> This is orthogonal to the fact that there can only be
> one Solaris slice table, inside one partition, on MBR.
> AFAIK this is irrelevant on GPT/EFI - no SMI slices there.

There's a simpler way to think of it on x86.
You always have FDISK partitioning (p1, p2, p3, p4).
You can then have SMI or GPT/EFI slices (both called s0, s1, ...)
in an FDISK partition of the appropriate type.
With SMI labeling, s2 is by convention the whole Solaris FDISK
partition (although this is not enforced).
With EFI labeling, s7 is enforced as the whole EFI FDISK partition,
and so the trailing s7 is dropped off the device name for
clarity.

This simplicity is brought about because the GPT spec requires
that backwards compatible FDISK partitioning is included, but
with just 1 partition assigned.

--
Andrew

Jim Klimov

2013-03-19 21:42:56 UTC

Permalink

On 2013-03-19 22:07, Andrew Gabriel wrote:
> The GPT partitioning spec requires the disk to be FDISK
> partitioned with just one single FDISK partition of type EFI,
> so that tools which predate GPT partitioning will still see
> such a GPT disk as fully assigned to FDISK partitions, and
> therefore less likely to be accidentally blown away.

Okay, I guess I got entangled in terminology now ;)
Anyhow, your words are not all news to me, though my write-up
was likely misleading to unprepared readers... sigh... Thanks
for the clarifications and deeper details that I did not know!

So, we can concur that GPT does indeed include the fake MBR
header with one EFI partition which addresses the smaller of
2TB (MBR limit) or disk size, minus a few sectors for the GPT
housekeeping. Inside the EFI partition are defined the GPT,
um, partitions (represented as "s"lices in Solaris). This is
after all a GUID *Partition* Table, and that's how parted
refers to them too ;)

Notably, there are also unportable tricks to fool legacy OSes
and bootloaders into addressing the same byte ranges via both
MBR entries (forged manually and abusing the GPT/EFI spec) and
proper GPT entries, as partitions in the sense of each table.

//Jim

Andrew Gabriel

2013-03-19 20:17:43 UTC

Permalink

Andrew Werchowiecki wrote:

> Total disk size is 9345 cylinders
> Cylinder size is 12544 (512 byte) blocks
>
> Cylinders
> Partition Status Type Start End Length %
> ========= ====== ============ ===== === ====== ===
> 1 EFI 0 9345 9346 100

You only have a p1 (and for a GPT/EFI labeled disk, you can only
have p1 - no other FDISK partitions are allowed).

> partition> print
> Current partition table (original):
> Total disk sectors available: 117214957 + 16384 (reserved sectors)
>
> Part Tag Flag First Sector Size Last Sector
> 0 usr wm 64 2.00GB 4194367
> 1 usr wm 4194368 53.89GB 117214990
> 2 unassigned wm 0 0 0
> 3 unassigned wm 0 0 0
> 4 unassigned wm 0 0 0
> 5 unassigned wm 0 0 0
> 6 unassigned wm 0 0 0
> 8 reserved wm 117214991 8.00MB 117231374

You have an s0 and s1.

> This isn’t the output from when I did it but it is exactly the same
> steps that I followed.
>
> Thanks for the info about slices, I may give that a go later on. I’m not
> keen on that because I have clear evidence (as in zpools set up this
> way, right now, working, without issue) that GPT partitions of the style
> shown above work and I want to see why it doesn’t work in my set up
> rather than simply ignoring and moving on.

You would have to blow away the partitioning you have, and create an FDISK
partitioned disk (not EFI), and then create a p1 and p2 partition. (Don't
use the 'partition' subcommand, which confusingly creates solaris slices.)
Give the FDISK partitions a partition type which nothing will recognise,
such as 'other', so that nothing will try and interpret them as OS partitions.
Then you can use them as raw devices, and they should be portable between
OS's which can handle FDISK partitioned devices.

--
Andrew