Discussion:
replace same sized disk fails with too small error
(too old to reply)
Antonius
2009-01-18 02:58:51 UTC
Permalink
I'm having an issue replacing a failed 500GB disk with another new one with the error that the disk is too small. The problem is that it isn't. Is there any help anyone can offer here?

I've tried adding it once set as a spare or seperate from the pool and with different formats and configs all to no avail. Is there a pre-requisite I am not fulfilling?

[b]bash-3.2# format[/b]
Searching for disks...done
AVAILABLE DISK SELECTIONS:
0. c0d0 <DEFAULT cyl 993 alt 2 hd 255 sec 63>
/***@0,0/pci8086,***@1c,3/pci-***@0/***@0/***@0,0
1. c1d0 <WDC WD50- WD-WCASY247370-0001-465.76GB>
/***@0,0/pci8086,***@1e/pci-***@0/***@0/***@0,0
2. c1d1 <SAMSUNG-S0MUJQSQA0223-0001-465.76GB>
/***@0,0/pci8086,***@1e/pci-***@0/***@0/***@1,0
3. c2d0 <WDC WD50- WD-WCASY244407-0001-465.76GB>
/***@0,0/pci8086,***@1e/pci-***@0/***@1/***@0,0
4. c3d0 <SAMSUNG-S0MUJQSQA0224-0001-465.76GB>
/***@0,0/pci8086,***@1e/pci-***@1/***@0/***@0,0
5. c3d1 <SAMSUNG-S0VVJ1CP30560-0001-465.76GB>
/***@0,0/pci8086,***@1e/pci-***@1/***@0/***@1,0
6. c4d0 <SAMSUNG-S0VVJ1CP30539-0001-465.76GB>
/***@0,0/pci8086,***@1e/pci-***@1/***@1/***@0,0
7. c4d1 <SAMSUNG-S0VVJ1CP30561-0001-465.76GB>
/***@0,0/pci8086,***@1e/pci-***@1/***@1/***@1,0
8. c5t0d0 <ATA-SAMSUNG HD501LJ-0-10-465.76GB>
/***@0,0/pci1458,***@1f,2/***@0,0
9. c5t1d0 <ATA-SAMSUNG HD501LJ-0-10-465.76GB>
/***@0,0/pci1458,***@1f,2/***@1,0
10. c5t2d0 <ATA-SAMSUNG HD501LJ-0-10-465.76GB>
/***@0,0/pci1458,***@1f,2/***@2,0
11. c5t3d0 <ATA-SAMSUNG HD501LJ-0-10-465.76GB>
/***@0,0/pci1458,***@1f,2/***@3,0
Specify disk (enter its number): 4
selecting c3d0
NO Alt slice
No defect list found
[disk formatted, no defect list found]
/dev/dsk/c3d0s0 is reserved as a hot spare for ZFS pool storage. Please see zpool(1M).


FORMAT MENU:
disk - select a disk
type - select (define) a disk type
partition - select (define) a partition table
current - describe the current disk
format - format and analyze the disk
fdisk - run the fdisk program
repair - repair a defective sector
show - translate a disk address
label - write label to the disk
analyze - surface analysis
defect - defect list management
backup - search for backup labels
verify - read and display labels
volname - set 8-character volume name
!<cmd> - execute <cmd>, then return
quit
[b]format> verify[/b]

Volume name = < >
ascii name = <SAMSUNG-S0MUJQSQA0224-0001-465.76GB>
bytes/sector = 512
sectors = 976760063
accessible sectors = 976760030
Part Tag Flag First Sector Size Last Sector
0 usr wm 256 465.75GB 976743646
1 unassigned wm 0 0 0
2 unassigned wm 0 0 0
3 unassigned wm 0 0 0
4 unassigned wm 0 0 0
5 unassigned wm 0 0 0
6 unassigned wm 0 0 0
8 reserved wm 976743647 8.00MB 976760030

[b]bash-3.2# zpool status[/b]
pool: storage
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: http://www.sun.com/msg/ZFS-8000-2Q
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
storage DEGRADED 0 0 0
raidz1 DEGRADED 0 0 0
c2d0 ONLINE 0 0 0
c2d1 UNAVAIL 0 0 0 cannot open
c5t0d0 ONLINE 0 0 0
c5t1d0 ONLINE 0 0 0
c3d1 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c4d1 ONLINE 0 0 0
c5t3d0 ONLINE 0 0 0
c1d0 ONLINE 0 0 0
c4d0 ONLINE 0 0 0
c5t2d0 ONLINE 0 0 0
spares
c1d1 AVAIL
c3d0 AVAIL

errors: No known data errors

[b]bash-3.2# zpool replace -f storage c2d1 c1d1[/b]
cannot replace c2d1 with c1d1: device is too small
--
This message posted from opensolaris.org
Eric D. Mudama
2009-01-18 04:42:11 UTC
Permalink
Post by Antonius
I'm having an issue replacing a failed 500GB disk with another new one with the error that the disk is too small. The problem is that it isn't. Is there any help anyone can offer here?
ascii name = <SAMSUNG-S0MUJQSQA0224-0001-465.76GB>
bytes/sector = 512
sectors = 976760063
accessible sectors = 976760030
Part Tag Flag First Sector Size Last Sector
0 usr wm 256 465.75GB 976743646
1 unassigned wm 0 0 0
2 unassigned wm 0 0 0
3 unassigned wm 0 0 0
4 unassigned wm 0 0 0
5 unassigned wm 0 0 0
6 unassigned wm 0 0 0
8 reserved wm 976743647 8.00MB 976760030
Are you 100% sure it has the exact same number of sectors? While more
recently, most vendors have settled on the IDEMA calculation for
number of available sectors, some drives sold into retail and/or some
models may not follow this, leaving you with a slightly different
number.

The "IDEMA" size for a 500GB disk is 976,773,168 sectors.

That Samsung drive would be about 13,000 sectors short, if your other
disks are IDEMA sized.

FYI, the formula is: sectors = 97696368 + (1953504 * ((Size in GB) - 50))
--
Eric D. Mudama
***@mail.bounceswoosh.org
Tim
2009-01-18 04:43:14 UTC
Permalink
Post by Eric D. Mudama
Are you 100% sure it has the exact same number of sectors? While more
recently, most vendors have settled on the IDEMA calculation for
number of available sectors, some drives sold into retail and/or some
models may not follow this, leaving you with a slightly different
number.
The "IDEMA" size for a 500GB disk is 976,773,168 sectors.
That Samsung drive would be about 13,000 sectors short, if your other
disks are IDEMA sized.
FYI, the formula is: sectors = 97696368 + (1953504 * ((Size in GB) - 50))
So you're saying zfs does absolutely no right-sizing? That sounds like a
bad idea all around...


--Tim
Richard Elling
2009-01-18 04:50:22 UTC
Permalink
Post by Eric D. Mudama
Are you 100% sure it has the exact same number of sectors? While more
recently, most vendors have settled on the IDEMA calculation for
number of available sectors, some drives sold into retail and/or some
models may not follow this, leaving you with a slightly different
number.
The "IDEMA" size for a 500GB disk is 976,773,168 sectors.
That Samsung drive would be about 13,000 sectors short, if your other
disks are IDEMA sized.
FYI, the formula is: sectors = 97696368 + (1953504 * ((Size in GB) - 50))
So you're saying zfs does absolutely no right-sizing? That sounds like
a bad idea all around...
ZFS only looks at available sectors.
-- richard
C***@Sun.COM
2009-01-18 11:18:46 UTC
Permalink
Post by Tim
So you're saying zfs does absolutely no right-sizing? That sounds like a
bad idea all around...
You can use a bigger disk; NOT a smaller disk.

Casper
Tim
2009-01-18 15:29:08 UTC
Permalink
Post by C***@Sun.COM
Post by Tim
So you're saying zfs does absolutely no right-sizing? That sounds like a
bad idea all around...
You can use a bigger disk; NOT a smaller disk.
Casper
Right, which is an absolutely piss poor design decision and why every major
storage vendor right-sizes drives. What happens if I have an old maxtor
drive in my pool whose "500g" is just slightly larger than every other mfg
on the market? You know, the one who is no longer making their own drives
since being purchased by seagate. I can't replace the drive anymore?
*GREAT*.

--Tim
Adam Leventhal
2009-01-18 16:16:38 UTC
Permalink
Post by Tim
Right, which is an absolutely piss poor design decision and why
every major storage vendor right-sizes drives. What happens if I
have an old maxtor drive in my pool whose "500g" is just slightly
larger than every other mfg on the market? You know, the one who is
no longer making their own drives since being purchased by seagate.
I can't replace the drive anymore? *GREAT*.
Sun does "right size" our drives. Are we talking about replacing a
device bought from sun with another device bought from Sun? If these
are just drives that fell off the back of some truck, you may not have
that assurance.

Adam

--
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Tim
2009-01-18 18:20:49 UTC
Permalink
Post by Tim
Right, which is an absolutely piss poor design decision and why every major
Post by Tim
storage vendor right-sizes drives. What happens if I have an old maxtor
drive in my pool whose "500g" is just slightly larger than every other mfg
on the market? You know, the one who is no longer making their own drives
since being purchased by seagate. I can't replace the drive anymore?
*GREAT*.
Sun does "right size" our drives. Are we talking about replacing a device
bought from sun with another device bought from Sun? If these are just
drives that fell off the back of some truck, you may not have that
assurance.
Adam
--
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Since it's done in software by HDS, NetApp, and EMC, that's complete
bullshit. Forcing people to spend 3x the money for a "Sun" drive that's
identical to the seagate OEM version is also bullshit and a piss-poor
answer.

--Tim
Adam Leventhal
2009-01-19 17:05:16 UTC
Permalink
Post by Tim
Since it's done in software by HDS, NetApp, and EMC, that's complete
bullshit. Forcing people to spend 3x the money for a "Sun" drive that's
identical to the seagate OEM version is also bullshit and a piss-poor
answer.
I didn't know that HDS, NetApp, and EMC all allow users to replace their
drives with stuff they've bought at Fry's. Is this still covered by their
service plan or would this only be in an unsupported config?

Thanks.

Adam
--
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Tim
2009-01-19 18:34:05 UTC
Permalink
Post by Adam Leventhal
Post by Tim
Since it's done in software by HDS, NetApp, and EMC, that's complete
bullshit. Forcing people to spend 3x the money for a "Sun" drive that's
identical to the seagate OEM version is also bullshit and a piss-poor
answer.
I didn't know that HDS, NetApp, and EMC all allow users to replace their
drives with stuff they've bought at Fry's. Is this still covered by their
service plan or would this only be in an unsupported config?
Thanks.
Adam
So because an enterprise vendor requires you to use their drives in their
array, suddenly zfs can't right-size? Vendor requirements have absolutely
nothing to do with their right-sizing, and everything to do with them
wanting your money.

Are you telling me zfs is deficient to the point it can't handle basic
right-sizing like a 15$ sata raid adapter?

--Tim
Adam Leventhal
2009-01-19 18:39:22 UTC
Permalink
Post by Tim
Post by Adam Leventhal
Post by Tim
Since it's done in software by HDS, NetApp, and EMC, that's complete
bullshit. Forcing people to spend 3x the money for a "Sun" drive that's
identical to the seagate OEM version is also bullshit and a piss-poor
answer.
I didn't know that HDS, NetApp, and EMC all allow users to replace their
drives with stuff they've bought at Fry's. Is this still covered by their
service plan or would this only be in an unsupported config?
So because an enterprise vendor requires you to use their drives in their
array, suddenly zfs can't right-size? Vendor requirements have absolutely
nothing to do with their right-sizing, and everything to do with them
wanting your money.
Sorry, I must have missed your point. I thought that you were saying that
HDS, NetApp, and EMC had a different model. Were you merely saying that the
software in those vendors' products operates differently than ZFS?
Post by Tim
Are you telling me zfs is deficient to the point it can't handle basic
right-sizing like a 15$ sata raid adapter?
How do there $15 sata raid adapters solve the problem? The more details you
could provide the better obviously.

Adam
--
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Bob Friesenhahn
2009-01-19 19:12:25 UTC
Permalink
Post by Adam Leventhal
Post by Tim
Are you telling me zfs is deficient to the point it can't handle basic
right-sizing like a 15$ sata raid adapter?
How do there $15 sata raid adapters solve the problem? The more details you
could provide the better obviously.
It is really quite simple. If the disk is resilvered but the new
drive is a bit too small, then the RAID card might tell you that a bit
of data might have lost in the last sectors, or it may just assume
that you didn't need that data, or maybe a bit of cryptic message text
scrolls off the screen a split second after it has been issued. Or if
you try to write at the end of the volume and one of the replacement
drives is a bit too short, then the RAID card may return a hard read
or write error. Most filesystems won't try to use that last bit of
space anyway since they run real slow when the disk is completely
full, or their flimsy formatting algorithm always wastes a bit of the
end of the disk. Only ZFS is rash enough to use all of the space
provided to it, and actually expect that the space continues to be
usable.

Bob
======================================
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Tim
2009-01-19 19:32:43 UTC
Permalink
On Mon, Jan 19, 2009 at 1:12 PM, Bob Friesenhahn <
Post by Tim
Are you telling me zfs is deficient to the point it can't handle basic
Post by Tim
right-sizing like a 15$ sata raid adapter?
How do there $15 sata raid adapters solve the problem? The more details you
could provide the better obviously.
It is really quite simple. If the disk is resilvered but the new drive is
a bit too small, then the RAID card might tell you that a bit of data might
have lost in the last sectors, or it may just assume that you didn't need
that data, or maybe a bit of cryptic message text scrolls off the screen a
split second after it has been issued. Or if you try to write at the end of
the volume and one of the replacement drives is a bit too short, then the
RAID card may return a hard read or write error. Most filesystems won't try
to use that last bit of space anyway since they run real slow when the disk
is completely full, or their flimsy formatting algorithm always wastes a bit
of the end of the disk. Only ZFS is rash enough to use all of the space
provided to it, and actually expect that the space continues to be usable.
It's a horribly *bad thing* to not use the entire disk and right-size it for
sanity's sake. That's why Sun currently sells arrays that do JUST THAT.

I'd wager fishworks does just that as well. Why don't you open source that
code and prove me wrong ;)

I'm wondering why they don't come right out with it and say "we want to
intentionally make this painful to our end users so that they buy our
packaged products". It'd be far more honest and productive than this
pissing match.


--Tim
Richard Elling
2009-01-19 21:29:56 UTC
Permalink
Post by Tim
On Mon, Jan 19, 2009 at 1:12 PM, Bob Friesenhahn
Are you telling me zfs is deficient to the point it can't
handle basic
right-sizing like a 15$ sata raid adapter?
How do there $15 sata raid adapters solve the problem? The more details you
could provide the better obviously.
Note that for the LSI RAID controllers Sun uses on many products,
if you take a disk that was JBOD and tell the controller to make
it RAIDed, then the controller will relabel the disk for you and
will cause you to lose the data. As best I can tell, ZFS is better
in that it will protect your data rather than just relabeling and
clobbering your data. AFAIK, NVidia and others do likewise.
Post by Tim
It is really quite simple. If the disk is resilvered but the new
drive is a bit too small, then the RAID card might tell you that a
bit of data might have lost in the last sectors, or it may just
assume that you didn't need that data, or maybe a bit of cryptic
message text scrolls off the screen a split second after it has been
issued. Or if you try to write at the end of the volume and one of
the replacement drives is a bit too short, then the RAID card may
return a hard read or write error. Most filesystems won't try to
use that last bit of space anyway since they run real slow when the
disk is completely full, or their flimsy formatting algorithm always
wastes a bit of the end of the disk. Only ZFS is rash enough to use
all of the space provided to it, and actually expect that the space
continues to be usable.
It's a horribly *bad thing* to not use the entire disk and right-size it
for sanity's sake. That's why Sun currently sells arrays that do JUST
THAT.
??
Post by Tim
I'd wager fishworks does just that as well. Why don't you open source
that code and prove me wrong ;)
I don't think so, because fishworks is an engineering team and I
don't think I can reserve space on a person... at least not legally
where I live :-)

But this is not a problem for the Sun Storage 7000 systems because
the supported disks are already "right-sized."
Post by Tim
I'm wondering why they don't come right out with it and say "we want to
intentionally make this painful to our end users so that they buy our
packaged products". It'd be far more honest and productive than this
pissing match.
I think that if there is enough real desire for this feature,
then someone would file an RFE on http://bugs.opensolaris.org
It would help to attach diffs to the bug and it would help to
reach a concensus of the amount of space to be reserved prior
to filing. This is not an intractable problem and easy workarounds
already exist, but if ease of use is more valuable than squeezing
every last block, then the RFE should fly.
-- richard
Antonius
2009-01-21 09:26:53 UTC
Permalink
you mentioned one, so what do you recomend as a workaround?.
I've tried re-initialing the disks on another system's HW RAID controller, but still get the same error.
--
This message posted from opensolaris.org
Tim
2009-01-19 19:35:22 UTC
Permalink
Post by Adam Leventhal
Sorry, I must have missed your point. I thought that you were saying that
HDS, NetApp, and EMC had a different model. Were you merely saying that the
software in those vendors' products operates differently than ZFS?
Gosh, was the point that hard to get? Let me state it a fourth time: They
all short stroke the disks to avoid the CF that results in all drives not
adhering to a strict sizing standard.
Post by Adam Leventhal
Post by Tim
Are you telling me zfs is deficient to the point it can't handle basic
right-sizing like a 15$ sata raid adapter?
How do there $15 sata raid adapters solve the problem? The more details you
could provide the better obviously.
They short stroke the disk so that when you buy a new 500GB drive that isn't
the exact same number of blocks you aren't screwed. It's a design choice to
be both sane, and to make the end-users life easier. You know, sort of like
you not letting people choose their raid layout...

--Tim
Adam Leventhal
2009-01-19 20:55:23 UTC
Permalink
Post by Tim
Post by Adam Leventhal
Post by Tim
Are you telling me zfs is deficient to the point it can't handle basic
right-sizing like a 15$ sata raid adapter?
How do there $15 sata raid adapters solve the problem? The more details you
could provide the better obviously.
They short stroke the disk so that when you buy a new 500GB drive that isn't
the exact same number of blocks you aren't screwed. It's a design choice to
be both sane, and to make the end-users life easier. You know, sort of like
you not letting people choose their raid layout...
Drive vendors, it would seem, have an incentive to make their "500GB" drives
as small as possible. Should ZFS then choose some amount of padding at the
end of each device and chop it off as insurance against a slightly smaller
drive? How much of the device should it chop off? Conversely, should users
have the option to use the full extent of the drives they've paid for, say,
if they're using a vendor that already provides that guarantee?
Post by Tim
You know, sort of like you not letting people choose their raid layout...
Yes, I'm not saying it shouldn't be done. I'm asking what the right answer
might be.

Adam
--
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Tim
2009-01-19 23:23:17 UTC
Permalink
Post by Adam Leventhal
Drive vendors, it would seem, have an incentive to make their "500GB" drives
as small as possible. Should ZFS then choose some amount of padding at the
end of each device and chop it off as insurance against a slightly smaller
drive? How much of the device should it chop off? Conversely, should users
have the option to use the full extent of the drives they've paid for, say,
if they're using a vendor that already provides that guarantee?
Drive vendors, it would seem, have incentive to make their 500GB drives as
cheap as possible. The two are not necessarily one and the same.

And again, I say take a look at the market today, figure out a percentage,
and call it done. I don't think you'll find a lot of users crying foul over
losing 1% of their drive space when they don't already cry foul over the
false advertising that is drive sizes today.

In any case, you might as well can ZFS entirely because it's not really fair
that users are losing disk space to raid and metadata... see where this
argument is going?

I really, REALLY doubt you're going to have users screaming at you for
losing 1% (or whatever the figure ends up being) to a right-sizing
algorithm. In fact, I would bet the average user will NEVER notice if you
don't tell them ahead of time. Sort of like the average user had absolutely
no clue that 500GB drives were of slightly differing block numbers, and he'd
end up screwed six months down the road if he couldn't source an identical
drive.

I have two disks in one of my systems... both maxtor 500GB drives, purchased
at the same time shortly after the buyout. One is a rebadged Seagate, one
is a true, made in China Maxtor. Different block numbers... same model
drive, purchased at the same time.

Wasn't zfs supposed to be about using software to make up for deficiencies
in hardware? It would seem this request is exactly that...
Post by Adam Leventhal
Post by Tim
You know, sort of like you not letting people choose their raid layout...
Yes, I'm not saying it shouldn't be done. I'm asking what the right answer
might be.
The *right answer* in simplifying storage is not "manually slice up every
disk you insert into the system to avoid this issue".

The right answer is "right-size by default, give admins the option to skip
it if they really want". Sort of like I'd argue the right answer on the
7000 is to give users the raid options you do today by default, and allow
them to lay it out themselves from some sort of advanced *at your own risk*
mode, whether that be command line (the best place I'd argue) or something
else.

--Tim
Adam Leventhal
2009-01-19 23:39:19 UTC
Permalink
Post by Tim
And again, I say take a look at the market today, figure out a percentage,
and call it done. I don't think you'll find a lot of users crying foul over
losing 1% of their drive space when they don't already cry foul over the
false advertising that is drive sizes today.
Perhaps it's quaint, but 5GB still seems like a lot to me to throw away.
Post by Tim
In any case, you might as well can ZFS entirely because it's not really fair
that users are losing disk space to raid and metadata... see where this
argument is going?
Well, I see where this _specious_ argument is going.
Post by Tim
I have two disks in one of my systems... both maxtor 500GB drives, purchased
at the same time shortly after the buyout. One is a rebadged Seagate, one
is a true, made in China Maxtor. Different block numbers... same model
drive, purchased at the same time.
Wasn't zfs supposed to be about using software to make up for deficiencies
in hardware? It would seem this request is exactly that...
That's a fair point, and I do encourage you to file an RFE, but a) Sun has
already solved this problem in a different way as a company with our products
and b) users already have the ability to right-size drives.

Perhaps a better solution would be to handle the procedure of replacing a disk
with a slightly smaller one by migrating data and then treating the extant
disks as slightly smaller as well. This would have the advantage of being far
more dynamic and of only applying the space tax in situations where it actually
applies.

Adam
--
Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Blake
2009-01-20 05:31:29 UTC
Permalink
So the place we are arriving is to push the RFE for shrinkable pools?

Warning the user about the difference in actual drive size, then
offering to shrink the pool to allow a smaller device seems like a
nice solution to this problem.

The ability to shrink pools might be very useful in other situations.
Say I built server that once did a decent amount of iops using SATA
disks, and now that the workloads iops is greatly increased (busy
database?), I need SAS disks. If I'd originally bought 500gb SATA
(current sweet spot) disks, I might have a lot of empty space in my
pool. Shrinking the pool would allow me to migrate to smaller
(capacity) SAS disks with much better seek times, without being forced
to buy 2x as many disks due to the higher cost/gb of SAS.

I think I remember an RFE for shrinkable pools, but can't find it -
can someone post a link if they know where it is?

cheers,
Blake
Tim
2009-01-20 13:24:41 UTC
Permalink
Post by Tim
Post by Tim
And again, I say take a look at the market today, figure out a
percentage,
Post by Tim
and call it done. I don't think you'll find a lot of users crying foul
over
Post by Tim
losing 1% of their drive space when they don't already cry foul over the
false advertising that is drive sizes today.
Perhaps it's quaint, but 5GB still seems like a lot to me to throw away.
That wasn't a hard number, that was a hypothetical number. On 750GB drives
I'm only seeing them lose in the area of 300-500MB.
Post by Tim
Post by Tim
I have two disks in one of my systems... both maxtor 500GB drives,
purchased
Post by Tim
at the same time shortly after the buyout. One is a rebadged Seagate,
one
Post by Tim
is a true, made in China Maxtor. Different block numbers... same model
drive, purchased at the same time.
Wasn't zfs supposed to be about using software to make up for
deficiencies
Post by Tim
in hardware? It would seem this request is exactly that...
That's a fair point, and I do encourage you to file an RFE, but a) Sun has
already solved this problem in a different way as a company with our products
and b) users already have the ability to right-size drives.
Perhaps a better solution would be to handle the procedure of replacing a disk
with a slightly smaller one by migrating data and then treating the extant
disks as slightly smaller as well. This would have the advantage of being far
more dynamic and of only applying the space tax in situations where it actually
applies.
A) Should have a big bright * next to it referencing "our packaged storage
solutions". I've got plenty of 72G "Sun" drives still lying around that
aren't all identical block numbers ;) Yes, an RMA is great, but when I've
got spares sitting on the shelf and I lose a drive at 4:40pm on a Friday,
I'm going to stick the spare off the shelf in, call Sun, and put the
replacement back on the shelf on Monday. /horse beaten

B) I think we can both agree that having to pre-slice every disk that goes
into the system is not a viable long-term solution to this issue.

That being said, your conclusion sounds like a perfectly acceptable/good
idea to me for all of the technical people such as those on this list.

Joe User is another story, but much like adding a single drive to a
raid-z(2) vdev, I doubt that's a target market for Sun at this time.
C***@Sun.COM
2009-01-18 16:17:22 UTC
Permalink
Post by Tim
Right, which is an absolutely piss poor design decision and why every major
storage vendor right-sizes drives. What happens if I have an old maxtor
drive in my pool whose "500g" is just slightly larger than every other mfg
on the market? You know, the one who is no longer making their own drives
since being purchased by seagate. I can't replace the drive anymore?
*GREAT*.
With a larger drive.

Who can replace drives with smaller drives?

What exactly does "right size" drives mean? They don't use all of the
disk?

Casper
Tim
2009-01-18 18:18:12 UTC
Permalink
Post by C***@Sun.COM
Post by Tim
Right, which is an absolutely piss poor design decision and why every
major
Post by Tim
storage vendor right-sizes drives. What happens if I have an old maxtor
drive in my pool whose "500g" is just slightly larger than every other mfg
on the market? You know, the one who is no longer making their own drives
since being purchased by seagate. I can't replace the drive anymore?
*GREAT*.
With a larger drive.
Who can replace drives with smaller drives?
What exactly does "right size" drives mean? They don't use all of the
disk?
Casper
"right-sizing" is when the volume manager short strokes the drive
intentionally because not all vendors 500GB is the same size. Hence the
OP's problem.

How aggressive the short-stroke is, depends on the OEM.

--Tim
Bob Friesenhahn
2009-01-18 16:51:55 UTC
Permalink
Post by Tim
Right, which is an absolutely piss poor design decision and why every major
storage vendor right-sizes drives. What happens if I have an old maxtor
drive in my pool whose "500g" is just slightly larger than every other mfg
on the market? You know, the one who is no longer making their own drives
since being purchased by seagate. I can't replace the drive anymore?
*GREAT*.
I appreciate that in these times of financial hardship that you can
not afford a 750GB drive to replace the oversized 500GB drive. Sorry
to hear about your situation.

Bob
======================================
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Will Murnane
2009-01-18 17:45:49 UTC
Permalink
On Sun, Jan 18, 2009 at 16:51, Bob Friesenhahn
Post by Bob Friesenhahn
I appreciate that in these times of financial hardship that you can
not afford a 750GB drive to replace the oversized 500GB drive. Sorry
to hear about your situation.
That's easy to say, but what if there were no larger alternative?
Suppose I have a pool composed of those 1.5TB Seagate disks, and
Hitachi puts out some of the "same" capacity that are actually
slightly smaller. A drive fails in my array, I buy a Hitachi disk to
replace it, and it doesn't work. If I can't get a large enough drive
to replace the missing disk with, it'd be a shame to have to destroy
and recreate the pool on smaller media.

Perhaps this is yet another problem that can be solved with BP
rewrite. If "zpool replace" detects that a disk is slightly smaller
but not so small that it can't hold all the data, warn the user first
but then allow them to replace the disk anyways.

Will
Bob Friesenhahn
2009-01-18 18:19:18 UTC
Permalink
Post by Will Murnane
That's easy to say, but what if there were no larger alternative?
Suppose I have a pool composed of those 1.5TB Seagate disks, and
Hitachi puts out some of the "same" capacity that are actually
slightly smaller. A drive fails in my array, I buy a Hitachi disk to
replace it, and it doesn't work. If I can't get a large enough drive
to replace the missing disk with, it'd be a shame to have to destroy
and recreate the pool on smaller media.
What do you propose that OpenSolaris should do about this? Should
OpenSolaris use some sort of a table of "common size" drives, or use
an algorithm which determines certain discrete usage values based on
declared drive sizes and a margin for error? What should OpenSolaris
of today do with the 20TB disk drives of tomorrow? What should the
margin for error of a 30TB disk drive be? Is it ok to arbitrarily
ignore 3/4TB of storage space?

If the "drive" is actually a huge 20TB LUN exported from a SAN RAID
array, how should the margin for error be handled in that case?

Bob
======================================
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Tim
2009-01-18 18:23:52 UTC
Permalink
On Sun, Jan 18, 2009 at 12:19 PM, Bob Friesenhahn <
Post by Bob Friesenhahn
Post by Will Murnane
That's easy to say, but what if there were no larger alternative?
Suppose I have a pool composed of those 1.5TB Seagate disks, and
Hitachi puts out some of the "same" capacity that are actually
slightly smaller. A drive fails in my array, I buy a Hitachi disk to
replace it, and it doesn't work. If I can't get a large enough drive
to replace the missing disk with, it'd be a shame to have to destroy
and recreate the pool on smaller media.
What do you propose that OpenSolaris should do about this? Should
OpenSolaris use some sort of a table of "common size" drives, or use
an algorithm which determines certain discrete usage values based on
declared drive sizes and a margin for error? What should OpenSolaris
of today do with the 20TB disk drives of tomorrow? What should the
margin for error of a 30TB disk drive be? Is it ok to arbitrarily
ignore 3/4TB of storage space?
If the "drive" is actually a huge 20TB LUN exported from a SAN RAID
array, how should the margin for error be handled in that case?
Bob
======================================
Bob Friesenhahn
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Take a look at drives on the market, figure out a percentage, and call it a
day. If there's a significant issue with "20TB" drives of the future, issue
a bug report and a fix, just like every other issue that comes up.


--Tim
Ellis, Mike
2009-01-18 18:26:38 UTC
Permalink
Does this all go away when BP-rewrite gets fully resolved/implemented?

Short of the pool being 100% full, it should allow a rebalancing
operation and possible LUN/device-size-shrink to match the new device
that is being inserted?

Thanks,

-- MikeE

-----Original Message-----
From: zfs-discuss-***@opensolaris.org
[mailto:zfs-discuss-***@opensolaris.org] On Behalf Of Bob
Friesenhahn
Sent: Sunday, January 18, 2009 1:19 PM
To: Will Murnane
Cc: zfs-***@opensolaris.org
Subject: Re: [zfs-discuss] replace same sized disk fails with too small
error
Post by Will Murnane
That's easy to say, but what if there were no larger alternative?
Suppose I have a pool composed of those 1.5TB Seagate disks, and
Hitachi puts out some of the "same" capacity that are actually
slightly smaller. A drive fails in my array, I buy a Hitachi disk to
replace it, and it doesn't work. If I can't get a large enough drive
to replace the missing disk with, it'd be a shame to have to destroy
and recreate the pool on smaller media.
What do you propose that OpenSolaris should do about this? Should
OpenSolaris use some sort of a table of "common size" drives, or use
an algorithm which determines certain discrete usage values based on
declared drive sizes and a margin for error? What should OpenSolaris
of today do with the 20TB disk drives of tomorrow? What should the
margin for error of a 30TB disk drive be? Is it ok to arbitrarily
ignore 3/4TB of storage space?

If the "drive" is actually a huge 20TB LUN exported from a SAN RAID
array, how should the margin for error be handled in that case?

Bob
======================================
Bob Friesenhahn
***@simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Will Murnane
2009-01-18 18:38:39 UTC
Permalink
On Sun, Jan 18, 2009 at 18:19, Bob Friesenhahn
Post by Bob Friesenhahn
What do you propose that OpenSolaris should do about this?
Take drive size, divide by 100, round down to two significant digits.
Floor to a multiple of that size. This method wastes no more than 1%
of the disk space, and gives a reasonable (I think) number.

For example: I have a machine with a "250GB" disk that is 251000193024
bytes long.
$ python
Post by Bob Friesenhahn
n=str(251000193024//100)
int(n[:2] + "0" * (len(n)-2)) * 100
250000000000L
So treat this volume as being 250 billion bytes long, exactly.

Most drives are sold with two significant digits in the size: 320 GB,
400 GB, 640GB, 1.0 TB, etc. I don't see this changing any time
particularly soon; unless someone starts selling a 1.25 TB drive or
something, two digits will suffice. Even then, this formula would
give you 96% (1.2/1.25) of the disk's capacity.

Note that this method also works for small-capacity disks: suppose I
have a disk that's exactly 250 billion bytes long. This formula will
produce 250 billion as the size it is to be treated as. Thus,
replacing my 251 billion byte disk with a 250 billion byte one will
not be a problem.
Post by Bob Friesenhahn
Is it ok to arbitrarily ignore 3/4TB of storage
space?
If it's less than 1% of the disk space, I don't see a problem doing so.
Post by Bob Friesenhahn
If the "drive" is actually a huge 20TB LUN exported from a SAN RAID array,
how should the margin for error be handled in that case?
So make it configurable if you must. If no partition table exists
when "zpool create" is called, make it "right-size" the disks, but if
a pre-existing EFI label is there, use it instead. Or make a flag
that tells zpool create not to "right-size".

Will
Bob Friesenhahn
2009-01-18 19:30:45 UTC
Permalink
Post by Will Murnane
Most drives are sold with two significant digits in the size: 320 GB,
400 GB, 640GB, 1.0 TB, etc. I don't see this changing any time
particularly soon; unless someone starts selling a 1.25 TB drive or
something, two digits will suffice. Even then, this formula would
give you 96% (1.2/1.25) of the disk's capacity.
If the drive is attached to a RAID controller which steals part of its
capacity for its own purposes, how will you handle that?

These stated drive sizes are just marketing terms and do not have a
sound technical basis. Don't drive vendors provide actual sizing
information in their specification sheets so that knowledgeable
people can purchase the right sized drive?

Bob
======================================
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Tim
2009-01-18 19:43:28 UTC
Permalink
On Sun, Jan 18, 2009 at 1:30 PM, Bob Friesenhahn <
Post by Bob Friesenhahn
Post by Will Murnane
Most drives are sold with two significant digits in the size: 320 GB,
400 GB, 640GB, 1.0 TB, etc. I don't see this changing any time
particularly soon; unless someone starts selling a 1.25 TB drive or
something, two digits will suffice. Even then, this formula would
give you 96% (1.2/1.25) of the disk's capacity.
If the drive is attached to a RAID controller which steals part of its
capacity for its own purposes, how will you handle that?
These stated drive sizes are just marketing terms and do not have a
sound technical basis. Don't drive vendors provide actual sizing
information in their specification sheets so that knowledgeable
people can purchase the right sized drive?
Bob
======================================
Bob Friesenhahn
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
You look at the size of the drive and you take a set percentage off... If
it's a "LUN" and it's so far off it still can't be added with the percentage
that works across the board for EVERYTHING ELSE, you change the size of the
LUN at the storage array or adapter.

I know it's fun to pretend this is rocket science and impossible, but the
fact remains the rest of the industry has managed to make it work. I have a
REAL tough time believing that Sun and/or zfs is so deficient it's an
insurmountable obstacle for them.

--Tim
Eric D. Mudama
2009-01-18 19:56:52 UTC
Permalink
Post by Tim
You look at the size of the drive and you take a set percentage off... If
it's a "LUN" and it's so far off it still can't be added with the
percentage that works across the board for EVERYTHING ELSE, you change the
size of the LUN at the storage array or adapter.
I know it's fun to pretend this is rocket science and impossible, but the
fact remains the rest of the industry has managed to make it work. I have
a REAL tough time believing that Sun and/or zfs is so deficient it's an
insurmountable obstacle for them.
If, instead of having ZFS manage these differences, a user simply
created slices that were, say, 98% as big as the average number of
sectors in a XXX GB drive... would ZFS enable write cache on that
device or not?

I thought I'd read that ZFS didn't use write cache on slices because
it couldn't guarantee that the other slices were used in a
write-cache-safe fashion, would that apply to cases where no other
slices were allocated?
--
Eric D. Mudama
***@mail.bounceswoosh.org
Tim
2009-01-18 19:54:01 UTC
Permalink
On Sun, Jan 18, 2009 at 1:56 PM, Eric D. Mudama
Post by Eric D. Mudama
Post by Tim
You look at the size of the drive and you take a set percentage off...
If
it's a "LUN" and it's so far off it still can't be added with the
percentage that works across the board for EVERYTHING ELSE, you change the
size of the LUN at the storage array or adapter.
I know it's fun to pretend this is rocket science and impossible, but the
fact remains the rest of the industry has managed to make it work. I have
a REAL tough time believing that Sun and/or zfs is so deficient it's an
insurmountable obstacle for them.
If, instead of having ZFS manage these differences, a user simply
created slices that were, say, 98% as big as the average number of
sectors in a XXX GB drive... would ZFS enable write cache on that
device or not?
I thought I'd read that ZFS didn't use write cache on slices because
it couldn't guarantee that the other slices were used in a
write-cache-safe fashion, would that apply to cases where no other
slices were allocated?
It will disable it by default, but you can manually re-enable it. That's
not so much the point though. ZFS is supposed to be filesystem/volume
manager all-in-one. When I have to start going through format every time I
add a drive, it's a non-starter, not to mention it's a kludge.

--Tim
Richard Elling
2009-01-18 20:43:46 UTC
Permalink
comment at the bottom...
Post by Tim
On Sun, Jan 18, 2009 at 1:56 PM, Eric D. Mudama
You look at the size of the drive and you take a set percentage off... If
it's a "LUN" and it's so far off it still can't be added with the
percentage that works across the board for EVERYTHING ELSE, you change the
size of the LUN at the storage array or adapter.
I know it's fun to pretend this is rocket science and
impossible, but the
fact remains the rest of the industry has managed to make it work. I have
a REAL tough time believing that Sun and/or zfs is so deficient it's an
insurmountable obstacle for them.
If, instead of having ZFS manage these differences, a user simply
created slices that were, say, 98% as big as the average number of
sectors in a XXX GB drive... would ZFS enable write cache on that
device or not?
I thought I'd read that ZFS didn't use write cache on slices because
it couldn't guarantee that the other slices were used in a
write-cache-safe fashion, would that apply to cases where no other
slices were allocated?
It will disable it by default, but you can manually re-enable it.
That's not so much the point though. ZFS is supposed to be
filesystem/volume manager all-in-one. When I have to start going
through format every time I add a drive, it's a non-starter, not to
mention it's a kludge.
DIY. Personally, I'd be more upset if ZFS reserved any sectors
for "some potential swap I might want to do later, but may never
need to do." If you want to reserve some space for swappage, DIY.

As others have noted, this is not a problem for systems vendors
because we try, and usually succeed, at ensuring that our multiple
sources of disk drives are compatible such that we can swap one
for another.
-- richard
Tim
2009-01-18 21:00:47 UTC
Permalink
Post by Richard Elling
comment at the bottom...
DIY. Personally, I'd be more upset if ZFS reserved any sectors
for "some potential swap I might want to do later, but may never
need to do." If you want to reserve some space for swappage, DIY.
As others have noted, this is not a problem for systems vendors
because we try, and usually succeed, at ensuring that our multiple
sources of disk drives are compatible such that we can swap one
for another.
-- richard
And again I call BS. I've pulled drives out of a USP-V, Clariion, DMX, and
FAS3040. Every single one had drives of slightly differing sizes. Every
single one is right-sized at format time.

Hell, here's a filer I have sitting in a lab right now:

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used
(MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- -----
-------------- --------------
dparity 0b.32 0b 2 0 FC:B - FCAL 10000
68000/139264000 68444/140174232
parity 0b.33 0b 2 1 FC:B - FCAL 10000
68000/139264000 68444/140174232
data 0b.34 0b 2 2 FC:B - FCAL 10000
68000/139264000 68552/140395088

Notice line's 2 and 3 are different physical block size, and those are BOTH
seagate cheetah's, just different generation. So, it gets short stroked to
68000 from 68552 or 68444.

And NO, the re-branded USP-V's Sun sell's don't do anything any differently,
so stop lying, it's getting old.

If you're so concerned with the storage *lying* or *hiding* space, I assume
you're leading the charge at Sun to properly advertise drive sizes, right?
Because the 1TB drive I can buy from Sun today is in no way, shape, or form
able to store 1TB of data. You use the same *fuzzy math* the rest of the
industry does.

--Tim
Richard Elling
2009-01-18 21:39:57 UTC
Permalink
Post by Richard Elling
comment at the bottom...
DIY. Personally, I'd be more upset if ZFS reserved any sectors
for "some potential swap I might want to do later, but may never
need to do." If you want to reserve some space for swappage, DIY.
As others have noted, this is not a problem for systems vendors
because we try, and usually succeed, at ensuring that our multiple
sources of disk drives are compatible such that we can swap one
for another.
-- richard
And again I call BS. I've pulled drives out of a USP-V, Clariion, DMX,
and FAS3040. Every single one had drives of slightly differing sizes.
Every single one is right-sized at format time.
It is naive to think that different storage array vendors
would care about people trying to use another array vendors
disks in their arrays. In fact, you should get a flat,
impersonal, "not supported" response.

What vendors can do, is make sure that if you get a disk
which is supported in a platform and replace it with another
disk which is also supported, and the same size, then it will
just work. In order for this method to succeed, a least,
common size is used.
Post by Richard Elling
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used
(MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- -----
-------------- --------------
dparity 0b.32 0b 2 0 FC:B - FCAL 10000
68000/139264000 68444/140174232
parity 0b.33 0b 2 1 FC:B - FCAL 10000
68000/139264000 68444/140174232
data 0b.34 0b 2 2 FC:B - FCAL 10000
68000/139264000 68552/140395088
Notice line's 2 and 3 are different physical block size, and those are
BOTH seagate cheetah's, just different generation. So, it gets short
stroked to 68000 from 68552 or 68444.
And NO, the re-branded USP-V's Sun sell's don't do anything any
differently, so stop lying, it's getting old.
Vendors can change the default label, which is how it is
implemented. For example, if we source XYZ-GByte disks
from two different vendors intended for the same platform,
then we will ensure that the number of available sectors
is the same, otherwise the FRU costs would be very high.
No conspiracy here... just good planning.
Post by Richard Elling
If you're so concerned with the storage *lying* or *hiding* space, I
assume you're leading the charge at Sun to properly advertise drive
sizes, right? Because the 1TB drive I can buy from Sun today is in no
way, shape, or form able to store 1TB of data. You use the same *fuzzy
math* the rest of the industry does.
There is no fuzzy math. Disk vendors size by base 10.
They explicitly state this in their product documentation,
as business law would expect.
http://en.wikipedia.org/wiki/Mebibyte
-- richard
Tim
2009-01-18 21:58:18 UTC
Permalink
Post by Richard Elling
It is naive to think that different storage array vendors
would care about people trying to use another array vendors
disks in their arrays. In fact, you should get a flat,
impersonal, "not supported" response.
But we aren't talking about me trying to stick disks into Sun's arrays.
We're talking about how this open source, supposed all-in-one volume manager
and filesystem handles new disks. You know, the one that was supposed to
make all of our lives infinitely easier, and simplify managing lots, and
lots of disks. Whether they be inside of a official Sun array or just a
server running Solaris.
Post by Richard Elling
What vendors can do, is make sure that if you get a disk
which is supported in a platform and replace it with another
disk which is also supported, and the same size, then it will
just work. In order for this method to succeed, a least,
common size is used.
The ONLY reason vendors put special labels or firmware on disks is to force
you to buy them direct. Let's not pretend there's something magical about
an "HDS" 1TB drive or a "Sun" 1TB drive. They're rolling off the same line
as everyone else's. The way they ensure the disk works is by short stroking
them from the start...

It's *naive* to claim it's any sort of technical limitation.
Post by Richard Elling
Vendors can change the default label, which is how it is
implemented. For example, if we source XYZ-GByte disks
from two different vendors intended for the same platform,
then we will ensure that the number of available sectors
is the same, otherwise the FRU costs would be very high.
No conspiracy here... just good planning.
The number of blocks on the disks won't be the same. Which is why they're
right-sized per above. Do I really need to start pulling disks from my Sun
systems to prove this point? Sun does not require exact block counts any
more than HDS, EMC, or NetApp. So for the life of the server, I can call in
and get the exact same part that broke in the box from Sun, because they've
got contracts with the drive mfg's. What happens when I'm out of the
supported life of the system? Oh, I just buy a new one? Because having my
volume manager us a bit of intelligence and short stroke the disk like I
would expect from the start is a *bad idea*.

The sad part about all of this is that the $15 promise raid controller in my
desktop short-strokes by default and you're telling me zfs can't, or won't.
Post by Richard Elling
There is no fuzzy math. Disk vendors size by base 10.
They explicitly state this in their product documentation,
as business law would expect.
http://en.wikipedia.org/wiki/Mebibyte
-- richard
If it's not fuzzy math, drive mfg's wouldn't lose in court over the false
advertising, would they?
http://apcmag.com/seagate_settles_class_action_cash_back_over_misleading_hard_drive_capacities.htm



At the end of the day, this back and forth changes nothing though. The
default behavior for zfs importing a new disk should be right-sizing a
fairly conservative amount if you're (you as in Sun, not you as in Richard)
going to continue to market it as you have in the past. It most definitely
does not eliminate the same old pains of managing disks with Solaris if I
have to start messing with labels and slices again. The whole point of
merging a volume manager/filesystem/etc is to take away that pain. That is
not even remotely manageable over the long term.

--Tim
Eric D. Mudama
2009-01-19 00:49:34 UTC
Permalink
Post by Tim
If you're so concerned with the storage *lying* or *hiding* space, I
assume you're leading the charge at Sun to properly advertise drive sizes,
right? Because the 1TB drive I can buy from Sun today is in no way,
shape, or form able to store 1TB of data. You use the same *fuzzy math*
the rest of the industry does.
While in general I'd like to see a combined FS/VM be smarter as you
do, on this point I disagree with you. Most drive vendors publish the
exact sector counts of each model that they ship, and this should be
sufficient for your purposes.

As an arbitrary example, seagate lists a number of "Guaranteed
Sectors" in their technical specifications for each unique model
number.

Their 7200.11 500GB drive ST3500320AS guarantees 976,773,168 sectors,
which happens to exactly equal the IDEMA amount for 500GB.

While rounding down to the next IDEMA multiple might make sense,
depending on the technique that could cause you 1GB per device, and
i'm sure a lot of people would rather not have that limitation.
--
Eric D. Mudama
***@mail.bounceswoosh.org
Ross
2009-01-19 11:25:53 UTC
Permalink
The problem is they might publish these numbers, but we really have no way of controlling what number manufacturers will choose to use in the future.

If for some reason future 500GB drives all turn out to be slightly smaller than the current ones you're going to be stuck. Reserving 1-2% of space in exchange for greater flexibility in replacing drives sounds like a good idea to me. As others have said, RAID controllers have been doing this for long enough that even the very basic models do it now, and I don't understand why such simple features like this would be left out of ZFS.

Fair enough, for high end enterprise kit where you want to squeeze every byte out of the system (and know you'll be buying Sun drives), you might not want this, but it would have been trivial to turn this off for kit like that. It's certainly a lot easier to expand a pool than shrink it!
--
This message posted from opensolaris.org
Blake
2009-01-19 14:14:59 UTC
Permalink
I'm going waaay out on a limb here, as a non-programmer...but...

Since the source is open, maybe community members should organize and
work on some sort of sizing algorithm? I can certainly imagine Sun
deciding to do this in the future - I can also imagine that it's not
at the top of Sun's priority list (most of the devices they deal with
are their own, and perhaps not subject to the right-sizing issue). If
it matters to the community, why not, as a community, try to
fix/improve zfs in this way?

Again, I've not even looked at the code for block allocation or
whatever it might be called in this case, so I could be *way* off here
:)

Lastly, Antonius, you can try the zpool trick to get this disk
relabeled, I think. Try 'zpool create temp_pool [problem_disk]' then
'zpool destroy temp_pool]' - this should relabel the disk in question
and set up the defaults that zfs uses. Can you also run format >
partition > print on one of the existing disks and send the output so
that we can see what the existing disk looks like? (Off-list directly
to me if you prefer).

cheers,
Blake
Richard Elling
2009-01-19 16:18:05 UTC
Permalink
Post by Ross
The problem is they might publish these numbers, but we really have no way of controlling what number manufacturers will choose to use in the future.
If for some reason future 500GB drives all turn out to be slightly smaller than the current ones you're going to be stuck. Reserving 1-2% of space in exchange for greater flexibility in replacing drives sounds like a good idea to me. As others have said, RAID controllers have been doing this for long enough that even the very basic models do it now, and I don't understand why such simple features like this would be left out of ZFS.
I have added the following text to the best practices guide:

* When a vdev is replaced, the size of the replacements vdev, measured
by usable
sectors, must be the same or greater than the vdev being replaced. This
can be
confusing when whole disks are used because different models of disks may
provide a different number of usable sectors. For example, if a pool was
created
with a "500 GByte" drive and you need to replace it with another "500
GByte"
drive, then you may not be able to do so if the drives are not of the
same make,
model, and firmware revision. Consider planning ahead and reserving some
space
by creating a slice which is smaller than the whole disk instead of the
whole disk.
Post by Ross
Fair enough, for high end enterprise kit where you want to squeeze every byte out of the system (and know you'll be buying Sun drives), you might not want this, but it would have been trivial to turn this off for kit like that. It's certainly a lot easier to expand a pool than shrink it!
Actually, enterprise customers do not ever want to squeeze every byte, they
would rather have enough margin to avoid such issues entirely. This is what
I was referring to earlier in this thread wrt planning.
-- richard
Jim Dunham
2009-01-19 16:39:25 UTC
Permalink
Richard,
Post by Richard Elling
Post by Ross
The problem is they might publish these numbers, but we really have
no way of controlling what number manufacturers will choose to use
in the future.
If for some reason future 500GB drives all turn out to be slightly
smaller than the current ones you're going to be stuck. Reserving
1-2% of space in exchange for greater flexibility in replacing
drives sounds like a good idea to me. As others have said, RAID
controllers have been doing this for long enough that even the very
basic models do it now, and I don't understand why such simple
features like this would be left out of ZFS.
* When a vdev is replaced, the size of the replacements vdev, measured
by usable
sectors, must be the same or greater than the vdev being replaced. This
can be
confusing when whole disks are used because different models of disks may
provide a different number of usable sectors. For example, if a pool was
created
with a "500 GByte" drive and you need to replace it with another "500
GByte"
drive, then you may not be able to do so if the drives are not of the
same make,
model, and firmware revision. Consider planning ahead and reserving some
space
by creating a slice which is smaller than the whole disk instead of the
whole disk.
Creating a slice, instead of using the whole disk, will cause ZFS to
not enable write-caching on the underlying device.

- Jim
Post by Richard Elling
Post by Ross
Fair enough, for high end enterprise kit where you want to squeeze
every byte out of the system (and know you'll be buying Sun
drives), you might not want this, but it would have been trivial to
turn this off for kit like that. It's certainly a lot easier to
expand a pool than shrink it!
Actually, enterprise customers do not ever want to squeeze every byte, they
would rather have enough margin to avoid such issues entirely. This is what
I was referring to earlier in this thread wrt planning.
-- richard
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Julien Gabel
2009-01-19 18:51:56 UTC
Permalink
Post by Jim Dunham
Creating a slice, instead of using the whole disk, will cause ZFS to
not enable write-caching on the underlying device.
Correct. Engineering trade-off. Since most folks don't read the manual,
or the best practices guide, until after they've hit a problem, it is really
just a CYA entry :-(
It seems this trade-off can now be mitigated, regarding Roch Bourbonnais
comment on a another thread on this list:
- http://mail.opensolaris.org/pipermail/zfs-discuss/2009-January/054587.html

In particular:
" If ZFS owns a disk it will enable the write cache on the drive but I'm
not positive this has a great performance impact today. It used to
but that was before we had a proper NCQ implementation. Today
I don't know that it helps much. That this is because we always
flush the cache when consistency requires it."
--
julien.
http://blog.thilelli.net/
Richard Elling
2009-01-19 18:19:44 UTC
Permalink
Post by Jim Dunham
Richard,
Post by Richard Elling
Post by Ross
The problem is they might publish these numbers, but we really have
no way of controlling what number manufacturers will choose to use
in the future.
If for some reason future 500GB drives all turn out to be slightly
smaller than the current ones you're going to be stuck. Reserving
1-2% of space in exchange for greater flexibility in replacing
drives sounds like a good idea to me. As others have said, RAID
controllers have been doing this for long enough that even the very
basic models do it now, and I don't understand why such simple
features like this would be left out of ZFS.
* When a vdev is replaced, the size of the replacements vdev, measured
by usable
sectors, must be the same or greater than the vdev being replaced. This
can be
confusing when whole disks are used because different models of disks may
provide a different number of usable sectors. For example, if a pool was
created
with a "500 GByte" drive and you need to replace it with another "500
GByte"
drive, then you may not be able to do so if the drives are not of the
same make,
model, and firmware revision. Consider planning ahead and reserving some
space
by creating a slice which is smaller than the whole disk instead of the
whole disk.
Creating a slice, instead of using the whole disk, will cause ZFS to
not enable write-caching on the underlying device.
Correct. Engineering trade-off. Since most folks don't read the manual,
or the best practices guide, until after they've hit a problem, it is
really
just a CYA entry :-(

BTW, I also added a quick link to CR 4852783, reduce pool capacity, which
is the feature which has a good chance of making this point moot.
-- richard
Moore, Joe
2009-01-20 17:04:53 UTC
Permalink
Post by Ross
Post by Ross
The problem is they might publish these numbers, but we
really have
Post by Ross
no way of controlling what number manufacturers will
choose to use
Post by Ross
in the future.
If for some reason future 500GB drives all turn out to be slightly
smaller than the current ones you're going to be stuck. Reserving
1-2% of space in exchange for greater flexibility in replacing
drives sounds like a good idea to me. As others have said, RAID
controllers have been doing this for long enough that even the very
basic models do it now, and I don't understand why such simple
features like this would be left out of ZFS.
It would certainly be "terrible" go back to the days where 5% of the filesystem space is inaccessible to users, and force the sysadmin to manually change that percentage to 0 to get full use of the disk.

Oh wait, UFS still does that, and it's a configurable parameter at mkfs time (and can be tuned on the fly)

For a ZFS pool, (until block pointer rewrite capability) this would have to be a pool-create-time parameter. Perhaps a --usable-size=N[%] option which would either cut down the size of the EFI slices or fake the disk geometry so the EFI label ends early.

Or it would be a small matter of programming to build a perl wrapper for zpool create that would accomplish the same thing.

--Joe
Miles Nordin
2009-01-20 18:46:56 UTC
Permalink
mj> For a ZFS pool, (until block pointer rewrite capability) this
mj> would have to be a pool-create-time parameter.

naw. You can just make ZFS do it all the time, like the other storage
vendors do. no parameters.

You can invent parameter-free ways of turning it off. For example,

1. label the disk with an EFI label taking up the whole disk

2. point ZFS at slice zero instead of the whole disk, like
/dev/dsk/c0t0d0s0 instead of /dev/dsk/c0t0d0

3. ZFS will then be written to know it's supposed to use the entire
disk instead of writing a new label, but will still behave as
though it owns the disk cache-wise.

-or-

1. label the disk any way you like

2. point ZFS at the whole disk, /dev/dsk/c0t0d0. And make that
whole-disk device name work for all disks no matter what
controller, whether or not they're ``removeable,'' or how they're
labeled, like the equivalent device name does in Linux, FreeBSD,
and Mac OS X.

3. ZFS should remove your label and write a one-slice EFI label that
doesn't use the entire disk, and rounds down to a
bucketed/quantized/whole-ish number. If the disk is a replacement
for a component of an existing vdev, the EFI labelsize it picks
will be the *larger* of:

a. The right-size ZFS would have picked if the disk weren't a
replacement

b. the smallest existing component in the vdev

Most people will not even notice the feature exists except by getting
errors less often. AIUI this is how it works with other RAID layers,
the cheap and expensive alike among ``hardware'' RAID, and this
common-practice is very ZFS-ish. except hardware RAID is proprietary
so you cannot determine their exact policy, while in ZFS you would be
able to RTFS and figure it out.

But there is still no need for parameters. There isn't even a need to
explain the feature to the user.

I guess this has by now become a case of the silly unimportant but
easy-to-understand feature dominating the mailing list because it's so
obvious that everyone's qualified to pipe up with his opinion, so
maybe I'm a bit late and shuld have let it die.
Moore, Joe
2009-01-20 20:26:23 UTC
Permalink
Post by Miles Nordin
mj> For a ZFS pool, (until block pointer rewrite capability) this
mj> would have to be a pool-create-time parameter.
naw. You can just make ZFS do it all the time, like the other storage
vendors do. no parameters.
Other storage vendors have specific compatibility requirements for the disks you are "allowed" to install in their chassis.

On the other hand, OpenSolaris is intended to work on commodity hardware.

And there is no way to change this after the pool has been created, since after that time, the disk size can't be changed. So whatever policy is used by default, it is very important to get it right.
(snip)
Post by Miles Nordin
Most people will not even notice the feature exists except by getting
errors less often. AIUI this is how it works with other RAID layers,
the cheap and expensive alike among ``hardware'' RAID, and this
common-practice is very ZFS-ish. except hardware RAID is proprietary
so you cannot determine their exact policy, while in ZFS you would be
able to RTFS and figure it out.
Sysadmins should not be required to RTFS. Behaviors should be documented in other places too.
Post by Miles Nordin
But there is still no need for parameters. There isn't even a need to
explain the feature to the user.
There isn't a need to explain the feature to the user? That's one of the most irresponsible responses I've heard lately. A user is expecting their 500GB disk to be 500000000 bytes, not 4999500 bytes, unless that feature is explained.

Parameters with reasonable defaults (and a reasonable way to change them) allow users who care about the parameter and understand the tradeoffs involved in changing from the default to make their system work better.

If I didn't want to be able to tune my system for performance, I would be running Windows. OpenSolaris is about transparency, not just Open Source.

--Joe
Richard Elling
2009-01-20 21:36:20 UTC
Permalink
[I hate to keep dragging this thread forward, but...]
Post by Moore, Joe
And there is no way to change this after the pool has been created,
since after that time, the disk size can't be changed. So whatever
policy is used by default, it is very important to get it right.
Today, vdev size can be grown, but not shrunk, on the fly, without
causing any copying of data. If you need to shrink today, you need
to copy the data. This is also true of many, but not all, file systems.
-- richard
Miles Nordin
2009-01-20 22:17:49 UTC
Permalink
jm> Sysadmins should not be required to RTFS.

I never said they were. The comparison was between hardware RAID and
ZFS, not between two ZFS alternatives. The point: other systems'
behavior is enitely secret. Therefore, secret opaque undiscussed
right-sizing is the baseline. The industry-wide baseline is not
guaranteeing to use the whole disk no matter what, nor is it building
a flag-ridden partitioning tool with bikeshed HOWTO documentation into
zpool full of multi-paragraph Windows ExPee-style CYA ``are you SURE
you want to use the whole disk, because blah bla blahblah blha
blaaagh'' modal dialog box warnings.

This overdiscussion feels like the way X.509 and IPsec grow and grow,
accomodating every feature dreamed up by people who don't have to
implement or live with the result because each feature is so important
that some day it'd be disastrous not to have it.

jm> There isn't a need to explain the feature to the user? That's
jm> one of the most irresponsible responses I've heard lately.

It's fine if you disagree, but the disastrous tone makes no sense.
Other filesystems and RAID layers consume similar amounts of space for
metadata, labels, bitmaps, whatever. The suggestion is neither
surprising nor harmful, especially compared to the current behavior.

anyway probably none of it matters because of the IDEMA sizes, and the
rewrite/evacuation feature that will hopefully be done a couple years
from now.
Tim
2009-01-21 00:37:32 UTC
Permalink
Post by Moore, Joe
Other storage vendors have specific compatibility requirements for the
disks you are "allowed" to install in their chassis.
And again, the reason for those requirements is 99% about making money, not
a technical one. If you go back far enough in time, nearly all of them at
some point allowed non-approved disks into the system, or there was firmware
available to flash unsupported drives to make them work. Heck, if you knew
the right people you could still do that today...
Post by Moore, Joe
There isn't a need to explain the feature to the user? That's one of the
most irresponsible responses I've heard lately. A user is expecting their
500GB disk to be 500000000 bytes, not 4999500 bytes, unless that feature is
explained.
The user DEFINITELY isn't expecting 500000000 bytes, or what you meant to
say 500000000000 bytes, they're expecting 500GB. You know, 536,870,912,000
bytes. But even if the drive mfg's calculated it correctly, they wouldn't
even be getting that due to filesystem overhead.

Funny I haven't seen any posts to the list from you demanding that Sun
release exact specifications for how much overhead is lost to metadata,
snapshots, and filesystem structure...
Post by Moore, Joe
Parameters with reasonable defaults (and a reasonable way to change them)
allow users who care about the parameter and understand the tradeoffs
involved in changing from the default to make their system work better.
If I didn't want to be able to tune my system for performance, I would be
running Windows. OpenSolaris is about transparency, not just Open Source.
If you fill the disks 100% full, you won't need to worry about performance.
In fact, I would wager if the only space you have left on the device is the
amount you lost to right-sizing, the pool will have already toppled over and
died.

Although I do agree with you, being able to change from the default
behavior, in general, is a good idea. Agreeing on what that default
behavior should be is probably another issue entirely ;)

I would imagine this could be something set perhaps with a flag in
bootenv.rc (or wherevever deemed appropriate).
Anton B. Rang
2009-01-21 04:27:12 UTC
Permalink
The user DEFINITELY isn't expecting 500000000 bytes, or what you meant to say 500000000000
bytes, they're expecting 500GB. You know, 536,870,912,000 bytes. But even if the drive mfg's
calculated it correctly, they wouldn't even be getting that due to filesystem overhead.
I doubt there are any users left in the world that would expect that -- the drive manufacturers have made it clear for the past 20 years that 500 GB = 500*10^9, not 500*2^30. Even the OS vendors have finally (for the most part) started displaying GB instead of GiB.
And again, the reason for [certified devices] is 99% about making money, not a technical one.
Yes and no. From my experience at three storage vendors, it *is* about making money (aren't all corporate decisions supposed to be?) but it's less about making money by selling overpriced drives than by not *losing* money by trying to support hardware that doesn't quite work. It's a dirty little secret of the drive/controller/array industry (and networking, for that matter) that two arbitrary pieces of hardware which are supposed to conform to a standard will usually, mostly, work together -- but not always, and when they fail, it's very difficult to track down (usually impossible in a customer environment). By limiting which drives, controllers, firmware revisions, etc. are supported, we reduce the support burden immensely and are able to ensure that we can actually test what a custome
r is using.

A few specific examples I've seen personally:

* SCSI drives with caches that would corrupt data if the mode pages were set wrong.
* SATA adapters which couldn't always complete commands simultaneously on multiple channels (leading to timeouts or I/O errors).
* SATA controllers which couldn't quite deal with timing at one edge of the spec ... and drives which pushed the timing to that edge under the right conditions.
* Drive firmware which silently dropped commands when the queue depth got too large.

All of these would 'mostly work', especially in desktop use (few outstanding commands, no changes to default parameters, no use of task control messages), but would fail in other environments in ways that were almost impossible to track down with specialized hardware.

When I was in a software-only RAID company, we did support nearly arbitrary hardware -- but we had a "compatible" list of what we'd tested, and for everything else, the users were pretty much on their own. That's OK for home users, but for critical data, the greatly increased risk is not worth saving a few thousand (or even tens of thousands) dollars.
--
This message posted from opensolaris.org
C***@Sun.COM
2009-01-21 09:58:09 UTC
Permalink
Post by Tim
The user DEFINITELY isn't expecting 500000000 bytes, or what you meant to
say 500000000000 bytes, they're expecting 500GB. You know, 536,870,912,000
bytes. But even if the drive mfg's calculated it correctly, they wouldn't
even be getting that due to filesystem overhead.
Then you have a very stupid user who is been living in a cave.

The only reason why we incorrect label memory is because the systems are
binary. (Incorrect, because there's one standard and it says that
"K", "M", "G" and "T" are powers of 10.)
The computer cannot efficiently address non-binary sized memory.

IIRC, some stupid user did indeed sue WD and he won, but that is in
America (I'm sure that the km is 1024 meters in the US)

Since that lawsuit the vendors all make sure that the specification says
how many addressable sectors are in a disk.

You make the "right size" disk "a big issue". And perhaps it is, however,
ZFS is out a number of years and noone complained about it before.
It's not just a "big priority", it's not even in the list.

File a bug/rfe, if you want this fixed.

Casper
Miles Nordin
2009-01-19 18:20:55 UTC
Permalink
edm> If, instead of having ZFS manage these differences, a user
edm> simply created slices that were, say, 98%

if you're willing to manually create slices, you should be able to
manually enable the write cache, too, while you're in there, so I
wouldn't worry about that. I'd worry a little about the confusion
over this write cache bit in general---where the write cache setting
is stored and when it's enabled and when (if?) it's disabled, if the
rules differ on each type of disk attachment, and if you plug the disk
into Linux will Linux screw up the setting by auto-enabling at boot or
by auto-disabling at shutdown or does Linux use stateless versions
(analagous to sdparm without --save) when it prints that boot-time
message about enabling write caches? For example weirdness, on iSCSI
I get this, on a disk to which I've let ZFS write a GPT/EFI label:

write_cache> display
Write Cache is disabled
write_cache> enable
Write cache setting is not changeable

so is that a bug of my iSCSI target, and is there another implicit
write cache inside the iSCSI initiator or not? The Linux hdparm man
page says:

-W Disable/enable the IDE drive's write-caching feature (default
state is undeterminable; manufacturer/model specific).

so is the write_cache 'display' feature in 'format -e' actually
reliable? Or is it impossible to reliably read this setting on an ATA
drive, and 'format -e' is making stuff up?

With Linux I can get all kinds of crazy caching data from a SATA disk:

***@node0 ~ # sdparm --page=ca --long /dev/sda
/dev/sda: ATA WDC WD1000FYPS-0 02.0
Caching (SBC) [PS=0] mode page:
IC 0 Initiator control
ABPF 0 Abort pre-fetch
CAP 0 Caching analysis permitted
DISC 0 Discontinuity
SIZE 0 Size (1->CSS valid, 0->NCS valid)
WCE 1 Write cache enable
MF 0 Multiplication factor
RCD 0 Read cache disable
DRRP 0 Demand read retension priority
WRP 0 Write retension priority
DPTL 0 Disable pre-fetch transfer length
MIPF 0 Minimum pre-fetch
MAPF 0 Maximum pre-fetch
MAPFC 0 Maximum pre-fetch ceiling
FSW 0 Force sequential write
LBCSS 0 Logical block cache segment size
DRA 0 Disable read ahead
NV_DIS 0 Non-volatile cache disable
NCS 0 Number of cache segments
CSS 0 Cache segment size

but what's actually coming from the drive, and what's fabricated by
the SCSI-to-SATA translator built into Garzik's libata? Because I
think Solaris has such a translator, too, if it's attaching sd to SATA
disks. I'm guessing it's all a fantasy because:

***@node0 ~ # sdparm --clear=WCE /dev/sda
/dev/sda: ATA WDC WD1000FYPS-0 02.0
change_mode_page: failed setting page: Caching (SBC)

but neverminding the write cache, I'd be happy saying ``just round
down disk sizes using the labeling tool instead of giving ZFS the
whole disk, if you care,'' IF the following things were true:

* doing so were written up as a best-practice. because, I think it's
a best practice if the rest of the storage industry from EMC to $15
promise cards is doing it, though maybe it's not important any more
because of IDEMA. And right now very few people are likely to have
done it because of the way they've been guided into the setup process.

* it were possible to do this label-sizing to bootable mirrors in the
various traditional/IPS/flar/jumpstart installers

* there weren't a proliferation of >= 4 labeling tools in Solaris,
each riddled with assertion bailouts and slightly different
capabilities. Linux also has a mess of labeling tools, but they're
less assertion-riddled, and usually you can pick one and use it for
everything---you don't have to drag out a different tool for USB
sticks because they're considered ``removeable.'' Also it's always
possible to write to the unpartitioned block device with 'dd' on
Linux (and FreeBSD and Mac OS X), no matter what label is on the
disk, while Solaris doesn't seem to have an unpartitioned device.
And finally the Linux formatting tools work by writing to this
unpartitioned device, not by calling into a rat's nest of ioctl's,
so they're much easier for me to get along with.

Part of the attraction of ZFS should be avoiding this messy part of
Solaris, but we still have to use format/fmthard/fdisk/rmformat, to
swap label types because ZFS won't, to frob the write cache because
ZFS's user interface is too simple and does that semi-automatically
though I'm not sure all the rules it's using, to enumerate the
installed disks, to determine in which of the several states
working / connected-but-not-identified / disconnected /
disconnected-but-refcounted the iSCSI initiator is in.

And while ZFS will do special things to an UNlabeled disk, I'm not
sure there is a documented procedure for removing the label from a
disk---Sun seems to imagine all disks will ship with labels that
can never be removed, only ``converted,'' and removing a GPT/EFI
label is tricky because of that backup label at the end which some
tools respect and others don't.

I would prefer cleaning up the mess of labelers and removing bogus
assertions about ``removeable'' disks and similar cruft, over
adding the equivalent of a fifth also-extremely-limited labeling
tool to zpool.
JZ
2009-01-18 19:52:13 UTC
Permalink
Hi Bob, Will, Tim,
I also had some off-list comments on my irrelevant comments.
So I will try to make this post less irrelevant, though my thoughts on this
topic may be off the list discussion line of thoughts, as usual.
From the major storage vendors I know, network storage systems as integrated
products, are only offered with the same size/type of drives in a
traditional RAID set (not the V-RAID style). Mixing different drives in a
traditional RAID set is not recommanded by many vendors, and I think when
taking that as a policy, it will cut off much trouble in trying to mix
different drives in a RAID set.

And folks, the last time I really got into the largest database (by Winter)
data sets, their sizes were not really as huge as I thought.
http://www.wintercorp.com/VLDB/2005_TopTen_Survey/TopTenProgram.html

Again, I think, the exponential data growth we have been talking about for a
few years is more in file data. Databases use block-storage, very efficient
on capacity.

The kind of drives is not as important as how do you use those drives. IMHO.

Best,
z


----- Original Message -----
From: "Bob Friesenhahn" <***@simple.dallas.tx.us>
To: "Will Murnane" <***@gmail.com>
Cc: <zfs-***@opensolaris.org>
Sent: Sunday, January 18, 2009 2:30 PM
Subject: Re: [zfs-discuss] replace same sized disk fails with too small
error
Post by Will Murnane
Most drives are sold with two significant digits in the size: 320 GB,
400 GB, 640GB, 1.0 TB, etc. I don't see this changing any time
particularly soon; unless someone starts selling a 1.25 TB drive or
something, two digits will suffice. Even then, this formula would
give you 96% (1.2/1.25) of the disk's capacity.
If the drive is attached to a RAID controller which steals part of its
capacity for its own purposes, how will you handle that?
These stated drive sizes are just marketing terms and do not have a
sound technical basis. Don't drive vendors provide actual sizing
information in their specification sheets so that knowledgeable
people can purchase the right sized drive?
Bob
======================================
Bob Friesenhahn
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Antonius
2009-01-21 06:55:39 UTC
Permalink
so you're suggesting I buy 750s to replace the 500s. then if a 750 fails buy another bigger drive again?

the drives are RMA replacements for the other disks that faulted in the array before. they are the same brand, model and model number, apparently not so under the label though, but no way I could tell that before.
--
This message posted from opensolaris.org
C***@Sun.COM
2009-01-21 10:27:14 UTC
Permalink
Post by Antonius
so you're suggesting I buy 750s to replace the 500s. then if a 750 fails buy another bigger drive again?
Have you filed a bug/rfe to fix this in ZFS in future?

Anyway, you only need to change the 750GB drives if:
- all 500GBs drives are replace by 750GB disks
- and they're all bigger than the newest 750GB
Post by Antonius
the drives are RMA replacements for the other disks that faulted in the
array before. they are the same brand, model and model number, apparently
not so under the label though, but no way I could tell that before.
That is really weird.

Or is this, perhaps, because you use a EFI label on the disks and
we now label the disks different? (I think we make sure that the
ZFS label starts at a 128K offset, now, before it did not)

Casper
Antonius
2009-01-19 09:39:39 UTC
Permalink
yes, it's the same make and model as most of the other disks in the zpool and reports the same number of sectors
--
This message posted from opensolaris.org
Antonius
2009-01-18 07:18:35 UTC
Permalink
Volume name = < >
ascii name = <SAMSUNG-S0VVJ1CP30539-0001-465.76GB>
bytes/sector = 512
sectors = 976760063
accessible sectors = 976760030
Part Tag Flag First Sector Size Last Sector
0 usr wm 256 465.75GB 976743646
1 unassigned wm 0 0 0
2 unassigned wm 0 0 0
3 unassigned wm 0 0 0
4 unassigned wm 0 0 0
5 unassigned wm 0 0 0
6 unassigned wm 0 0 0
8 reserved wm 976743647 8.00MB 976760030

This is the readout from a disk it's meant to replace. looks like the same number of sectors, as it should be being the same model.
--
This message posted from opensolaris.org
dick hoogendijk
2009-01-18 10:25:29 UTC
Permalink
On Sat, 17 Jan 2009 23:18:35 PST
Antonius <***@gmail.com> wrote:

Maybe the other disk has an EFI label?
--
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
+ http://nagual.nl/ | SunOS sxce snv105 ++
+ All that's really worth doing is what we do for others (Lewis Carrol)
Antonius
2009-01-18 11:54:52 UTC
Permalink
If so what should I do to remedy that? just reformat it?
--
This message posted from opensolaris.org
JZ
2009-01-18 11:55:34 UTC
Permalink
meh


----- Original Message -----
From: "Antonius" <***@gmail.com>
To: <zfs-***@opensolaris.org>
Sent: Sunday, January 18, 2009 6:54 AM
Subject: Re: [zfs-discuss] replace same sized disk fails with too small
error
Post by Antonius
If so what should I do to remedy that? just reformat it?
--
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Al Tobey
2009-01-18 20:09:37 UTC
Permalink
I ran into a bad label causing this once.
<br/><br/>
Usually the s2 slice is a good bet for your whole disk device, but if it's EFI labeled, you need to use p0 (somebody correct me if I'm wrong).
<br/><br/>
I like to zero the first few megs of a drive before doing any of this stuff. This will destroy any data.
<br/><br/>
Obviously, change "<span style="font-family: monospace;">c7t1d0p0</span>" to whatever your drive's device is.
<br/><br/>
<span style="font-family: monospace;">dd if=/dev/zero of=/dev/rdsk/c7t1d0p0 bs=512 count=8192</span>
<br/><br/>
For EFI you may also need to zero the end of the disk too because it writes the VTOC to both the beginning and end for redundancy. I'm not sure of the best way to get the drive size in blocks without using format(1M) so I'll leave that as an exercise for the reader. For my 500gb disks it was something like:
<br/><br/>
976533504 is $number_of_blocks (from format) - 8192 (4mb in 512 byte blocks).
<br/><br/>
<span style="font-family: monospace;">dd if=/dev/zero of=/dev/rdsk/c7t0d0p0 bs=512 count=8192 seek=976533504</span>
<br/><br/>
When you run format -> fdisk, it should prompt you to write a new Solaris label to the disk. Just accept all the defaults.
<br/><br/>
<span style="font-family: monospace;">format -d c7t1d0</span>
<br/><br/>
Remember to double-check your devices and wait a beat before pressing enter with those dd commands as they destroy without warning or checking.
--
This message posted from opensolaris.org
JZ
2009-01-18 20:21:12 UTC
Permalink
Yes, I agree, command interface is more efficient and more risky than GUI.
You will have to be very careful when doing that.

Best,
z


----- Original Message -----
From: "Al Tobey" <***@gmail.com>
To: <zfs-***@opensolaris.org>
Sent: Sunday, January 18, 2009 3:09 PM
Subject: Re: [zfs-discuss] replace same sized disk fails with too small
error
Post by Al Tobey
I ran into a bad label causing this once.
<br/><br/>
Usually the s2 slice is a good bet for your whole disk device, but if it's
EFI labeled, you need to use p0 (somebody correct me if I'm wrong).
<br/><br/>
I like to zero the first few megs of a drive before doing any of this
stuff. This will destroy any data.
<br/><br/>
Obviously, change "<span style="font-family: monospace;">c7t1d0p0</span>"
to whatever your drive's device is.
<br/><br/>
<span style="font-family: monospace;">dd if=/dev/zero
of=/dev/rdsk/c7t1d0p0 bs=512 count=8192</span>
<br/><br/>
For EFI you may also need to zero the end of the disk too because it
writes the VTOC to both the beginning and end for redundancy. I'm not
sure of the best way to get the drive size in blocks without using
format(1M) so I'll leave that as an exercise for the reader. For my
<br/><br/>
976533504 is $number_of_blocks (from format) - 8192 (4mb in 512 byte blocks).
<br/><br/>
<span style="font-family: monospace;">dd if=/dev/zero
of=/dev/rdsk/c7t0d0p0 bs=512 count=8192 seek=976533504</span>
<br/><br/>
When you run format -> fdisk, it should prompt you to write a new Solaris
label to the disk. Just accept all the defaults.
<br/><br/>
<span style="font-family: monospace;">format -d c7t1d0</span>
<br/><br/>
Remember to double-check your devices and wait a beat before pressing
enter with those dd commands as they destroy without warning or checking.
--
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Antonius
2009-01-21 11:16:06 UTC
Permalink
I'll attach 2 files of output from 2 disks:

c4d0 is a current member of the zpool that is a "sibling" (as in a member of the same batch a couple of serial number increments different) of the faulted disk to replace and currently running without issue

and c3d0 is a new disk I got back from as a replacement for a failed disk that's obviously different. it appears the EFI label needs fixing. I just can't get it to stick with any combination of commands I've tried.

eg removing and resetting all partitions with fdisk -e
and trying to recreate with geometry as per the existing pool members even after trying to dd the first section of all partitions:

bash-3.2# fdisk -A 238:0:0:1:0:254:63:1023:1:976773167 c3d0
fdisk: EFI partitions must encompass the entire disk
(input numsect: 976773167 - avail: 976760063)
--
This message posted from opensolaris.org
Richard Elling
2009-01-21 17:05:18 UTC
Permalink
I believe this is an fdisk issue. But I don't think any
of the fdisk engineers hang out on this forum.

You might try partitioning the disk on another OS.
-- richard
Post by Antonius
c4d0 is a current member of the zpool that is a "sibling" (as in a member of the same batch a couple of serial number increments different) of the faulted disk to replace and currently running without issue
and c3d0 is a new disk I got back from as a replacement for a failed disk that's obviously different. it appears the EFI label needs fixing. I just can't get it to stick with any combination of commands I've tried.
eg removing and resetting all partitions with fdisk -e
bash-3.2# fdisk -A 238:0:0:1:0:254:63:1023:1:976773167 c3d0
fdisk: EFI partitions must encompass the entire disk
(input numsect: 976773167 - avail: 976760063)
------------------------------------------------------------------------
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Al Tobey
2009-01-21 17:39:13 UTC
Permalink
Grab the AOE driver and pull "aoelabinit" out of the package. They wrote it just for forcing EFI or Sun labels onto disks when the normal Solaris tools get in the way. coraid's website looks like it's broken at the moment, so you may need to find it elsewhere on the web.
--
This message posted from opensolaris.org
Antonius
2009-01-22 02:53:10 UTC
Permalink
can you recommend a walk-through for this process, or a bit more of a description? I'm not quite sure how I'd use that utility to repair the EFI label
--
This message posted from opensolaris.org
Dale Sears
2009-01-22 08:32:07 UTC
Permalink
Would this work? (to get rid of an EFI label).

dd if=/dev/zero of=/dev/dsk/<thedisk> bs=1024k count=1

Then use

format

format might complain that the disk is not labeled. You
can then label the disk.

Dale
Post by Antonius
can you recommend a walk-through for this process, or a bit more of a description? I'm not quite sure how I'd use that utility to repair the EFI label
Antonius
2009-01-23 01:59:14 UTC
Permalink
yes, that's exactly what I did. the issue is that I can't get the "corrected" label to be written once I've zero'd the drive. I get and error from fdisk that apparently views the backup label
--
This message posted from opensolaris.org
Jonathan Edwards
2009-01-23 04:34:27 UTC
Permalink
not quite .. it's 16KB at the front and 8MB back of the disk (16384
sectors) for the Solaris EFI - so you need to zero out both of these

of course since these drives are <1TB you i find it's easier to format
to SMI (vtoc) .. with format -e (choose SMI, label, save, validate -
then choose EFI)

but to Casper's point - you might want to make sure that fdisk is
using the whole disk .. you should probably reinitialize the fdisk
sectors either with the fdisk command or run fdisk from format (delete
the partition, create a new partition using 100% of the disk, blah,
blah) ..

finally - glancing at the format output - there appears to be a mix of
labels on these disks as you've got a mix c#d# entries and c#t#d#
entries so i might suspect fdisk might not be consistent across the
various disks here .. also noticed that you dumped the vtoc for c3d0
and c4d0, but you're replacing c2d1 (of unknown size/layout) with c1d1
(never dumped in your emails) .. so while this has been an animated
(slightly trollish) discussion on right-sizing (odd - I've typically
only seen that term as an ONTAPism) with some short-stroking digs ..
it's a little unclear what the c1d1s0 slice looks like here or what
the cylinder count is - i agree it should be the same - but it would
be nice to see from my armchair here
Post by Dale Sears
Would this work? (to get rid of an EFI label).
dd if=/dev/zero of=/dev/dsk/<thedisk> bs=1024k count=1
Then use
format
format might complain that the disk is not labeled. You
can then label the disk.
Dale
Post by Antonius
can you recommend a walk-through for this process, or a bit more of
a description? I'm not quite sure how I'd use that utility to
repair the EFI label
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Paul Schlie
2009-01-23 04:12:30 UTC
Permalink
It also wouldn't be a bad idea for ZFS to also verify drives designated as
hot spares in fact have sufficient capacity to be compatible replacements
for particular configurations, prior to actually being critically required
(as if drives otherwise appearing to have equivalent capacity may not, it
wouldn't be a nice thing to first discover upon attempted replacement of a
failed drive).
Blake
2009-01-23 23:04:38 UTC
Permalink
+1
Post by Paul Schlie
It also wouldn't be a bad idea for ZFS to also verify drives designated as
hot spares in fact have sufficient capacity to be compatible replacements
for particular configurations, prior to actually being critically required
(as if drives otherwise appearing to have equivalent capacity may not, it
wouldn't be a nice thing to first discover upon attempted replacement of a
failed drive).
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Continue reading on narkive:
Loading...