Limit ZFS Memory Utilization

Discussion:

Jason J. W. Williams

2007-01-08 06:01:08 UTC

Hello,

Is there a way to set a max memory utilization for ZFS? We're trying
to debug an issue where the ZFS is sucking all the RAM out of the box,
and its crashing MySQL as a result we think. Will ZFS reduce its cache
size if it feels memory pressure? Any help is greatly appreciated.

Best Regards,
Jason

Sanjeev Bagewadi

2007-01-08 06:41:38 UTC

Permalink

Jason,

There is no documented way of limiting the memory consumption.
The ARC section of ZFS tries to adapt to the memory pressure of the system.
However, in your case probably it is not quick enough I guess.

One way of limiting the memory consumption would be limit the arc.c_max
This (arc.c_max) is set to 3/4 of the memory available (or 1GB less than
memory available).
This is done when the ZFS is loaded (arc_init()).

You should be able to change the value of arc.c_max through mdb and set
it to the value
you want. Exercise caution while setting it. Make sure you don't have
active zpools during this operation.

Thanks and regards,
Sanjeev.

Post by Jason J. W. Williams
Hello,
Is there a way to set a max memory utilization for ZFS? We're trying
to debug an issue where the ZFS is sucking all the RAM out of the box,
and its crashing MySQL as a result we think. Will ZFS reduce its cache
size if it feels memory pressure? Any help is greatly appreciated.
Best Regards,
Jason
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel: x27521 +91 80 669 27521

Jason J. W. Williams

2007-01-08 16:32:32 UTC

Permalink

HI Sanjeev,

Thank you very much! I'm not very familiar with using mdb. Is there
anything to be aware of besides no active zpools?

Also, which takes precedence 3/4 of the memory or 1GB? Thank you in
advance! Your help is greatly appreciated.

Best Regards,
Jason

Post by Sanjeev Bagewadi
Jason,
There is no documented way of limiting the memory consumption.
The ARC section of ZFS tries to adapt to the memory pressure of the system.
However, in your case probably it is not quick enough I guess.
One way of limiting the memory consumption would be limit the arc.c_max
This (arc.c_max) is set to 3/4 of the memory available (or 1GB less than
memory available).
This is done when the ZFS is loaded (arc_init()).
You should be able to change the value of arc.c_max through mdb and set
it to the value
you want. Exercise caution while setting it. Make sure you don't have
active zpools during this operation.
Thanks and regards,
Sanjeev.

--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel: x27521 +91 80 669 27521

Sanjeev Bagewadi

2007-01-09 06:56:28 UTC

Permalink

Jason,

Post by Jason J. W. Williams
Thank you very much! I'm not very familiar with using mdb. Is there
anything to be aware of besides no active zpools?

you can do the following as root user :
-- snip --
# mdb -kw

Post by Jason J. W. Williams
arc::print -a "struct arc" c_max

ffffffffc009a538 c_max = 0x2f9aa800

Post by Jason J. W. Williams
ffffffffc009a538/W 0x20000000

arc+0x48: 0x2f9aa800 = 0x20000000
-- snip --
Here I have modified the value of c_max from 0x2f9aa800 to 0x20000000.

Post by Jason J. W. Williams
Also, which takes precedence 3/4 of the memory or 1GB? Thank you in
advance! Your help is greatly appreciated.

Whichever is higher. Look at the routine arc_init() [Line 2544] at :
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c

And I think your request has been answered :-)
Looking at the source I see that they have introduced two new variables
zfs_arc_max and zfs_arc_min which seem to be tunables !

There is a detailed explaination at :
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6505658

Thanks and regards,
Sanjeev.

Post by Jason J. W. Williams
Best Regards,
Jason

--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel: x27521 +91 80 669 27521

_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel: x27521 +91 80 669 27521

Jason J. W. Williams

2007-01-09 21:28:12 UTC

Permalink

Hi Sanjeev,

Thank you! I was not able to find anything as useful on the subject as
that! We are running build 54 on an X4500, would I be correct in my
reading of that article that if I put "set zfs:zfs_arc_max =
0x100000000 #4GB" in my /etc/system, ZFS will consume no more than
4GB? Thank you in advance.

Best Regards,
Jason

Post by Sanjeev Bagewadi
Jason,

Post by Jason J. W. Williams
Thank you very much! I'm not very familiar with using mdb. Is there
anything to be aware of besides no active zpools?

-- snip --
# mdb -kw

Post by Jason J. W. Williams
arc::print -a "struct arc" c_max

ffffffffc009a538 c_max = 0x2f9aa800

Post by Jason J. W. Williams
ffffffffc009a538/W 0x20000000

arc+0x48: 0x2f9aa800 = 0x20000000
-- snip --
Here I have modified the value of c_max from 0x2f9aa800 to 0x20000000.

Post by Jason J. W. Williams
Also, which takes precedence 3/4 of the memory or 1GB? Thank you in
advance! Your help is greatly appreciated.

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c
And I think your request has been answered :-)
Looking at the source I see that they have introduced two new variables
zfs_arc_max and zfs_arc_min which seem to be tunables !
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6505658
Thanks and regards,
Sanjeev.

Post by Jason J. W. Williams
Best Regards,
Jason

--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel: x27521 +91 80 669 27521

_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel: x27521 +91 80 669 27521

Robert Milkowski

2007-01-10 10:41:53 UTC

Permalink

Hello Jason,

Tuesday, January 9, 2007, 10:28:12 PM, you wrote:

JJWW> Hi Sanjeev,

JJWW> Thank you! I was not able to find anything as useful on the subject as
JJWW> that! We are running build 54 on an X4500, would I be correct in my
JJWW> reading of that article that if I put "set zfs:zfs_arc_max =
JJWW> 0x100000000 #4GB" in my /etc/system, ZFS will consume no more than
JJWW> 4GB? Thank you in advance.

That's the idea however it's not working that way now - under some
circumstances ZFS could still consume much more memory - see other
posts lately here.

--
Best regards,
Robert mailto:***@task.gda.pl
http://milek.blogspot.com

Sanjeev Bagewadi

2007-01-10 12:03:13 UTC

Permalink

Jason,

Robert is right...

The point is ARC is the caching module of ZFS and majority of the memory
is consumed through ARC.
Hence by limiting the c_max of ARC we are limiting the amount ARC consumes.

However, other modules of ZFS would consume more but that may not be as
significant as ARC.

Expert, please correct me if I am wrong here.

Thanks and regards,
Sanjeev.

Post by Robert Milkowski
Hello Jason,
JJWW> Hi Sanjeev,
JJWW> Thank you! I was not able to find anything as useful on the subject as
JJWW> that! We are running build 54 on an X4500, would I be correct in my
JJWW> reading of that article that if I put "set zfs:zfs_arc_max =
JJWW> 0x100000000 #4GB" in my /etc/system, ZFS will consume no more than
JJWW> 4GB? Thank you in advance.
That's the idea however it's not working that way now - under some
circumstances ZFS could still consume much more memory - see other
posts lately here.

--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel: x27521 +91 80 669 27521

Jason J. W. Williams

2007-01-10 20:45:05 UTC

Permalink

Sanjeev & Robert,

Thanks guys. We put that in place last night and it seems to be doing
a lot better job of consuming less RAM. We set it to 4GB and each of
our 2 MySQL instances on the box to a max of 4GB. So hopefully slush
of 4GB on the Thumper is enough. I would be interested in what the
other ZFS modules memory behaviors are. I'll take a perusal through
the archives. In general it seems to me that a max cap for ZFS whether
set through a series of individual tunables or a single root tunable
would be very helpful.

Best Regards,
Jason

Post by Sanjeev Bagewadi
Jason,
Robert is right...
The point is ARC is the caching module of ZFS and majority of the memory
is consumed through ARC.
Hence by limiting the c_max of ARC we are limiting the amount ARC consumes.
However, other modules of ZFS would consume more but that may not be as
significant as ARC.
Expert, please correct me if I am wrong here.
Thanks and regards,
Sanjeev.

--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel: x27521 +91 80 669 27521

Jason J. W. Williams

2007-01-10 21:23:56 UTC

Permalink

Hi Guys,

After reading through the discussion on this regarding ZFS memory
fragmentation on snv_53 (and forward) and going through our
::kmastat...looks like ZFS is sucking down about 544 MB of RAM in the
various caches. About 360MB of that is in the zio_buf_65536 cache.
Next most notable is 55MB in zio_buf_32768, and 36MB in zio_buf_16384.
I don't think that's too bad but worth keeping track of. At this
point our kernel memory growth seems to have slowed, with it hovering
around 5GB, and the anon column is mostly what's growing now (as
expected...MySQL).

Most of the problem in the discussion thread on this seemed to be
related to a lot of DLNC entries due to the workload of a file server.
How would this affect a database server with operations in only a
couple very large files? Thank you in advance.

Best Regards,
Jason

Post by Jason J. W. Williams
Sanjeev & Robert,
Thanks guys. We put that in place last night and it seems to be doing
a lot better job of consuming less RAM. We set it to 4GB and each of
our 2 MySQL instances on the box to a max of 4GB. So hopefully slush
of 4GB on the Thumper is enough. I would be interested in what the
other ZFS modules memory behaviors are. I'll take a perusal through
the archives. In general it seems to me that a max cap for ZFS whether
set through a series of individual tunables or a single root tunable
would be very helpful.
Best Regards,
Jason

--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel: x27521 +91 80 669 27521

Robert Milkowski

2007-01-10 23:21:44 UTC

Permalink

Hello Jason,

Wednesday, January 10, 2007, 9:45:05 PM, you wrote:

JJWW> Sanjeev & Robert,

JJWW> Thanks guys. We put that in place last night and it seems to be doing
JJWW> a lot better job of consuming less RAM. We set it to 4GB and each of
JJWW> our 2 MySQL instances on the box to a max of 4GB. So hopefully slush
JJWW> of 4GB on the Thumper is enough. I would be interested in what the
JJWW> other ZFS modules memory behaviors are. I'll take a perusal through
JJWW> the archives. In general it seems to me that a max cap for ZFS whether
JJWW> set through a series of individual tunables or a single root tunable
JJWW> would be very helpful.

Yes it would. Better yet would be if memory consumed by ZFS for
caching (dnodes, vnodes, data, ...) would behave similar to page cache
like with UFS so applications will be able to get back almost all
memory used for ZFS caches if needed.

I guess (and it's really a guess only based on some emails here) that
in worst case scenario ZFS caches would consume about:

arc_max + 3*arc_max + memory lost for fragmentation

So I guess with arc_max set to 1GB you can lost even 5GB (or more) and
currently only that first 1GB can be get back automatically.

--
Best regards,
Robert mailto:***@task.gda.pl
http://milek.blogspot.com

Jason J. W. Williams

2007-01-10 23:36:46 UTC

Permalink

Hi Robert,

Thank you! Holy mackerel! That's a lot of memory. With that type of a
calculation my 4GB arc_max setting is still in the danger zone on a
Thumper. I wonder if any of the ZFS developers could shed some light
on the calculation?

That kind of memory loss makes ZFS almost unusable for a database system.

I agree that a page cache similar to UFS would be much better. Linux
works similarly to free pages, and it has been effective enough in the
past. Though I'm equally unhappy about Linux's tendency to grab every
bit of free RAM available for filesystem caching, and then cause
massive memory thrashing as it frees it for applications.

Best Regards,
Jason

Post by Robert Milkowski
Hello Jason,
JJWW> Sanjeev & Robert,
JJWW> Thanks guys. We put that in place last night and it seems to be doing
JJWW> a lot better job of consuming less RAM. We set it to 4GB and each of
JJWW> our 2 MySQL instances on the box to a max of 4GB. So hopefully slush
JJWW> of 4GB on the Thumper is enough. I would be interested in what the
JJWW> other ZFS modules memory behaviors are. I'll take a perusal through
JJWW> the archives. In general it seems to me that a max cap for ZFS whether
JJWW> set through a series of individual tunables or a single root tunable
JJWW> would be very helpful.
Yes it would. Better yet would be if memory consumed by ZFS for
caching (dnodes, vnodes, data, ...) would behave similar to page cache
like with UFS so applications will be able to get back almost all
memory used for ZFS caches if needed.
I guess (and it's really a guess only based on some emails here) that
arc_max + 3*arc_max + memory lost for fragmentation
So I guess with arc_max set to 1GB you can lost even 5GB (or more) and
currently only that first 1GB can be get back automatically.
--
Best regards,
http://milek.blogspot.com

Robert Milkowski

2007-01-10 23:57:43 UTC

Permalink

Hello Jason,

Thursday, January 11, 2007, 12:36:46 AM, you wrote:

JJWW> Hi Robert,

JJWW> Thank you! Holy mackerel! That's a lot of memory. With that type of a
JJWW> calculation my 4GB arc_max setting is still in the danger zone on a
JJWW> Thumper. I wonder if any of the ZFS developers could shed some light
JJWW> on the calculation?

JJWW> That kind of memory loss makes ZFS almost unusable for a database system.

If you leave ncsize with default value then I belive it won't consume
that much memory.

JJWW> I agree that a page cache similar to UFS would be much better. Linux
JJWW> works similarly to free pages, and it has been effective enough in the
JJWW> past. Though I'm equally unhappy about Linux's tendency to grab every
JJWW> bit of free RAM available for filesystem caching, and then cause
JJWW> massive memory thrashing as it frees it for applications.

Page cache won't be better - just better memory control for ZFS caches
is strongly desired. Unfortunately from time to time ZFS makes servers
to page enormously :(

--
Best regards,
Robert mailto:***@task.gda.pl
http://milek.blogspot.com

Jason J. W. Williams

2007-01-11 00:10:10 UTC

Permalink

Hi Robert,

We've got the default ncsize. I didn't see any advantage to increasing
it outside of NFS serving...which this server is not. For speed the
X4500 is showing to be a killer MySQL platform. Between the blazing
fast procs and the sheer number of spindles, its perfromance is
tremendous. If MySQL cluster had full disk-based support, scale-out
with X4500s a-la Greenplum would be terrific solution.

At this point, the ZFS memory gobbling is the main roadblock to being
a good database platform.

Regarding the paging activity, we too saw tremendous paging of up to
24% of the X4500s CPU being used for that with the default arc_max.
After changing it to 4GB, we haven't seen anything much over 5-10%.

Best Regards,
Jason

Post by Robert Milkowski
Hello Jason,
JJWW> Hi Robert,
JJWW> Thank you! Holy mackerel! That's a lot of memory. With that type of a
JJWW> calculation my 4GB arc_max setting is still in the danger zone on a
JJWW> Thumper. I wonder if any of the ZFS developers could shed some light
JJWW> on the calculation?
JJWW> That kind of memory loss makes ZFS almost unusable for a database system.
If you leave ncsize with default value then I belive it won't consume
that much memory.
JJWW> I agree that a page cache similar to UFS would be much better. Linux
JJWW> works similarly to free pages, and it has been effective enough in the
JJWW> past. Though I'm equally unhappy about Linux's tendency to grab every
JJWW> bit of free RAM available for filesystem caching, and then cause
JJWW> massive memory thrashing as it frees it for applications.
Page cache won't be better - just better memory control for ZFS caches
is strongly desired. Unfortunately from time to time ZFS makes servers
to page enormously :(
--
Best regards,
http://milek.blogspot.com

Robert Milkowski

2007-01-11 00:27:11 UTC

Permalink

Hello Jason,

Thursday, January 11, 2007, 1:10:10 AM, you wrote:

JJWW> Hi Robert,

JJWW> We've got the default ncsize. I didn't see any advantage to increasing
JJWW> it outside of NFS serving...which this server is not. For speed the
JJWW> X4500 is showing to be a killer MySQL platform. Between the blazing
JJWW> fast procs and the sheer number of spindles, its perfromance is

Have you got any numbers you can share?

--
Best regards,
Robert mailto:***@task.gda.pl
http://milek.blogspot.com

Erblichs

2007-01-11 00:39:38 UTC

Permalink

Hey guys,

Do to loooong URL lookups, the DNLC was pushed to variable
sized entries. The hit rate was dropping because of
"name to long" misses. This was done long ago while I
was at Sun under a bug reported by me..

I don't know your usage, but you should attempt to
estimate the amount of mem used with the default size.

Yes, this is after you start tracking your DNLC hit rate
and make sure it doesn't significantly drop if the ncsize
is decreased. You also may wish to increase the size and
again check the hit rate.. Yes, it is posible that your
access is random enough that no changes will effect the
hit rte.

2nd item.. Bonwick's mem allcators I think still have the
ability to limit the size of each slab. The issue is that
some parts of the code expect non mem failures with
SLEEPs. This can result in extended SLEEPs, but can be
done.

If your company generates changes to your local source
and then you rebuild, it is possible to pre-allocate a
fixed number of objects per cache and then use NOLSLEEPs
with returning values that indicate to retry or failure.

3rd.. And could be the most important, the mem cache
allocators are lazy in freeing memory when it is not
needed by anyone else. Thus, unfreed memory is effectively
used as a cache to remove latencies of on-demand
memory allocations. This artificially keeps memory
usage high, but should have minimal latencies to realloc
when necessary.

Also, it is possible to make mods to increase the level
of mem garbage collection after some watermark code
is added to minimize repeated allocs and frees.

Mitchell Erblich
----------------

Post by Jason J. W. Williams
Hi Robert,
We've got the default ncsize. I didn't see any advantage to increasing
it outside of NFS serving...which this server is not. For speed the
X4500 is showing to be a killer MySQL platform. Between the blazing
fast procs and the sheer number of spindles, its perfromance is
tremendous. If MySQL cluster had full disk-based support, scale-out
with X4500s a-la Greenplum would be terrific solution.
At this point, the ZFS memory gobbling is the main roadblock to being
a good database platform.
Regarding the paging activity, we too saw tremendous paging of up to
24% of the X4500s CPU being used for that with the default arc_max.
After changing it to 4GB, we haven't seen anything much over 5-10%.
Best Regards,
Jason

_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Sanjeev Bagewadi

2007-01-11 09:47:48 UTC

Permalink

Jason,

Remember that ZFS does not use the standard solaris paging architecture
for caching.
Instead it uses ARC for all its caching. And that is the reason tuning
the ARC should
help in your case.

The zio_bufs that you referred to in the previous are the caches used by
ARC for caching
various things (including the metadata and the data).

Thanks and regards,
Sanjeev.

Post by Jason J. W. Williams
Best Regards,
Jason

Post by Robert Milkowski
Hello Jason,
JJWW> Hi Robert,
JJWW> Thank you! Holy mackerel! That's a lot of memory. With that type of a
JJWW> calculation my 4GB arc_max setting is still in the danger zone on a
JJWW> Thumper. I wonder if any of the ZFS developers could shed some light
JJWW> on the calculation?
JJWW> That kind of memory loss makes ZFS almost unusable for a database system.
If you leave ncsize with default value then I belive it won't consume
that much memory.
JJWW> I agree that a page cache similar to UFS would be much better.
Linux
JJWW> works similarly to free pages, and it has been effective enough in the
JJWW> past. Though I'm equally unhappy about Linux's tendency to grab every
JJWW> bit of free RAM available for filesystem caching, and then cause
JJWW> massive memory thrashing as it frees it for applications.
Page cache won't be better - just better memory control for ZFS caches
is strongly desired. Unfortunately from time to time ZFS makes servers
to page enormously :(
--
Best regards,
http://milek.blogspot.com

Mark Maybee

2007-01-11 00:37:36 UTC

Permalink

Post by Jason J. W. Williams
Hi Robert,
Thank you! Holy mackerel! That's a lot of memory. With that type of a
calculation my 4GB arc_max setting is still in the danger zone on a
Thumper. I wonder if any of the ZFS developers could shed some light
on the calculation?

In a worst-case scenario, Robert's calculations are accurate to a
certain degree: If you have 1GB of dnode_phys data in your arc cache
(that would be about 1,200,000 files referenced), then this will result
in another 3GB of "related" data held in memory: vnodes/znodes/
dnodes/etc. This related data is the in-core data associated with
an accessed file. Its not quite true that this data is not evictable,
it *is* evictable, but the space is returned from these kmem caches
only after the arc has cleared its blocks and triggered the "free" of
the related data structures (and even then, the kernel will need to
to a kmem_reap to reclaim the memory from the caches). The
fragmentation that Robert mentions is an issue because, if we don't
free everything, the kmem_reap may not be able to reclaim all the
memory from these caches, as they are allocated in "slabs".

We are in the process of trying to improve this situation.

Post by Jason J. W. Williams
That kind of memory loss makes ZFS almost unusable for a database system.

Note that you are not going to experience these sorts of overheads
unless you are accessing *many* files. In a database system, there are
only going to be a few files => no significant overhead.

Post by Jason J. W. Williams
I agree that a page cache similar to UFS would be much better. Linux
works similarly to free pages, and it has been effective enough in the
past. Though I'm equally unhappy about Linux's tendency to grab every
bit of free RAM available for filesystem caching, and then cause
massive memory thrashing as it frees it for applications.

The page cache is "much better" in the respect that it is more tightly
integrated with the VM system, so you get more efficient response to
memory pressure. It is *much worse* than the ARC at caching data for
a file system. In the long-term we plan to integrate the ARC into the
Solaris VM system.

Post by Jason J. W. Williams
Best Regards,
Jason

_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Jason J. W. Williams

2007-01-11 02:37:13 UTC

Permalink

Hi Mark,

Thank you. That makes a lot of sense. In our case we're talking around
10 multi-gigabyte files. The arc_max+3*arc_max+fragmentation was a bit
worrisome. It sounds then that this is mostly an issue on something
like an NFS server which had a ton of small files, where the
minimum_file_node_overhead*files was consuming the arc_max*3?

On a side-note it appears that most of our zio cache stay pretty
static. 99% of the ::memstat kernel memory increases are in
zio_buf_65536. It seems to increase between 5-50MB/hr depending on the
database update load.

Is integrating the ARC into the Solaris VM system a Solaris Nevada
goal? Or would that be the next major release after Nevada?

Best Regards,
Jason

Post by Mark Maybee

In a worst-case scenario, Robert's calculations are accurate to a
certain degree: If you have 1GB of dnode_phys data in your arc cache
(that would be about 1,200,000 files referenced), then this will result
in another 3GB of "related" data held in memory: vnodes/znodes/
dnodes/etc. This related data is the in-core data associated with
an accessed file. Its not quite true that this data is not evictable,
it *is* evictable, but the space is returned from these kmem caches
only after the arc has cleared its blocks and triggered the "free" of
the related data structures (and even then, the kernel will need to
to a kmem_reap to reclaim the memory from the caches). The
fragmentation that Robert mentions is an issue because, if we don't
free everything, the kmem_reap may not be able to reclaim all the
memory from these caches, as they are allocated in "slabs".
We are in the process of trying to improve this situation.

Post by Jason J. W. Williams
That kind of memory loss makes ZFS almost unusable for a database system.

Note that you are not going to experience these sorts of overheads
unless you are accessing *many* files. In a database system, there are
only going to be a few files => no significant overhead.

Post by Jason J. W. Williams
Best Regards,
Jason

_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Al Hopper

2007-01-11 02:55:23 UTC

Permalink

Post by Mark Maybee

In a worst-case scenario, Robert's calculations are accurate to a
certain degree: If you have 1GB of dnode_phys data in your arc cache
(that would be about 1,200,000 files referenced), then this will result
in another 3GB of "related" data held in memory: vnodes/znodes/
dnodes/etc. This related data is the in-core data associated with
an accessed file. Its not quite true that this data is not evictable,
it *is* evictable, but the space is returned from these kmem caches
only after the arc has cleared its blocks and triggered the "free" of
the related data structures (and even then, the kernel will need to
to a kmem_reap to reclaim the memory from the caches). The
fragmentation that Robert mentions is an issue because, if we don't
free everything, the kmem_reap may not be able to reclaim all the
memory from these caches, as they are allocated in "slabs".
We are in the process of trying to improve this situation.

.... snip .....

Understood (and many Thanks). In the meantime, is there a rule-of-thumb
that you could share that would allow mere humans (like me) to calculate
the best values of zfs:zfs_arc_max and ncsize, given the that machine has
nGb of RAM and is used in the following broad workload scenarios:

a) a busy NFS server
b) a general multiuser development server
c) a database server
d) an Apache/Tomcat/FTP server
e) a single user Gnome desktop running U3 with home dirs on a ZFS
filesystem

It would seem, from reading between the lines of previous emails,
particularly the ones you've (Mark M) written, that there is a rule of
thumb that would apply given a standard or modified ncsize tunable??

I'm primarily interested in a calculation that would allow settings that
would reduce the possibility of the machine descending into "swap hell".

PS: Interesting is that no one has mentioned (the tunable) maxpgio. I've
often found that increasing maxpgio is the only way to improve the odds of
a machine remaining usable when lots of swapping is taking place.

Regards,

Al Hopper Logical Approach Inc, Plano, TX. ***@logical-approach.com
Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
OpenSolaris Governing Board (OGB) Member - Feb 2006

Jason J. W. Williams

2007-01-11 05:28:26 UTC

Permalink

Hello all,

I second Al's motion. Even a little script a-la the CoolTools for
tuning Solaris for the T2000 would be great.

-J

Post by Al Hopper

Post by Mark Maybee

In a worst-case scenario, Robert's calculations are accurate to a
certain degree: If you have 1GB of dnode_phys data in your arc cache
(that would be about 1,200,000 files referenced), then this will result
in another 3GB of "related" data held in memory: vnodes/znodes/
dnodes/etc. This related data is the in-core data associated with
an accessed file. Its not quite true that this data is not evictable,
it *is* evictable, but the space is returned from these kmem caches
only after the arc has cleared its blocks and triggered the "free" of
the related data structures (and even then, the kernel will need to
to a kmem_reap to reclaim the memory from the caches). The
fragmentation that Robert mentions is an issue because, if we don't
free everything, the kmem_reap may not be able to reclaim all the
memory from these caches, as they are allocated in "slabs".
We are in the process of trying to improve this situation.

.... snip .....
Understood (and many Thanks). In the meantime, is there a rule-of-thumb
that you could share that would allow mere humans (like me) to calculate
the best values of zfs:zfs_arc_max and ncsize, given the that machine has
a) a busy NFS server
b) a general multiuser development server
c) a database server
d) an Apache/Tomcat/FTP server
e) a single user Gnome desktop running U3 with home dirs on a ZFS
filesystem
It would seem, from reading between the lines of previous emails,
particularly the ones you've (Mark M) written, that there is a rule of
thumb that would apply given a standard or modified ncsize tunable??
I'm primarily interested in a calculation that would allow settings that
would reduce the possibility of the machine descending into "swap hell".
PS: Interesting is that no one has mentioned (the tunable) maxpgio. I've
often found that increasing maxpgio is the only way to improve the odds of
a machine remaining usable when lots of swapping is taking place.
Regards,
Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
OpenSolaris Governing Board (OGB) Member - Feb 2006
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Mark Maybee

2007-01-11 15:38:13 UTC

Permalink

Post by Al Hopper

Post by Mark Maybee

In a worst-case scenario, Robert's calculations are accurate to a
certain degree: If you have 1GB of dnode_phys data in your arc cache
(that would be about 1,200,000 files referenced), then this will result
in another 3GB of "related" data held in memory: vnodes/znodes/
dnodes/etc. This related data is the in-core data associated with
an accessed file. Its not quite true that this data is not evictable,
it *is* evictable, but the space is returned from these kmem caches
only after the arc has cleared its blocks and triggered the "free" of
the related data structures (and even then, the kernel will need to
to a kmem_reap to reclaim the memory from the caches). The
fragmentation that Robert mentions is an issue because, if we don't
free everything, the kmem_reap may not be able to reclaim all the
memory from these caches, as they are allocated in "slabs".
We are in the process of trying to improve this situation.

.... snip .....
Understood (and many Thanks). In the meantime, is there a rule-of-thumb
that you could share that would allow mere humans (like me) to calculate
the best values of zfs:zfs_arc_max and ncsize, given the that machine has
a) a busy NFS server
b) a general multiuser development server
c) a database server
d) an Apache/Tomcat/FTP server
e) a single user Gnome desktop running U3 with home dirs on a ZFS
filesystem
It would seem, from reading between the lines of previous emails,
particularly the ones you've (Mark M) written, that there is a rule of
thumb that would apply given a standard or modified ncsize tunable??
I'm primarily interested in a calculation that would allow settings that
would reduce the possibility of the machine descending into "swap hell".

Ideally, there would be no need for any tunables; ZFS would always "do
the right thing". This is our grail. In the meantime, I can give some
recommendations, but there is no "rule of thumb" that is going to work
in all circumstances.

ncsize: As I have mentioned previously, there are overheads
associated with caching "vnode data" in ZFS. While
the physical on-disk data for a znode is only 512bytes,
the related in-core cost is significantly higher.
Roughly, you can expect that each ZFS vnode held in
the DNLC will cost about 3K of kernel memory.

So, you need to set ncsize appropriately for how much
memory you are willing to devote to it. 500,000 entries
is going to cost you 1.5GB of memory.

zfs_arc_max: This is the maximum amount of memory you want the
ARC to be able to use. Note that the ARC won't
necessarily use this much memory: if other applications
need memory, the ARC will shrink to accommodate.
Although, also note that the ARC *can't* shrink if all
of its memory is held. For example, data in the DNLC
cannot be evicted from the ARC, so this data must first
be evicted from the DNLC before the ARC can free up
space (this is why it is dangerous to turn off the ARCs
ability to evict vnodes from the DNLC).

Also keep in mind that the ARC size does not account for
many in-core data structures used by ZFS (znodes/dnodes/
dbufs/etc). Roughly, for every 1MB of cached file
pointers, you can expect another 3MB of memory used
outside of the ARC. So, in the example above, where
ncsize is 500,000, the ARC is only seeing about 400MB
of the 1.5GB consumed. As I have stated previously,
we consider this a bug in the current ARC accounting
that we will soon fix. This is only an issue in
environments where many files are being accessed. If
the number of files accessed is relatively low, then
the ARC size will be much closer to the actual memory
consumed by ZFS.

So, in general, you should not really need to "tune"
zfs_arc_max. However, in environments where you have
specific applications that consume known quantities of
memory (e.g. database), it will likely help to set the
ARC max size down, to guarantee that the necessary
kernel memory is available. There may be other times
when it will be beneficial to explicitly set the ARCs
maximum size, but at this time I can't offer any general
"rule of thumb".

I hope that helps.

-Mark

Tomas Ögren

2007-01-11 15:58:31 UTC

Permalink

Post by Mark Maybee

Post by Al Hopper
It would seem, from reading between the lines of previous emails,
particularly the ones you've (Mark M) written, that there is a rule of
thumb that would apply given a standard or modified ncsize tunable??
I'm primarily interested in a calculation that would allow settings that
would reduce the possibility of the machine descending into "swap hell".

Due to fragmentation, 200k entries can eat over 1.5GB memory too.

Loading Image...

This is only dnlc-related buffers on a 2GB machine.. the spike at the
end caused the machine to more or less stand still.

Post by Mark Maybee
zfs_arc_max: This is the maximum amount of memory you want the
ARC to be able to use. Note that the ARC won't
necessarily use this much memory: if other applications
need memory, the ARC will shrink to accommodate.
Although, also note that the ARC *can't* shrink if all
of its memory is held. For example, data in the DNLC
cannot be evicted from the ARC, so this data must first
be evicted from the DNLC before the ARC can free up
space (this is why it is dangerous to turn off the ARCs
ability to evict vnodes from the DNLC).

I've tried that.. didn't work out too great due to fragmentation.. Left
non-kernel with like 4MB to play with..

/Tomas

--
Tomas Ögren, ***@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se

Jason J. W. Williams

2007-01-11 21:18:21 UTC

Permalink

Hi Mark,

That does help tremendously. How does ZFS decide which zio cache to
use? I apologize if this has already been addressed somewhere.

Best Regards,
Jason

Post by Mark Maybee

Post by Al Hopper

Post by Mark Maybee

In a worst-case scenario, Robert's calculations are accurate to a
certain degree: If you have 1GB of dnode_phys data in your arc cache
(that would be about 1,200,000 files referenced), then this will result
in another 3GB of "related" data held in memory: vnodes/znodes/
dnodes/etc. This related data is the in-core data associated with
an accessed file. Its not quite true that this data is not evictable,
it *is* evictable, but the space is returned from these kmem caches
only after the arc has cleared its blocks and triggered the "free" of
the related data structures (and even then, the kernel will need to
to a kmem_reap to reclaim the memory from the caches). The
fragmentation that Robert mentions is an issue because, if we don't
free everything, the kmem_reap may not be able to reclaim all the
memory from these caches, as they are allocated in "slabs".
We are in the process of trying to improve this situation.

.... snip .....
Understood (and many Thanks). In the meantime, is there a rule-of-thumb
that you could share that would allow mere humans (like me) to calculate
the best values of zfs:zfs_arc_max and ncsize, given the that machine has
a) a busy NFS server
b) a general multiuser development server
c) a database server
d) an Apache/Tomcat/FTP server
e) a single user Gnome desktop running U3 with home dirs on a ZFS
filesystem
It would seem, from reading between the lines of previous emails,
particularly the ones you've (Mark M) written, that there is a rule of
thumb that would apply given a standard or modified ncsize tunable??
I'm primarily interested in a calculation that would allow settings that
would reduce the possibility of the machine descending into "swap hell".

Ideally, there would be no need for any tunables; ZFS would always "do
the right thing". This is our grail. In the meantime, I can give some
recommendations, but there is no "rule of thumb" that is going to work
in all circumstances.
ncsize: As I have mentioned previously, there are overheads
associated with caching "vnode data" in ZFS. While
the physical on-disk data for a znode is only 512bytes,
the related in-core cost is significantly higher.
Roughly, you can expect that each ZFS vnode held in
the DNLC will cost about 3K of kernel memory.
So, you need to set ncsize appropriately for how much
memory you are willing to devote to it. 500,000 entries
is going to cost you 1.5GB of memory.
zfs_arc_max: This is the maximum amount of memory you want the
ARC to be able to use. Note that the ARC won't
necessarily use this much memory: if other applications
need memory, the ARC will shrink to accommodate.
Although, also note that the ARC *can't* shrink if all
of its memory is held. For example, data in the DNLC
cannot be evicted from the ARC, so this data must first
be evicted from the DNLC before the ARC can free up
space (this is why it is dangerous to turn off the ARCs
ability to evict vnodes from the DNLC).
Also keep in mind that the ARC size does not account for
many in-core data structures used by ZFS (znodes/dnodes/
dbufs/etc). Roughly, for every 1MB of cached file
pointers, you can expect another 3MB of memory used
outside of the ARC. So, in the example above, where
ncsize is 500,000, the ARC is only seeing about 400MB
of the 1.5GB consumed. As I have stated previously,
we consider this a bug in the current ARC accounting
that we will soon fix. This is only an issue in
environments where many files are being accessed. If
the number of files accessed is relatively low, then
the ARC size will be much closer to the actual memory
consumed by ZFS.
So, in general, you should not really need to "tune"
zfs_arc_max. However, in environments where you have
specific applications that consume known quantities of
memory (e.g. database), it will likely help to set the
ARC max size down, to guarantee that the necessary
kernel memory is available. There may be other times
when it will be beneficial to explicitly set the ARCs
maximum size, but at this time I can't offer any general
"rule of thumb".
I hope that helps.
-Mark
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Mark Maybee

2007-01-11 21:36:50 UTC

Permalink

Post by Jason J. W. Williams
Hi Mark,
That does help tremendously. How does ZFS decide which zio cache to
use? I apologize if this has already been addressed somewhere.

The ARC caches data blocks in the zio_buf_xxx() cache that matches
the block size. For example, dnode data is stored on disk in 16K
blocks (32 dnodes/block), so zio_buf_16384() is used for those blocks.
Most file data blocks (in large files) are stored in 128K blocks, so
zio_buf_131072() is used, etc.

-Mark

Richard L. Hamilton

2007-02-07 00:44:39 UTC

Permalink

If I understand correctly, at least some systems claim not to guarantee
consistency between changes to a file via write(2) and changes via mmap(2).
But historically, at least in the case of regular files on local UFS, since Solaris
used the page cache for both cases, the results should have been consistent.

Since zfs uses somewhat different mechanisms, does it still have the same
consistency between write(2) and mmap(2) that was historically present
(whether or not "guaranteed") when using UFS on Solaris?

This message posted from opensolaris.org

Sanjeev Bagewadi

2007-02-07 03:40:51 UTC

Permalink

Richard,

Post by Richard L. Hamilton
If I understand correctly, at least some systems claim not to guarantee
consistency between changes to a file via write(2) and changes via mmap(2).
But historically, at least in the case of regular files on local UFS, since Solaris
used the page cache for both cases, the results should have been consistent.
Since zfs uses somewhat different mechanisms, does it still have the same
consistency between write(2) and mmap(2) that was historically present
(whether or not "guaranteed") when using UFS on Solaris?

Yes, it does have the consistency. There is specific code to keep
the page cache (needed in case of mmaped files) and the ARC caches
consistent.

Thanks and regards,
Sanjeev.

--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel: x27521 +91 80 669 27521

j***@sun.com

2007-01-10 23:45:46 UTC

Permalink

Better yet would be if memory consumed by ZFS for caching (dnodes,
vnodes, data, ...) would behave similar to page cache like with UFS so
applications will be able to get back almost all memory used for ZFS
caches if needed.

I believe that a better response to memory pressure is a long-term goal
for ZFS. There's also an effort in progress to improve the caching
algorithms used in the ARC.

-j

Sanjeev Bagewadi

2007-01-11 09:42:18 UTC

Permalink

Robert,

Comments inline...

This is not true from what I know :-) How did you get to this number ?

From my knowledge it uses :
c_max + (some memory for other caches)

NOTE : (some memory for other caches) is not as large as c_max. It is
probably just x% of it
and not multiples of c_max.

Post by Robert Milkowski
So I guess with arc_max set to 1GB you can lost even 5GB (or more) and
currently only that first 1GB can be get back automatically.

This doesn't seem right based on my knowledge of ZFS.

Regards,
Sanjeev.

Jason J. W. Williams

2007-01-08 16:54:50 UTC

Permalink

Sanjeev,

Could you point me in the right direction as to how to convert the
following GCC compile flags to Studio 11 compile flags? Any help is
greatly appreciated. We're trying to recompile MySQL to give a
stacktrace and core file to track down exactly why its
crashing...hopefully it will illuminate if memory truly is the issue.
Thank you very much in advance!

-felide-constructors
-fno-exceptions -fno-rtti

Best Regards,
Jason

--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel: x27521 +91 80 669 27521

Toby Thain

2007-01-08 18:54:52 UTC

Permalink

...We're trying to recompile MySQL to give a
stacktrace and core file to track down exactly why its
crashing...hopefully it will illuminate if memory truly is the issue.

If you're using the Enterprise release, can't you get MySQL's
assistance with this?

--Toby

Jason J. W. Williams

2007-01-08 21:01:00 UTC

Permalink

We're not using the Enterprise release, but we are working with them.
It looks like MySQL is crashing due to lack of memory.

-J

Post by Toby Thain

...We're trying to recompile MySQL to give a
stacktrace and core file to track down exactly why its
crashing...hopefully it will illuminate if memory truly is the issue.

If you're using the Enterprise release, can't you get MySQL's
assistance with this?
--Toby

Sanjeev Bagewadi

2007-01-10 03:54:23 UTC

Permalink

Jason,

Apologies.. I missed out this mail yesterday...

I am not too familiar with the options. Someoen else will have to answer
this.

Thanks and regards,
Sanjeev.

Post by Jason J. W. Williams
Sanjeev,
Could you point me in the right direction as to how to convert the
following GCC compile flags to Studio 11 compile flags? Any help is
greatly appreciated. We're trying to recompile MySQL to give a
stacktrace and core file to track down exactly why its
crashing...hopefully it will illuminate if memory truly is the issue.
Thank you very much in advance!
-felide-constructors
-fno-exceptions -fno-rtti
Best Regards,
Jason

--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel: x27521 +91 80 669 27521

_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel: x27521 +91 80 669 27521

Matt Ingenthron

2007-01-10 04:53:09 UTC

Permalink

Hi Jason,

Depending on which hardware architecture you're working on, you may be
able to get Studio 11 compiled binaries through the CoolStack project:

http://cooltools.sunsource.net/coolstack/index.html

Regardless, optimized compiler flags for MySQL with Studio 11 are in the
source bundle listed there. If you need additional help, let me know.

Separately, it's my understanding that ZFS reduces it's memory usage as
Solaris needs to allocate more memory for applications. I've not seen
this problem but I suspect it'd be better to try to come up with a more
simple test case that mimics MySQL (i.e. mmap() or malloc through
whichever memory library MySQL is using). I suspect it's mmap or DISM,
but if it's a *alloc problem, you may want to look at the man page for
umem_debug, as that may be able to help you find where the problem is
coming from.

In fact, CoolStack may be a good tested, stable build for you to use
alongside ZFS. You can email me directly with any issues you run into
with it and I'll get it into the right group of people.

Hope that helps,

- Matt

Post by Sanjeev Bagewadi
Jason,
Apologies.. I missed out this mail yesterday...
I am not too familiar with the options. Someoen else will have to
answer this.
Thanks and regards,
Sanjeev.

Post by Jason J. W. Williams
Hello,
Is there a way to set a max memory utilization for ZFS? We're trying
to debug an issue where the ZFS is sucking all the RAM out of the

box,

Post by Jason J. W. Williams
and its crashing MySQL as a result we think. Will ZFS reduce its

cache

Post by Jason J. W. Williams
size if it feels memory pressure? Any help is greatly appreciated.
Best Regards,
Jason
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
Solaris Revenue Products Engineering,
India Engineering Center,
Sun Microsystems India Pvt Ltd.
Tel: x27521 +91 80 669 27521

_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
Matt Ingenthron - Web Infrastructure Solutions Architect
Sun Microsystems, Inc. - Client Solutions, Systems Practice
http://blogs.sun.com/mingenthron/
email: ***@sun.com Phone: 310-242-6439