OpenSolaris

Discussions Communities Projects Download Source Browser

Home » OpenSolaris Forums » zfs » discuss

Thread: ZFS on-disk compression

Welcome, Guest Help
Login Login
Guest Settings Guest Settings
Reply to this Thread Reply to this Thread Search Forum Search Forum Back to Thread List Back to Thread List

Permlink Replies: 10 - Last Post: Dec 20, 2005 3:10 PM by: joerg
Pekka J Enberg
penberg@cs.Helsinki.FI
ZFS on-disk compression
Posted: Dec 16, 2005 6:40 AM

  Click to reply to this thread Reply

Hi,

The ZFS On-Disk Format Specification [1] isn't very clear on what is
compressed on disk and what's not. Could someone please clarify that? Am I
correct that, for example, the MOS pointed to by ub_rootbp is compressed
with the algorithm defined by the comp field of blkptr_t?

Pekka

1. http://opensolaris.org/os/community/zfs/docs/ondiskformatfinal.pdf
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



eschrock

Posts: 799
From: Menlo Park, CA

Registered: 3/9/05
Re: ZFS on-disk compression
Posted: Dec 16, 2005 8:00 AM   in response to: Pekka J Enberg

  Click to reply to this thread Reply

On Fri, Dec 16, 2005 at 04:40:27PM +0200, Pekka J Enberg wrote:
> Hi,
>
> The ZFS On-Disk Format Specification [1] isn't very clear on what is
> compressed on disk and what's not. Could someone please clarify that? Am I
> correct that, for example, the MOS pointed to by ub_rootbp is compressed
> with the algorithm defined by the comp field of blkptr_t?

Pekka -

Any block can be compressed or uncompressed as expressed in the relevant
blkptr_t (ub_rootbp, in your example). Which blocks are compressed is
entirely an implementation detail, and doesn't affect the on-disk
specification.

For example, the initial release of ZFS had all metadata compressed.
However, this made failure diagnosis and disaster recovery more
difficult, and was reversed in build 29:

6354299 Disable metadata compression, at least temporarily

However, this could be changed at a future date; it doesn't matter as
far as the on-disk spec goes.

- Eric

--
Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



jake

Posts: 71
From:

Registered: 6/14/05
Re: ZFS on-disk compression
Posted: Dec 17, 2005 8:56 PM   in response to: eschrock

  Click to reply to this thread Reply

> Any block can be compressed or uncompressed as expressed in the relevant
> blkptr_t (ub_rootbp, in your example). Which blocks are compressed is
> entirely an implementation detail, and doesn't affect the on-disk
> specification.

Is there a mechanism in place to try to compress part of the block and only store it compressed if a reasonable ratio is achieved? reiser4 implements something similar, although at the file level, I think, which is probably preferable. I know ideally one would create a compressed filesystem for ASCII config files, source trees, etc. and an uncompressed filesystem for mp3s, compressed tarballs, divx video, etc., but in the case of building software from source it's a pain to keep the source trees and compressed tarballs on different filesystems.

billm

Posts: 91
From: Menlo Park, CA

Registered: 3/9/05
Re: Re: ZFS on-disk compression
Posted: Dec 17, 2005 10:07 PM   in response to: jake

  Click to reply to this thread Reply

On Sat, Dec 17, 2005 at 08:56:36PM -0800, Jake Maciejewski wrote:
> > Any block can be compressed or uncompressed as expressed in the relevant
> > blkptr_t (ub_rootbp, in your example). Which blocks are compressed is
> > entirely an implementation detail, and doesn't affect the on-disk
> > specification.
>
> Is there a mechanism in place to try to compress part of the block and
> only store it compressed if a reasonable ratio is achieved? reiser4
> implements something similar, although at the file level, I think,
> which is probably preferable. I know ideally one would create a
> compressed filesystem for ASCII config files, source trees, etc. and
> an uncompressed filesystem for mp3s, compressed tarballs, divx video,
> etc., but in the case of building software from source it's a pain to
> keep the source trees and compressed tarballs on different
> filesystems.

That is exactly what we do. We only store the block compressed if we
achieve at least 25% savings. If a block does not compress very well,
it gets stored uncompressed. It's an implementation detail, not an
on-disk format thing. Most probably, we will make it a tunable at some
point in the future as we implement other compression algorithms.

The Reiser method is totally untenable for larger files. Imagine having
to recompress, say, a large log file every time you appended to it.
That would really suck. ZFS, by comparison, only compresses the blocks
that are written; in this case, the last block. Compressing the file as
a whole also sucks if you want to change compression algorithms on the
fly. Imagine you had a large file, changed the compression algorithm,
then re-wrote one block of it. With ZFS, we would just store that one
block as being compressed with a different algorithm. With a full-file
method, you'd have the re-compress the entire file, which could take
quite some time.

There is no need to make separate filesystems. Since the uncompressable
files you mentioned are large, write-once kinds of things, you only pay
a small CPU tax on the initial write to find out they are
uncompressable. Subsequent reads are fast, since the block pointer
indicates that the blocks are stored uncompressed.


--Bill
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



jake

Posts: 71
From:

Registered: 6/14/05
Re: Re: ZFS on-disk compression
Posted: Dec 18, 2005 12:09 AM   in response to: billm

  Click to reply to this thread Reply

Cool. I hoped that was the way compression was implemented, but I didn't see anything in the docs explicitly stating so. Unless I overlooked it, something should probably be added to the man page and admin guide.

Regarding the reiser4 method, I speak with no authority on the issue and compression support hasn't been finalized, so don't be quick to judge. What I meant, though, wasn't that the file would be compressed as a single unit, but rather that if part of the file doesn't compress, chances are none of it's worth compressing, hence avoiding testing each block.

woocky

Posts: 21
From:

Registered: 10/19/05
Re: ZFS on-disk compression
Posted: Dec 19, 2005 7:45 AM   in response to: Pekka J Enberg

  Click to reply to this thread Reply

Apologies for hijacking the thread, but how does du(1) and ls know the compressed size of files in a directory?

sommerfe

Posts: 975
From: US

Registered: 3/9/05
Re: Re: ZFS on-disk compression
Posted: Dec 19, 2005 10:07 AM   in response to: woocky

  Click to reply to this thread Reply

On Mon, 2005-12-19 at 10:45, John Smith wrote:
> Apologies for hijacking the thread, but how does du(1) and ls know the compressed size of files in a directory?

The unix stat(2) call and its variants return two size-relevant values:
st_size "file size in bytes"
st_blocks "number of 512 byte blocks allocated".

ls -l shows (among other things) st_size; du and ls -s show st_blocks

It's the second field, st_blocks, which reflects the actual on-disk
footprint of the file rather than the apparent size.

On UFS, st_blocks includes overhead such as indirect blocks:

: 1 %; mkfile 1m f
: 1 %; ls -l f
-rw------- 1 sommerfeld staff 1048576 Dec 19 13:02 f
: 1 %; ls -ls f
2064 -rw------- 1 sommerfeld staff 1048576 Dec 19 13:02 f
: 1 %; bc
1048576/512
2048

ZFS is similar, with the added wrinkle that, because of compression, you
can often store more than 512 bytes of file content in a single 512-byte
block.

- Bill



_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



joerg

Posts: 3,783
From: DE

Registered: 4/27/05
Re: Re: ZFS on-disk compression
Posted: Dec 20, 2005 3:05 PM   in response to: sommerfe

  Click to reply to this thread Reply

Bill Sommerfeld <sommerfeld at sun dot com> wrote:

> On Mon, 2005-12-19 at 10:45, John Smith wrote:
> > Apologies for hijacking the thread, but how does du(1) and ls know the compressed size of files in a directory?
>
> The unix stat(2) call and its variants return two size-relevant values:
> st_size "file size in bytes"
> st_blocks "number of 512 byte blocks allocated".
>
> ls -l shows (among other things) st_size; du and ls -s show st_blocks

I should make an important note:

Although this is 100% correct, it will currently fool 'star -diff'
as I did forget to use the same (correct) sparse check in diff mode as
in cdreate mode.

It will definitely fool GNU tar in any operation mode.

Jörg

--
EMail:joerg at schily dot isdn dot cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
js@cs.tu-berlin.de (uni)
schilling at fokus dot fraunhofer dot de (work) Blog: http://schily.blogspot.com/
URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



Matthew Ahrens
ahrens@sun.com
Re: Re: ZFS on-disk compression
Posted: Dec 19, 2005 4:28 PM   in response to: woocky

  Click to reply to this thread Reply

On Mon, Dec 19, 2005 at 07:45:52AM -0800, John Smith wrote:
> Apologies for hijacking the thread, but how does du(1) and ls know the
> compressed size of files in a directory?

They look at the st_blocks field of the the stat structure, as returned
by the stat(2) system call.

st_blocks The total number of physical blocks of size
512 bytes actually allocated on disk. This
field is not defined for block special or
character special files.

You may find it interesting that the compression ratio (as reported by
'zfs get ratio') is not calculated by comparing st_size ("the address of
the end of the file") to st_blocks. That would inflate the compression
ratio for sparse files. Rather, ZFS internally tracks the compressed
and uncompressed size of each block, and the sums of each of these for
each filesystem. On additional bit of trickyness is that any
compression applied to metadata is not counted, since metadata
compression is an implementation detail and not controlled by the
'compression' property.

For details, see the code in dsl_dataset_block_born() and
dsl_dataset_block_kill(), which use the macros BP_GET_PSIZE() and
BP_GET_UCSIZE() to determine the compressed and uncompressed sizes,
respectively. BP_GET_UCSIZE() is where we determine if it is metadata,
in which case we report the compressed size.

--matt
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



joerg

Posts: 3,783
From: DE

Registered: 4/27/05
Re: Re: ZFS on-disk compression
Posted: Dec 20, 2005 3:10 PM   in response to: Matthew Ahrens

  Click to reply to this thread Reply

Matthew Ahrens <ahrens at sun dot com> wrote:

> On Mon, Dec 19, 2005 at 07:45:52AM -0800, John Smith wrote:
> > Apologies for hijacking the thread, but how does du(1) and ls know the
> > compressed size of files in a directory?
>
> They look at the st_blocks field of the the stat structure, as returned
> by the stat(2) system call.
>
> st_blocks The total number of physical blocks of size
> 512 bytes actually allocated on disk. This
> field is not defined for block special or
> character special files.

As it seems that only a few people know this, it would be nice if this
text could mention that the value in st_blocks is in units of DEV_BSIZE.

This would help people to understand why e.g. things look hosed on HP-UX....



Jörg

--
EMail:joerg at schily dot isdn dot cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
js@cs.tu-berlin.de (uni)
schilling at fokus dot fraunhofer dot de (work) Blog: http://schily.blogspot.com/
URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris dot org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



ahrens

Posts: 413
From: US

Registered: 3/9/05
Re: ZFS on-disk compression
Posted: Dec 20, 2005 10:31 AM   in response to: woocky

  Click to reply to this thread Reply

> Apologies for hijacking the thread, but how does
> du(1) and ls know the compressed size of files in a
> directory?

[my apologies if this is posted twice; it seems that the message I sent yesterday didn't go through]

They look at the st_blocks field of the the stat structure, as returned
by the stat(2) system call.

st_blocks The total number of physical blocks of size
512 bytes actually allocated on disk. This
field is not defined for block special or
character special files.

You may find it interesting that the compression ratio (as reported by
'zfs get ratio') is not calculated by comparing st_size ("the address of
the end of the file") to st_blocks. That would inflate the compression
ratio for sparse files. Rather, ZFS internally tracks the compressed
and uncompressed size of each block, and the sums of each of these for
each filesystem. On additional bit of trickyness is that any
compression applied to metadata is not counted, since metadata
compression is an implementation detail and not controlled by the
'compression' property.

For details, see the code in dsl_dataset_block_born() and
dsl_dataset_block_kill(), which use the macros BP_GET_PSIZE() and
BP_GET_UCSIZE() to determine the compressed and uncompressed sizes,
respectively. BP_GET_UCSIZE() is where we determine if it is metadata,
in which case we report the compressed size.

--matt




Terms of Use | Privacy | Trademarks | Copyright Policy | Site Guidelines
Your use of this web site or any of its content or software indicates your agreement to be bound by these Terms of Use.
Copyright © 1995-2005 Sun Microsystems, Inc.