|
Replies:
2
-
Last Post:
Aug 9, 2005 11:18 AM
by: iriomote
|
|
|
|
|
|
|
UFS Direct I/O
Posted:
Aug 8, 2005 1:52 PM
|
|
Howdy,
While reading through Solaris Internals this weekend, I came to the section on UFS direct I/O. The book states that random and large sequential workloads benefit from direct I/O. Does anyone happen to know how big a "large sequential" I/O needs to be to benefit from direct I/O? Are there any advantages to using direct I/O with volumes devoted to Oracle redo/undo and archive logs? I have read that it is best to avoid direct I/O with redo/undo, since the file system will cluster small writes, and boost total throughput (especially during log switches). I have also read that due to the transient nature of redo/undo, the CPU and memory resources devoted to creating the pages would be wasted, since these pages would not be re-used for future reads/writes. Has anyone sat down and looked at direct I/O in depth? Any idea which workloads (if any) work best with redo/undo on UFS direct I/O file systems? If there is a set of documentation that explains this, please let me know.
Thanks, - Ryan
_______________________________________________ perf-discuss mailing list perf-discuss at opensolaris dot org
|
|
|
Jarod Jenson
jarod@aeysis.com
|
|
|
|
Re: UFS Direct I/O
Posted:
Aug 9, 2005 6:59 AM
in response to: Matty
|
|
Matty's email at 8/8/2005 3:52 PM, said: > Howdy, > > While reading through Solaris Internals this weekend, I came to the > section on UFS direct I/O. The book states that random and large > sequential workloads benefit from direct I/O. Does anyone happen > to know how big a "large sequential" I/O needs to be to benefit from > direct I/O? Are there any advantages to using direct I/O with volumes > devoted to Oracle redo/undo and archive logs? I have read that > it is best to avoid direct I/O with redo/undo, since the file system > will cluster small writes, and boost total throughput (especially during > log switches). I have also read that due to the transient nature of > redo/undo, the CPU and memory resources devoted to creating the pages > would be wasted, since these pages would not be re-used for future > reads/writes. Has anyone sat down and looked at direct I/O in depth? Any > idea which workloads (if any) work best with redo/undo on UFS direct I/O > file systems? If there is a set of documentation that explains this, > please let me know. > > Thanks, > - Ryan > > _______________________________________________ > perf-discuss mailing list > perf-discuss at opensolaris dot org > >
There is an easy way to think about this in the case of Oracle. Use direct I/O anywhere Oracle uses O_DSYNC. This is a pretty good rule of thumb that will be true 99% of the time. This means data, redo, and control files all get direct I/O and archive does not. The presence of O_DSYNC is going to cause UFS to "break" all of the rules you are familiar with. For instance, no clustering with O_DSYNC and buffered I/O. This is the configuration I use on smallish systems all the way up to fully loaded 25K's.
Be on the lookout in the (hopefully) near future for a fix to direct I/O that will make it behave the way it really should ;) I'll give details later.
Thanks,
Jarod
_______________________________________________ perf-discuss mailing list perf-discuss at opensolaris dot org
|
|
|
|
Posts:
16
From:
Washington, DC USA
Registered:
8/9/05
|
|
|
|
Re: UFS Direct I/O
Posted:
Aug 9, 2005 11:18 AM
in response to: Matty
|
|
Ryan:
The goodness of UFS direct I/O is highly application-specific. The benefits arise from two fundamental reasons - one being avoidance of OS page cache scaling limitations, and the other being removal of the POSIX single-writer lock constraint - which allows multiple I/O operations to a file to occur concurently when a write is active. Of these factors, the latter usually has the broadest impact.
For Oracle online redo logs, the concensus is that UFS direct I/O is pretty much always a good thing. By default, Oracle's logging uses asynchronous writes (aio_write()), with data-synchronous completion criteria (because the logs are opened with O_DSYNC). Because these writes are synchronous, filesystem write coalescing at the filesystem level has no opportunity to help. Because the writes are asynchronous, they can benefit from improved thorughput by removal of the single-writer lock. (Yes - it's confusing, 'synchronous' and 'asynchonous' and not plain- English opposites here, but rather different topics altogether. I/O that is not asynchronous is 'blocking' (eg: pwrite()), and I/O that is not synchronous is 'deferred' (ie: only flushed by fsync(), maxcontig fills, or moved along by fsflushd.)
The only downside to using UFS direct I/O for online logs comes from the archiver losing the performance advantage of UFS filesystem pre-fetching when reading these files. However, since the archiver uses larger I/O sizes, I'm not aware that this has ever become anyone's constraining bottleneck. Therefore, the improved write throughput to logs with UFS direct I/O is pretty much always a good tradeoff. There are also tradeoffs and limitations associated with the memory management overhead underlying the OS page cache with and without filesystem buffering. At high throughput rates, these factors can absolutely be limiting, but most folks are far more impacted by the single-writer lock than the cost of memory mamagement, so I consider these impacts to be secondary.
Note that you would *never* want UFS direct I/O on log achive destinations, since the archiver does *not* use O_DSYNC on its output files, and expects to enjoy the performance benefit of deferred writes!
The size threshold at which UFS direct I/O would be beneficial can depend on a great many factors - including UFS tunables; volume management factors; I/O mutlipathing factors; the actual APIs used by the application; whether or not space allocation is occuring; and backend configuration factors. For any given configuration, what's best can be best determined by I/O microbenchmarking techniques. Formulating an appropriate microbenchmark requires an accurate understanding of the actual APIs and tuning factors used by your actual application. For Oracle logging, a correct microbenchmark would use O_DSYNC on open() and aio_write() for writing - and the target files will be pre-allocated so that filesystem logging will not bias the results. Assuming a high transaction rate, Oracle itself will probably coalesce log writes to 8K operations, but for a single-stream workload of iterated single-row INSERT/COMMIT operations, log writes may be quite small. Unfortunately, in the area of I/O microbencharking, errors occur quite frequently due to inappropriate experiment design and incorrect interpretation of results - so be careful!
For each application and category of I/O, there are tradeoffs to consider in using UFS direct I/O. As a rule, high-end scaling requires use of some storage option with the essential characteristics of UFS direct I/O - and that would include RAW, QFS direct I/O with Q-writes, VxFS Quick I/O or VxFS ODM. All of these should be expected to perform 'similarly' - but the UFS option is free! When moving to one of these options from 'out-of-the-box' buffered I/O, it is typically necessary to do some Oracle tuning to make use of the system memory that is liberated when filesystem buffering is switched off. It is also typical that the impact of these options on backup and restore operations needs to be properly evaluated.
The physics underlying these factors is all well-understood. The problems come in making policy decisions around the tradeoffs associated with these factors. There is a load of mis-information available online. Beware any posting that says "you should always use UFS direct I/O". There is a complex set of tradeoffs here, including operational constraints and logistics of changing from other options. The best guidance I can offer in a small space is to "make well-informed decisions regarding these factors". To promote a better understanding of these factors, I wrote a paper a while ago called "Oracle I/O: Supply and Demand". That paper is due for an upgrade, and I hope to push it out this Fall - with the scope expanded to include RAC/Grid considerations and factors affecting 'direct path' and NOLOGGING write performance. Shucks - this posting is getting way too close to *being* a whitepaper! ;-)
Hope this helps, -- Bob Sneed
|
|
|
|
|