OpenSolaris

Discussions Communities Projects Download Source Browser

Home » OpenSolaris Forums » zones » discuss

Thread: [zones-discuss] zones on shared storage proposal

Welcome, Guest Help
Login Login
Guest Settings Guest Settings
Reply to this Thread Reply to this Thread Search Forum Search Forum Back to Thread List Back to Thread List

Permlink Replies: 11 - Last Post: Sep 21, 2009 12:21 PM by: johnlev Threads: [ Previous | Next ]
edp

Posts: 605
From: US

Registered: 3/9/05
[zones-discuss] zones on shared storage proposal
Posted: May 21, 2009 1:55 AM

  Click to reply to this thread Reply

hey all,

i've created a proposal for my vision of how zones hosted on shared
storage should work. if anyone is interested in this functionality then
please give my proposal a read and let me know what you think. (fyi,
i'm leaving on vacation next week so if i don't reply to comments right
away please don't take offence, i'll get to it when i get back. ;)

ed
_______________________________________________
zones-discuss mailing list
zones-discuss at opensolaris dot org


mgerdts

Posts: 1,264
From: US

Registered: 8/5/05
Re: [zones-discuss] zones on shared storage proposal
Posted: May 21, 2009 9:59 AM   in response to: edp

  Click to reply to this thread Reply

On Thu, May 21, 2009 at 3:55 AM, Edward Pilatowicz <edward dot pilatowicz at sun dot com> wrote: > hey all, > > i've created a proposal for my vision of how zones hosted on shared > storage should work.  if anyone is interested in this functionality then > please give my proposal a read and let me know what you think.  (fyi, > i'm leaving on vacation next week so if i don't reply to comments right > away please don't take offence, i'll get to it when i get back.  ;) > > ed I'm very happy to see this. Comments appear below. > " please ensure that the vim modeline option is not disabled > vim:textwidth=72 > > ------------------------------------------------------------------------------- > Zones on shared storage (v1.0) > [snip] > ---------- > C.1.i Zonecfg(1m) > > The zonecfg(1m) command will be enhanced with the following two new > resources and associated properties: > > rootzpool resource > src resource property > install-size resource property > zpool-preserve resource property > dataset resource property > > zpool resource > src resource property > install-size resource property > zpool-preserve resource property > name resource property > > The new resource and properties will be defined as follows: > > "rootzpool" > - Description: Identifies a shared storage object (and it's > associated parameters) which will be used to contain the root > zfs filesystem for a zone. > > "zpool" > - Description: Identifies a shared storage object (and it's > associated parameters) which will be made available to the > zone as a delegated zfs dataset. That is to say "put your OS stuff in rootzpool, put everything else in zpool" - right? > > "src" > - Status: Required. > - Format: Storage object uri (so-uri). (See definition below.) > - Description: Identifies the storage object associated with this > resource. > > "install-size" > - Status: Optional. > - Format: Integer. Defaults to bytes, but can be flagged as > gigabytes, kilobytes, or megabytes, with a g, k, or m suffix, > respectively. > - Description: If the specified storage object doesn't exist at zone > install time it will be created with this specific size. This > property has no effect for storage objects which already exist and > have a pre-defined size. > > "zpool-preserve" > - Status: Optional. > - Format: Boolean. Defaults to false. > - Description: When doing an install, if this property if this > property is set to true and a zpool already exists on the > specified storage object it will be used. When doing a destroy, > if this property is set to true, the root zpool will not be > destroyed. > > "dataset" > - Status: Optional > - Format: zfs filesystem name component (can't contain a '/') > - Description: Name of a dataset within the root zpool to delegate > to the zone. > > "name" > - Status: Required > - Format: zfs filesystem name component (can't contain a '/') > - Description: Used as part of the name for a zpool which will be > delegated to the zone. > > Zonecfg(1m) "verify" will verify the syntax of any "rootzpool" resource > group (and its properties), but it will NOT verify the accessibility of > any storage specified by by a so-uri. (This is because accessing the > storage specified by an so-uri could require configuration changes to > other subsystems.) > > > ---------- > C.1.ii Storage object uri (so-uri) format > > The storage object uri (so-uri) syntax[03] will conform to the standard > uri format defined in RFC 3986 [04]. The nfs URI scheme is defined in > RFC 2224 [05]. The so-uri syntax can be summarised as follows: > > File storage objects: > > path:///<file-absolute> > nfs://<host>[:port]/<file-absolute> > > Vdisk storage objects: > > vpath:///<file-absolute> > vnfs://<host>[:port]/<file-absolute> > > Device storage objects: > > fc:///wwn[@<lun>] > iscsi:///alias=<alias>[@<lun>] > iscsi:///target=<target>[@<lun>] > iscsi://host[:port]/[tpgt=<tpgt>/]target=<target>[@<lun>] > > File storage objects point to plain files on a local, nfs, or cifs > filesystems. These files are used to contain zpools which store zone > datasets. These are the simplest types of storage objects. Once > created, they have a fixed size, can't be grown, and don't support > advanced features like snapshotting, etc. Some example file so-uri's > are: > > path:///export/xvm/vm1.disk > - a local file > path:///net/heaped.sfbay/export/xvm/1.disk > - a nfs file accessible via autofs > nfs://heaped.sfbay/export/xvm/1.disk > - same file specified directly via a nfs so-uri > > Vdisk storage objects are similar to file storage objects in that they > can live on local, nfs, or cifs filesystems, but they each have their > own special data format and varying featuresets, with support for things > like snapshotting, etc.. Some common vdisk formats are: VDI, VMDK and > VHD. Some example vdisk so-uri's are: > > vpath:///export/xvm/vm1.vmdk > - a local vdisk image > vpath:///net/heaped.sfbay/export/xvm/1.vmdk > - a nfs vdisk image accessible via autofs > vnfs://heaped.sfbay/export/xvm/1.vmdk > - same vdisk image specified directly via a nfs so-uri > > Device storage objects specify block storage devices in a host > independant fashion. When configuring FC or iscsi storage on different > hosts, the storage configuration normally lives outsize of zonecfg, and > the configured storage may have varying /dev/dsk/cXtXdX* names. The > so-uri syntax provides a way to specify storage in a host independent > fashion, and during zone management operations, the zones framework can > map this storage to a host specific device path. Some example device > so-uri's are: > > fc:///20000014c347492a@0 > - lun 0 of a fc disk with the specified wwn > iscsi:///alias=oracle zone root@0 > - lun 0 of an iscsi disk with the specified alias. > iscsi:///target=iqn.1986-03.com.sun:02:38abfd16-78c5-c58e-e629-ea77a33c6740 > - lun 0 of an iscsi disk with the specified target id. What about if there is already the necessary layer of abstraction that provides a consistent namespace? For example, /dev/vx/dsk/zone1dg/rootvol would refer to a block device named rootvol in the disk group zone1dg. That may reside on a single disk or span many disks and will have the same name regardless of which host the disk group is imported on. Since this VxVM volume may span many disks, it would be inappropriate to refer to a single LUN that makes up that disk group. Perhaps the following is appropriate for such situations. dev:///dev/vx/dsk/zone1dg/rootvol > ---------- > C.1.iii Zoneadm(1m) install > > When a zone is installed via the zoneadm(1m) "install" subcommand, the > zones subsystem will first verify that any required so-uris exist and > are accessible. > > If an so-uri points to a plain file, nfs file, or vdisk, and the object > does not exist, the object will be created with the install-size that > was specified via zonecfg(1m). If the so-uri does not exist and an > install-size was not specified via zonecfg(1m) an error will be > generated and the install will fail. > > If an so-uri points to an explicit nfs server, the zones framework will > need to mount the nfs filesystem containing storage object. The nfs > server share containing the specified object will be auto-mounted at: > > /var/zones/nfsmount/<zonename>/<host>/<nfs-share-name> Just for clarity, I think you mean: - "will be mounted at". I think "auto-mounted" conjures up the idea that there is integration with autofs. - <host> is the NFS server - <nfs-share-name> is the path on the NFS server. Is this the exact same thing as <path-absolute> in the URI specification? Is this the file that is mounted or the directory above the file? My storage administrators give me grief if I create too many NFS mounts (but I am not sure I've heard a convincing reason). As I envision NFS server layout, I think I would see something like: vol zones zone1 rootzpool zpool zone2 rootzpool zpool zone3 rootzpool zpool It seems as though if these three zones are all running on the same box the box will have at least the following mounts: /var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1 /var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2 /var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3 But maybe as many as: /var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1/rootzpool /var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1/zpool /var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2/rootzpool /var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2/zpool /var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3/rootzpool /var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3/zpool With a slightly different arrangment this could be reduced to one. Change > /var/zones/nfsmount/<zonename>/<host>/<nfs-share-name> To: /var/zones/nfsmount/<host>/<nfs-share-name>/<zonename>/<fi le> I can see that this would complicate things a bit because it would be hard to figure out how far up the path is the right place for the mount. Perhaps if this is what I would like I would be better off adding a global zone vfstab entry to mount nfsserver:/vol/zones somewhere and use the path:/// uri instead. Thoughts? > If an so-uri points to a fibre channel lun, the zones subsystem will > verify that the specified wwn corresponds to a global zone accessible > fibre channel disk device. > > If an so-uri points to an iSCSI target or alias, the zones subsystem > will verify that the iSCSI device is accessible on the local system. If > an so-uri points to a static iSCSI target and that target is not > already accessible on the local host, then the zones subsystem will > enable static discovery for the local iSCSI initiator and attempt to > apply the specified static iSCSI configuration. If the iSCSI target > device is not accessible then the install will fail. > > Once a zones install has verified that any required so-uri exists and is > accessible, the zones subsystem will need to initialise the so-uri. In > the case of a path or nfs path, this will involve creating a zpool > within the specified file. In the case of a vdisk, fibre channel lun, > or iSCSI lun, this will involve creating a EFI/GPT partition on the > device which uses the entire disk, then a zpool will be created within > this partition. For data protection purposes, if a storage object > contains any pre-existing partitions, zpools, or ufs filesystems, the > install will fail will fail with an appropriate error message. To s/will fail will fail/will fail/ > continue the installation and overwrite any pre-existing data, the user > will be able to specify a new '-f' option to zoneadm(1m) install. (This > option mimics the '-f' option used by zpool(1m) create.) > > If zpool-preserve is set to true, then before initialising any target > storage objects, the zones subsystem will attempt to import a > pre-existing zpool from those objects. This will allow users to > pre-create a zpool with custom creation time options, for use with > zones. To successfully import a pre-created zpool for a zone install, > that zpool must not be attached. (Ie, any pre-created zpool must be > exported from the system where it was created before a zone can be > installed on it.) Once the zpool is imported the install process will > check for the existence of a /ROOT filesystem within the zpool. If this > filesystem exists the install will fail with an appropriate error > message. To continue the installation the user will need to specify the > '-f' option to zoneadm(1m) install, which will cause the zones framework > to delete the pre-existing /ROOT filesystem within the zpool. Is this because the zone root will be installed <zonepath>/ROOT/<bename> rather than <zonepath>/root? > The newly created or imported root zpool will be named after the zone to > which it is associated, with the assigned name being "<zonename>_rpool". > This zpool will then be mounted at the zones rootpath and then the > install process will continue normally[07]. This seems odd... why not have the root zpool mounted at zonepath rather than zoneroot? This way (e.g.) SUNWdetached.xml would follow the zone during migrations. > XXX: use altroot at zpool creation or just manually mount zpool? > > If the user has specified a "zpool" resource, then the zones framework > will configure, initialize, and/or import it in a similar manaer to a > zpool specified by the "rootzpool" resource. The key differences are > that the name of the newly created or imported zpool will be > "<zonename>_<name>". The specified zpool will also have the zfs "zoned" > property set to "on", hence it will not be mounted anywhere in the > global zone. > > XXX: do we need "zpool import -O file-system-property=" to set the > zoned property upon import. > > Once a zone configured with a so-uri is in the installed state, the > zones framework needs a mechanism to mark that storage as in use to > prevent it from being accessed by multiple hosts simultaneously. The > most likely situation where this could happen is via a zoneadm(1m) > attach on a remote host. The easiest way to achieve this is to keep the > zpools associated with the storage imported and mounted at all times, > and leverage the existing zpool support for detecting and preventing > multi-host access. > > So whenever a global zone boots and the zones smf service runs, it will > attempt to configure and import any shared storage objects associated > with installed zones. It will then continue to behave as it does today > and boot any installed zones that have the autoboot property set. If > any shared sorage objects fail to configure or import, then: > > - the zones associated with the failed storage will be transitioned > to the "uninstalled" state. Is "uninstalled" a real state? Perhaps "configured" is more appropriate, as this allows a transition to "installed" via "zoneadm attach". > - an error message will be emitted to the zones smf log file. > - after booting any remaning installed zones that have autoboot set > to true, the zones smf service will enter the "maintainence" state, > there by prompting the administrator to look at the zones smf log > file. > > After fixing any problems with shared storage accessibility, the > admin should be able to simply re-attach the zone to the system. > > Currently the zones smf service is dependant upon multi-user-server, so > all networking services required for access to shared storage should be > propertly configured well before we try to import any shared storage > associated with zones. May I propose a fix to the zones SMF service as part of this? The current integration with the global zone's SMF is rather weak in reporting the real status of zones and allowing the use of SMF for controlling the zones service. In particular: - If a zone fails to start, the state of svc:/system/zones:default does not reflect a maintenance or degraded state. - If an admin wishes to start a zone the same way that the system would do it, "svcadm restart" and similar have the side effect of rebooting all zones on the system. - There is no way to establish dependencies between zones or between a zone and something that needs to happen in the global zone. - There isn't a good way to allow certain individuals within the global zone the ability to start/stop specific zones with RBAC or authorizations. I propose that: - zonecfg creates a new services instance svc:/system/zones:zonename when the zone is configured. Its initial state is disabled. If the service already exists sanity checking may be performed but it should not whack things like dependencies and authorizations. - After zoneadm installs a zone, the general/enabled property of svc:/system/zones:zonename is set to match the zonecfg autoboot property. - "zoneadm boot" is the equivalent of "svcadm enable -t svc:/system/zones:zonename" - A new command "zoneadm shutdown" is the equivalent of "svcadm disable -t svc:/system/zones:zonename" - "zoneadm halt" is the equivalent of "svcadm mark maintenance svc:/system/zones:zonename:" followed by the traditional ungraceful teardown of the zone. - Modification of the autoboot property with zonecfg (so long as the zone has been installed/attached) triggers the corresponding general/enabled property change in SMF. This should set the property general/enabled without causing an immediate state change. - zoneadm uninstall and zoneadm detach set the service to not autostart. - zonecfg delete also deletes the service. - A new property be added to zonecfg to disable SMF integration of this particular zone. This will be important for people that have already worked around this problem (including ISV's providing clustering products) that don't want SMF getting in the way of their already working solution. > On system shutdown, the zones system will NOT export zpools contained > within storage object used by the zone. Zpools contained within storage > objects assigned to installed zones will only be exported during zone > detach. More details about the behaviour of zone detach is provided > below. > > > ---------- > C.1.iv Zoneadm(1m) attach > [snip] > > ---------- > C.1.v Zoneadm(1m) boot > [snip] > > ---------- > C.1.vi Zoneadm(1m) detach > [snip] > > ---------- > C.1.vii Zoneadm(1m) uninstall > [snip] > > ---------- > C.1.viii Zoneadm(1m) clone > > Normally when cloning a zone which lives on a zfs filesystem the zones > framework will take a zfs(1m) snapshot of the source zone and then do a > zfs(1m) clone operation to create a filesystem for the new zone which is > being instantiated. This works well when all the zones on a given > system live on local storage in a single zfs filesystem, but this model > doesn't work well for zones with encapsulated roots. First, with > encapsulated roots each zone has it's own zpool, and zfs (1m) does not > support cloning across zpools. Second, zfs(1m) snapshotting/cloning > within the source zpool and then mounting the resultant filesystem onto > the target zones zoneroot would introduce dependencies between zones, > complicating things like zone migration. > > Hence, for cloning operations, if the source zone has an encapsulated > root, zoneadm(1m) will not use zfs(1m) snapshot/clone. Currently > zoneadm(1m) will fall back to the use of find+cpio to clone zones if it > is unable to use zfs(1m) snapshot/clone. We could just fall back to > this default behaviour for encapsulated root zones, but find+cpio are > not error free and can have problem with large files. So we propose to > update zoneadm(1m) clone to detect when both the source and target zones > are using separate zfs filesystems, and in that case attempt to use zfs > send/recv before falling back to find+cpio. Can a provision be added for running an external command to produce the clone? I envision this being used to make a call to a storage device to tell the storage device to create a clone of the storage. (This implies that the super-secret tool to re-write the GUID would need to become available.) The alternative seems to be to have everyone invent their own mechanism with the same external commands and zoneadm attach. > Today, the zoneadm(1m) clone operations ignores any additional storage > (specified via the "fs", "device", or "dataset" resources) that may be > associated with the zone. Similarly, the clone operation will ignore > additional storage associated with any "zpool" resources. > > Since zoneadm(1m) clone will be enhanced to support cloning between > encapsulated root zones and un-encapsulated root zones, zoneadm(1m) > clone will be documented as the recommended migration mechanism for > users who which to migrate existing zones from one format to another. > > > ---------- > C.2 Storage object uid/gid handling > > One issue faced by all VTs that support shared storage is dealing with > file access permissions of storage objects accessible via NFS. This > issue doesn't affect device based shared storage, or local files and > vdisks, since these types of storage are always accessible, regardless > of the uid of the access process (as long as the accessing process has > the necessary privileges). But when accessing files and vdisk via NFS, > the accessing process can not use privileges to circumvent restrictive > file access premissions. This issue is also complicated by the fact > that by default most NFS servier will map all accesses by remote root > user to a different uid, usually "nobody". (a process known as "root > squashing".) > > In order to avoid root squashing, or requiring users to setup special > configurations on their NFS servers, whenever the zone framework > attempts to create a storage object file or vdisk, it will temporarily > change it's uid and gid to the "xvm" user and group, and then create the > file with 0600 access permissions. > > Additionally, whenever the zones framework attempts to access an storage > object file or vdisk it will temporarily switch its uid and gid to match > the owner and group of the file/vdisk, ensure that the file is readable > and writeable by it's owner (updating the file/vdisk permissions if > necessary), and finally setup the file/vdisk for access via a zpool > import or lofiadm -a. This should will allow the zones framework to > access storage object files/vdisks that we created by any user, > regardless of their ownership, simplifying file ownership and management > issues for administrators. This implies that the xvm user is getting some additional privileges. What are those privileges? > ---------- > C.3 Taskq enhancements > > The integration of Duckhorn[08] greatly simplifies the management of cpu > resources assigned to zone. This management is partially implemented > through the use of dynamic resource pools, where zones and their > associated cpu resources can both be bound to a pool. > > Internally, zfs has worker threads associated with each zpool. These > are kernel taskq threads which can run on any cpu which has not been > explicitly allocated to a cpu set/partition/pool. > > So today, for any zones living on zfs filesystems, and running in a > dedicated cpu pool, any zfs disk processing associated with that zone is > not done by the cpu's bound to that zones pool. Essentially all the > zones zfs processing is done for "free" by the global zone. > > With the introduction of zpools encapsulated within storage objects, > which are themselves associated with specific zones, it would be > desirable to have the zpool worker threads bound to the cpus currently > allocated to the zone. Currently, zfs uses taskq threads for each > zpool, so one way of doing this would be to introduce a mechanism that > allows for the binding of taskqs to pools. > > Hence we propose the following new interfaces: > zfs_poolbind(char *, poolid_t); > taskq_poolbind(taskq_t, poolid_t); > > When a zone, which is bound to a pool, is booted, the zones framework > will call zfs_poolbind() for each zpool associated with an encapsulated > storage object bound to the zone being booted. > > Zfs will in turn use the new taskq pool binding interfaces to bind all > it's taskqs to the specified pools. This mapping is transient and zfs > will not record or persist this binding in any way. > > The taskq implementation will be enhanced to allow for binding worker > threads to a specific pool. If taskqs threads are created for a taskq > which is bound to a specific pool, those new thread will also inherit > the same pool bindings. The taskq to pool binding will remain in effect > until the taskq is explicitly rebound or the pool to which it is bound > is destroyed. Any thoughts of dooing something similar for dedicated NICs? From dladm(1M): cpus Bind the processing of packets for a given data link to a processor or a set of processors. The value can be a comma-separated list of one or more processor ids. If the list consists of more than one processor, the pro- cessing will spread out to all the processors. Connec- tion to processor affinity and packet ordering for any individual connection will be maintained. That is, the enhancement is already there, it's just a matter of making use of it. > ---------- > C.4 Zfs enhancements > > In addition to the zfs_poolbind() interface proposed above. The > zpool(1m) "import" command will need to be enhanced. Currently the > zpool(1m) import by default scans all storage devices on the system > looking for pools to import. The caller can also use the '-d' option to > specify a directory within which the zpool(1m) command will scan for > zpools that may be imported. This scanning involves sampling many > objects. When dealing with zpools encapsulated in storage objects, this > scanning is unnecessary since we already know the path to the objects > which contains the zpool. Hence, the '-d' option will be enhanced to > allow for the specification of a file or device. The user will also be > able to specify this option multiple times, in case the zpool spans > multiple objects. > > > ---------- > C.5 Lofi and lofiadm(1m) enhancements > > Currently, there is no way for a global zone to access the contents of a > vdisk. Vdisk support was first introduced in VirtualBox. xVM then > adopted the VirtualBox code for vdisk support. With both technologies, > the only way to access the contents of a vdisk is to export it to a VM. > > To allow zones to use vdisk devices we propose to leverage the code > introduced by by xVM by incorporating it into lofi. This will allow any > solaris system to access the contents of vdisk devices. The interface > changes to lofi to allow for this are fairly straitforward. > > A new '-l' option will be added to the lofiadm(1m) "-a" device creation > mode. The '-l' option will indicate to lofi that the new device should > have a label associated with it. Normally lofi device are named > /dev/lofi/ and /dev/rlofi/, where is the lofi device number. > When a disk device has a label associated with it, it exports many > device nodes with different names. Therefore lofi will need to be > enhanced to support these new device names, which multiple nodes > per device. These new names will be: > > /dev/lofi/dsk/p<j> - block device partitions > /dev/lofi/dsk/s<j> - block device slices > /dev/rlofi/dsk/p<j> - char device partitions > /dev/rlofi/dsk/s<j> - char device slices One of the big weaknesses with lofi is that you can't count on the device name being the same between boots. Could -l take an argument to be used instead of "dsk"? That is: lofiadm -a -l coolgames /media/coolgames.iso Creates: /dev/lofi/coolgames/p<j> /dev/lofi/coolgames/s<j> /dev/rlofi/coolgames/p<j> /dev/rlofi/coolgames/s<j> For those cases where legacy behavior is desired, an optional %d can be used to create the names you suggest above. lofiadm -a -l dsk%d /nfs/server/zone/stuff [snip] > ---------- > C.6 Performance considerations > > As previously mentioned, this proposal primarily simplifies the process > of configuring zones on shared storage. In most cases these proposed > configurations can be created today, but no one has actually verified > that these configurations perform acceptably. Hence, in conjunction > with providing functionality to simplify the setup of these configs, > we also need to be quantifying their performance to make sure that > none of the configurations suffer from gross performance problems. > > The most straitforward configurations, with the least possibilities for > poor performance, are ones using local devices, fibre channel luns, and > iSCSI luns. These configuration should perform identically to the > configurations where the global zone uses these objects to host zfs > filesystems without zones. Additionally, the performance of these > configurations will mostly be dependent upon the hardware associated > with the storage devices. Hence the performance of these configuration > is for the most part uninteresting and performance analysis of these > configuration can by skipped. > > Looking at the performance of storage objects which are local files or > nfs files is more interesting. In these cases the zpool that hosts the > zone will be accessing it's storage via the zpool vdev_file vdev_ops_t > interface. Currently, this interface doesn't receive as much use and > performance testing as some of the other zpool vdev_ops_t interfaces. > Hence it will worthwhile to measure the performance of a zpool backed by > a file within another zfs filesystem. Likewise we will want to measure > the performance of a zpool backed by a file on an NFS filesystem. > Finally, we should compare these two performance points to a zone which > is not encapsulated within a zpool, but is instead installed directly on > a local zfs filesystem. (These comparisons are not really that > interesting when dealing with block device based storage objects.) Reminder for when I am testing: is this a case where forcedirectio will make a lot of sense? That is, zfs is already buffering, don't make NFS do it too. > Currently, while it is very common to deploy large numbers of zfs > filesystems, systems with large numbers of zpools are not very common. > The solution proposed in this project will likely result in an increase > of zpools on systems hosting zones. Hence, we should evaluate the > impact of an increasing number of zpools on performance scalability. > This could be done by comparing the io performance drop-off of an > increasing number of zones hosted multiple zfs filesystems in a single > zpool vs zones hosted in seperate zpools. > > Finally, it will be important to do performance measurements for vdisk > configurations. These configurations are similar to the local file or > nfs configurations, but they will be utilising the vdev_disk backend and > they will have an additional layer of indirection through lofi. > > XXX: impact of multiple zpools on arc and l2 arc? talk to mark maybee. > > > ---------- > C.7 Phased delivery > > Customers have been asking for a simple mechanisms to allow hosting of > zones on NFS since the introduction of zones. Hence we'd like to get > this functionality into the hands of customers as quickly as possible. > Also, the approach taken by this proposal to supporting zones on shared > storage is different from what was originally anticipated, hence we'd > like to get practical experience with this approach at customer sites > asap to determine if there are situations where this approach may not > meet their requires. To accelerate the delivery of the previously > proposed features, we plan to deliver them in three phases: Sounds quite reasonable. [snip] > > ------------------------------------------------------------------------------- -- Mike Gerdts http://mgerdts.blogspot.com/
_______________________________________________
zones-discuss mailing list
zones-discuss at opensolaris dot org


edp

Posts: 605
From: US

Registered: 3/9/05
Re: [zones-discuss] zones on shared storage proposal
Posted: May 22, 2009 12:11 AM   in response to: mgerdts

  Click to reply to this thread Reply

hey mike,

thanks for all the great feedback.
my replies to your individual comments are inline below.

i've updated my proposal to include your feedback, but i'm unable to
attach it to this reply because of mail size restrictions imposed by
this alias. i'll send some follow up emails which include the revised
proposal.

thanks again,
ed


On Thu, May 21, 2009 at 11:59:22AM -0500, Mike Gerdts wrote:
> On Thu, May 21, 2009 at 3:55 AM, Edward Pilatowicz
> <edward dot pilatowicz at sun dot com> wrote:
> > hey all,
> >
> > i've created a proposal for my vision of how zones hosted on shared
> > storage should work.  if anyone is interested in this functionality then
> > please give my proposal a read and let me know what you think.  (fyi,
> > i'm leaving on vacation next week so if i don't reply to comments right
> > away please don't take offence, i'll get to it when i get back.  ;)
> >
> > ed
>
> I'm very happy to see this. Comments appear below.
>
> > " please ensure that the vim modeline option is not disabled
> > vim:textwidth=72
> >
> > -------------------------------------------------------------------------------
> > Zones on shared storage (v1.0)
> >
> [snip]
> > ----------
> > C.1.i Zonecfg(1m)
> >
> > The zonecfg(1m) command will be enhanced with the following two new
> > resources and associated properties:
> >
> > rootzpool resource
> > src resource property
> > install-size resource property
> > zpool-preserve resource property
> > dataset resource property
> >
> > zpool resource
> > src resource property
> > install-size resource property
> > zpool-preserve resource property
> > name resource property
> >
> > The new resource and properties will be defined as follows:
> >
> > "rootzpool"
> > - Description: Identifies a shared storage object (and it's
> > associated parameters) which will be used to contain the root
> > zfs filesystem for a zone.
> >
> > "zpool"
> > - Description: Identifies a shared storage object (and it's
> > associated parameters) which will be made available to the
> > zone as a delegated zfs dataset.
>
> That is to say "put your OS stuff in rootzpool, put everything else in
> zpool" - right?
>

yes. as i see it, this proposal allows for multiple types of deployment
configurations.

- a zone with a single encapsulated "rootzpool" zpool.
the OS will reside in <zonename>_rpool/ROOT/zbeXXX
everything else will also reside in <zonename>_rpool/ROOT/zbeXXX

- a zone with a single encapsulated "rootzpool" zpool.
the OS will reside in <zonename>_rpool/ROOT/zbeXXX
everything else will reside in <zonename>_rpool/dataset/<dataset>

- a zone with multiple encapsulated zpools.
the OS will reside in <zonename>_rpool/ROOT/zbeXXX
everything else will reside in other encapsulated "zpool"s

i've added some text to this section of the proposal to explain these
different configuration scenarios.

> > ----------
> > C.1.ii Storage object uri (so-uri) format
> >
> > The storage object uri (so-uri) syntax[03] will conform to the standard
> > uri format defined in RFC 3986 [04]. The nfs URI scheme is defined in
> > RFC 2224 [05]. The so-uri syntax can be summarised as follows:
> >
> > File storage objects:
> >
> > path:///<file-absolute>
> > nfs://<host>[:port]/<file-absolute>
> >
> > Vdisk storage objects:
> >
> > vpath:///<file-absolute>
> > vnfs://<host>[:port]/<file-absolute>
> >
> > Device storage objects:
> >
> > fc:///wwn[@<lun>]
> > iscsi:///alias=<alias>[@<lun>]
> > iscsi:///target=<target>[@<lun>]
> > iscsi://host[:port]/[tpgt=<tpgt>/]target=<target>[@<lun>]
> >
> > File storage objects point to plain files on a local, nfs, or cifs
> > filesystems. These files are used to contain zpools which store zone
> > datasets. These are the simplest types of storage objects. Once
> > created, they have a fixed size, can't be grown, and don't support
> > advanced features like snapshotting, etc. Some example file so-uri's
> > are:
> >
> > path:///export/xvm/vm1.disk
> > - a local file
> > path:///net/heaped.sfbay/export/xvm/1.disk
> > - a nfs file accessible via autofs
> > nfs://heaped.sfbay/export/xvm/1.disk
> > - same file specified directly via a nfs so-uri
> >
> > Vdisk storage objects are similar to file storage objects in that they
> > can live on local, nfs, or cifs filesystems, but they each have their
> > own special data format and varying featuresets, with support for things
> > like snapshotting, etc.. Some common vdisk formats are: VDI, VMDK and
> > VHD. Some example vdisk so-uri's are:
> >
> > vpath:///export/xvm/vm1.vmdk
> > - a local vdisk image
> > vpath:///net/heaped.sfbay/export/xvm/1.vmdk
> > - a nfs vdisk image accessible via autofs
> > vnfs://heaped.sfbay/export/xvm/1.vmdk
> > - same vdisk image specified directly via a nfs so-uri
> >
> > Device storage objects specify block storage devices in a host
> > independant fashion. When configuring FC or iscsi storage on different
> > hosts, the storage configuration normally lives outsize of zonecfg, and
> > the configured storage may have varying /dev/dsk/cXtXdX* names. The
> > so-uri syntax provides a way to specify storage in a host independent
> > fashion, and during zone management operations, the zones framework can
> > map this storage to a host specific device path. Some example device
> > so-uri's are:
> >
> > fc:///20000014c347492a@0
> > - lun 0 of a fc disk with the specified wwn
> > iscsi:///alias=oracle zone root@0
> > - lun 0 of an iscsi disk with the specified alias.
> > iscsi:///target=iqn.1986-03.com.sun:02:38abfd16-78c5-c58e-e629-ea77a33c6740
> > - lun 0 of an iscsi disk with the specified target id.
>
> What about if there is already the necessary layer of abstraction that
> provides a consistent namespace? For example,
> /dev/vx/dsk/zone1dg/rootvol would refer to a block device named rootvol
> in the disk group zone1dg. That may reside on a single disk or span
> many disks and will have the same name regardless of which host the disk
> group is imported on. Since this VxVM volume may span many disks, it
> would be inappropriate to refer to a single LUN that makes up that disk
> group.
>
> Perhaps the following is appropriate for such situations.
>
> dev:///dev/vx/dsk/zone1dg/rootvol
>

good point. but rather than adding another URI type i'd rather just re-use
the "path:///" uri.

i've updated the doc to describe this use case and i've added an
example.

>
> > ----------
> > C.1.iii Zoneadm(1m) install
> >
> > When a zone is installed via the zoneadm(1m) "install" subcommand, the
> > zones subsystem will first verify that any required so-uris exist and
> > are accessible.
> >
> > If an so-uri points to a plain file, nfs file, or vdisk, and the object
> > does not exist, the object will be created with the install-size that
> > was specified via zonecfg(1m). If the so-uri does not exist and an
> > install-size was not specified via zonecfg(1m) an error will be
> > generated and the install will fail.
> >
> > If an so-uri points to an explicit nfs server, the zones framework will
> > need to mount the nfs filesystem containing storage object. The nfs
> > server share containing the specified object will be auto-mounted at:
> >
> > /var/zones/nfsmount/<zonename>/<host>/<nfs-share-name>
>
> Just for clarity, I think you mean:
>
> - "will be mounted at". I think "auto-mounted" conjures up the idea
> that there is integration with autofs.
> - <host> is the NFS server
> - <nfs-share-name> is the path on the NFS server. Is this the exact
> same thing as <path-absolute> in the URI specification? Is this the
> file that is mounted or the directory above the file?
>
> My storage administrators give me grief if I create too many NFS mounts
> (but I am not sure I've heard a convincing reason). As I envision NFS
> server layout, I think I would see something like:
>
> vol
> zones
> zone1
> rootzpool
> zpool
> zone2
> rootzpool
> zpool
> zone3
> rootzpool
> zpool
>
> It seems as though if these three zones are all running on the same box
> the box will have at least the following mounts:
>
> /var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1
> /var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2
> /var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3
>

well, it all depends on what nfs shares are actually being exported.

if the nfs server has the following share(s) exported:
nfsserver:/vol
then you would have the following mount(s):
/var/zones/nfsmount/zone1/nfsserver/vol
/var/zones/nfsmount/zone2/nfsserver/vol
/var/zones/nfsmount/zone3/nfsserver/vol

if the nfs server has the following share(s) exported:
nfsserver:/vol/zones
then you would have the following mount(s):
/var/zones/nfsmount/zone1/nfsserver/vol/zones
/var/zones/nfsmount/zone2/nfsserver/vol/zones
/var/zones/nfsmount/zone3/nfsserver/vol/zones

if the nfs server has the following share(s) exported:
nfsserver:/vol/zones/zone1
nfsserver:/vol/zones/zone2
nfsserver:/vol/zones/zone3
then you would have the following mount(s):
/var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1
/var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2
/var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3

> But maybe as many as:
>
> /var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1/rootzpool
> /var/zones/nfsmount/zone1/nfsserver/vol/zones/zone1/zpool
> /var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2/rootzpool
> /var/zones/nfsmount/zone2/nfsserver/vol/zones/zone2/zpool
> /var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3/rootzpool
> /var/zones/nfsmount/zone3/nfsserver/vol/zones/zone3/zpool
>

hm. afaik, you can only share directories via nfs, and i'm assuming
that "zpool" and "rootzpool" above are files (or volumes) which can
actually store data. in which case you would never mount them directly.


> With a slightly different arrangment this could be reduced to one.
> Change
>
> > /var/zones/nfsmount/<zonename>/<host>/<nfs-share-name>
>
> To:
>
> /var/zones/nfsmount/<host>/<nfs-share-name>/<zonename>/<fi le>
>

nice catch.

in early versions of my proposal, the nfs:// uri i was planning to
support allowed for the specification of mount options. this required
allowing for per-zone nfs mounts with potentially different mount
options. since then i've simplified things (realizing that most people
really don't need or want to specify mount options) and i've switched to
using the the nfs uri defined in rfc 2224. this means we can do away
with the '<zonename>' path component as you suggest.

i've updated the doc.

> I can see that this would complicate things a bit because it would be
> hard to figure out how far up the path is the right place for the mount.
>

afaik, determining the mount point should be pretty strait forward.
i was planning to get a list of all the shares exported by the specified
nfs server, and then do a strncmp() of all the exported shares against
the specified path. the longest matching share name is the mount path.

for example. if we have:
nfs://jurassic/a/b/c/d/file

and jurassic is exporting:
jurassic:/a
jurassic:/a/b
jurassic:/a/b/c

then our mount path with be:
/var/zones/nfsmount/jurassic/a/b/c

and our encapsulated zvol will be accessible at:
/var/zones/nfsmount/jurassic/a/b/c/d/file

afaik, this is acutally the only way that this could be implemented.

> Perhaps if this is what I would like I would be better off adding a
> global zone vfstab entry to mount nfsserver:/vol/zones somewhere and use
> the path:/// uri instead.
>
> Thoughts?
>

i'm not sure i understand how you would like to see this functionality
behave.

wrt vfstab, i'd rather you not use that since that moves configuration
outside of zonecfg. so later, if you want to migrate the zone, you'll
need to remember about that vfstab configuration and move it as well.
if at all possible i'd really like to keep all the configuration within
zonecfg(1m).

perhaps you could explanin your issues with the currently planned
approach in a different way to help me understand it better?

> > If an so-uri points to a fibre channel lun, the zones subsystem will
> > verify that the specified wwn corresponds to a global zone accessible
> > fibre channel disk device.
> >
> > If an so-uri points to an iSCSI target or alias, the zones subsystem
> > will verify that the iSCSI device is accessible on the local system. If
> > an so-uri points to a static iSCSI target and that target is not
> > already accessible on the local host, then the zones subsystem will
> > enable static discovery for the local iSCSI initiator and attempt to
> > apply the specified static iSCSI configuration. If the iSCSI target
> > device is not accessible then the install will fail.
> >
> > Once a zones install has verified that any required so-uri exists and is
> > accessible, the zones subsystem will need to initialise the so-uri. In
> > the case of a path or nfs path, this will involve creating a zpool
> > within the specified file. In the case of a vdisk, fibre channel lun,
> > or iSCSI lun, this will involve creating a EFI/GPT partition on the
> > device which uses the entire disk, then a zpool will be created within
> > this partition. For data protection purposes, if a storage object
> > contains any pre-existing partitions, zpools, or ufs filesystems, the
> > install will fail will fail with an appropriate error message. To
>
> s/will fail will fail/will fail/
>

oops. thanks. ;)

> > continue the installation and overwrite any pre-existing data, the user
> > will be able to specify a new '-f' option to zoneadm(1m) install. (This
> > option mimics the '-f' option used by zpool(1m) create.)
> >
> > If zpool-preserve is set to true, then before initialising any target
> > storage objects, the zones subsystem will attempt to import a
> > pre-existing zpool from those objects. This will allow users to
> > pre-create a zpool with custom creation time options, for use with
> > zones. To successfully import a pre-created zpool for a zone install,
> > that zpool must not be attached. (Ie, any pre-created zpool must be
> > exported from the system where it was created before a zone can be
> > installed on it.) Once the zpool is imported the install process will
> > check for the existence of a /ROOT filesystem within the zpool. If this
> > filesystem exists the install will fail with an appropriate error
> > message. To continue the installation the user will need to specify the
> > '-f' option to zoneadm(1m) install, which will cause the zones framework
> > to delete the pre-existing /ROOT filesystem within the zpool.
>
> Is this because the zone root will be installed <zonepath>/ROOT/<bename>
> rather than <zonepath>/root?
>

yes.

the current zones zfs filesystem layout and management for
opensolaris is documented here:
http://www.opensolaris.org/jive/thread.jspa?messageID=272726񂥖 i've mentioned this and reffered the user the '[07]'. (which references the link above.) > > The newly created or imported root zpool will be named after the zone to > > which it is associated, with the assigned name being "<zonename>_rpool". > > This zpool will then be mounted at the zones rootpath and then the > > install process will continue normally[07]. > > This seems odd... why not have the root zpool mounted at zonepath rather > than zoneroot? This way (e.g.) SUNWdetached.xml would follow the zone > during migrations. > oops. that a mistake. it will be mounted on the zonepath. i've fixed this. > > XXX: use altroot at zpool creation or just manually mount zpool? > > > > If the user has specified a "zpool" resource, then the zones framework > > will configure, initialize, and/or import it in a similar manaer to a > > zpool specified by the "rootzpool" resource. The key differences are > > that the name of the newly created or imported zpool will be > > "<zonename>_<name>". The specified zpool will also have the zfs "zoned" > > property set to "on", hence it will not be mounted anywhere in the > > global zone. > > > > XXX: do we need "zpool import -O file-system-property=" to set the > > zoned property upon import. > > > > Once a zone configured with a so-uri is in the installed state, the > > zones framework needs a mechanism to mark that storage as in use to > > prevent it from being accessed by multiple hosts simultaneously. The > > most likely situation where this could happen is via a zoneadm(1m) > > attach on a remote host. The easiest way to achieve this is to keep the > > zpools associated with the storage imported and mounted at all times, > > and leverage the existing zpool support for detecting and preventing > > multi-host access. > > > > So whenever a global zone boots and the zones smf service runs, it will > > attempt to configure and import any shared storage objects associated > > with installed zones. It will then continue to behave as it does today > > and boot any installed zones that have the autoboot property set. If > > any shared sorage objects fail to configure or import, then: > > > > - the zones associated with the failed storage will be transitioned > > to the "uninstalled" state. > > Is "uninstalled" a real state? Perhaps "configured" is more > appropriate, as this allows a transition to "installed" via "zoneadm > attach". > oops. another bug. fixed. > > - an error message will be emitted to the zones smf log file. > > - after booting any remaning installed zones that have autoboot set > > to true, the zones smf service will enter the "maintainence" state, > > there by prompting the administrator to look at the zones smf log > > file. > > > > After fixing any problems with shared storage accessibility, the > > admin should be able to simply re-attach the zone to the system. > > > > Currently the zones smf service is dependant upon multi-user-server, so > > all networking services required for access to shared storage should be > > propertly configured well before we try to import any shared storage > > associated with zones. > > May I propose a fix to the zones SMF service as part of this? The > current integration with the global zone's SMF is rather weak in > reporting the real status of zones and allowing the use of SMF for > controlling the zones service. In particular: > > - If a zone fails to start, the state of svc:/system/zones:default does > not reflect a maintenance or degraded state. > - If an admin wishes to start a zone the same way that the system would > do it, "svcadm restart" and similar have the side effect of rebooting > all zones on the system. > - There is no way to establish dependencies between zones or between a > zone and something that needs to happen in the global zone. > - There isn't a good way to allow certain individuals within the global > zone the ability to start/stop specific zones with RBAC or > authorizations. > > I propose that: > > - zonecfg creates a new services instance svc:/system/zones:zonename > when the zone is configured. Its initial state is disabled. If the > service already exists sanity checking may be performed but it should > not whack things like dependencies and authorizations. > - After zoneadm installs a zone, the general/enabled property of > svc:/system/zones:zonename is set to match the zonecfg autoboot > property. > - "zoneadm boot" is the equivalent of > "svcadm enable -t svc:/system/zones:zonename" > - A new command "zoneadm shutdown" is the equivalent of > "svcadm disable -t svc:/system/zones:zonename" > - "zoneadm halt" is the equivalent of "svcadm mark maintenance > svc:/system/zones:zonename:" followed by the traditional ungraceful > teardown of the zone. > - Modification of the autoboot property with zonecfg (so long as the > zone has been installed/attached) triggers the corresponding > general/enabled property change in SMF. This should set the property > general/enabled without causing an immediate state change. > - zoneadm uninstall and zoneadm detach set the service to not autostart. > - zonecfg delete also deletes the service. > - A new property be added to zonecfg to disable SMF integration of this > particular zone. This will be important for people that have already > worked around this problem (including ISV's providing clustering > products) that don't want SMF getting in the way of their already > working solution. > yeah. the zones team is well aware that our current smf integration story is pretty poor. :( we really want to improve our smf integration by moving all our configuration into smf, adding per-zone smf services, etc. so while this project proposes some minor changes to the behavior of our existing smf service, i think that an overhaul of our smf integration is really a project in and of itself, and out of scope for this proposal. (this proposal already has plenty of scope that could take a while to deliver. ;) > > ---------- > > C.1.viii Zoneadm(1m) clone > > > > Normally when cloning a zone which lives on a zfs filesystem the zones > > framework will take a zfs(1m) snapshot of the source zone and then do a > > zfs(1m) clone operation to create a filesystem for the new zone which is > > being instantiated. This works well when all the zones on a given > > system live on local storage in a single zfs filesystem, but this model > > doesn't work well for zones with encapsulated roots. First, with > > encapsulated roots each zone has it's own zpool, and zfs (1m) does not > > support cloning across zpools. Second, zfs(1m) snapshotting/cloning > > within the source zpool and then mounting the resultant filesystem onto > > the target zones zoneroot would introduce dependencies between zones, > > complicating things like zone migration. > > > > Hence, for cloning operations, if the source zone has an encapsulated > > root, zoneadm(1m) will not use zfs(1m) snapshot/clone. Currently > > zoneadm(1m) will fall back to the use of find+cpio to clone zones if it > > is unable to use zfs(1m) snapshot/clone. We could just fall back to > > this default behaviour for encapsulated root zones, but find+cpio are > > not error free and can have problem with large files. So we propose to > > update zoneadm(1m) clone to detect when both the source and target zones > > are using separate zfs filesystems, and in that case attempt to use zfs > > send/recv before falling back to find+cpio. > > Can a provision be added for running an external command to produce the > clone? I envision this being used to make a call to a storage device to > tell the storage device to create a clone of the storage. (This implies > that the super-secret tool to re-write the GUID would need to become > available.) > > The alternative seems to be to have everyone invent their own mechanism > with the same external commands and zoneadm attach. > hm. currently there are internal brand hooks which are run during a clone operation, but i don't think it would be appropriate to expose these. a "zoneadm clone" is basically a copy + sys-unconfig. if you have a storage device that can be used to do the copy for you, perhaps you could simply do the copy on the storage device, and then do a "zoneadm attach" of the new zone image? if you want, i think it would be a pretty trivial RFE to add a sys-unconfig option to "zoneadm attach". that should let you get the same essential functionality as clone, without having to add any new callbacks. thoughts? > > Today, the zoneadm(1m) clone operations ignores any additional storage > > (specified via the "fs", "device", or "dataset" resources) that may be > > associated with the zone. Similarly, the clone operation will ignore > > additional storage associated with any "zpool" resources. > > > > Since zoneadm(1m) clone will be enhanced to support cloning between > > encapsulated root zones and un-encapsulated root zones, zoneadm(1m) > > clone will be documented as the recommended migration mechanism for > > users who which to migrate existing zones from one format to another. > > > > > > ---------- > > C.2 Storage object uid/gid handling > > > > One issue faced by all VTs that support shared storage is dealing with > > file access permissions of storage objects accessible via NFS. This > > issue doesn't affect device based shared storage, or local files and > > vdisks, since these types of storage are always accessible, regardless > > of the uid of the access process (as long as the accessing process has > > the necessary privileges). But when accessing files and vdisk via NFS, > > the accessing process can not use privileges to circumvent restrictive > > file access premissions. This issue is also complicated by the fact > > that by default most NFS servier will map all accesses by remote root > > user to a different uid, usually "nobody". (a process known as "root > > squashing".) > > > > In order to avoid root squashing, or requiring users to setup special > > configurations on their NFS servers, whenever the zone framework > > attempts to create a storage object file or vdisk, it will temporarily > > change it's uid and gid to the "xvm" user and group, and then create the > > file with 0600 access permissions. > > > > Additionally, whenever the zones framework attempts to access an storage > > object file or vdisk it will temporarily switch its uid and gid to match > > the owner and group of the file/vdisk, ensure that the file is readable > > and writeable by it's owner (updating the file/vdisk permissions if > > necessary), and finally setup the file/vdisk for access via a zpool > > import or lofiadm -a. This should will allow the zones framework to > > access storage object files/vdisks that we created by any user, > > regardless of their ownership, simplifying file ownership and management > > issues for administrators. > > This implies that the xvm user is getting some additional privileges. > What are those privileges? > hm. afaik, the xvm user isn't defined as having any particular privileges. (/etc/user_attr doesn't have an xvm entry.) i wasn't planning on defining any privileg requirements for the xvm user. zoneadmd currently runs as root with all privs. so zoneadmd will be able to switch to the xvm user to create encapsulated zpool files/vdisks. similarly, zoneadmd will also be able to switch uid to the owner of any other objects it may need to access. > > ---------- > > C.3 Taskq enhancements > > > > The integration of Duckhorn[08] greatly simplifies the management of cpu > > resources assigned to zone. This management is partially implemented > > through the use of dynamic resource pools, where zones and their > > associated cpu resources can both be bound to a pool. > > > > Internally, zfs has worker threads associated with each zpool. These > > are kernel taskq threads which can run on any cpu which has not been > > explicitly allocated to a cpu set/partition/pool. > > > > So today, for any zones living on zfs filesystems, and running in a > > dedicated cpu pool, any zfs disk processing associated with that zone is > > not done by the cpu's bound to that zones pool. Essentially all the > > zones zfs processing is done for "free" by the global zone. > > > > With the introduction of zpools encapsulated within storage objects, > > which are themselves associated with specific zones, it would be > > desirable to have the zpool worker threads bound to the cpus currently > > allocated to the zone. Currently, zfs uses taskq threads for each > > zpool, so one way of doing this would be to introduce a mechanism that > > allows for the binding of taskqs to pools. > > > > Hence we propose the following new interfaces: > > zfs_poolbind(char *, poolid_t); > > taskq_poolbind(taskq_t, poolid_t); > > > > When a zone, which is bound to a pool, is booted, the zones framework > > will call zfs_poolbind() for each zpool associated with an encapsulated > > storage object bound to the zone being booted. > > > > Zfs will in turn use the new taskq pool binding interfaces to bind all > > it's taskqs to the specified pools. This mapping is transient and zfs > > will not record or persist this binding in any way. > > > > The taskq implementation will be enhanced to allow for binding worker > > threads to a specific pool. If taskqs threads are created for a taskq > > which is bound to a specific pool, those new thread will also inherit > > the same pool bindings. The taskq to pool binding will remain in effect > > until the taskq is explicitly rebound or the pool to which it is bound > > is destroyed. > > Any thoughts of dooing something similar for dedicated NICs? From > dladm(1M): > > cpus > > Bind the processing of packets for a given data link to > a processor or a set of processors. The value can be a > comma-separated list of one or more processor ids. If > the list consists of more than one processor, the pro- > cessing will spread out to all the processors. Connec- > tion to processor affinity and packet ordering for any > individual connection will be maintained. > > That is, the enhancement is already there, it's just a matter of making > use of it. > i'm currently engaged with someone on the crossbow team who is working on a proposal to allow for binding datalinks to pools. but once again, that's a seperate project. ;) > > ---------- > > C.4 Zfs enhancements > > > > In addition to the zfs_poolbind() interface proposed above. The > > zpool(1m) "import" command will need to be enhanced. Currently the > > zpool(1m) import by default scans all storage devices on the system > > looking for pools to import. The caller can also use the '-d' option to > > specify a directory within which the zpool(1m) command will scan for > > zpools that may be imported. This scanning involves sampling many > > objects. When dealing with zpools encapsulated in storage objects, this > > scanning is unnecessary since we already know the path to the objects > > which contains the zpool. Hence, the '-d' option will be enhanced to > > allow for the specification of a file or device. The user will also be > > able to specify this option multiple times, in case the zpool spans > > multiple objects. > > > > > > ---------- > > C.5 Lofi and lofiadm(1m) enhancements > > > > Currently, there is no way for a global zone to access the contents of a > > vdisk. Vdisk support was first introduced in VirtualBox. xVM then > > adopted the VirtualBox code for vdisk support. With both technologies, > > the only way to access the contents of a vdisk is to export it to a VM. > > > > To allow zones to use vdisk devices we propose to leverage the code > > introduced by by xVM by incorporating it into lofi. This will allow any > > solaris system to access the contents of vdisk devices. The interface > > changes to lofi to allow for this are fairly straitforward. > > > > A new '-l' option will be added to the lofiadm(1m) "-a" device creation > > mode. The '-l' option will indicate to lofi that the new device should > > have a label associated with it. Normally lofi device are named > > /dev/lofi/ and /dev/rlofi/, where is the lofi device number. > > When a disk device has a label associated with it, it exports many > > device nodes with different names. Therefore lofi will need to be > > enhanced to support these new device names, which multiple nodes > > per device. These new names will be: > > > > /dev/lofi/dsk/p<j> - block device partitions > > /dev/lofi/dsk/s<j> - block device slices > > /dev/rlofi/dsk/p<j> - char device partitions > > /dev/rlofi/dsk/s<j> - char device slices > > One of the big weaknesses with lofi is that you can't count on the > device name being the same between boots. Could -l take an argument > to be used instead of "dsk"? That is: > > lofiadm -a -l coolgames /media/coolgames.iso > > Creates: > > /dev/lofi/coolgames/p<j> > /dev/lofi/coolgames/s<j> > /dev/rlofi/coolgames/p<j> > /dev/rlofi/coolgames/s<j> > > For those cases where legacy behavior is desired, an optional %d can be > used to create the names you suggest above. > > lofiadm -a -l dsk%d /nfs/server/zone/stuff > so there are a lot of improvements that could be done to lofi. one improvement that i think we should do is to allow for persistent lofi devices that come back after reboots. custom device naming is another. but once again, i think that is outside the scope of this project. (this project will facilitate these other changes because it is creating an smf service for lofi, where persistent configuration could be stored, but adding that functionality will have to be another project.) > > ---------- > > C.6 Performance considerations > > > > As previously mentioned, this proposal primarily simplifies the process > > of configuring zones on shared storage. In most cases these proposed > > configurations can be created today, but no one has actually verified > > that these configurations perform acceptably. Hence, in conjunction > > with providing functionality to simplify the setup of these configs, > > we also need to be quantifying their performance to make sure that > > none of the configurations suffer from gross performance problems. > > > > The most straitforward configurations, with the least possibilities for > > poor performance, are ones using local devices, fibre channel luns, and > > iSCSI luns. These configuration should perform identically to the > > configurations where the global zone uses these objects to host zfs > > filesystems without zones. Additionally, the performance of these > > configurations will mostly be dependent upon the hardware associated > > with the storage devices. Hence the performance of these configuration > > is for the most part uninteresting and performance analysis of these > > configuration can by skipped. > > > > Looking at the performance of storage objects which are local files or > > nfs files is more interesting. In these cases the zpool that hosts the > > zone will be accessing it's storage via the zpool vdev_file vdev_ops_t > > interface. Currently, this interface doesn't receive as much use and > > performance testing as some of the other zpool vdev_ops_t interfaces. > > Hence it will worthwhile to measure the performance of a zpool backed by > > a file within another zfs filesystem. Likewise we will want to measure > > the performance of a zpool backed by a file on an NFS filesystem. > > Finally, we should compare these two performance points to a zone which > > is not encapsulated within a zpool, but is instead installed directly on > > a local zfs filesystem. (These comparisons are not really that > > interesting when dealing with block device based storage objects.) > > Reminder for when I am testing: is this a case where forcedirectio will > make a lot of sense? That is, zfs is already buffering, don't make NFS > do it too. > this is a great question, and i don't know the answer. i'll have to ask some nfs folks and do some perf testing to determine what should be done here. i've added a not about forcedirectio to the doc. _______________________________________________ zones-discuss mailing list zones-discuss at opensolaris dot org

edp

Posts: 605
From: US

Registered: 3/9/05
Re: [zones-discuss] zones on shared storage proposal
Posted: May 22, 2009 12:12 AM   in response to: mgerdts

  Click to reply to this thread Reply

[ second reply, includes revised proposal ]

hey mike,

thanks for all the great feedback.
my replies to your individual comments are inline below.

i've updated my proposal to include your feedback, but i'm unable to
attach it to this reply because of mail size restrictions imposed by
this alias. i'll send some follow up emails which include the revised
proposal.

thanks again,
ed
_______________________________________________
zones-discuss mailing list
zones-discuss at opensolaris dot org


edp

Posts: 605
From: US

Registered: 3/9/05
Re: [zones-discuss] zones on shared storage proposal
Posted: May 22, 2009 12:13 AM   in response to: mgerdts

  Click to reply to this thread Reply

[ third reply, includes revised proposal + change bars from previous
version ]

hey mike,

thanks for all the great feedback.
my replies to your individual comments are inline below.

i've updated my proposal to include your feedback, but i'm unable to
attach it to this reply because of mail size restrictions imposed by
this alias. i'll send some follow up emails which include the revised
proposal.

thanks again,
ed
_______________________________________________
zones-discuss mailing list
zones-discuss at opensolaris dot org


johnlev

Posts: 852
From: GB

Registered: 3/9/05
Re: [zones-discuss] zones on shared storage proposal
Posted: Sep 5, 2009 3:07 PM   in response to: edp

  Click to reply to this thread Reply

On Thu, May 21, 2009 at 04:55:15PM +0800, Edward Pilatowicz wrote:

> File storage objects:
>
> path:///<file-absolute>
> nfs://<host>[:port]/<file-absolute>
>
> Vdisk storage objects:
>
> vpath:///<file-absolute>
> vnfs://<host>[:port]/<file-absolute>

This makes me uncomfortable. The fact it's a vdisk is derivable except
in one case: creation. And when creating, we will already want some way
to specify the underlying format of the vdisk, so we could easily hook
the "make it a vdisk" option there.

That is, I think vdisks should just use path:/// and nfs:// not have
their own special schemes.

> In order to avoid root squashing, or requiring users to setup special
> configurations on their NFS servers, whenever the zone framework
> attempts to create a storage object file or vdisk, it will temporarily
> change it's uid and gid to the "xvm" user and group, and then create the
> file with 0600 access permissions.

Hmmph. I really don't want the 'xvm' user to be exposed any more than it
is. It was always intended as an internal detail of the Xen least
privilege implementation. Encoding it as the official UID to access
shared storage seems very problematic to me. Not least, it means xend,
qemu-dm, etc. can suddenly write to all the shared storage even if it's
nothing to do with Xen.

Please make this be a 'user' option that the user can specify (with a
default of root or whatever). I'm pretty sure we'd agreed on that last
time we talked about this?

> Additionally, whenever the zones framework attempts to access an storage
> object file or vdisk it will temporarily switch its uid and gid to match
> the owner and group of the file/vdisk, ensure that the file is readable
> and writeable by it's owner (updating the file/vdisk permissions if
> necessary), and finally setup the file/vdisk for access via a zpool
> import or lofiadm -a. This should will allow the zones framework to
> access storage object files/vdisks that we created by any user,
> regardless of their ownership, simplifying file ownership and management
> issues for administrators.

+1 on this bit, for sure.

> For RAS purposes, we will need to ensure that this vdisk utility is
> always running. Hence we will introduce a new lofi smf service
> svc:/system/lofi:default, which will start a new /usr/lib/lofi/lofid
> daemon, which will manage the starting, stopping, monitoring, and
> possible re-start of the vdisk utility. Re-starts of vdisk utility

I'm confused by this bit: isn't startd what manages "starting, stopping,
monitoring, and possible re-start" of daemons? Why isn't this
svc:/system/vdisk:default ? What is lofid actually doing?

regards
john
_______________________________________________
zones-discuss mailing list
zones-discuss at opensolaris dot org


edp

Posts: 605
From: US

Registered: 3/9/05
Re: [zones-discuss] zones on shared storage proposal
Posted: Sep 16, 2009 4:34 PM   in response to: johnlev

  Click to reply to this thread Reply

thanks for taking the time to look at this and sorry for the delay in
replying. my comments are line below.
ed

On Sat, Sep 05, 2009 at 11:13:07PM +0100, John Levon wrote:
> On Thu, May 21, 2009 at 04:55:15PM +0800, Edward Pilatowicz wrote:
>
> > File storage objects:
> >
> > path:///<file-absolute>
> > nfs://<host>[:port]/<file-absolute>
> >
> > Vdisk storage objects:
> >
> > vpath:///<file-absolute>
> > vnfs://<host>[:port]/<file-absolute>
>
> This makes me uncomfortable. The fact it's a vdisk is derivable except
> in one case: creation. And when creating, we will already want some way
> to specify the underlying format of the vdisk, so we could easily hook
> the "make it a vdisk" option there.
>
> That is, I think vdisks should just use path:/// and nfs:// not have
> their own special schemes.
>

this is easy enough to change.

but would you mind explaning what is the detection techniques are for
the different vdisk formats? are they files with well known extensions?
all directories with well known extensions? directories with certain
contents?

> > In order to avoid root squashing, or requiring users to setup special
> > configurations on their NFS servers, whenever the zone framework
> > attempts to create a storage object file or vdisk, it will temporarily
> > change it's uid and gid to the "xvm" user and group, and then create the
> > file with 0600 access permissions.
>
> Hmmph. I really don't want the 'xvm' user to be exposed any more than it
> is. It was always intended as an internal detail of the Xen least
> privilege implementation. Encoding it as the official UID to access
> shared storage seems very problematic to me. Not least, it means xend,
> qemu-dm, etc. can suddenly write to all the shared storage even if it's
> nothing to do with Xen.
>
> Please make this be a 'user' option that the user can specify (with a
> default of root or whatever). I'm pretty sure we'd agreed on that last
> time we talked about this?
>

i have no objections to adding a 'user' option.

but i'd still like to avoid defaulting to root and being subject to
root-squashing. the xvm user seems like a good way to do this. but if
you don't like this then i could always introduce a new user just for
this purpose, say the zonesnfs user.

> > For RAS purposes, we will need to ensure that this vdisk utility is
> > always running. Hence we will introduce a new lofi smf service
> > svc:/system/lofi:default, which will start a new /usr/lib/lofi/lofid
> > daemon, which will manage the starting, stopping, monitoring, and
> > possible re-start of the vdisk utility. Re-starts of vdisk utility
>
> I'm confused by this bit: isn't startd what manages "starting, stopping,
> monitoring, and possible re-start" of daemons? Why isn't this
> svc:/system/vdisk:default ? What is lofid actually doing?
>

well, as specified in the proposal, the administrative interface for
accessing vdisks is via lofi:

---8<---
Here's some examples of how this lofi functionality could be used
(outside of the zone framework). If there are no lofi devices on
the system, and an admin runs the following command:
lofiadm -a -l /e

johnlev

Posts: 852
From: GB

Registered: 3/9/05
Re: [zones-discuss] zones on shared storage proposal
Posted: Sep 16, 2009 5:07 PM   in response to: edp

  Click to reply to this thread Reply

On Wed, Sep 16, 2009 at 04:34:06PM -0700, Edward Pilatowicz wrote:

> thanks for taking the time to look at this and sorry for the delay in
> replying.

Compared to /my/ delay...

> > That is, I think vdisks should just use path:/// and nfs:// not have
> > their own special schemes.
>
> this is easy enough to change.
>
> but would you mind explaning what is the detection techniques are for
> the different vdisk formats? are they files with well known extensions?
> all directories with well known extensions? directories with certain
> contents?

Well, the format comes from the XML property file present in the vdisk.
At import time, it's a combination of sniffing the type from the file,
and some static checks on file name (namely .raw and .iso suffixes).

> > Hmmph. I really don't want the 'xvm' user to be exposed any more than it
> > is. It was always intended as an internal detail of the Xen least
> > privilege implementation. Encoding it as the official UID to access
> > shared storage seems very problematic to me. Not least, it means xend,
> > qemu-dm, etc. can suddenly write to all the shared storage even if it's
> > nothing to do with Xen.
> >
> > Please make this be a 'user' option that the user can specify (with a
> > default of root or whatever). I'm pretty sure we'd agreed on that last
> > time we talked about this?
>
> i have no objections to adding a 'user' option.
>
> but i'd still like to avoid defaulting to root and being subject to
> root-squashing.

How about defaulting to the owner of the containing directory? If it's
root, you won't be able to write if you're root-squashed (or not root
user) anyway.

Failing that, I'd indeed prefer a different user, especially one that's
configurable in terms of uid/gid.

> there wouldn't really be any problem which changing this from a lofi
> service to be a vdisk service. both services would do the same thing.
> each would have a daemon that keeps track of the current vdisks on the
> system and ensures that a vdisk utility remains running for each one.
>
> if you want smf to manage the vdisk utility processes directly, then
> we'll have to create a new smf service each time a vdisk is accessed
> and destroy that smf service each time the vdisk is taken down.

Ah, right, I see now. Yes, out of the two options, I'd prefer each vdisk
to have its own fault container (SMF service). You avoid the need for
another hierarchy of fault management process (lofid), and get the
benefit of enhanced visibility:

# svcs
...
online 15:33:19 svc:/system/lofi:dsk0
online 15:33:19 svc:/system/lofi:dsk1
maintenance 15:33:19 svc:/system/lofi:dsk2

Heck, if we ever do represent zones or domains as SMF instances, we
could even build dependencies on the lofi instances. (Presuming we
somehow rewhack xVM to start a service instead of an isolated vdisk
process.)

regards
john
_______________________________________________
zones-discuss mailing list
zones-discuss at opensolaris dot org


edp

Posts: 605
From: US

Registered: 3/9/05
Re: [zones-discuss] zones on shared storage proposal
Posted: Sep 16, 2009 9:33 PM   in response to: johnlev

  Click to reply to this thread Reply

On Thu, Sep 17, 2009 at 01:13:53AM +0100, John Levon wrote:
> On Wed, Sep 16, 2009 at 04:34:06PM -0700, Edward Pilatowicz wrote:
>
> > thanks for taking the time to look at this and sorry for the delay in
> > replying.
>
> Compared to /my/ delay...
>
> > > That is, I think vdisks should just use path:/// and nfs:// not have
> > > their own special schemes.
> >
> > this is easy enough to change.
> >
> > but would you mind explaning what is the detection techniques are for
> > the different vdisk formats? are they files with well known extensions?
> > all directories with well known extensions? directories with certain
> > contents?
>
> Well, the format comes from the XML property file present in the vdisk.

there by implying that the vdisk path is a directory. ok. that's easy
enough to detect.

> At import time, it's a combination of sniffing the type from the file,
> and some static checks on file name (namely .raw and .iso suffixes).
>

well, as long as the suffixes above apply to directories and not to
files then i think we'd be ok. if the extensions above will apply to
files then we have a problem.

in the xvm world, you don't have any issues with accessing the files
above since you know that every object exported to a domain contains
a virtual disk, and there for contains a label.

but with zones this isn't the case. in my proposal there are two access
modes for files. raw file mode, where a zpool is created directly
inside a file. and vdisk mode, where we first create a label within the
device and then create a zpool inside one of the partitions.

so previously if the user specified:
file:///.../foo.raw
then we would create a zpool directly within the file, no label.

and if the user specified:
vfile:///.../foo.raw

then we would use lofi with the newly proposed -l option to access the
file, then we'd put a label on it (via the lofi device), and then create
a zpool in one of the partitions (and once again, zfs would access the
file through the lofi device).

so in the two cases, how can we make the access mode determination
without having the seperate uri syntax?

> > > Hmmph. I really don't want the 'xvm' user to be exposed any more than it
> > > is. It was always intended as an internal detail of the Xen least
> > > privilege implementation. Encoding it as the official UID to access
> > > shared storage seems very problematic to me. Not least, it means xend,
> > > qemu-dm, etc. can suddenly write to all the shared storage even if it's
> > > nothing to do with Xen.
> > >
> > > Please make this be a 'user' option that the user can specify (with a
> > > default of root or whatever). I'm pretty sure we'd agreed on that last
> > > time we talked about this?
> >
> > i have no objections to adding a 'user' option.
> >
> > but i'd still like to avoid defaulting to root and being subject to
> > root-squashing.
>
> How about defaulting to the owner of the containing directory? If it's
> root, you won't be able to write if you're root-squashed (or not root
> user) anyway.
>
> Failing that, I'd indeed prefer a different user, especially one that's
> configurable in terms of uid/gid.
>

if a directory is owned by a non-root user and i want to create a file
there, i think it's a great idea to switch to the uid of the directory
owner todo my file operations. i'll add that to the proposal.

but, say i'm on a host that is not subject to root squashing and i need
to create a file on a share that is only writable by root. in that
case, should i go ahead and create a file owned by root? imho, no.
instead, i'd rather create the file as some other user. why? because
if the administrator then tries to migrate that zone to a host that is
subject to root squashing from the server, then i'd lose access to that
file. eliminating all file accesses as root allows us to avoid
root-squashing and just help eliminate potential failure modes.

this would be my argument for adding a new non-root user that could be
used as a fallback for remote file access in cases that would otherwise
default to the root user.

> > there wouldn't really be any problem which changing this from a lofi
> > service to be a vdisk service. both services would do the same thing.
> > each would have a daemon that keeps track of the current vdisks on the
> > system and ensures that a vdisk utility remains running for each one.
> >
> > if you want smf to manage the vdisk utility processes directly, then
> > we'll have to create a new smf service each time a vdisk is accessed
> > and destroy that smf service each time the vdisk is taken down.
>
> Ah, right, I see now. Yes, out of the two options, I'd prefer each vdisk
> to have its own fault container (SMF service). You avoid the need for
> another hierarchy of fault management process (lofid), and get the
> benefit of enhanced visibility:
>
> # svcs
> ...
> online 15:33:19 svc:/system/lofi:dsk0
> online 15:33:19 svc:/system/lofi:dsk1
> maintenance 15:33:19 svc:/system/lofi:dsk2
>
> Heck, if we ever do represent zones or domains as SMF instances, we
> could even build dependencies on the lofi instances. (Presuming we
> somehow rewhack xVM to start a service instead of an isolated vdisk
> process.)
>

it's a little fine grained for my tastes, but ok.

one other thing to consider is that all the services above will be
running the vdisk utility which will be shuffeling data between a lofi
device node and a vdisk file. and since lofi nodes don't persist across
reboots, the services above shouldn't persist across a reboot either. i
guess that the method script for the services above could delete the
service if it noticed that the corresponding device node associated with
the vdisk was missing.

i can write this into the proposal as well.

ed
_______________________________________________
zones-discuss mailing list
zones-discuss at opensolaris dot org


johnlev

Posts: 852
From: GB

Registered: 3/9/05
Re: [zones-discuss] zones on shared storage proposal
Posted: Sep 21, 2009 8:22 AM   in response to: edp

  Click to reply to this thread Reply

On Wed, Sep 16, 2009 at 09:33:11PM -0700, Edward Pilatowicz wrote:

> there by implying that the vdisk path is a directory. ok. that's easy

Right.

> enough to detect.

It's probably safer to directly use vdiskadm to sniff the directory, if
you can.

> > At import time, it's a combination of sniffing the type from the file,
> > and some static checks on file name (namely .raw and .iso suffixes).
>
> well, as long as the suffixes above apply to directories and not to
> files then i think we'd be ok. if the extensions above will apply to
> files then we have a problem.

Once imported, the contents of the vdisk directory are private to vdisk.
The name of the containing directory can be anything.

That is, an import consists of taking the foo.raw file, and putting it
into a directory along with an XML properties file.

> so previously if the user specified:
> file:///.../foo.raw
> then we would create a zpool directly within the file, no label.
>
> and if the user specified:
> vfile:///.../foo.raw
>
> then we would use lofi with the newly proposed -l option to access the
> file, then we'd put a label on it (via the lofi device), and then create
> a zpool in one of the partitions (and once again, zfs would access the
> file through the lofi device).
>
> so in the two cases, how can we make the access mode determination
> without having the seperate uri syntax?

In the creation case, which I think we're talking about above, we create
the vdisk directory (rather than direct file access, which vdiskadm
can't do, even though vdisk itself can) and the container format is
clear.

If we want to configure access to a pre-existing raw file, I'm not sure
we'd be doing the labelling ourselves. Perhaps I don't quite understand
the use cases for what you're suggesting.

> > How about defaulting to the owner of the containing directory? If it's
> > root, you won't be able to write if you're root-squashed (or not root
> > user) anyway.
> >
> > Failing that, I'd indeed prefer a different user, especially one that's
> > configurable in terms of uid/gid.
>
> if a directory is owned by a non-root user and i want to create a file
> there, i think it's a great idea to switch to the uid of the directory
> owner todo my file operations. i'll add that to the proposal.
>
> but, say i'm on a host that is not subject to root squashing and i need
> to create a file on a share that is only writable by root. in that
> case, should i go ahead and create a file owned by root? imho, no.
> instead, i'd rather create the file as some other user.

I don't agree that second-guessing the user's intentions when they've
explicitly disabled root-squashing is a useful behaviour.

> if the administrator then tries to migrate that zone to a host that is
> subject to root squashing from the server, then i'd lose access to that
> file. eliminating all file accesses as root allows us to avoid
> root-squashing and just help eliminate potential failure modes.

Replacing it with a potentially more subtle issue: what's the zonenfs
user ID, is it configured on the server, and is it unique and reserved
across the organisation, and across all OSes?

Having access fail with a clear message is an understandable failure
mode, with a clear remedy: use a suitable uid /chosen by the
administrator/. NFS users are surely comfortable and familiar with root
squashing by now.

Having a MySQL security hole allow access to all your virtual shared
storage is a much more subtle problem (yes, I discovered despite my
initial research that UID 60 is used by some Linux machines as mysqld).

regards
john
_______________________________________________
zones-discuss mailing list
zones-discuss at opensolaris dot org


edp

Posts: 605
From: US

Registered: 3/9/05
Re: [zones-discuss] zones on shared storage proposal
Posted: Sep 21, 2009 10:23 AM   in response to: johnlev

  Click to reply to this thread Reply

On Mon, Sep 21, 2009 at 04:25:30PM +0100, John Levon wrote:
> On Wed, Sep 16, 2009 at 09:33:11PM -0700, Edward Pilatowicz wrote:
>
> > there by implying that the vdisk path is a directory. ok. that's easy
>
> Right.
>
> > enough to detect.
>
> It's probably safer to directly use vdiskadm to sniff the directory, if
> you can.
>

sure.

> > > At import time, it's a combination of sniffing the type from the file,
> > > and some static checks on file name (namely .raw and .iso suffixes).
> >
> > well, as long as the suffixes above apply to directories and not to
> > files then i think we'd be ok. if the extensions above will apply to
> > files then we have a problem.
>
> Once imported, the contents of the vdisk directory are private to vdisk.
> The name of the containing directory can be anything.
>
> That is, an import consists of taking the foo.raw file, and putting it
> into a directory along with an XML properties file.
>

so in this context, an import is one method for creating a vdisk.

> > so previously if the user specified:
> > file:///.../foo.raw
> > then we would create a zpool directly within the file, no label.
> >
> > and if the user specified:
> > vfile:///.../foo.raw
> >
> > then we would use lofi with the newly proposed -l option to access the
> > file, then we'd put a label on it (via the lofi device), and then create
> > a zpool in one of the partitions (and once again, zfs would access the
> > file through the lofi device).
> >
> > so in the two cases, how can we make the access mode determination
> > without having the seperate uri syntax?
>
> In the creation case, which I think we're talking about above, we create
> the vdisk directory (rather than direct file access, which vdiskadm
> can't do, even though vdisk itself can) and the container format is
> clear.
>
> If we want to configure access to a pre-existing raw file, I'm not sure
> we'd be doing the labelling ourselves. Perhaps I don't quite understand
> the use cases for what you're suggesting.
>

the two use cases above were creation use cases.

i think part of the confusion here is that in the raw case, i thought a
vdisk would just have a file, not a directory with an xml file and the
disk file. (when i was using xvm that was the format of all the vdisks
i created.)

the other part of the confusion is that i was trying to support
automatic creation for raw vdisks.

if we only support vdisks created via vdiskadm(1m), then we'll always
have a directory and we can always use vdiskadm(1m) to sniff out if it's
a valid vdisk and access it as such.

then for the implicit creation case we'll just support files containing
a zpool.

sound good?

> > > How about defaulting to the owner of the containing directory? If it's
> > > root, you won't be able to write if you're root-squashed (or not root
> > > user) anyway.
> > >
> > > Failing that, I'd indeed prefer a different user, especially one that's
> > > configurable in terms of uid/gid.
> >
> > if a directory is owned by a non-root user and i want to create a file
> > there, i think it's a great idea to switch to the uid of the directory
> > owner todo my file operations. i'll add that to the proposal.
> >
> > but, say i'm on a host that is not subject to root squashing and i need
> > to create a file on a share that is only writable by root. in that
> > case, should i go ahead and create a file owned by root? imho, no.
> > instead, i'd rather create the file as some other user.
>
> I don't agree that second-guessing the user's intentions when they've
> explicitly disabled root-squashing is a useful behaviour.
>
> > if the administrator then tries to migrate that zone to a host that is
> > subject to root squashing from the server, then i'd lose access to that
> > file. eliminating all file accesses as root allows us to avoid
> > root-squashing and just help eliminate potential failure modes.
>
> Replacing it with a potentially more subtle issue: what's the zonenfs
> user ID, is it configured on the server, and is it unique and reserved
> across the organisation, and across all OSes?
>
> Having access fail with a clear message is an understandable failure
> mode, with a clear remedy: use a suitable uid /chosen by the
> administrator/. NFS users are surely comfortable and familiar with root
> squashing by now.
>
> Having a MySQL security hole allow access to all your virtual shared
> storage is a much more subtle problem (yes, I discovered despite my
> initial research that UID 60 is used by some Linux machines as mysqld).
>

ok. so how about we just generate an error if we need to create a file,
and an explicity user id has not been specified, and root squashing is
enabled. (because under these conditions we'd generate a file owned by
nobody.)

ed
_______________________________________________
zones-discuss mailing list
zones-discuss at opensolaris dot org


johnlev

Posts: 852
From: GB

Registered: 3/9/05
Re: [zones-discuss] zones on shared storage proposal
Posted: Sep 21, 2009 12:21 PM   in response to: edp

  Click to reply to this thread Reply

On Mon, Sep 21, 2009 at 10:23:21AM -0700, Edward Pilatowicz wrote:

> if we only support vdisks created via vdiskadm(1m), then we'll always
> have a directory and we can always use vdiskadm(1m) to sniff out if it's
> a valid vdisk and access it as such.
>
> then for the implicit creation case we'll just support files containing
> a zpool.
>
> sound good?

Yes.

> ok. so how about we just generate an error if we need to create a file,
> and an explicity user id has not been specified, and root squashing is
> enabled. (because under these conditions we'd generate a file owned by
> nobody.)

Sounds good to me.

regards
john
_______________________________________________
zones-discuss mailing list
zones-discuss at opensolaris dot org





Terms of Use | Privacy | Trademarks | Copyright Policy | Site Guidelines
Your use of this web site or any of its content or software indicates your agreement to be bound by these Terms of Use.
Copyright © 1995-2005 Sun Microsystems, Inc.