509 lines
22 KiB
Plaintext
509 lines
22 KiB
Plaintext
|
|
configfs - Userspace-driven kernel object configuration.
|
|
|
|
Joel Becker <joel.becker@oracle.com>
|
|
|
|
Updated: 31 March 2005
|
|
|
|
Copyright (c) 2005 Oracle Corporation,
|
|
Joel Becker <joel.becker@oracle.com>
|
|
|
|
|
|
[What is configfs?]
|
|
|
|
configfs is a ram-based filesystem that provides the converse of
|
|
sysfs's functionality. Where sysfs is a filesystem-based view of
|
|
kernel objects, configfs is a filesystem-based manager of kernel
|
|
objects, or config_items.
|
|
|
|
With sysfs, an object is created in kernel (for example, when a device
|
|
is discovered) and it is registered with sysfs. Its attributes then
|
|
appear in sysfs, allowing userspace to read the attributes via
|
|
readdir(3)/read(2). It may allow some attributes to be modified via
|
|
write(2). The important point is that the object is created and
|
|
destroyed in kernel, the kernel controls the lifecycle of the sysfs
|
|
representation, and sysfs is merely a window on all this.
|
|
|
|
A configfs config_item is created via an explicit userspace operation:
|
|
mkdir(2). It is destroyed via rmdir(2). The attributes appear at
|
|
mkdir(2) time, and can be read or modified via read(2) and write(2).
|
|
As with sysfs, readdir(3) queries the list of items and/or attributes.
|
|
symlink(2) can be used to group items together. Unlike sysfs, the
|
|
lifetime of the representation is completely driven by userspace. The
|
|
kernel modules backing the items must respond to this.
|
|
|
|
Both sysfs and configfs can and should exist together on the same
|
|
system. One is not a replacement for the other.
|
|
|
|
[Using configfs]
|
|
|
|
configfs can be compiled as a module or into the kernel. You can access
|
|
it by doing
|
|
|
|
mount -t configfs none /config
|
|
|
|
The configfs tree will be empty unless client modules are also loaded.
|
|
These are modules that register their item types with configfs as
|
|
subsystems. Once a client subsystem is loaded, it will appear as a
|
|
subdirectory (or more than one) under /config. Like sysfs, the
|
|
configfs tree is always there, whether mounted on /config or not.
|
|
|
|
An item is created via mkdir(2). The item's attributes will also
|
|
appear at this time. readdir(3) can determine what the attributes are,
|
|
read(2) can query their default values, and write(2) can store new
|
|
values. Don't mix more than one attribute in one attribute file.
|
|
|
|
There are two types of configfs attributes:
|
|
|
|
* Normal attributes, which similar to sysfs attributes, are small ASCII text
|
|
files, with a maximum size of one page (PAGE_SIZE, 4096 on i386). Preferably
|
|
only one value per file should be used, and the same caveats from sysfs apply.
|
|
Configfs expects write(2) to store the entire buffer at once. When writing to
|
|
normal configfs attributes, userspace processes should first read the entire
|
|
file, modify the portions they wish to change, and then write the entire
|
|
buffer back.
|
|
|
|
* Binary attributes, which are somewhat similar to sysfs binary attributes,
|
|
but with a few slight changes to semantics. The PAGE_SIZE limitation does not
|
|
apply, but the whole binary item must fit in single kernel vmalloc'ed buffer.
|
|
The write(2) calls from user space are buffered, and the attributes'
|
|
write_bin_attribute method will be invoked on the final close, therefore it is
|
|
imperative for user-space to check the return code of close(2) in order to
|
|
verify that the operation finished successfully.
|
|
To avoid a malicious user OOMing the kernel, there's a per-binary attribute
|
|
maximum buffer value.
|
|
|
|
When an item needs to be destroyed, remove it with rmdir(2). An
|
|
item cannot be destroyed if any other item has a link to it (via
|
|
symlink(2)). Links can be removed via unlink(2).
|
|
|
|
[Configuring FakeNBD: an Example]
|
|
|
|
Imagine there's a Network Block Device (NBD) driver that allows you to
|
|
access remote block devices. Call it FakeNBD. FakeNBD uses configfs
|
|
for its configuration. Obviously, there will be a nice program that
|
|
sysadmins use to configure FakeNBD, but somehow that program has to tell
|
|
the driver about it. Here's where configfs comes in.
|
|
|
|
When the FakeNBD driver is loaded, it registers itself with configfs.
|
|
readdir(3) sees this just fine:
|
|
|
|
# ls /config
|
|
fakenbd
|
|
|
|
A fakenbd connection can be created with mkdir(2). The name is
|
|
arbitrary, but likely the tool will make some use of the name. Perhaps
|
|
it is a uuid or a disk name:
|
|
|
|
# mkdir /config/fakenbd/disk1
|
|
# ls /config/fakenbd/disk1
|
|
target device rw
|
|
|
|
The target attribute contains the IP address of the server FakeNBD will
|
|
connect to. The device attribute is the device on the server.
|
|
Predictably, the rw attribute determines whether the connection is
|
|
read-only or read-write.
|
|
|
|
# echo 10.0.0.1 > /config/fakenbd/disk1/target
|
|
# echo /dev/sda1 > /config/fakenbd/disk1/device
|
|
# echo 1 > /config/fakenbd/disk1/rw
|
|
|
|
That's it. That's all there is. Now the device is configured, via the
|
|
shell no less.
|
|
|
|
[Coding With configfs]
|
|
|
|
Every object in configfs is a config_item. A config_item reflects an
|
|
object in the subsystem. It has attributes that match values on that
|
|
object. configfs handles the filesystem representation of that object
|
|
and its attributes, allowing the subsystem to ignore all but the
|
|
basic show/store interaction.
|
|
|
|
Items are created and destroyed inside a config_group. A group is a
|
|
collection of items that share the same attributes and operations.
|
|
Items are created by mkdir(2) and removed by rmdir(2), but configfs
|
|
handles that. The group has a set of operations to perform these tasks
|
|
|
|
A subsystem is the top level of a client module. During initialization,
|
|
the client module registers the subsystem with configfs, the subsystem
|
|
appears as a directory at the top of the configfs filesystem. A
|
|
subsystem is also a config_group, and can do everything a config_group
|
|
can.
|
|
|
|
[struct config_item]
|
|
|
|
struct config_item {
|
|
char *ci_name;
|
|
char ci_namebuf[UOBJ_NAME_LEN];
|
|
struct kref ci_kref;
|
|
struct list_head ci_entry;
|
|
struct config_item *ci_parent;
|
|
struct config_group *ci_group;
|
|
struct config_item_type *ci_type;
|
|
struct dentry *ci_dentry;
|
|
};
|
|
|
|
void config_item_init(struct config_item *);
|
|
void config_item_init_type_name(struct config_item *,
|
|
const char *name,
|
|
struct config_item_type *type);
|
|
struct config_item *config_item_get(struct config_item *);
|
|
void config_item_put(struct config_item *);
|
|
|
|
Generally, struct config_item is embedded in a container structure, a
|
|
structure that actually represents what the subsystem is doing. The
|
|
config_item portion of that structure is how the object interacts with
|
|
configfs.
|
|
|
|
Whether statically defined in a source file or created by a parent
|
|
config_group, a config_item must have one of the _init() functions
|
|
called on it. This initializes the reference count and sets up the
|
|
appropriate fields.
|
|
|
|
All users of a config_item should have a reference on it via
|
|
config_item_get(), and drop the reference when they are done via
|
|
config_item_put().
|
|
|
|
By itself, a config_item cannot do much more than appear in configfs.
|
|
Usually a subsystem wants the item to display and/or store attributes,
|
|
among other things. For that, it needs a type.
|
|
|
|
[struct config_item_type]
|
|
|
|
struct configfs_item_operations {
|
|
void (*release)(struct config_item *);
|
|
int (*allow_link)(struct config_item *src,
|
|
struct config_item *target);
|
|
void (*drop_link)(struct config_item *src,
|
|
struct config_item *target);
|
|
};
|
|
|
|
struct config_item_type {
|
|
struct module *ct_owner;
|
|
struct configfs_item_operations *ct_item_ops;
|
|
struct configfs_group_operations *ct_group_ops;
|
|
struct configfs_attribute **ct_attrs;
|
|
struct configfs_bin_attribute **ct_bin_attrs;
|
|
};
|
|
|
|
The most basic function of a config_item_type is to define what
|
|
operations can be performed on a config_item. All items that have been
|
|
allocated dynamically will need to provide the ct_item_ops->release()
|
|
method. This method is called when the config_item's reference count
|
|
reaches zero.
|
|
|
|
[struct configfs_attribute]
|
|
|
|
struct configfs_attribute {
|
|
char *ca_name;
|
|
struct module *ca_owner;
|
|
umode_t ca_mode;
|
|
ssize_t (*show)(struct config_item *, char *);
|
|
ssize_t (*store)(struct config_item *, const char *, size_t);
|
|
};
|
|
|
|
When a config_item wants an attribute to appear as a file in the item's
|
|
configfs directory, it must define a configfs_attribute describing it.
|
|
It then adds the attribute to the NULL-terminated array
|
|
config_item_type->ct_attrs. When the item appears in configfs, the
|
|
attribute file will appear with the configfs_attribute->ca_name
|
|
filename. configfs_attribute->ca_mode specifies the file permissions.
|
|
|
|
If an attribute is readable and provides a ->show method, that method will
|
|
be called whenever userspace asks for a read(2) on the attribute. If an
|
|
attribute is writable and provides a ->store method, that method will be
|
|
be called whenever userspace asks for a write(2) on the attribute.
|
|
|
|
[struct configfs_bin_attribute]
|
|
|
|
struct configfs_attribute {
|
|
struct configfs_attribute cb_attr;
|
|
void *cb_private;
|
|
size_t cb_max_size;
|
|
};
|
|
|
|
The binary attribute is used when the one needs to use binary blob to
|
|
appear as the contents of a file in the item's configfs directory.
|
|
To do so add the binary attribute to the NULL-terminated array
|
|
config_item_type->ct_bin_attrs, and the item appears in configfs, the
|
|
attribute file will appear with the configfs_bin_attribute->cb_attr.ca_name
|
|
filename. configfs_bin_attribute->cb_attr.ca_mode specifies the file
|
|
permissions.
|
|
The cb_private member is provided for use by the driver, while the
|
|
cb_max_size member specifies the maximum amount of vmalloc buffer
|
|
to be used.
|
|
|
|
If binary attribute is readable and the config_item provides a
|
|
ct_item_ops->read_bin_attribute() method, that method will be called
|
|
whenever userspace asks for a read(2) on the attribute. The converse
|
|
will happen for write(2). The reads/writes are bufferred so only a
|
|
single read/write will occur; the attributes' need not concern itself
|
|
with it.
|
|
|
|
[struct config_group]
|
|
|
|
A config_item cannot live in a vacuum. The only way one can be created
|
|
is via mkdir(2) on a config_group. This will trigger creation of a
|
|
child item.
|
|
|
|
struct config_group {
|
|
struct config_item cg_item;
|
|
struct list_head cg_children;
|
|
struct configfs_subsystem *cg_subsys;
|
|
struct list_head default_groups;
|
|
struct list_head group_entry;
|
|
};
|
|
|
|
void config_group_init(struct config_group *group);
|
|
void config_group_init_type_name(struct config_group *group,
|
|
const char *name,
|
|
struct config_item_type *type);
|
|
|
|
|
|
The config_group structure contains a config_item. Properly configuring
|
|
that item means that a group can behave as an item in its own right.
|
|
However, it can do more: it can create child items or groups. This is
|
|
accomplished via the group operations specified on the group's
|
|
config_item_type.
|
|
|
|
struct configfs_group_operations {
|
|
struct config_item *(*make_item)(struct config_group *group,
|
|
const char *name);
|
|
struct config_group *(*make_group)(struct config_group *group,
|
|
const char *name);
|
|
int (*commit_item)(struct config_item *item);
|
|
void (*disconnect_notify)(struct config_group *group,
|
|
struct config_item *item);
|
|
void (*drop_item)(struct config_group *group,
|
|
struct config_item *item);
|
|
};
|
|
|
|
A group creates child items by providing the
|
|
ct_group_ops->make_item() method. If provided, this method is called from mkdir(2) in the group's directory. The subsystem allocates a new
|
|
config_item (or more likely, its container structure), initializes it,
|
|
and returns it to configfs. Configfs will then populate the filesystem
|
|
tree to reflect the new item.
|
|
|
|
If the subsystem wants the child to be a group itself, the subsystem
|
|
provides ct_group_ops->make_group(). Everything else behaves the same,
|
|
using the group _init() functions on the group.
|
|
|
|
Finally, when userspace calls rmdir(2) on the item or group,
|
|
ct_group_ops->drop_item() is called. As a config_group is also a
|
|
config_item, it is not necessary for a separate drop_group() method.
|
|
The subsystem must config_item_put() the reference that was initialized
|
|
upon item allocation. If a subsystem has no work to do, it may omit
|
|
the ct_group_ops->drop_item() method, and configfs will call
|
|
config_item_put() on the item on behalf of the subsystem.
|
|
|
|
IMPORTANT: drop_item() is void, and as such cannot fail. When rmdir(2)
|
|
is called, configfs WILL remove the item from the filesystem tree
|
|
(assuming that it has no children to keep it busy). The subsystem is
|
|
responsible for responding to this. If the subsystem has references to
|
|
the item in other threads, the memory is safe. It may take some time
|
|
for the item to actually disappear from the subsystem's usage. But it
|
|
is gone from configfs.
|
|
|
|
When drop_item() is called, the item's linkage has already been torn
|
|
down. It no longer has a reference on its parent and has no place in
|
|
the item hierarchy. If a client needs to do some cleanup before this
|
|
teardown happens, the subsystem can implement the
|
|
ct_group_ops->disconnect_notify() method. The method is called after
|
|
configfs has removed the item from the filesystem view but before the
|
|
item is removed from its parent group. Like drop_item(),
|
|
disconnect_notify() is void and cannot fail. Client subsystems should
|
|
not drop any references here, as they still must do it in drop_item().
|
|
|
|
A config_group cannot be removed while it still has child items. This
|
|
is implemented in the configfs rmdir(2) code. ->drop_item() will not be
|
|
called, as the item has not been dropped. rmdir(2) will fail, as the
|
|
directory is not empty.
|
|
|
|
[struct configfs_subsystem]
|
|
|
|
A subsystem must register itself, usually at module_init time. This
|
|
tells configfs to make the subsystem appear in the file tree.
|
|
|
|
struct configfs_subsystem {
|
|
struct config_group su_group;
|
|
struct mutex su_mutex;
|
|
};
|
|
|
|
int configfs_register_subsystem(struct configfs_subsystem *subsys);
|
|
void configfs_unregister_subsystem(struct configfs_subsystem *subsys);
|
|
|
|
A subsystem consists of a toplevel config_group and a mutex.
|
|
The group is where child config_items are created. For a subsystem,
|
|
this group is usually defined statically. Before calling
|
|
configfs_register_subsystem(), the subsystem must have initialized the
|
|
group via the usual group _init() functions, and it must also have
|
|
initialized the mutex.
|
|
When the register call returns, the subsystem is live, and it
|
|
will be visible via configfs. At that point, mkdir(2) can be called and
|
|
the subsystem must be ready for it.
|
|
|
|
[An Example]
|
|
|
|
The best example of these basic concepts is the simple_children
|
|
subsystem/group and the simple_child item in
|
|
samples/configfs/configfs_sample.c. It shows a trivial object displaying
|
|
and storing an attribute, and a simple group creating and destroying
|
|
these children.
|
|
|
|
[Hierarchy Navigation and the Subsystem Mutex]
|
|
|
|
There is an extra bonus that configfs provides. The config_groups and
|
|
config_items are arranged in a hierarchy due to the fact that they
|
|
appear in a filesystem. A subsystem is NEVER to touch the filesystem
|
|
parts, but the subsystem might be interested in this hierarchy. For
|
|
this reason, the hierarchy is mirrored via the config_group->cg_children
|
|
and config_item->ci_parent structure members.
|
|
|
|
A subsystem can navigate the cg_children list and the ci_parent pointer
|
|
to see the tree created by the subsystem. This can race with configfs'
|
|
management of the hierarchy, so configfs uses the subsystem mutex to
|
|
protect modifications. Whenever a subsystem wants to navigate the
|
|
hierarchy, it must do so under the protection of the subsystem
|
|
mutex.
|
|
|
|
A subsystem will be prevented from acquiring the mutex while a newly
|
|
allocated item has not been linked into this hierarchy. Similarly, it
|
|
will not be able to acquire the mutex while a dropping item has not
|
|
yet been unlinked. This means that an item's ci_parent pointer will
|
|
never be NULL while the item is in configfs, and that an item will only
|
|
be in its parent's cg_children list for the same duration. This allows
|
|
a subsystem to trust ci_parent and cg_children while they hold the
|
|
mutex.
|
|
|
|
[Item Aggregation Via symlink(2)]
|
|
|
|
configfs provides a simple group via the group->item parent/child
|
|
relationship. Often, however, a larger environment requires aggregation
|
|
outside of the parent/child connection. This is implemented via
|
|
symlink(2).
|
|
|
|
A config_item may provide the ct_item_ops->allow_link() and
|
|
ct_item_ops->drop_link() methods. If the ->allow_link() method exists,
|
|
symlink(2) may be called with the config_item as the source of the link.
|
|
These links are only allowed between configfs config_items. Any
|
|
symlink(2) attempt outside the configfs filesystem will be denied.
|
|
|
|
When symlink(2) is called, the source config_item's ->allow_link()
|
|
method is called with itself and a target item. If the source item
|
|
allows linking to target item, it returns 0. A source item may wish to
|
|
reject a link if it only wants links to a certain type of object (say,
|
|
in its own subsystem).
|
|
|
|
When unlink(2) is called on the symbolic link, the source item is
|
|
notified via the ->drop_link() method. Like the ->drop_item() method,
|
|
this is a void function and cannot return failure. The subsystem is
|
|
responsible for responding to the change.
|
|
|
|
A config_item cannot be removed while it links to any other item, nor
|
|
can it be removed while an item links to it. Dangling symlinks are not
|
|
allowed in configfs.
|
|
|
|
[Automatically Created Subgroups]
|
|
|
|
A new config_group may want to have two types of child config_items.
|
|
While this could be codified by magic names in ->make_item(), it is much
|
|
more explicit to have a method whereby userspace sees this divergence.
|
|
|
|
Rather than have a group where some items behave differently than
|
|
others, configfs provides a method whereby one or many subgroups are
|
|
automatically created inside the parent at its creation. Thus,
|
|
mkdir("parent") results in "parent", "parent/subgroup1", up through
|
|
"parent/subgroupN". Items of type 1 can now be created in
|
|
"parent/subgroup1", and items of type N can be created in
|
|
"parent/subgroupN".
|
|
|
|
These automatic subgroups, or default groups, do not preclude other
|
|
children of the parent group. If ct_group_ops->make_group() exists,
|
|
other child groups can be created on the parent group directly.
|
|
|
|
A configfs subsystem specifies default groups by adding them using the
|
|
configfs_add_default_group() function to the parent config_group
|
|
structure. Each added group is populated in the configfs tree at the same
|
|
time as the parent group. Similarly, they are removed at the same time
|
|
as the parent. No extra notification is provided. When a ->drop_item()
|
|
method call notifies the subsystem the parent group is going away, it
|
|
also means every default group child associated with that parent group.
|
|
|
|
As a consequence of this, default groups cannot be removed directly via
|
|
rmdir(2). They also are not considered when rmdir(2) on the parent
|
|
group is checking for children.
|
|
|
|
[Dependent Subsystems]
|
|
|
|
Sometimes other drivers depend on particular configfs items. For
|
|
example, ocfs2 mounts depend on a heartbeat region item. If that
|
|
region item is removed with rmdir(2), the ocfs2 mount must BUG or go
|
|
readonly. Not happy.
|
|
|
|
configfs provides two additional API calls: configfs_depend_item() and
|
|
configfs_undepend_item(). A client driver can call
|
|
configfs_depend_item() on an existing item to tell configfs that it is
|
|
depended on. configfs will then return -EBUSY from rmdir(2) for that
|
|
item. When the item is no longer depended on, the client driver calls
|
|
configfs_undepend_item() on it.
|
|
|
|
These API cannot be called underneath any configfs callbacks, as
|
|
they will conflict. They can block and allocate. A client driver
|
|
probably shouldn't calling them of its own gumption. Rather it should
|
|
be providing an API that external subsystems call.
|
|
|
|
How does this work? Imagine the ocfs2 mount process. When it mounts,
|
|
it asks for a heartbeat region item. This is done via a call into the
|
|
heartbeat code. Inside the heartbeat code, the region item is looked
|
|
up. Here, the heartbeat code calls configfs_depend_item(). If it
|
|
succeeds, then heartbeat knows the region is safe to give to ocfs2.
|
|
If it fails, it was being torn down anyway, and heartbeat can gracefully
|
|
pass up an error.
|
|
|
|
[Committable Items]
|
|
|
|
NOTE: Committable items are currently unimplemented.
|
|
|
|
Some config_items cannot have a valid initial state. That is, no
|
|
default values can be specified for the item's attributes such that the
|
|
item can do its work. Userspace must configure one or more attributes,
|
|
after which the subsystem can start whatever entity this item
|
|
represents.
|
|
|
|
Consider the FakeNBD device from above. Without a target address *and*
|
|
a target device, the subsystem has no idea what block device to import.
|
|
The simple example assumes that the subsystem merely waits until all the
|
|
appropriate attributes are configured, and then connects. This will,
|
|
indeed, work, but now every attribute store must check if the attributes
|
|
are initialized. Every attribute store must fire off the connection if
|
|
that condition is met.
|
|
|
|
Far better would be an explicit action notifying the subsystem that the
|
|
config_item is ready to go. More importantly, an explicit action allows
|
|
the subsystem to provide feedback as to whether the attributes are
|
|
initialized in a way that makes sense. configfs provides this as
|
|
committable items.
|
|
|
|
configfs still uses only normal filesystem operations. An item is
|
|
committed via rename(2). The item is moved from a directory where it
|
|
can be modified to a directory where it cannot.
|
|
|
|
Any group that provides the ct_group_ops->commit_item() method has
|
|
committable items. When this group appears in configfs, mkdir(2) will
|
|
not work directly in the group. Instead, the group will have two
|
|
subdirectories: "live" and "pending". The "live" directory does not
|
|
support mkdir(2) or rmdir(2) either. It only allows rename(2). The
|
|
"pending" directory does allow mkdir(2) and rmdir(2). An item is
|
|
created in the "pending" directory. Its attributes can be modified at
|
|
will. Userspace commits the item by renaming it into the "live"
|
|
directory. At this point, the subsystem receives the ->commit_item()
|
|
callback. If all required attributes are filled to satisfaction, the
|
|
method returns zero and the item is moved to the "live" directory.
|
|
|
|
As rmdir(2) does not work in the "live" directory, an item must be
|
|
shutdown, or "uncommitted". Again, this is done via rename(2), this
|
|
time from the "live" directory back to the "pending" one. The subsystem
|
|
is notified by the ct_group_ops->uncommit_object() method.
|
|
|
|
|