146 lines
5.0 KiB
ReStructuredText
146 lines
5.0 KiB
ReStructuredText
|
=============
|
||
|
dm-log-writes
|
||
|
=============
|
||
|
|
||
|
This target takes 2 devices, one to pass all IO to normally, and one to log all
|
||
|
of the write operations to. This is intended for file system developers wishing
|
||
|
to verify the integrity of metadata or data as the file system is written to.
|
||
|
There is a log_write_entry written for every WRITE request and the target is
|
||
|
able to take arbitrary data from userspace to insert into the log. The data
|
||
|
that is in the WRITE requests is copied into the log to make the replay happen
|
||
|
exactly as it happened originally.
|
||
|
|
||
|
Log Ordering
|
||
|
============
|
||
|
|
||
|
We log things in order of completion once we are sure the write is no longer in
|
||
|
cache. This means that normal WRITE requests are not actually logged until the
|
||
|
next REQ_PREFLUSH request. This is to make it easier for userspace to replay
|
||
|
the log in a way that correlates to what is on disk and not what is in cache,
|
||
|
to make it easier to detect improper waiting/flushing.
|
||
|
|
||
|
This works by attaching all WRITE requests to a list once the write completes.
|
||
|
Once we see a REQ_PREFLUSH request we splice this list onto the request and once
|
||
|
the FLUSH request completes we log all of the WRITEs and then the FLUSH. Only
|
||
|
completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
|
||
|
simulate the worst case scenario with regard to power failures. Consider the
|
||
|
following example (W means write, C means complete):
|
||
|
|
||
|
W1,W2,W3,C3,C2,Wflush,C1,Cflush
|
||
|
|
||
|
The log would show the following:
|
||
|
|
||
|
W3,W2,flush,W1....
|
||
|
|
||
|
Again this is to simulate what is actually on disk, this allows us to detect
|
||
|
cases where a power failure at a particular point in time would create an
|
||
|
inconsistent file system.
|
||
|
|
||
|
Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as
|
||
|
they complete as those requests will obviously bypass the device cache.
|
||
|
|
||
|
Any REQ_OP_DISCARD requests are treated like WRITE requests. Otherwise we would
|
||
|
have all the DISCARD requests, and then the WRITE requests and then the FLUSH
|
||
|
request. Consider the following example:
|
||
|
|
||
|
WRITE block 1, DISCARD block 1, FLUSH
|
||
|
|
||
|
If we logged DISCARD when it completed, the replay would look like this:
|
||
|
|
||
|
DISCARD 1, WRITE 1, FLUSH
|
||
|
|
||
|
which isn't quite what happened and wouldn't be caught during the log replay.
|
||
|
|
||
|
Target interface
|
||
|
================
|
||
|
|
||
|
i) Constructor
|
||
|
|
||
|
log-writes <dev_path> <log_dev_path>
|
||
|
|
||
|
============= ==============================================
|
||
|
dev_path Device that all of the IO will go to normally.
|
||
|
log_dev_path Device where the log entries are written to.
|
||
|
============= ==============================================
|
||
|
|
||
|
ii) Status
|
||
|
|
||
|
<#logged entries> <highest allocated sector>
|
||
|
|
||
|
=========================== ========================
|
||
|
#logged entries Number of logged entries
|
||
|
highest allocated sector Highest allocated sector
|
||
|
=========================== ========================
|
||
|
|
||
|
iii) Messages
|
||
|
|
||
|
mark <description>
|
||
|
|
||
|
You can use a dmsetup message to set an arbitrary mark in a log.
|
||
|
For example say you want to fsck a file system after every
|
||
|
write, but first you need to replay up to the mkfs to make sure
|
||
|
we're fsck'ing something reasonable, you would do something like
|
||
|
this::
|
||
|
|
||
|
mkfs.btrfs -f /dev/mapper/log
|
||
|
dmsetup message log 0 mark mkfs
|
||
|
<run test>
|
||
|
|
||
|
This would allow you to replay the log up to the mkfs mark and
|
||
|
then replay from that point on doing the fsck check in the
|
||
|
interval that you want.
|
||
|
|
||
|
Every log has a mark at the end labeled "dm-log-writes-end".
|
||
|
|
||
|
Userspace component
|
||
|
===================
|
||
|
|
||
|
There is a userspace tool that will replay the log for you in various ways.
|
||
|
It can be found here: https://github.com/josefbacik/log-writes
|
||
|
|
||
|
Example usage
|
||
|
=============
|
||
|
|
||
|
Say you want to test fsync on your file system. You would do something like
|
||
|
this::
|
||
|
|
||
|
TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
|
||
|
dmsetup create log --table "$TABLE"
|
||
|
mkfs.btrfs -f /dev/mapper/log
|
||
|
dmsetup message log 0 mark mkfs
|
||
|
|
||
|
mount /dev/mapper/log /mnt/btrfs-test
|
||
|
<some test that does fsync at the end>
|
||
|
dmsetup message log 0 mark fsync
|
||
|
md5sum /mnt/btrfs-test/foo
|
||
|
umount /mnt/btrfs-test
|
||
|
|
||
|
dmsetup remove log
|
||
|
replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
|
||
|
mount /dev/sdb /mnt/btrfs-test
|
||
|
md5sum /mnt/btrfs-test/foo
|
||
|
<verify md5sum's are correct>
|
||
|
|
||
|
Another option is to do a complicated file system operation and verify the file
|
||
|
system is consistent during the entire operation. You could do this with:
|
||
|
|
||
|
TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
|
||
|
dmsetup create log --table "$TABLE"
|
||
|
mkfs.btrfs -f /dev/mapper/log
|
||
|
dmsetup message log 0 mark mkfs
|
||
|
|
||
|
mount /dev/mapper/log /mnt/btrfs-test
|
||
|
<fsstress to dirty the fs>
|
||
|
btrfs filesystem balance /mnt/btrfs-test
|
||
|
umount /mnt/btrfs-test
|
||
|
dmsetup remove log
|
||
|
|
||
|
replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
|
||
|
btrfsck /dev/sdb
|
||
|
replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
|
||
|
--fsck "btrfsck /dev/sdb" --check fua
|
||
|
|
||
|
And that will replay the log until it sees a FUA request, run the fsck command
|
||
|
and if the fsck passes it will replay to the next FUA, until it is completed or
|
||
|
the fsck command exists abnormally.
|