io handling of md module
md_alloc() set md_make_request function as bio handler.
md_alloc():
blk_queue_make_request(mddev->queue, md_make_request);
So bio handling of md driver is done by md_make_request()
.
What md_make_request does is briefly:
- call md_make_request -> mddev->pers->make_request = raid1.c:make_request
- update statistics information
raid1.c:make_request
create r1bio object
r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
The comment for r1bio says:
/*
* this is our 'private' RAID1 bio.
*
* it contains information about what kind of IO operations were started
* for this RAID1 operation, and about their status:
*/
struct r1bio {
We created RAID-1 device, so bio should be go into both of two raid-devices. Only one bio is sent to md device. So make_request creates two bio's and send them into each raid-devices. That two bio's are managed in bios field of r1bio structure.
Following is the code of bios field of r1bio structure.
/*
* if the IO is in WRITE direction, then multiple bios are used.
* We choose the number when they are allocated.
*/
struct bio *bios[0];
/* DO NOT PUT ANY NEW FIELDS HERE - bios array is contiguously alloced*/
};
Actually READ operation read only one raid-device because two raid-devices have the same data. But WRITE operation must be done for all raid-devices.
Let's go ahead.
r1_bio->master_bio = bio;
r1_bio->sectors = bio_sectors(bio);
r1_bio->state = 0;
r1_bio->mddev = mddev;
r1_bio->sector = bio->bi_iter.bi_sector;
r1_bio has the information about the original bio.
Next code is separated for READ and WRITE.
READ operation
First it find a disk that READ operation will be handled
rdisk = read_balance(conf, r1_bio, &max_sectors);
if (rdisk < 0) {
/* couldn't find anywhere to read from */
raid_end_bio_io(r1_bio);
return;
}
mirror = conf->mirrors + rdisk;
conf->mirrors has all raid-devices, so mirror will have pointer to rdev object.
read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
New bio is created from bio-pool of mddev.
read_bio->bi_iter.bi_sector = r1_bio->sector +
mirror->rdev->data_offset;
read_bio->bi_bdev = mirror->rdev->bdev;
read_bio->bi_end_io = raid1_end_read_request;
read_bio->bi_rw = READ | do_sync;
read_bio->bi_private = r1_bio;
After bio is completed, raid1_end_read_request will be called.
generic_make_request(read_bio);
Now bio is sent to request-queue of vda or vdb. I checked what function process the bio with gdb.
(gdb) p read_bio->bi_bdev->bd_disk->queue->make_request_fn
$31 = (make_request_fn *) 0xffffffff813a2130 <blk_sq_make_request>
blk_sq_make_request will get bio and send it into virtio driver because QEMU generates block disk with virtio block driver. You can check detail with blk_mq_init_allocated_queue and virtblk_probe functions.
raid1_end_read_request
Let's check what raid1 module will do to complete bio processing.
int uptodate = !bio->bi_error;
...
if (uptodate)
set_bit(R1BIO_Uptodate, &r1_bio->state);
...
if (uptodate) {
raid_end_bio_io(r1_bio);
Above code shows that raid1_end_read_request calls raid_end_bio_io if there is no error from vda/vdb device. Then raid_end_bio_io free r1_bio object.
WRITE operation
There is the first loop to set bio in bios[].
disks = conf->raid_disks * 2;
retry_write:
rcu_read_lock();
max_sectors = r1_bio->sectors;
for (i = 0; i < disks; i++) {
struct md_rdev *rdev = rcu_dereference(conf->mirrors[i].rdev);
r1_bio->bios[i] = NULL;
atomic_inc(&rdev->nr_pending);
r1_bio->bios[i] = bio;
}
rcu_read_unlock();
r1_bio object bios[] array to store bios that will be sent to disks, vda/vdb.
We have two raid-disks, so bios array has four elements: two for disks and two for replacement disks.
If we don't have any error on disks, the first loop is the same as bios[0] = bios[1] = bio
.
Following shows bios array values.
(gdb) p r1_bio->bios[0]
$38 = (struct bio *) 0xffff880005ef4200
(gdb) p r1_bio->bios[1]
$42 = (struct bio *) 0xffff880005ef4200
(gdb) p r1_bio->bios[2]
$43 = (struct bio *) 0x0 <irq_stack_union>
(gdb) p r1_bio->bios[3]
$44 = (struct bio *) 0x0 <irq_stack_union>
But the first loop does not anything. The second loop actually does IO processing.
for (i = 0; i < disks; i++) {
struct bio *mbio;
if (!r1_bio->bios[i])
continue;
mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
......
r1_bio->bios[i] = mbio;
mbio->bi_iter.bi_sector = (r1_bio->sector +
conf->mirrors[i].rdev->data_offset);
mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
mbio->bi_end_io = raid1_end_write_request;
mbio->bi_rw =
WRITE | do_flush_fua | do_sync | do_discard | do_same;
mbio->bi_private = r1_bio;
atomic_inc(&r1_bio->remaining);
cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
if (cb)
plug = container_of(cb, struct raid1_plug_cb, cb);
else
plug = NULL;
spin_lock_irqsave(&conf->device_lock, flags);
if (plug) {
bio_list_add(&plug->pending, mbio);
plug->pending_cnt++;
} else {
bio_list_add(&conf->pending_bio_list, mbio);
conf->pending_count++;
}
spin_unlock_irqrestore(&conf->device_lock, flags);
if (!plug)
md_wakeup_thread(mddev->thread);
It created new bio object for each raid-disks. And new bio is added into list of plug or list of conf.
If a task is plugged (it means IO is not processed at the moment), then the bio will be added into list of plug and sent to raid-disks via raid1_unplug function, that is called by blk_finish_plug. You can see calling generic_make_request in raid1_unplug.
Or if a task is not plugged (IO can be processed right now), bio will be added into list of conf and raid1 thread is waken up. raid1d thread calls flush_pending_writes that calls generic_make_request with bio in conf->pending_bio_list.
Please refer other documents for detail of plug.
r1_bio_write_done(r1_bio);
r1_bio_write_done is called but r1_bio is not freed until all bio in r1_bio are completed. The initial value of r1_bio->remaining is 1 and it is added whenever bio is added. So if there are two bio, r1_bio->remaining value would be three. r1_bio_write_done decreased r1_bio->remaining so that it is two. When bio is completed, raid1_end_write_request is called and r1_bio_write_done is called inside of raid1_end_write_request. So all bio are completed, r1_bio->remaining become zero and r1_bio object is freed.