Balancing I/O response time and disk rebuild time in a RAID5 disk array
Abstract
When a disk in the RAID5 disk array architecture has failed, requests to that disk can only be serviced by reading data from all surviving disks and rebuilding the lost data. This may cut disk performance in half. To avoid this degradation, all of the lost data must be rebuilt and written to a spare disk. The faster the data are rebuilt, the sooner the disk array returns to normal operation. Giving high priority to the rebuild process, however, can increase response times for incoming application requests which complete for disk service. A balance must be found between acceptable application response times and disk rebuild times. Simulation was used to evaluate the effect of the rebuild unit size on response time and rebuild time. The authors have found this tradeoff to be embodied in the choice of the rebuild unit and the amount of rebuild data which is atomically read from each surviving disk. The find that a single track rebuild unit provides faster rebuild times than a one sector rebuild unit. Rebuilding one track at a time provides better application request response times when compared with rebuilding one cylinder at a time.