Now we will copy a large file and time the operation from start to finish while the RAID 6 is resyncing the last disk and with the compression set to ON on the ZFS pool (4.7GB file):
Still 1.00x for compression but no errors. Not surprised, file was already a compressed one. 🙂 Just for fun, and though theres no errors, let's scrub the pool:
pool: MBPCBackupz
state: ONLINE
see: http://www.sun.com/msg/ZFS-8000-EY
scrub: scrub in progress for 0h0m, 23.54% done, 0h0m to go
config:
NAME STATE READ WRITE CKSUM
MBPCBackupz ONLINE 0 0 0
MBPCStorage/MBPCBackup ONLINE 0 0 0
errors: No known data errors
#
So the overall speed of copy and compression of about 4.7GB, even though little was compressed, was 23MB per second. Degraded but still not bad. Checking with du -ah, we can see that the compression was only lightly effective on the large 4.7GB file but significant (from 127MB to 27MB) on the text file:
# du -ah
4.4G ./SampleBinary.dat
27M ./wpa_supplicant_watch.log-20111226
4.5G .
# du -a
4606003 ./SampleBinary.dat
27639 ./wpa_supplicant_watch.log-20111226
4633651 .
#
Very nice and successful setup, but only from a ease of configuration perspective:
RAID6 (mdadm)
LVM2
ZFS
gzip
Some static iostat statistics on our configuration:
# iostat
Linux 2.6.32-131.12.1.el6.x86_64 (mbpc) 02/26/2012 _x86_64_ (2 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
10.70 0.05 24.86 0.73 0.00 63.67
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 739.42 64390.64 66.14 2673766075 2746394
sdb 735.83 64390.75 66.15 2673770703 2746802
sdd 836.45 67273.85 66.13 2793488937 2746106
sde 722.54 64390.85 66.11 2673774628 2745154
sdg 7.87 337.86 340.54 14029176 14140448
dm-0 43.79 62.07 338.18 2577228 14042688
dm-1 0.01 0.09 0.00 3914 0
dm-2 0.01 0.09 0.00 3650 0
md127 28.93 223.94 258.45 9298765 10731930
dm-3 1.09 267.77 0.06 11118820 2632
dm-4 0.79 6.36 0.00 263956 64
dm-5 0.13 0.79 0.22 32660 9192
sdh 580.92 17345.18 47111.61 720243498 1956269152
sdi 105.80 38.88 17364.42 1614442 721042636
dm-6 28.85 223.27 258.45 9271035 10731850
#
# iostat -x -k -d
Linux 2.6.32-131.12.1.el6.x86_64 (mbpc) 02/26/2012 _x86_64_ (2 CPU)
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 7309.77 7.59 739.35 0.45 32206.22 33.05 87.16 0.86 1.17 0.26 19.47
sdb 7313.38 7.59 735.75 0.46 32206.26 33.05 87.58 0.77 1.04 0.25 18.46
sdd 7573.79 7.60 836.72 0.43 33652.00 33.05 80.47 1.76 2.11 0.30 25.10
sde 7326.68 7.59 722.47 0.45 32206.31 33.03 89.19 0.84 1.16 0.27 19.41
sdg 0.90 37.46 2.55 5.32 168.83 170.18 86.15 0.16 20.75 2.93 2.30
dm-0 0.00 0.00 1.44 42.33 31.01 169.00 9.14 3.58 81.86 0.46 2.02
dm-1 0.00 0.00 0.01 0.00 0.05 0.00 7.91 0.00 4.54 2.65 0.00
dm-2 0.00 0.00 0.01 0.00 0.04 0.00 7.90 0.00 1.20 1.18 0.00
md127 0.00 0.00 13.40 15.51 111.90 129.15 16.67 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 1.08 0.01 133.80 0.03 245.79 0.01 4.82 2.63 0.29
dm-4 0.00 0.00 0.79 0.00 3.18 0.00 8.00 0.00 4.28 0.40 0.03
dm-5 0.00 0.00 0.10 0.03 0.39 0.11 7.99 0.00 14.05 0.52 0.01
sdh 1943.95 5507.00 228.05 353.33 8697.72 23541.54 110.90 3.15 5.41 0.80 46.23
sdi 2.70 2063.57 2.04 104.07 19.43 8707.34 164.47 0.56 5.25 1.13 11.98
dm-6 0.00 0.00 13.32 15.51 111.57 129.15 16.70 0.41 14.12 0.15 0.44
#
INTERESTING NOTE: After we copied both files above, the system appears to be still compressing the files as the ratio went from 1.01x to 1.02x after copy was done for both files. This would appear to be a nice feature however I'm not sure I would want lingering processes on the system when production jobs need the CPU:
# zfs get compressratio MBPCBackupz
NAME PROPERTY VALUE SOURCE
MBPCBackupz compressratio 1.02x –
#
Full read/write from sdg to md127(sda, sdb, sdd, sde, sdh, sdi) (Nothing at 100%?)
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 15.33 1414.90 21.17 77.53 153.33 6109.10 126.90 0.43 4.13 2.40 23.69
sdb 15.03 1413.50 22.03 84.20 155.33 6124.30 118.22 0.45 4.10 2.25 23.91
sdd 15.87 1417.23 23.03 77.13 161.60 6116.30 125.35 0.46 4.38 2.47 24.75
sde 15.87 1415.67 21.47 76.00 155.47 6107.23 128.51 0.40 3.95 2.28 22.22
sdg 0.50 6.43 184.87 1.47 23531.07 31.20 252.90 0.79 4.24 2.28 42.47
dm-0 0.00 0.00 1.77 7.80 26.00 31.20 11.96 1.61 168.64 10.36 9.91
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md127 0.00 0.00 1.73 2871.47 27.73 23940.53 16.68 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 183.57 0.00 23496.53 0.00 256.00 0.66 3.62 2.28 41.76
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 15.57 1416.67 23.63 74.33 163.87 6102.43 127.93 0.45 4.35 2.50 24.53
sdi 15.67 1414.50 18.80 82.03 144.40 6121.50 124.28 0.42 3.99 2.38 24.02
dm-6 0.00 0.00 1.73 2871.47 27.73 23940.53 16.68 68.81 23.95 0.20 57.54
and took 197 seconds. Time to check the CPU% to see if the process is limited by a single execution core:
top – 08:49:51 up 16:49, 8 users, load average: 0.51, 0.21, 0.12
Tasks: 293 total, 2 running, 290 sleeping, 0 stopped, 1 zombie
Cpu0 : 14.6%us, 17.5%sy, 0.0%ni, 47.4%id, 20.2%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu1 : 30.6%us, 13.7%sy, 0.0%ni, 22.5%id, 32.6%wa, 0.3%hi, 0.3%si, 0.0%st
Mem: 3920884k total, 3769812k used, 151072k free, 64652k buffers
Swap: 4194296k total, 480k used, 4193816k free, 2403000k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3222 root 20 0 1417m 173m 1516 S 35 4.5 3:29.04 zfs-fuse
3243 root 20 0 1104m 68m 15m R 13 1.8 123:47.75 npviewer.bin
32530 root 20 0 111m 972 708 D 9 0.0 0:06.75 cp
893 root 20 0 0 0 0 S 7 0.0 332:40.52 md127_raid6
3127 root 20 0 1099m 291m 21m S 4 7.6 62:17.18 firefox
2244 root 20 0 187m 64m 9936 S 3 1.7 19:34.16 Xorg
38 root 20 0 0 0 0 S 2 0.0 0:06.01 kswapd0
2660 root 20 0 138m 3004 2380 S 1 0.1 0:04.70 gvfsd-trash
18330 videouse 20 0 138m 2856 2352 S 1 0.1 0:02.66 gvfsd-trash
3169 root 20 0 292m 13m 9108 S 1 0.4 1:39.88 gnome-terminal
11967 root 20 0 15220 1352 904 R 1 0.0 1:41.24 top
22 root 20 0 0 0 0 S 0 0.0 2:30.00 kblockd/0
which, at first glance, doesn't appear to be the case because zfs-fuse is well under 50% in Irix mode however when checking all the CPU counters the usage is in fact substantial (The ZFS Daemon?). Copying a file back, yields this:
top – 08:54:26 up 16:53, 8 users, load average: 1.60, 0.64, 0.30
Tasks: 294 total, 4 running, 289 sleeping, 0 stopped, 1 zombie
Cpu0 : 17.0%us, 28.3%sy, 0.0%ni, 6.7%id, 35.7%wa, 0.0%hi, 12.3%si, 0.0%st
Cpu1 : 15.8%us, 34.5%sy, 0.0%ni, 0.0%id, 47.7%wa, 0.3%hi, 1.6%si, 0.0%st
Mem: 3920884k total, 3776932k used, 143952k free, 64988k buffers
Swap: 4194296k total, 480k used, 4193816k free, 2371892k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3222 root 20 0 1417m 177m 1516 S 46 4.6 4:27.26 zfs-fuse
893 root 20 0 0 0 0 S 16 0.0 332:53.63 md127_raid6
770 root 20 0 111m 880 712 R 13 0.0 0:04.74 cp
3243 root 20 0 1104m 68m 15m R 13 1.8 124:23.04 npviewer.bin
771 root 20 0 0 0 0 D 5 0.0 0:01.43 flush-253:3
3127 root 20 0 1099m 292m 21m S 4 7.6 62:28.77 firefox
2244 root 20 0 187m 64m 9936 S 4 1.7 19:42.01 Xorg
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 81.83 0.00 1308.63 0.00 9825.65 0.00 15.02 2.57 1.96 0.17 21.97
sdb 85.83 0.00 1304.10 0.00 9831.33 0.00 15.08 2.96 2.28 0.17 22.61
sdd 79.73 0.00 1309.20 0.00 9824.80 0.00 15.01 1.26 0.97 0.14 17.78
sde 82.33 0.00 1307.93 0.00 9823.03 0.00 15.02 1.17 0.90 0.14 17.76
sdg 0.00 11392.93 3.13 91.50 60.53 46054.93 974.61 142.69 1522.58 10.57 100.00
dm-0 0.00 0.00 3.10 5.03 60.40 20.00 19.77 2.79 342.85 40.18 32.68
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md127 0.00 0.00 5803.60 0.00 48809.82 0.00 16.82 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 0.03 11478.83 0.13 45915.33 8.00 18153.20 1596.93 0.09 100.00
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 84.97 0.00 1304.03 0.00 9823.73 0.00 15.07 3.09 2.37 0.18 23.29
sdi 84.03 0.00 1306.27 0.00 9825.83 0.00 15.04 1.61 1.23 0.14 18.81
dm-6 0.00 0.00 5803.60 0.00 48809.82 0.00 16.82 20.97 3.62 0.07 42.71
Which means reads are much faster out of the RAID6 /dev/raidmd0. The bottleneck is clearly the target here but it's not so clear where the bottleneck is with a single drive to RAID6 /dev/raidmd0 copy. So reading could go up to 115MB/s theoretically but writes suffer at no higher then 25MB/s. (This is very slow).
Tweaking time:
cat /sys/block/md127/md/stripe_cache_size
OR
cat /sys/block/$(awk 'BEGIN { "ls -al /dev/raidmd0" | getline; print $NF }')/md/stripe_cache_size
and so doing this action:
# echo "8192" > /sys/block/md127/md/stripe_cache_size
had absolutely no effect:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 20.60 1510.37 22.10 92.07 177.47 6580.90 118.39 0.51 4.23 2.21 25.28
sdb 18.70 1507.87 22.80 96.10 172.53 6580.37 113.59 0.45 3.66 2.04 24.28
sdd 17.90 1515.40 22.97 91.13 170.13 6595.83 118.60 0.51 4.21 2.31 26.34
sde 21.27 1513.87 22.27 93.63 180.93 6588.90 116.82 0.50 4.09 2.13 24.64
sdg 0.03 122.07 192.90 4.83 24229.33 506.93 250.20 0.77 3.89 2.23 44.09
dm-0 0.00 0.00 4.00 126.73 50.93 506.93 8.53 6.45 49.31 0.36 4.72
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md127 0.00 0.00 1.47 3193.80 23.47 25799.33 16.16 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 188.90 0.00 24174.13 0.00 255.95 0.62 3.28 2.12 39.95
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 19.70 1513.63 22.13 94.57 173.60 6601.83 116.12 0.51 4.13 2.24 26.19
sdi 19.77 1513.33 23.23 88.30 179.33 6577.30 121.16 0.50 4.21 2.35 26.21
dm-6 0.00 0.00 1.47 3193.80 23.47 25799.33 16.16 75.74 23.67 0.19 59.21
# ./tune.bash
check /tmp/tune_raid.log for messages in case of error.
suggested read ahead size per device: 768 blocks (384kb)
suggested read ahead size of array: 4608 blocks (2304kb)
RUN blockdev –setra 768 /dev/sda /dev/sdb /dev/sdd /dev/sde /dev/sdh /dev/sdi
your current value for readahead is 256 256 256 256 256 256
RUN blockdev –setra 4608 /dev/md127
your current value for readahead is 256
suggested stripe cache size of devices: 96 pages (384kb)
RUN echo 96 > /sys/block/md127/md/stripe_cache_size
current value of /sys/block/md127/md/stripe_cache_size is 8192
setting max sectors kb to match chunk size
RUN echo 16 > /sys/block/sdi/queue/max_sectors_kb
current value of /sys/block/sdi/queue/max_sectors_kb is 512
RUN echo 16 > /sys/block/sdh/queue/max_sectors_kb
current value of /sys/block/sdh/queue/max_sectors_kb is 512
RUN echo 16 > /sys/block/sde/queue/max_sectors_kb
current value of /sys/block/sde/queue/max_sectors_kb is 512
RUN echo 16 > /sys/block/sdd/queue/max_sectors_kb
current value of /sys/block/sdd/queue/max_sectors_kb is 512
RUN echo 16 > /sys/block/sdb/queue/max_sectors_kb
current value of /sys/block/sdb/queue/max_sectors_kb is 512
RUN echo 16 > /sys/block/sda/queue/max_sectors_kb
current value of /sys/block/sda/queue/max_sectors_kb is 512
setting NCQ queue depth to 1
RUN echo 1 > /sys/block/sdi/device/queue_depth
current value of /sys/block/sdi/device/queue_depth is 31
RUN echo 1 > /sys/block/sdh/device/queue_depth
current value of /sys/block/sdh/device/queue_depth is 31
RUN echo 1 > /sys/block/sde/device/queue_depth
current value of /sys/block/sde/device/queue_depth is 31
RUN echo 1 > /sys/block/sdd/device/queue_depth
current value of /sys/block/sdd/device/queue_depth is 31
RUN echo 1 > /sys/block/sdb/device/queue_depth
current value of /sys/block/sdb/device/queue_depth is 31
RUN echo 1 > /sys/block/sda/device/queue_depth
current value of /sys/block/sda/device/queue_depth is 31
After the above, the write took 261 seconds from the earlier 197 seconds. So degraded performance. The difference is most visible on iostat:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 64.17 756.40 54.37 362.40 490.53 4502.53 23.96 4.27 10.23 1.08 45.19
sdb 56.87 756.23 53.23 364.53 458.40 4508.40 23.78 4.03 9.63 1.05 43.95
sdd 56.13 751.40 53.43 365.60 458.93 4493.87 23.64 3.63 8.65 0.99 41.37
sde 66.20 755.03 55.67 361.23 505.87 4490.80 23.97 4.06 9.70 1.05 43.74
sdg 1.03 6.57 139.27 2.40 17570.27 35.07 248.55 0.61 4.33 2.34 33.20
dm-0 0.00 0.00 3.37 8.83 42.80 35.07 12.77 1.59 130.33 8.47 10.33
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md127 0.00 0.00 1.33 2189.10 21.33 17552.62 16.05 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 137.00 0.00 17536.00 0.00 256.00 0.48 3.52 2.24 30.62
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 58.40 760.20 51.70 355.87 461.87 4498.53 24.34 4.41 10.82 1.13 46.16
sdi 63.23 755.70 55.20 362.87 489.20 4500.40 23.87 3.62 8.64 1.05 43.80
dm-6 0.00 0.00 1.33 2189.10 21.33 17552.62 16.05 79.42 36.30 0.31 68.95
So I try to use my own numbers instead.
MY NUMBERS:
echo 4096 > /sys/block/md127/md/stripe_cache_size
blockdev –setra 1024 /dev/sda /dev/sdb /dev/sdd /dev/sde /dev/sdh /dev/sdi
blockdev –setra 16484 /dev/md127
for mskb in sdi sdh sde sdd sdb sda; do echo 4096 > /sys/block/$mskb/queue/max_sectors_kb; done
for qdepth in sdi sdh sde sdd sdb sda; do echo 256 > /sys/block/$qdepth/device/queue_depth; done
This braught it back to 196 seconds for a 4.7GB file. Looking at the iostat -x -k -d 30 numbers, the individual RAID6 disks are nearly half as busy with the higher numbers. This is a good indication:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 14.03 1415.77 18.70 78.37 135.60 6117.80 128.85 0.43 4.17 2.38 23.14
sdb 12.77 1417.63 19.03 80.47 131.33 6130.73 125.87 0.41 3.90 2.33 23.22
sdd 13.03 1416.23 19.70 79.77 137.07 6125.93 125.93 0.47 4.55 2.64 26.27
sde 13.37 1415.53 18.70 80.67 135.33 6123.53 125.98 0.41 3.99 2.33 23.14
sdg 0.17 5.73 187.47 1.40 23914.80 28.13 253.54 0.75 3.98 2.20 41.58
dm-0 0.00 0.00 0.97 7.03 17.20 28.13 11.33 0.56 70.47 3.61 2.89
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md127 0.00 0.00 1.47 2852.20 23.47 23960.73 16.81 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 186.70 0.00 23897.60 0.00 256.00 0.64 3.44 2.18 40.73
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 14.30 1414.70 20.00 80.23 143.07 6117.67 124.92 0.44 4.22 2.40 24.02
sdi 12.60 1416.63 19.00 81.60 131.87 6133.67 124.56 0.49 4.62 2.61 26.27
dm-6 0.00 0.00 1.47 2852.20 23.47 23960.73 16.81 67.94 23.80 0.21 58.66
So let's try some higher numbers by doubling them:
echo 8192 > /sys/block/md127/md/stripe_cache_size
blockdev –setra 2048 /dev/sda /dev/sdb /dev/sdd /dev/sde /dev/sdh /dev/sdi
blockdev –setra 32768 /dev/md127
for mskb in sdi sdh sde sdd sdb sda; do echo 8192 > /sys/block/$mskb/queue/max_sectors_kb; done
for qdepth in sdi sdh sde sdd sdb sda; do echo 256 > /sys/block/$qdepth/device/queue_depth; done
NOTE: putting 256 failed this time. It couldn't set higher then 31 for queue_depth which is interesting. The above did give me the fastest result at 192 seconds for 4.7GB:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 17.37 1416.83 21.73 84.60 163.48 6180.37 119.32 0.47 4.22 2.28 24.25
sdb 16.40 1414.93 20.87 84.57 156.53 6175.30 120.11 0.44 3.92 2.19 23.08
sdd 15.73 1416.70 20.10 85.73 148.80 6188.37 119.76 0.45 4.11 2.31 24.44
sde 17.00 1416.33 21.20 87.83 159.47 6191.03 116.49 0.44 3.88 2.11 23.01
sdg 0.17 63.83 191.50 2.80 24238.80 266.00 252.24 0.77 3.99 2.25 43.64
dm-0 0.00 0.00 2.70 66.50 51.07 266.00 9.16 4.78 69.05 0.69 4.76
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md127 0.00 0.00 1.67 2953.93 25.68 24204.58 16.40 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 188.97 0.00 24187.73 0.00 256.00 0.65 3.44 2.19 41.39
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 17.20 1419.53 19.60 81.17 153.07 6184.63 125.79 0.48 4.47 2.38 23.98
sdi 15.60 1415.27 21.30 87.73 155.07 6173.70 116.09 0.45 3.95 2.27 24.76
dm-6 0.00 0.00 1.67 2953.93 25.68 24204.58 16.40 71.91 24.30 0.20 57.94
Ok. Let's try with these:
echo 8192 > /sys/block/md127/md/stripe_cache_size
blockdev –setra 8192 /dev/sda /dev/sdb /dev/sdd /dev/sde /dev/sdh /dev/sdi
blockdev –setra 32768 /dev/md127
for mskb in sdi sdh sde sdd sdb sda; do echo 8192 > /sys/block/$mskb/queue/max_sectors_kb; done
for qdepth in sdi sdh sde sdd sdb sda; do echo 512 > /sys/block/$qdepth/device/queue_depth; done
This resulted in some sustained improvements from around 24MB even to 24.8MB even. Not a big improvement, ok an abysmal improvement, but an improvement and write time was 191s:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 15.43 1466.03 20.00 80.93 149.33 6342.73 128.64 0.42 4.05 2.39 24.08
sdb 13.67 1462.10 20.17 87.33 141.33 6346.60 120.71 0.39 3.54 2.16 23.26
sdd 14.47 1465.80 21.13 79.03 149.20 6343.27 129.63 0.43 4.13 2.42 24.20
sde 15.37 1463.27 20.23 80.87 148.40 6327.67 128.11 0.41 3.92 2.23 22.59
sdg 0.37 8.10 188.40 2.47 23924.13 41.60 251.13 0.82 4.28 2.25 43.04
dm-0 0.00 0.00 2.13 10.40 35.07 41.60 12.23 1.24 98.82 4.91 6.16
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md127 0.00 0.00 1.60 3027.00 25.60 24818.23 16.41 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 186.63 0.00 23889.07 0.00 256.00 0.65 3.47 2.21 41.20
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 13.80 1463.50 20.00 81.77 142.00 6331.80 127.23 0.43 4.09 2.40 24.42
sdi 16.37 1460.53 20.67 84.03 156.40 6337.27 124.04 0.43 3.97 2.37 24.84
dm-6 0.00 0.00 1.60 3027.17 25.60 24820.50 16.41 72.43 23.92 0.19 58.04
Next I will tune the stripe_cache_size to a higher number and see:
echo 32768 > /sys/block/md127/md/stripe_cache_size
blockdev –setra 32768 /dev/sda /dev/sdb /dev/sdd /dev/sde /dev/sdh /dev/sdi
blockdev –setra 32768 /dev/md127
for mskb in sdi sdh sde sdd sdb sda; do echo 16384 > /sys/block/$mskb/queue/max_sectors_kb; done
for qdepth in sdi sdh sde sdd sdb sda; do echo 512 > /sys/block/$qdepth/device/queue_depth; done
The results were slightly worse. So let's try these numbers:
echo 8192 > /sys/block/md127/md/stripe_cache_size
blockdev –setra 4096 /dev/sda /dev/sdb /dev/sdd /dev/sde /dev/sdh /dev/sdi
blockdev –setra 32768 /dev/md127
for mskb in sdi sdh sde sdd sdb sda; do echo 8192 > /sys/block/$mskb/queue/max_sectors_kb; done
for qdepth in sdi sdh sde sdd sdb sda; do echo 512 > /sys/block/$qdepth/device/queue_depth; done
A degredation from 24.8MB/s but improvement from earlier so I set it back to what I had that gave me 24.8MB/s:
echo 8192 > /sys/block/md127/md/stripe_cache_size
blockdev –setra 8192 /dev/sda /dev/sdb /dev/sdd /dev/sde /dev/sdh /dev/sdi
blockdev –setra 32768 /dev/md127
for mskb in sdi sdh sde sdd sdb sda; do echo 8192 > /sys/block/$mskb/queue/max_sectors_kb; done
for qdepth in sdi sdh sde sdd sdb sda; do echo 31 > /sys/block/$qdepth/device/queue_depth; done
And we're back to where we started:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 14.17 1465.27 20.43 81.00 146.53 6348.47 128.06 0.41 3.93 2.31 23.42
sdb 16.63 1464.10 20.40 84.40 154.80 6356.73 124.27 0.43 3.95 2.23 23.33
sdd 15.17 1467.00 18.67 84.03 142.27 6368.20 126.79 0.42 3.91 2.39 24.52
sde 13.27 1467.40 19.03 84.90 136.13 6371.67 125.23 0.40 3.74 2.15 22.38
sdg 0.00 1.40 187.20 1.27 23895.73 10.40 253.69 0.85 4.51 2.14 40.42
dm-0 0.00 0.00 0.57 2.60 6.67 10.40 10.78 0.24 74.99 7.93 2.51
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md127 0.00 0.00 1.60 3120.27 25.60 24887.83 15.96 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 186.63 0.00 23889.07 0.00 256.00 0.62 3.32 2.12 39.64
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 15.77 1467.90 20.43 80.70 151.60 6360.33 128.78 0.45 4.29 2.40 24.28
sdi 14.13 1465.30 19.60 82.77 143.07 6367.27 127.20 0.44 4.14 2.51 25.70
dm-6 0.00 0.00 1.60 3120.27 25.60 24887.83 15.96 75.88 24.29 0.19 59.85
So the above peaked at 24.8MB/s. Next I'll try to reset the chunk size from 16 to 128 (Man recommends 512):
# mdadm –grow /dev/raidmd0 –chunk-size=128
# mdadm –grow /dev/raidmd0 –chunk=32K
mdadm: New chunk size does not divide component size
#
# zpool destroy MBPCBackupz
#
# zpool list
no pools available
# zfs list
no datasets available
#
# lvm lvremove /dev/MBPCStorage/MBPCBackup
Do you really want to remove active logical volume MBPCBackup? [y/n]: y
Logical volume "MBPCBackup" successfully removed
#
# lvm lvs
LV VG Attr LSize Origin Snap% Move Log Copy% Convert
oLogVol02 VGEntertain -wi-ao 151.00g
olv_root VGEntertain -wi-ao 32.00g
olv_swap VGEntertain -wi-a- 4.00g
fmlv mbpcvg -wi-ao 1.15t
rootlv mbpcvg -wi-ao 31.25g
swaplv mbpcvg -wi-ao 4.00g
# lvm vgs
VG #PV #LV #SN Attr VSize VFree
MBPCStorage 1 0 0 wz–n- 3.64t 3.64t
VGEntertain 1 3 0 wz–n- 187.00g 0
mbpcvg 1 3 0 wz–n- 1.18t 0
#
# lvm vgremove MBPCStorage
Volume group "MBPCStorage" successfully removed
# lvm vgs
VG #PV #LV #SN Attr VSize VFree
VGEntertain 1 3 0 wz–n- 187.00g 0
mbpcvg 1 3 0 wz–n- 1.18t 0
#
# lvm pvremove /dev/raidmd0
Labels on physical volume "/dev/raidmd0" successfully wiped
[root@mbpc mnt]# lvm pvs
PV VG Fmt Attr PSize PFree
/dev/sdg2 mbpcvg lvm2 a- 1.18t 0
/dev/sdg3 VGEntertain lvm2 a- 187.00g 0
#
Next we stop our array:
# mdadm –detail /dev/raidmd0
/dev/raidmd0:
Version : 1.2
Creation Time : Mon Jan 30 00:22:17 2012
Raid Level : raid6
Array Size : 3907045696 (3726.05 GiB 4000.81 GB)
Used Dev Size : 976761424 (931.51 GiB 1000.20 GB)
Raid Devices : 6
Total Devices : 6
Persistence : Superblock is persistent
Update Time : Sun Feb 26 14:11:14 2012
State : clean
Active Devices : 6
Working Devices : 6
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 16K
Name : mbpc:0 (local to host mbpc)
UUID : b9c13d43:a7a1d949:f20dd93a:cb41cc00
Events : 312
Number Major Minor RaidDevice State
0 8 112 0 active sync /dev/sdh
1 8 64 1 active sync /dev/sde
2 8 48 2 active sync /dev/sdd
3 8 128 3 active sync /dev/sdi
4 8 16 4 active sync /dev/sdb
5 8 0 5 active sync /dev/sda
#
# mdadm –stop /dev/raidmd0
# mdadm –detail /dev/raidmd0
mdadm: cannot open /dev/raidmd0: No such file or directory
# cat /proc/mdadm
cat: /proc/mdadm: No such file or directory
#
And now we recreate our array:
mdadm –create –verbose /dev/md0 –level=raid6 –chunk=64K –auto=p –raid-devices=6 –spare-devices=0 /dev/rsd{a,b,c,d,e,f}
lvm pvcreate /dev/raidmd0
lvm vgcreate MBPCStorage /dev/raidmd0
lvm lvcreate -L3906254360S -n MBPCBackup MBPCStorage
zpool create MBPCBackupz /dev/MBPCStorage/MBPCBackup -m /mnt/MBPCBackupz
zfs set compression=on MBPCBackupz
Speed was still abysmal however at 191 seconds for 4.7GB:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.80 191.30 1.17 24418.93 7.60 253.83 1.04 5.39 2.17 41.79
dm-0 0.00 0.00 0.63 1.90 13.60 7.60 16.74 0.43 169.89 16.88 4.28
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 190.73 0.00 24413.87 0.00 256.00 0.64 3.37 2.14 40.74
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 429.73 1891.23 39.80 93.30 1890.53 7978.63 148.30 0.62 4.53 2.37 31.57
md0 0.00 0.00 0.43 479.03 27.73 24730.58 103.27 0.00 0.00 0.00 0.00
sdc 435.37 1883.03 40.37 91.07 1918.40 7942.77 150.06 0.55 4.05 2.15 28.25
sdd 456.23 1903.57 43.97 94.23 2010.53 8031.43 145.33 0.50 3.46 1.92 26.50
sde 449.67 1891.33 37.67 89.87 1962.80 7966.37 155.71 0.63 4.84 2.62 33.36
sdf 436.43 1896.80 39.03 96.00 1915.20 8008.23 146.98 0.53 3.82 2.09 28.25
sdg 425.77 1921.77 41.03 96.93 1882.67 8115.97 144.94 0.55 3.88 2.10 29.00
dm-6 0.00 0.00 0.43 479.03 27.73 24730.58 103.27 11.11 23.18 1.23 58.75
Time to tweak again:
cat /sys/block/md0/md/stripe_cache_size
blockdev –getra $(echo $(ls -al /dev/rsd*|awk '{ print "/dev/"$NF }'))
blockdev –getra /dev/md0 /dev/raidmd0
for mskb in $(ls -al /dev/rsd*|awk '{ print $NF }'); do cat /sys/block/$mskb/queue/max_sectors_kb; done
for qdepth in $(ls -al /dev/rsd*|awk '{ print $NF }'); do cat /sys/block/$qdepth/device/queue_depth; done
Verify our mappings for changing parameters (UDEV Rules):
# ls -al /dev/rsd*
lrwxrwxrwx. 1 root root 3 Feb 27 13:41 /dev/rsda -> sdb
lrwxrwxrwx. 1 root root 3 Feb 27 13:41 /dev/rsdb -> sdc
lrwxrwxrwx. 1 root root 3 Feb 27 13:41 /dev/rsdc -> sdd
lrwxrwxrwx. 1 root root 3 Feb 27 13:41 /dev/rsdd -> sde
lrwxrwxrwx. 1 root root 3 Feb 27 13:41 /dev/rsde -> sdf
lrwxrwxrwx. 1 root root 3 Feb 27 13:41 /dev/rsdf -> sdg
#
And let's try with yet another combination of numbers:
echo 8192 > /sys/block/md0/md/stripe_cache_size
blockdev –setra 8192 $(echo $(ls -al /dev/rsd*|awk '{ print "/dev/"$NF }'))
blockdev –setra 32768 /dev/md127
for mskb in $(ls -al /dev/rsd*|awk '{ print $NF }'); do echo 8192 > /sys/block/$mskb/queue/max_sectors_kb; done
for qdepth in $(ls -al /dev/rsd*|awk '{ print $NF }'); do echo 31 > /sys/block/$qdepth/device/queue_depth; done
Verify that things are actually set:
# cat /sys/block/md0/md/stripe_cache_size
256
# blockdev –getra $(echo $(ls -al /dev/rsd*|awk '{ print "/dev/"$NF }'))
256
256
256
256
256
256
# blockdev –getra /dev/md0 /dev/raidmd0
4096
4096
# ls -al /dev/rsd*|awk '{ print $NF }'
sdb
sdc
sdd
sde
sdf
sdg
# for mskb in $(ls -al /dev/rsd*|awk '{ print $NF }'); do cat /sys/block/$mskb/queue/max_sectors_kb; done
512
512
512
512
512
512
# for qdepth in $(ls -al /dev/rsd*|awk '{ print $NF }'); do cat /sys/block/$qdepth/device/queue_depth; done
31
31
31
31
31
31
#
So now we try to reset the chunk size again. This time because we started with a larger value, 256, we are successful:
# mdadm –grow /dev/raidmd0 –chunk=128K
mdadm: /dev/raidmd0: Cannot grow – need backup-file
#
Hmm. No luck but it's ok. So now let's continue performance testing:
top – 04:19:08 up 5:36, 6 users, load average: 0.52, 0.24, 0.09
Tasks: 221 total, 3 running, 217 sleeping, 0 stopped, 1 zombie
Cpu0 : 4.0%us, 14.3%sy, 0.0%ni, 25.6%id, 55.8%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu1 : 8.2%us, 14.1%sy, 0.0%ni, 67.0%id, 10.1%wa, 0.0%hi, 0.7%si, 0.0%st
Mem: 3920768k total, 3776328k used, 144440k free, 59900k buffers
Swap: 4194296k total, 524k used, 4193772k free, 2872196k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 91.97 1576.17 20.47 88.93 462.93 6727.83 131.46 0.42 3.73 2.10 22.97
sdb 88.00 1573.27 21.50 85.60 445.33 6702.37 133.48 0.42 3.79 2.19 23.46
sdc 86.57 1578.97 20.73 86.33 434.80 6729.03 133.82 0.40 3.62 2.14 22.96
sdd 84.93 1570.67 21.93 80.53 434.13 6669.30 138.65 0.42 4.01 2.30 23.59
sde 85.47 1571.33 20.97 86.17 438.80 6688.37 133.05 0.41 3.69 2.19 23.46
sdf 89.47 1571.10 22.07 84.27 454.13 6677.83 134.14 0.40 3.67 2.21 23.45
sdg 0.33 4.20 195.33 1.03 24820.80 20.53 253.01 0.75 3.83 2.27 44.49
dm-0 0.00 0.00 2.03 5.13 35.73 20.53 15.70 0.21 29.92 5.68 4.07
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-4 0.00 0.00 193.57 0.00 24776.53 0.00 256.00 0.70 3.60 2.24 43.42
dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.37 928.20 23.47 25264.65 54.47 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 0.37 928.20 23.47 25264.65 54.47 21.50 23.14 0.61 56.25
The second copy gave me 188 seconds so roughly 25.2MB/s. We then decided to add a bitmap as apparently it didn't have any Bitmap earlier. This is generally a good thing for recovering an array:
# mdadm –grow /dev/md127 –bitmap=internal
#
We can reset to none after with this command:
# mdadm –grow /dev/md127 –bitmap=none
#
# mdadm –detail /dev/md127
/dev/md127:
Version : 1.2
Creation Time : Sun Mar 4 23:11:42 2012
Raid Level : raid6
Array Size : 3907045632 (3726.05 GiB 4000.81 GB)
Used Dev Size : 976761408 (931.51 GiB 1000.20 GB)
Raid Devices : 6
Total Devices : 6
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Sun Mar 18 18:30:21 2012
State : active
Active Devices : 6
Working Devices : 6
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 64K
Name : mbpc:0 (local to host mbpc)
UUID : f1c5626d:cfd9d49e:41347e87:7b949c44
Events : 20
Number Major Minor RaidDevice State
0 8 32 0 active sync /dev/sdc
1 8 64 1 active sync /dev/sde
2 8 48 2 active sync /dev/sdd
3 8 80 3 active sync /dev/sdf
4 8 16 4 active sync /dev/sdb
5 8 0 5 active sync /dev/sda
#
# echo 50000 >/proc/sys/dev/raid/speed_limit_min
but little write speed difference.
Hi,
Great post. You don’t need to specify the parameters when creating the XFS file system, see http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E and http://www.spinics.net/lists/raid/msg38074.html . Of course, YMMV.
Did you run those benchmarks while the array was resyncing?
Hey Mathias,
Thanks for posting. Just added the testing numbers so feel free to have a look and judge yourself.
> logbsize and delaylog
I ran another test with logbsize=128k (couldn’t find anything for delaylog in my mkfs.xfs man page so I’m not sure if that’ll do anything). Little to no difference in this case on first glance. Watch out for the results at some point for a closer look.
One consideration here is that eventually I would grow the LVM and XFS to fill up to 4TB. I’ll be doing this soon Potentially in the future, I may try to grow this array as well to something well over 8TB (Yet to see how to do that). I’m not sure if XFS would auto-adjust in those cases for optimal values for those capacities and the link didn’t touch on that topic.
All in all, I can still run tests on this thing recreating the FS if I need to so feel free to suggest numbers you’d be interested to see. I might leave this topic open for a week or two to see if I can think of anything else or if I’m missing anything. For my setup, having anything > 125MB/s is a bonus as the network is only 1GB/s with that theoretical max.
Cheers!
TK
[…] could be done safely enough like this guy did and with RAID6 as well with SSD type R/W’s no less. Your size would be limited to the size of the […]
Thank you for posting this blog. I was getting desparate. I could not figure out why I could not stop the RAID1 device. Even from Ubuntu Rescue Remix. The LVM group was being assembled from the failed raid. I removed the volume group and was finally able to gain exclusive access to the array to stop it, put in the new disk and rebuild the array.
Nice job.
Best,
Dave.
[…] we'll use for this is the APCUPSD daemon available in RPM format. We've set one up for our HTPCB server for a home redundancy / backup solution to protect against power surges and bridge the […]
[…] every time while transferring my files. At the point, I not only lost connectivity with the HTPC+B but also my web access most of the time. Here are the culprits and here's how we went […]
[…] removed the cable and the adapter and only used a 2 foot cable to my HTPC+B system I've just configured. Voila! Problem solved. Ultimately, it's […]
[…] them from system to system to avoid choppy video / sound and also to accommodate the needs of our HTPC+B solution through file […]
[…] Linux Networking: Persistent naming rules based on MAC for eth0 and wlan0 Linux: HTPC / Home Backup: MDADM, RAID6, LVM, XFS, CIFS and NFS […]
[…] at this point and 4:15 minutes have passed). While this was going on, we are referencing our HTPC page for […]
[…] HTPC, Backup & Storage […]