Purpose

Mainly documenting a few things I don't want to forget. Perhaps it's useful to others as well.

If you came here for FISHWORKS (which I'm pretty sure you didn't), click here

tisdag 3 april 2012

RAMdisk with spillover onto SSD

The normal way to do this is to just mount a tmpfs, then create swap on SSD.
I can't do that, because tmpfs doesn't support O_DIRECT.

PURPOSE
======
Create a small to medium area for fast temporary storage with the ability to grow "onto" slower media. In my case, it's for MySQL temporary tables.


REQUIREMENTS
============
* Some RAM to spare
* 2*160G SSDs (or whatever you can get your hands on, or even regular rotating drives).

USAGE
=====

First we'll create a 40G file on tmpfs and associate a loop device to that file:

1. Create the ramdisk-backed loopback file:
# dd if=/dev/zero of=/dev/shm/ramdisk bs=1M count=40000

2. Create the loop device:
# losetup -f /dev/shm/ramdisk

3. (Optional) Create a logical volume (this is only to get disk statistics from iostat and /proc/partitions):

# gdisk -l /dev/loop0
Disk /dev/loop0: 83886080 sectors, 40.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 0E9B6683-EB36-4EAF-8A54-2C04B3486E7E
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 83886046
Partitions will be aligned on 2048-sector boundaries

mysql> select 83886046-33;
+-------------+
| 83886046-33 |
+-------------+
|    83886013 |
+-------------+
1 row in set (0.00 sec)

# echo "0 83886013 linear /dev/loop0 34" | dmsetup -u $(uuidgen) create ramdisk



Then we'll need to configure the SSD:

1. Find the sector info from the SSD:
# gdisk -l /dev/sdc
Disk /dev/sdc: 311427072 sectors, 148.5 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): B9A305D6-4536-42DB-A0F9-8861168F4061
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 311427038

2. Use the numbers above, hash out 80% of the drive:
mysql> select ceil(311427038-33*0.8);
+------------------------+
| ceil(311427038-33*0.8) |
+------------------------+
|              311427012 |
+------------------------+
1 row in set (0.00 sec)

echo "0 311427012 linear /dev/sdc 34" | dmsetup -u $(uuidgen) create ssd


Result:

  ls -l /dev/mapper/
  total 0
  crw------- 1 root root  10, 63 Feb 25 21:48 control
  brw-rw---- 1 root disk 253,  1 Apr  1 09:03 ramdisk
  brw-rw---- 1 root disk 253,  2 Apr  1 09:03 ssd

Next: Create a linear RAID device on these drives, with the ramdisk in the "bottom":
  mdadm --create /dev/md0 --level=linear --raid-devices=2 --name rambacked /dev/mapper/ramdisk /dev/mapper/ssd

Remember I said the logical volumes were optional? You might as well have done this:
  mdadm --create /dev/md0 --level=linear --raid-devices=2 --name rambacked /dev/loop0 /dev/sdc1
 

Create a filesystem:
  mkfs.xfs -f -l lazy-count=1 -L rambacked /dev/md0

Mount it:
  mkdir /rambacked
  mount -o rw,noauto,noatime,nodiratime,logbufs=8,nobarrier /dev/md0 /rambacked

Now, write 100G to it, first 40G will be blazingly fast:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 143745408  22648 53620368    0    0     0     7    0    0  0  0 100  0  0
 0  0      0 143745792  22648 53620380    0    0     0     0 1028  620  0  0 100  0  0
 0  0      0 143745904  22648 53620380    0    0     0     0 1007  514  0  0 100  0  0
 0  0      0 143745680  22648 53620380    0    0     0     0 1607  540  0  0 100  0  0
 0  0      0 143745680  22648 53620380    0    0     0     0 1096  530  0  0 100  0  0
 0  0      0 143746288  22656 53620372    0    0     0    68 1016  585  0  0 100  0  0
 0  0      0 143746288  22656 53620372    0    0     0     0 1008  525  0  0 100  0  0
 0  0      0 143746400  22656 53620380    0    0     0     0 1033  541  0  0 100  0  0
 1  1      0 143746272  22656 53620380    0    0    12 909984 1018 5932  0  1 98  1  0 <- starting dd write.
 2  0      0 143746528  22656 53620392    0    0     0 2995456 1027 18285  0  4 92  4  0
 1  1      0 143746528  22664 53620384    0    0     0 2945324 1013 17996  0  4 92  3  0
 2  0      0 143746656  22664 53620392    0    0     0 2930912 1028 17906  0  4 92  3  0
 1  0      0 143746528  22664 53620392    0    0    12 2932992 1006 17920  0  4 92  3  0
 1  1      0 143746208  22664 53620404    0    0     0 3002720 1608 18326  0  4 92  4  0 <- 3Gb/s direct io
 1  0      0 143746096  22664 53620404    0    0     0 2982210 1088 18202  0  4 92  4  0
 2  0      0 143745968  22672 53620396    0    0     0 2954616 1012 18076  0  5 92  3  0
 1  1      0 143746080  22672 53620404    0    0    12 2890080 1005 17667  0  5 92  3  0
 1  1      0 143746208  22672 53620416    0    0     0 2911584 1028 17802  0  4 92  4  0
 2  0      0 143746336  22672 53620416    0    0     0 2915680 1008 17807  0  5 92  2  0
 1  1      0 143746336  22672 53620416    0    0     0 2963936 1024 18124  0  5 92  3  0
 1  0      0 143746336  22680 53620408    0    0    12 2957816 1012 18104  0  4 92  3  0
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  1      0 143746464  22680 53620428    0    0     0 2944480 1019 18004  0  4 92  3  0
 2  0      0 143746448  22680 53620428    0    0     0 2908608 1013 17783  0  4 92  3  0
 1  1      0 143746576  22680 53620428    0    0    12 2943520 1642 18019  0  5 92  3  0
 1  1      0 143746576  22680 53620428    0    0     0 2912813 1077 17815  0  5 92  3  0
 1  1      0 143746688  22688 53620432    0    0     0 2882072 1010 17744  0  5 93  3  0
 0  1      0 143746688  22688 53620432    0    0     0 2149280 5857 12183  0  3 94  3  0 <- spills over on SSD.
 0  1      0 143746816  22688 53620444    0    0     0 265665 3368 1591  0  0 96  4  0
 0  1      0 143746816  22688 53620444    0    0     0 180416 2596 1223  0  0 96  4  0 <- now 100% on SSD device (see iostat output below)
 0  1      0 143746944  22688 53620444    0    0     0 170176 2522 1215  0  0 96  4  0
 0  1      0 143746816  22696 53620436    0    0     0 173240 2534 1228  0  0 96  4  0
 0  1      0 143746816  22696 53620444    0    0     0 171208 2536 1224  0  0 96  4  0
 0  1      0 143746816  22696 53620444    0    0     0 172236 2543 1208  0  0 96  4  0

iostat output:

Linux 2.6.18-274.el5 (staging-db-lv-2.staging.marinsw.net)     03/31/2012

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0.00     1.69  0.00  0.41     0.00    88.68   214.24     0.01   23.96   0.60   0.03
md0               0.00     0.00  0.00  1.23     0.00   282.16   229.89     0.00    0.00   0.00   0.00
dm-1              0.00     0.00  0.00  0.00     0.00     0.00     8.48     0.00    0.02   0.02   0.00

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00  0.60 9592.20     4.80 2182758.40   227.54     0.00    0.00   0.00   0.00
dm-1              0.00     0.00  0.60 9592.20     4.80 2182758.40   227.54     1.25    0.13   0.03  25.5    <- 100% in RAM

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00  0.60 25900.60     4.79 5893825.15   227.55     0.00    0.00   0.00   0.00
dm-1              0.00     0.00  0.60 25901.20     4.79 5893973.65   227.55     3.11    0.12   0.03  67.54  <- 26000 writes per second

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00  1.20 25693.20     9.60 5846630.40   227.55     0.00    0.00   0.00   0.00
dm-1              0.00     0.00  1.20 25692.60     9.60 5846481.60   227.54     3.78    0.15   0.03  74.26

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
md0               0.00     0.00  0.60 25683.40     4.80 5844178.00   227.54     0.00    0.00   0.00   0.00
dm-1              0.00     0.00  0.60 25683.40     4.80 5844178.00   227.54     3.20    0.12   0.03  66.98

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0.00     0.00  0.00 2230.20     0.00 507516.80   227.57     4.15    1.86   0.34  76.84   <- spilling over on SSD.
md0               0.00     0.00  0.00 7463.40     0.00 1698205.00   227.54     0.00    0.00   0.00   0.00
dm-1              0.00     0.00  0.00 5231.80     0.00 1190377.80   227.53     0.80    0.15   0.03  15.66

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0.00     0.00  0.00 1506.60     0.00 342835.20   227.56     5.13    3.41   0.65  98.64   <- almost exclusively on SSD.
md0               0.00     0.00  0.00 1507.80     0.00 342842.00   227.38     0.00    0.00   0.00   0.00
dm-1              0.00     0.00  0.00  1.20     0.00     6.80     5.67     0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0.00     0.00  0.00 1508.00     0.00 343145.60   227.55     5.20    3.45   0.65  98.56   <- exclusively SSD.
md0               0.00     0.00  0.00 1508.40     0.00 343244.80   227.56     0.00    0.00   0.00   0.00
dm-1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0.00     0.00  0.60 1506.00     4.80 342723.20   227.48     5.12    3.40   0.65  98.34   <- 1600 writes per second.
md0               0.00     0.00  0.60 1505.60     4.80 342432.20   227.35     0.00    0.00   0.00   0.00
dm-1              0.00     0.00  0.00  0.80     0.00     6.60     8.25     0.00    0.00   0.00   0.00

Useful? Maybe... if you have a use for a partition that's fast most of the time (as long as you don't go over a certain storage limit) but with the ability to spillover onto slower storage, this is it.