This is an old revision of the document!
These pages contain instructions, links and other information related to persistent memory enabling in Linux.
One interesting use of the PMEM driver is to allow users to begin developing software using DAX, which was upstreamed in v4.0. On a non-NFIT system this can be done by using PMEM's memmap kernel command line to manually create a type 12 memory region.
Here are the additions I made for my system with 32 GiB of RAM:
1) Reserve 16 GiB of memory via the “memmap” kernel parameter in grub's menu.lst, using PMEM's new “!” specifier:
memmap=16G!16G
The documentation for this parameter can be found here: https://www.kernel.org/doc/Documentation/admin-guide/kernel-parameters.txt
Also see: How to choose the correct memmap kernel parameter for PMEM on your system.
2) Set up the correct kernel configuration options for PMEM and DAX in .config.
Options in make menuconfig:
CONFIG_BLK_DEV_RAM_DAX=y CONFIG_FS_DAX=y CONFIG_X86_PMEM_LEGACY=y CONFIG_LIBNVDIMM=y CONFIG_BLK_DEV_PMEM=m CONFIG_ARCH_HAS_PMEM_API=y
This configuration gave me one pmem device with 16 GiB of space:
$ fdisk -l /dev/pmem0 Disk /dev/pmem0: 16 GiB, 17179869184 bytes, 33554432 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes
lsblk shows the block devices, including pmem devices. Examples:
$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT pmem0 259:0 0 16G 0 disk ├─pmem0p1 259:6 0 4G 0 part /mnt/ext4-pmem0 └─pmem0p2 259:7 0 11.9G 0 part /mnt/btrfs-pmem0 pmem1 259:1 0 16G 0 disk /mnt/xfs-pmem1 pmem2 259:2 0 16G 0 disk /mnt/xfs-pmem2 pmem3 259:3 0 16G 0 disk /mnt/xfs-pmem3
$ lsblk -t NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME pmem0 0 4096 0 4096 512 0 128 128 0B pmem1 0 4096 0 4096 512 0 128 128 0B pmem2 0 4096 0 4096 512 0 128 128 0B pmem3 0 4096 0 4096 512 0 128 128 0B
You can divide persistent memory address ranges into namespaces with ndctl. This stores namespace label metadata at the beginning of the persistent memory address range.
ndctl supports four modes:
Example commands on an 8 GiB NVDIMM with output showing the resulting sizes and /dev/ device names:
$ ndctl create-namespace --mode raw -e namespace0.0 -f { "dev":"namespace0.0", "mode":"raw", "size":8589934592, # this is exactly 8 GiB "blockdev":"pmem0" } $ ndctl create-namespace --mode sector -e namespace0.0 -f { "dev":"namespace0.0", "mode":"sector", "size":8580472832, # this is 9240 KiB less than 8 GiB "uuid":"52b53e55-eccd-40bf-a2aa-9f03ebf30e6b", "sector_size":4096, "blockdev":"pmem0s" } $ ndctl create-namespace --mode memory --map mem -e namespace0.0 -f { "dev":"namespace0.0", "mode":"memory", "size":8587837440, # this is 2 MiB less than 8 GiB "uuid":"349b7e53-dfbb-4b90-89ed-db80cfdaab0f", "blockdev":"pmem0" } $ ndctl create-namespace --mode memory --map dev -e namespace0.0 -f { "dev":"namespace0.0", "mode":"memory", "size":8453619712, # this is 130 MiB less than 8 GiB "uuid":"03faeca5-226c-48d9-bb47-f71cbc6d322e", "blockdev":"pmem0" } $ sudo ndctl create-namespace --mode dax -e namespace0.0 -f { "dev":"namespace0.0", "mode":"dax", "size":8453619712, # this is 130 MiB less than 8 GiB "uuid":"252d7895-91f3-42b7-9eeb-27ffc03e354c", "daxdevs":[ { "chardev":"dax0.0", # this is 130 MiB less than 8 GiB "size":8453619712 } ] }
You can divide raw, sector, and memory devices (/dev/pmemN and /dev/pmemNs) into partitions. In parted, the mkpart subcommand has this syntax
mkpart [part-type fs-type name] start end
Although mkpart defaults to 1 MiB alignment, you may want to use 2 MiB alignment to support more efficient page mappings - see https://nvdimm.wiki.kernel.org/2mib_fs_dax.
Example carving a 16 GiB /dev/pmem0 into 4 GiB, 8 GiB, and 4 GiB partitions (constrained by 1 MiB alignment at the beginning and end) (note: parted displays its outputs using SI decimal units; lsblk uses binary units):
$ parted -s -a optimal /dev/pmem0 \ mklabel gpt -- \ mkpart primary ext4 1MiB 4GiB \ mkpart primary xfs 4GiB 12GiB \ mkpart primary btrfs 12GiB -1MiB \ print Model: Unknown (unknown) Disk /dev/pmem0: 17.2GB Sector size (logical/physical): 512B/4096B Partition Table: gpt Disk Flags: Number Start End Size File system Name Flags 1 1049kB 4295MB 4294MB ext4 primary 2 4295MB 12.9GB 8590MB xfs primary 3 12.9GB 17.2GB 4294MB btrfs primary $ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT pmem0 259:0 0 16G 0 disk ├─pmem0p1 259:4 0 4G 0 part ├─pmem0p2 259:5 0 8G 0 part └─pmem0p3 259:8 0 4G 0 part $ fdisk -l /dev/pmem0 Disk /dev/pmem0: 16 GiB, 17179869184 bytes, 33554432 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disklabel type: gpt Disk identifier: B334CBC6-1C56-47DF-8981-770C866CEABE Device Start End Sectors Size Type /dev/pmem0p1 2048 8388607 8386560 4G Linux filesystem /dev/pmem0p2 8388608 25165823 16777216 8G Linux filesystem /dev/pmem0p3 25165824 33552383 8386560 4G Linux filesystem
You may place any filesystem (e.g., ext4, xfs, btrfs) on a raw or memory device (e.g., /dev/pmem0), a partition on a raw or memory device (e.g. /dev/pmem0p1), a sector device (e.g., /dev/pmem0s), or a partition on a sector device (e.g., /dev/pmem0sp1).
ext4 and xfs support DAX, which allow applications to perform direct access to persistent memory with mmap(). You may use DAX on raw devices and memory devices, but not on sector devices.
Example creating ext4, xfs, and btrfs filesystems on three partitions and mounting ext4 and xfs with DAX (note: df -h displays sizes in IEC binary units; df -H uses SI decimal units):
$ mkfs.ext4 -F /dev/pmem0p1 $ mkfs.xfs -f /dev/pmem0p2 $ mkfs.btrfs -f /dev/pmem0p3 $ mount -o dax /dev/pmem0p1 /mnt/ext4-pmem0 $ mount -o dax /dev/pmem0p2 /mnt/xfs-pmem0 $ mount /dev/pmem0p3 /mnt/btrfs-pmem0 $ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT pmem0 259:0 0 16G 0 disk ├─pmem0p1 259:4 0 4G 0 part /mnt/ext4-pmem0 ├─pmem0p2 259:5 0 8G 0 part /mnt/xfs-pmem0 └─pmem0p3 259:8 0 4G 0 part /mnt/btrfs-pmem0 $ df -h Filesystem Size Used Avail Use% Mounted on /dev/pmem0p1 3.9G 8.0M 3.7G 1% /mnt/ext4-pmem0 /dev/pmem0p2 8.0G 33M 8.0G 1% /mnt/xfs-pmem0 /dev/pmem0p3 4.0G 17M 3.8G 1% /mnt/btrfs-pmem0 $ df -H Filesystem Size Used Avail Use% Mounted on /dev/pmem0p1 4.2G 8.4M 4.0G 1% /mnt/ext4-pmem0 /dev/pmem0p2 8.6G 34M 8.6G 1% /mnt/xfs-pmem0 /dev/pmem0p3 4.3G 17M 4.1G 1% /mnt/btrfs-pmem0
iostats are disabled by default due to performance overhead (e.g., 12M IOPS dropping 25% to 9M IOPS). However, they can be enabled in sysfs if desired.
As of kernel 4.5, iostats are only collected for the base pmem device, not per-partition. Also, I/Os that go through DAX paths (rw_page, rw_bytes, and direct_access functions) are not counted, so nothing is collected for:
$ echo 1 > /sys/block/pmem0/queue/iostats $ echo 1 > /sys/block/pmem1/queue/iostats $ echo 1 > /sys/block/pmem2/queue/iostats $ echo 1 > /sys/block/pmem3/queue/iostats $ iostat -mxy 1 avg-cpu: %user %nice %system %iowait %steal %idle 21.53 0.00 78.47 0.00 0.00 0.00 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util pmem0 0.00 0.00 4706551.00 0.00 18384.95 0.00 8.00 6.00 0.00 0.00 0.00 0.00 113.90 pmem1 0.00 0.00 4701492.00 0.00 18365.20 0.00 8.00 6.01 0.00 0.00 0.00 0.00 119.30 pmem2 0.00 0.00 4701851.00 0.00 18366.60 0.00 8.00 6.37 0.00 0.00 0.00 0.00 108.90 pmem3 0.00 0.00 4688767.00 0.00 18315.50 0.00 8.00 6.43 0.00 0.00 0.00 0.00 117.40
Example fio script to perform 4 KiB random reads to four pmem devices:
[global] direct=1 ioengine=libaio norandommap randrepeat=0 bs=256k # for bandwidth bs=4k # for IOPS and latency iodepth=1 runtime=30 time_based=1 group_reporting thread gtod_reduce=0 # for latency gtod_reduce=1 # IOPS and bandwidth zero_buffers ## local CPU numjobs=9 # for bandwidth numjobs=1 # for latency numjobs=18 # for IOPS cpus_allowed_policy=split rw=randwrite rw=randread # CPU affinity based on two 18-core CPUs with QPI snoop configuration of cluster-on-die [drive_0] filename=/dev/pmem0 cpus_allowed=0-8,36-44 [drive_1] filename=/dev/pmem1 cpus_allowed=9-17,45-53 [drive_2] filename=/dev/pmem2 cpus_allowed=18-26,54-62 [drive_3] filename=/dev/pmem3 cpus_allowed=27-35,63-71
When using /dev/dax character devices, you must specify the size, because character devices do not have a size.
Example fio script to perform 4 KiB random reads to four /dev/dax character devices:
[global] ioengine=mmap pre_read=1 norandommap randrepeat=0 bs=4k iodepth=1 runtime=60000 time_based=1 group_reporting thread gtod_reduce=1 # reduce=1 except for latency test zero_buffers size=2G numjobs=36 cpus_allowed=0-17,36-53 cpus_allowed_policy=split [drive_0] filename=/dev/dax0.0 rw=randread [drive_1] filename=/dev/dax1.0 rw=randread [drive_2] filename=/dev/dax2.0 rw=randread [drive_3] filename=/dev/dax3.0 rw=randread