User Tools

Site Tools


start

This is an old revision of the document!


Persistent Memory


These pages contain instructions, links and other information related to persistent memory enabling in Linux.

Blogs

Industry specifications

Vendor-specific tools and specifications

Subtopics

Quick Setup Guide


One interesting use of the PMEM driver is to allow users to begin developing software using DAX, which was upstreamed in v4.0. On a non-NFIT system this can be done by using PMEM's memmap kernel command line to manually create a type 12 memory region.

Here are the additions I made for my system with 32 GiB of RAM:

1) Reserve 16 GiB of memory via the “memmap” kernel parameter in grub's menu.lst, using PMEM's new “!” specifier:

memmap=16G!16G

The documentation for this parameter can be found here: https://www.kernel.org/doc/Documentation/admin-guide/kernel-parameters.txt

Also see: How to choose the correct memmap kernel parameter for PMEM on your system.

2) Set up the correct kernel configuration options for PMEM and DAX in .config.

Options in make menuconfig:

  • Device Drivers - NVDIMM (Non-Volatile Memory Device) Support
    • PMEM: Persistent memory block device support
    • BLK: Block data window (aperture) device support
    • BTT: Block Translation Table (atomic sector updates)
  • Enable the block layer
    • Block device DAX support <not available in kernel-4.5 due to page cache issues>
  • File systems
    • Direct Access (DAX) support
  • Processor type and features
    • Support non-standard NVDIMMs and ADR protected memory <if using the memmap kernel parameter>
CONFIG_BLK_DEV_RAM_DAX=y
CONFIG_FS_DAX=y
CONFIG_X86_PMEM_LEGACY=y
CONFIG_LIBNVDIMM=y
CONFIG_BLK_DEV_PMEM=m
CONFIG_ARCH_HAS_PMEM_API=y

This configuration gave me one pmem device with 16 GiB of space:

$ fdisk -l /dev/pmem0

Disk /dev/pmem0: 16 GiB, 17179869184 bytes, 33554432 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

lsblk shows the block devices, including pmem devices. Examples:

$ lsblk
NAME                   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
pmem0                  259:0    0    16G  0 disk
├─pmem0p1              259:6    0     4G  0 part /mnt/ext4-pmem0
└─pmem0p2              259:7    0  11.9G  0 part /mnt/btrfs-pmem0
pmem1                  259:1    0    16G  0 disk /mnt/xfs-pmem1
pmem2                  259:2    0    16G  0 disk /mnt/xfs-pmem2
pmem3                  259:3    0    16G  0 disk /mnt/xfs-pmem3
$ lsblk -t
NAME                   ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE  RA WSAME
pmem0                          0   4096      0    4096     512    0           128 128    0B
pmem1                          0   4096      0    4096     512    0           128 128    0B
pmem2                          0   4096      0    4096     512    0           128 128    0B
pmem3                          0   4096      0    4096     512    0           128 128    0B

Namespaces


You can divide persistent memory address ranges into namespaces with ndctl. This stores namespace label metadata at the beginning of the persistent memory address range.

ndctl create-namespace ties a namespace to a block device or character device:

mode description device path label metadata atomicity filesystems DAX PFN metadata former name
raw raw /dev/pmemN no no yes no no
sector sector atomic /dev/pmemNs yes yes yes no no
fsdax filesystem DAX /dev/pmemN yes no yes yes yes memory
devdax device DAX /dev/daxN.M yes no no yes yes dax

For modes with PFN metadata (“struct page” metadata), overhead is 64 bytes per 4 KiB of persistent memory.

  • e.g., 128 MiB for 8 GiB persistent memory
  • e.g., 7.45 GiB for 1 TB persistent memory
  • –map=mem = regular system memory
    • adequate for small persistent memory capacities
  • –map=dev = persistent memory
    • intended for large persistent memory capacities (there might not be enough regular memory in the system!)
    • persistence of the PFN metadata is not important; this is just convenient because it scales with the persistent memory capacity

Sector Atomic mode uses a Block Translation Layer (BTT) to help software that doesn't understand sectors might end up with a mix of old and new data if power loss occurs while writes were underway.

Filesystem DAX mode lets the filesystem provide direct access to persistent memory to applications by using mmap() (e.g., ext4 and xfs filesystems).

Device DAX mode creates a character device instead of a block device, and is intended for applications that mmap() the the entire capacity. It does not support filesystems or interact with the kernel page cache.

Example commands on an 8 GiB NVDIMM with output showing the resulting sizes and /dev/ device names:

$ sudo ndctl/ndctl create-namespace --mode raw -e namespace0.0 -f
{
  "dev":"namespace0.0",
  "mode":"raw",
  "size":"8.00 GiB (8.59 GB)",
  "sector_size":512,
  "blockdev":"pmem0",
  "numa_node":0
}

$ sudo ndctl/ndctl create-namespace --mode sector -e namespace0.0 -f
{
  "dev":"namespace0.0",
  "mode":"sector",
  "size":"7.99 GiB (8.58 GB)",
  "uuid":"30868a48-9763-4d4d-a6b7-e43dbb165b16",
  "sector_size":4096,
  "blockdev":"pmem0s",
  "numa_node":0
}

$ sudo ndctl/ndctl create-namespace --mode fsdax --map mem -e namespace0.0 -f
{
  "dev":"namespace0.0",
  "mode":"fsdax",
  "map":"mem",
  "size":"8.00 GiB (8.59 GB)",
  "uuid":"f0ab3a91-c5bc-42b2-805f-4fa6c6075a50",
  "sector_size":512,
  "blockdev":"pmem0",
  "numa_node":0
}

$ sudo ndctl/ndctl create-namespace --mode fsdax --map dev -e namespace0.0 -f
{
  "dev":"namespace0.0",
  "mode":"fsdax",
  "map":"dev",
  "size":"7.87 GiB (8.45 GB)",
  "uuid":"64f617f3-b79a-4c92-8ca7-c02d05572d3c",
  "sector_size":512,
  "blockdev":"pmem0",
  "numa_node":0
}

$ sudo ndctl/ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
{
  "dev":"namespace0.0",
  "mode":"devdax",
  "map":"mem",
  "size":"8.00 GiB (8.59 GB)",
  "uuid":"7fc2ecfb-edb2-4370-b9e1-09ecbdf7df16",
  "daxregion":{
    "id":0,
    "size":"8.00 GiB (8.59 GB)",
    "align":2097152,
    "devices":[
      {
        "chardev":"dax0.0",
        "size":"8.00 GiB (8.59 GB)"
      }
    ]
  },
  "numa_node":0
}

$ sudo ndctl/ndctl create-namespace --mode devdax --map dev -e namespace0.0 -f
{
  "dev":"namespace0.0",
  "mode":"devdax",
  "map":"dev",
  "size":"7.87 GiB (8.45 GB)",
  "uuid":"47343804-46f5-49d8-a76e-76cc240d8fc7",
  "daxregion":{
    "id":0,
    "size":"7.87 GiB (8.45 GB)",
    "align":2097152,
    "devices":[
      {
        "chardev":"dax0.0",
        "size":"7.87 GiB (8.45 GB)"
      }
    ]
  },
  "numa_node":0
}

When using QEMU (see the Simulating persistent memory configurations using QEMU page) your namespaces will by default be in raw mode. You can use the following bash script to convert all your raw mode namespaces to fsdax mode:

#!/usr/bin/bash -ex

namespaces=$(ndctl list | jq -r '((. | arrays | .[]), . | objects) | select(.mode == "raw") | .dev')
for n in $namespaces; do
	ndctl create-namespace -f -e $n --mode=memory
done

This function highlights a tricky thing about ndctl and json. If you have a single namespace, that is returned by ndctl list as a single json object:

# ndctl list
{
  "dev":"namespace0.0",
  "mode":"fsdax",
  "size":17834180608,
  "uuid":"830d3440-df00-4e5a-9f89-a951dfb962cd",
  "raw_uuid":"2dbddec6-44cc-41a4-bafd-a4cc3e345e50",
  "sector_size":512,
  "blockdev":"pmem0",
  "numa_node":0
}

If you have two or more namespaces, though, they are returned as an array of json objects:

# ndctl list
[
  {
    "dev":"namespace1.0",
    "mode":"fsdax",
    "size":17834180608,
    "uuid":"ce92c90c-1707-4a39-abd8-1dd12788d137",
    "raw_uuid":"f8130943-5867-4e84-b2e5-6c685434ef81",
    "sector_size":512,
    "blockdev":"pmem1",
    "numa_node":0
  },
  {
    "dev":"namespace0.0",
    "mode":"fsdax",
    "size":17834180608,
    "uuid":"33d46163-095a-4bf8-acf0-6dbc5dc8a738",
    "raw_uuid":"8f44ccd3-50f3-4dec-9817-554e9d1a5c5f",
    "sector_size":512,
    "blockdev":"pmem0",
    "numa_node":0
  }
]

Note the outer [ and ] brackets surrounding the objects which turn it into an array. The difficulty is that a given jq command expects to either operate on objects or on an array, but not both. So, the command you need to run will vary based on how many namespaces you have.

The command above works around this by first converting the multiple namespace output from an array of objects to multiple objects in a series:

# ndctl list | jq -r '((. | arrays | .[]), . | objects)'
{
  "dev": "namespace1.0",
  "mode": "fsdax",
  "size": 17834180608,
  "uuid": "ce92c90c-1707-4a39-abd8-1dd12788d137",
  "raw_uuid": "f8130943-5867-4e84-b2e5-6c685434ef81",
  "sector_size": 512,
  "blockdev": "pmem1",
  "numa_node": 0
}
{
  "dev": "namespace0.0",
  "mode": "fsdax",
  "size": 17834180608,
  "uuid": "33d46163-095a-4bf8-acf0-6dbc5dc8a738",
  "raw_uuid": "8f44ccd3-50f3-4dec-9817-554e9d1a5c5f",
  "sector_size": 512,
  "blockdev": "pmem0",
  "numa_node": 0
}

We then structure the rest of the jq command to operate on normal objects, and it works whether we have one namespace or many.

Partitions


You can divide raw, sector, and fsdax devices (/dev/pmemN and /dev/pmemNs) into partitions. In parted, the mkpart subcommand has this syntax

mkpart [part-type fs-type name] start end

Although mkpart defaults to 1 MiB alignment, you may want to use 2 MiB alignment to support more efficient page mappings - see https://nvdimm.wiki.kernel.org/2mib_fs_dax.

Example carving a 16 GiB /dev/pmem0 into 4 GiB, 8 GiB, and 4 GiB partitions (constrained by 1 MiB alignment at the beginning and end) (note: parted displays its outputs using SI decimal units; lsblk uses binary units):

$ parted -s -a optimal /dev/pmem0 \
        mklabel gpt -- \
        mkpart primary ext4 1MiB 4GiB \
        mkpart primary xfs 4GiB 12GiB \
        mkpart primary btrfs 12GiB -1MiB \
        print

Model: Unknown (unknown)
Disk /dev/pmem0: 17.2GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags:

Number  Start   End     Size    File system  Name     Flags
 1      1049kB  4295MB  4294MB  ext4         primary
 2      4295MB  12.9GB  8590MB  xfs          primary
 3      12.9GB  17.2GB  4294MB  btrfs        primary

$ lsblk
NAME                   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
pmem0                  259:0    0    16G  0 disk
├─pmem0p1              259:4    0     4G  0 part
├─pmem0p2              259:5    0     8G  0 part
└─pmem0p3              259:8    0     4G  0 part

$ fdisk -l /dev/pmem0
Disk /dev/pmem0: 16 GiB, 17179869184 bytes, 33554432 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: B334CBC6-1C56-47DF-8981-770C866CEABE

Device          Start      End  Sectors Size Type
/dev/pmem0p1     2048  8388607  8386560   4G Linux filesystem
/dev/pmem0p2  8388608 25165823 16777216   8G Linux filesystem
/dev/pmem0p3 25165824 33552383  8386560   4G Linux filesystem

Filesystems


You may place any filesystem (e.g., ext4, xfs, btrfs) on a raw or fsdax device (e.g., /dev/pmem0), a partition on a raw or fsdax device (e.g. /dev/pmem0p1), a sector device (e.g., /dev/pmem0s), or a partition on a sector device (e.g., /dev/pmem0sp1).

ext4 and xfs support DAX, which allow applications to perform direct access to persistent memory with mmap(). You may use DAX on raw devices and fsdax devices, but not on sector devices.

Example creating ext4, xfs, and btrfs filesystems on three partitions and mounting ext4 and xfs with DAX (note: df -h displays sizes in IEC binary units; df -H uses SI decimal units):

$ mkfs.ext4 -F /dev/pmem0p1
$ mkfs.xfs -f /dev/pmem0p2
$ mkfs.btrfs -f /dev/pmem0p3
$ mount -o dax /dev/pmem0p1 /mnt/ext4-pmem0
$ mount -o dax /dev/pmem0p2 /mnt/xfs-pmem0
$ mount /dev/pmem0p3 /mnt/btrfs-pmem0

$ lsblk
NAME                   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
pmem0                  259:0    0    16G  0 disk
├─pmem0p1              259:4    0     4G  0 part /mnt/ext4-pmem0
├─pmem0p2              259:5    0     8G  0 part /mnt/xfs-pmem0
└─pmem0p3              259:8    0     4G  0 part /mnt/btrfs-pmem0

$ df -h
Filesystem                      Size  Used Avail Use% Mounted on
/dev/pmem0p1                    3.9G  8.0M  3.7G   1% /mnt/ext4-pmem0
/dev/pmem0p2                    8.0G   33M  8.0G   1% /mnt/xfs-pmem0
/dev/pmem0p3                    4.0G   17M  3.8G   1% /mnt/btrfs-pmem0

$ df -H
Filesystem                      Size  Used Avail Use% Mounted on
/dev/pmem0p1                    4.2G  8.4M  4.0G   1% /mnt/ext4-pmem0
/dev/pmem0p2                    8.6G   34M  8.6G   1% /mnt/xfs-pmem0
/dev/pmem0p3                    4.3G   17M  4.1G   1% /mnt/btrfs-pmem0

iostats


iostats are disabled by default due to performance overhead (e.g., 12M IOPS dropping 25% to 9M IOPS). However, they can be enabled in sysfs if desired.

As of kernel 4.5, iostats are only collected for the base pmem device, not per-partition. Also, I/Os that go through DAX paths (rw_page, rw_bytes, and direct_access functions) are not counted, so nothing is collected for:

  • I/O to files in filesystems mounted with -o dax
  • I/O to raw block devices if CONFIG_BLOCK_DAX is enabled
$ echo 1 > /sys/block/pmem0/queue/iostats
$ echo 1 > /sys/block/pmem1/queue/iostats
$ echo 1 > /sys/block/pmem2/queue/iostats
$ echo 1 > /sys/block/pmem3/queue/iostats

$ iostat -mxy 1
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          21.53    0.00   78.47    0.00    0.00    0.00

Device:         rrqm/s   wrqm/s        r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
pmem0             0.00     0.00 4706551.00    0.00 18384.95     0.00     8.00     6.00    0.00    0.00    0.00   0.00 113.90
pmem1             0.00     0.00 4701492.00    0.00 18365.20     0.00     8.00     6.01    0.00    0.00    0.00   0.00 119.30
pmem2             0.00     0.00 4701851.00    0.00 18366.60     0.00     8.00     6.37    0.00    0.00    0.00   0.00 108.90
pmem3             0.00     0.00 4688767.00    0.00 18315.50     0.00     8.00     6.43    0.00    0.00    0.00   0.00 117.40

fio


Example fio script to perform 4 KiB random reads to four pmem devices:

[global]
direct=1
ioengine=libaio
norandommap
randrepeat=0
bs=256k         # for bandwidth
bs=4k           # for IOPS and latency
iodepth=1
runtime=30
time_based=1
group_reporting
thread
gtod_reduce=0   # for latency
gtod_reduce=1   # IOPS and bandwidth
zero_buffers

## local CPU
numjobs=9       # for bandwidth
numjobs=1       # for latency
numjobs=18      # for IOPS
cpus_allowed_policy=split

rw=randwrite
rw=randread

# CPU affinity based on two 18-core CPUs with QPI snoop configuration of cluster-on-die

[drive_0]
filename=/dev/pmem0
cpus_allowed=0-8,36-44

[drive_1]
filename=/dev/pmem1
cpus_allowed=9-17,45-53

[drive_2]
filename=/dev/pmem2
cpus_allowed=18-26,54-62

[drive_3]
filename=/dev/pmem3
cpus_allowed=27-35,63-71

When using /dev/dax character devices, you must specify the size, because character devices do not have a size.

Example fio script to perform 4 KiB random reads to four /dev/dax character devices:

[global]
ioengine=mmap
pre_read=1
norandommap
randrepeat=0
bs=4k
iodepth=1
runtime=60000
time_based=1
group_reporting
thread
gtod_reduce=1   # reduce=1 except for latency test
zero_buffers
size=2G

numjobs=36

cpus_allowed=0-17,36-53
cpus_allowed_policy=split

[drive_0]
filename=/dev/dax0.0
rw=randread

[drive_1]
filename=/dev/dax1.0
rw=randread

[drive_2]
filename=/dev/dax2.0
rw=randread

[drive_3]
filename=/dev/dax3.0
rw=randread
start.1530287173.txt.gz · Last modified: 2018/06/29 15:46 by Robert Elliott