User Tools

Site Tools


2mib_fs_dax

This is an old revision of the document!


Overview

In recent Linux kernels filesystem DAX supports 2 MiB hugepage faults in addition to the standard 4 KiB page faults. This means that for each filesystem DAX page fault we can map either 4 KiB or 2 MiB worth of persistent memory into userspace.

Servicing page faults with 2 MiB hugepage mappings instead of 4 KiB mappings has several advantages. It will result in fewer page faults (a single 2 MiB hugepage fault instead of 512 page faults at 4 KiB), smaller page tables and less TLB contention. The end result of using filesystem DAX hugepages is reduced memory usage and increased performance.

However, for filesystem DAX to be able to use 2 MiB hugepages several things have to happen:

  1. Our mmap() mapping has to be at least 2 MiB in size.
  2. Our filesystem block allocation has to be at least 2 MiB in size.
  3. Our filesystem block allocation has to have the same alignment as our mmap().

The first of these, the size of our mmap() region, is the most easily controlled. The filesystem block allocations, though, are a bit more tricky. Luckily the two filesystems that support filesystem DAX, ext4 and XFS, each have support for requesting specific filesystem block allocation alignments and sizes. This feature was introduced in support of RAID, but we can use it equally well for filesystem DAX.

System Configuration

Here are the steps that I've used to successfully get filesystem DAX PMDs:

1. First, make sure that your namespace is in 'fsdax' mode.

# ndctl list --human
{
  "dev":"namespace0.0",
  "mode":"fsdax",
  "size":"16.73 GiB (17.96 GB)",
  "uuid":"179e5b98-96ee-4988-ba9f-ed9383d11598",
  "sector_size":512,
  "blockdev":"pmem0",
  "numa_node":0
}

2. Next, make sure that our persistent memory block device starts at a 2 MiB aligned physical address.

This is important because when we ask the filesystem for 2 MiB aligned and sized block allocations it will provide those block allocations relative to the beginning of its block device. If the filesystem is built on top of a namespace whose data starts at a 1 MiB aligned offset, for example, a block allocation that is 2 MiB aligned from the point of view of the filesystem will still be only 1 MiB aligned from DAX's point of view. This will cause DAX to fall back to 4 KiB page faults.

We can find the alignment of the persistent memory namespaces by looking at /proc/iomem, among other places:

# cat /proc/iomem
...
140000000-57fdfffff : Persistent Memory
  140000000-57fdfffff : namespace0.0

Our namespace in this case begins at 5 GiB (0x1 4000 0000), which is 2 MiB (0x20 0000) aligned.

It is recommend to use raw devices and create multiple namespaces if the system configuration calls for persistent memory to be provisioned into smaller volumes. This is because namespace alignment is enforced at namespace creation time whereas partitions need to be created by tooling that is careful to align both the start of the namespace and the start of partitions. Long term the pmem device partition is scheduled for deprecation in favor of requiring namespaces for all provisioning.

Instead, if we create any partitions on top of our PMEM namespace, we must ensure that those partitions are likewise 2 MiB aligned. By default fdisk will create partitions that are 1 MiB (2048 sector) aligned from the start of the parent block device:

# fdisk -l /dev/pmem0
Disk /dev/pmem0: 16.7 GiB, 17964204032 bytes, 35086336 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0xfd17c8f9

Device       Boot Start      End  Sectors  Size Id Type
/dev/pmem0p1       2048 35086335 35084288 16.7G 83 Linux

A filesystem built on top of this partition won't be able to provide DAX with 2 MiB aligned block allocations. We instead need to have our partition begin at a 2 MiB aligned boundary:

# fdisk -l /dev/pmem0
Disk /dev/pmem0: 16.7 GiB, 17964204032 bytes, 35086336 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0xfd17c8f9

Device       Boot Start      End  Sectors  Size Id Type
/dev/pmem0p1       4096 35086335 35082240 16.7G 83 Linux

3. Once we have a block device that starts at a 2 MiB aligned persistent memory address, we then need to create a filesystem on top of it that will give us 2 MiB aligned and sized block allocations. Here are the commands to do that with either ext4 or XFS:

ext4:

# mkfs.ext4 -b 4096 -E stride=512 -F /dev/pmem0

xfs:

# mkfs.xfs -f -d su=2m,sw=1 -m reflink=0 /dev/pmem0
# mount /dev/pmem0 /mnt/dax
# xfs_io -c "extsize 2m" /mnt/dax

Please refer to the man pages for mkfs.ext4(8), mkfs.xfs(8) and xfs_io(8) for more details.

4. Now that we have a filesystem that can give us 2 MiB sized and aligned block allocations we just need to create a file that will receive those allocations. To do this we need to begin with a file that is at least 2 MiB in size. We can do this with truncate(1), ftruncate(2), fallocate(1), posix_fallocate(3), etc. For example:

# fallocate --length 1G /mnt/dax/data

or

# truncate --size 1G /mnt/dax/data

Verifying Results

Once we have a system that is capable of giving us 2 MiB filesystem DAX faults, we probably want to verify that we are actually succeeding in using faults of that size.

The way that I normally do this is by looking at the filesystem DAX tracepoints:

# cd /sys/kernel/debug/tracing
# echo 1 > events/fs_dax/dax_pmd_fault_done/enable 
<run test which faults in filesystem DAX mappings>

We can then look at the dax_pmd_fault_done events in

/sys/kernel/debug/tracing/trace

and see whether they were successful. An event that successfully faulted in a filesystem DAX PMD looks like this:

big-1434  [008] ....  1502.341229: dax_pmd_fault_done: dev 259:0 ino 0xc shared 
WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 
0x10700000 pgoff 0x305 max_pgoff 0x1400 NOPAGE

The first thing to look at is the NOPAGE return value at the end of the line. This means that the fault succeeded and didn't return a page cache page, which is expected for DAX. A 2 MiB fault that failed and fell back to 4 KiB DAX faults will instead look like this:

small-1431  [008] ....  1499.402672: dax_pmd_fault_done: dev 259:0 ino 0xc shared
WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 vm_start 0x10200000 vm_end
0x10500000 pgoff 0x220 max_pgoff 0x3ffff FALLBACK

You can see that this fault resulted in a fallback to 4 KiB faults via the FALLBACK return code at the end of the line. The rest of the data in this line can help you determine why the fallback happened. In this case it was because I intentionally created an mmap() area that was smaller than 2 MiB.

2mib_fs_dax.1588379576.txt.gz · Last modified: 2020/05/02 00:32 by Dan Williams