In recent Linux kernels filesystem DAX supports 2 MiB hugepage faults in addition to the standard 4 KiB page faults. This means that for each filesystem DAX page fault we can map either 4 KiB or 2 MiB worth of persistent memory into userspace.
Servicing page faults with 2 MiB hugepage mappings instead of 4 KiB mappings has several advantages. It will result in fewer page faults (a single 2 MiB hugepage fault instead of 512 page faults at 4 KiB), smaller page tables and less TLB contention. The end result of using filesystem DAX hugepages is reduced memory usage and increased performance.
However, for filesystem DAX to be able to use 2 MiB hugepages several things have to happen:
The first of these, the size of our mmap() region, is the most easily controlled. The filesystem block allocations, though, are a bit more tricky. Luckily the two filesystems that support filesystem DAX, ext4 and XFS, each have support for requesting specific filesystem block allocation alignments and sizes. This feature was introduced in support of RAID, but we can use it equally well for filesystem DAX.
Here are the steps that I've used to successfully get filesystem DAX PMDs:
1. First, make sure that our persistent memory block device starts at a 2 MiB aligned physical address.
This is important because when we ask the filesystem for 2 MiB aligned and sized block allocations it will provide those block allocations relative to the beginning of its block device. If the filesystem is built on top of a namespace whose data starts at a 1 MiB aligned offset, for example, a block allocation that is 2 MiB aligned from the point of view of the filesystem will still be only 1 MiB aligned from DAX's point of view. This will cause DAX to fall back to 4 KiB page faults.
We can find the alignment of the persistent memory namespaces by looking at /proc/iomem, among other places:
# cat /proc/iomem ... 140000000-57fffffff : Persistent Memory 140000000-57fffffff : namespace0.0
Our namespace in this case begins at 5 GiB (0x1 4000 0000), which is 2 MiB (0x20 0000) aligned.
If we create any partitions on top of our PMEM namespace, we must ensure that those partitions are likewise 2 MiB aligned. By default fdisk will create partitions that are 1 MiB (2048 sector) aligned from the start of the parent block device:
# fdisk -l /dev/pmem0 Disk /dev/pmem0: 16.8 GiB, 17966301184 bytes, 35090432 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disklabel type: dos Disk identifier: 0x5af75158 Device Boot Start End Sectors Size Id Type /dev/pmem0p1 2048 35090431 35088384 16.7G 83 Linux
A filesystem built on top of this partition won't be able to provide DAX with 2 MiB aligned block allocations. We instead need to have our partition begin at a 2 MiB aligned boundary:
# fdisk -l /dev/pmem0 Disk /dev/pmem0: 16.8 GiB, 17966301184 bytes, 35090432 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disklabel type: dos Disk identifier: 0x276da416 Device Boot Start End Sectors Size Id Type /dev/pmem0p1 4096 35090431 35086336 16.7G 83 Linux
2. Once we have a block device that starts at a 2 MiB aligned persistent memory address, we then need to create a filesystem on top of it that will give us 2 MiB aligned and sized block allocations. Here are the commands to do that with either ext4 or XFS:
# mkfs.ext4 -b 4096 -E stride=512 -F /dev/pmem0
# mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0 # mount /dev/pmem0 /mnt/dax # xfs_io -c "extsize 2m" /mnt/dax
3. Now that we have a filesystem that can give us 2 MiB sized and aligned block allocations we just need to create a file that will receive those allocations. To do this we need to begin with a file that is at least 2 MiB in size. We can do this with truncate(1), ftruncate(2), fallocate(1), posix_fallocate(3), etc. For example:
# fallocate --length 1G /mnt/dax/data
# truncate --size 1G /mnt/dax/data
Once we have a system that is capable of giving us 2 MiB filesystem DAX faults, we probably want to verify that we are actually succeeding in using faults of that size.
The way that I normally do this is by looking at the filesystem DAX tracepoints:
# cd /sys/kernel/debug/tracing # echo 1 > events/fs_dax/dax_pmd_fault_done/enable <run test which faults in filesystem DAX mappings>
We can then look at the dax_pmd_fault_done events in
and see whether they were successful. An event that successfully faulted in a filesystem DAX PMD looks like this:
big-1434  .... 1502.341229: dax_pmd_fault_done: dev 259:0 ino 0xc shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x305 max_pgoff 0x1400 NOPAGE
The first thing to look at is the NOPAGE return value at the end of the line. This means that the fault succeeded and didn't return a page cache page, which is expected for DAX. A 2 MiB fault that failed and fell back to 4 KiB DAX faults will instead look like this:
small-1431  .... 1499.402672: dax_pmd_fault_done: dev 259:0 ino 0xc shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 vm_start 0x10200000 vm_end 0x10500000 pgoff 0x220 max_pgoff 0x3ffff FALLBACK
You can see that this fault resulted in a fallback to 4 KiB faults via the FALLBACK return code at the end of the line. The rest of the data in this line can help you determine why the fallback happened. In this case it was because I intentionally created an mmap() area that was smaller than 2 MiB.