User Tools

Site Tools


start

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
start [2018/06/29 22:13]
Robert Elliott
start [2020/06/12 15:58]
Ira Weiny [Filesystems]
Line 10: Line 10:
   * NDCTL: https://​github.com/​pmem/​ndctl.git   * NDCTL: https://​github.com/​pmem/​ndctl.git
   * NDCTL man pages online: http://​pmem.io/​ndctl/​   * NDCTL man pages online: http://​pmem.io/​ndctl/​
-  * linux-nvdimm Mailing List: https://​lists.01.org/​mailman/listinfo/​linux-nvdimm+  * linux-nvdimm Mailing List: https://​lists.01.org/​postorius/lists/​linux-nvdimm.lists.01.org
   * linux-nvdimm Patchwork: https://​patchwork.kernel.org/​project/​linux-nvdimm/​list/​   * linux-nvdimm Patchwork: https://​patchwork.kernel.org/​project/​linux-nvdimm/​list/​
  
Line 57: Line 57:
 Also see: [[How to choose the correct memmap kernel parameter for PMEM on your system|How to choose the correct memmap kernel parameter for PMEM on your system]]. Also see: [[How to choose the correct memmap kernel parameter for PMEM on your system|How to choose the correct memmap kernel parameter for PMEM on your system]].
  
-2) Set up the correct kernel configuration options for PMEM and DAX in .config.+2) Set up the correct kernel configuration options for PMEM and DAX in .config.  To use huge pages for mmapped files, you'll need CONFIG_FS_DAX_PMD selected, which is done automatically if you have the prerequisites marked below.
  
 Options in make menuconfig: Options in make menuconfig:
Line 70: Line 70:
   * Processor type and features   * Processor type and features
     * Support non-standard NVDIMMs and ADR protected memory <if using the memmap kernel parameter>​     * Support non-standard NVDIMMs and ADR protected memory <if using the memmap kernel parameter>​
 +    * Transparent Hugepage Support <needed for huge pages>
 +    * Allow for memory hot-add <needed for huge pages>
 +      * Allow for memory hot remove <needed for huge pages>
 +    * Device memory (pmem, HMM, etc...) hotplug support <needed for huge pages> ​
  
 <​code>​ <​code>​
-CONFIG_BLK_DEV_RAM_DAX=y +CONFIG_ZONE_DEVICE=y 
-CONFIG_FS_DAX=y +CONFIG_MEMORY_HOTPLUG=y 
-CONFIG_X86_PMEM_LEGACY=y +CONFIG_MEMORY_HOTREMOVE=y 
-CONFIG_LIBNVDIMM=y+CONFIG_TRANSPARENT_HUGEPAGE=y 
 +CONFIG_ACPI_NFIT=m 
 +CONFIG_X86_PMEM_LEGACY=m 
 +CONFIG_OF_PMEM=m 
 +CONFIG_LIBNVDIMM=m
 CONFIG_BLK_DEV_PMEM=m CONFIG_BLK_DEV_PMEM=m
-CONFIG_ARCH_HAS_PMEM_API=y+CONFIG_BTT=y 
 +CONFIG_NVDIMM_PFN=y 
 +CONFIG_NVDIMM_DAX=y 
 +CONFIG_FS_DAX=y 
 +CONFIG_DAX=y 
 +CONFIG_DEV_DAX=m 
 +CONFIG_DEV_DAX_PMEM=m 
 +CONFIG_DEV_DAX_KMEM=m
 </​code>​ </​code>​
  
Line 119: Line 134:
 ndctl create-namespace ties a namespace to a block device or character device: ndctl create-namespace ties a namespace to a block device or character device:
  
-^ mode   ^ description ​   ^ device path ^ label metadata ^ atomicity ^ filesystems ^ DAX ^ PFN metadata ^ former name ^ +^ mode   ^ description ​   ^ device path ^ device type ^ label metadata ^ atomicity ^ filesystems ^ DAX ^ PFN metadata ^ former name ^ 
-| raw    | raw            | /​dev/​pmemN ​ | no             | no        | yes         | no  | no           | | +| raw    | raw            | /​dev/​pmemN  ​| block       | no             | no        | yes         | no  | no           | | 
-| sector | sector atomic ​ | /dev/pmemNs | yes            | yes       | yes         | no  | no           | | +| sector | sector atomic ​ | /​dev/​pmemNs ​| block       | yes            | yes       | yes         | no  | no           | | 
-| fsdax  | filesystem DAX | /​dev/​pmemN ​ | yes            | no        | yes         | yes | yes          | memory | +| fsdax  | filesystem DAX | /​dev/​pmemN  ​| block       | yes            | no        | yes         | yes | yes          | memory | 
-| devdax | device DAX     | /dev/daxN.M | yes            | no        | no          | yes | yes          | dax |+| devdax | device DAX     | /​dev/​daxN.M ​| character ​  | yes            | no        | no          | yes | yes          | dax |
  
-For modes with PFN metadata ("​struct page" metadata), overhead is 64 bytes per 4 KiB of persistent memory. +There are two places to store PFN metadata ("​struct page" metadata):
-  * e.g., 128 MiB for 8 GiB persistent memory +
-  * e.g., 7.45 GiB for 1 TB persistent memory+
   * --map=mem = regular system memory   * --map=mem = regular system memory
     * adequate for small persistent memory capacities     * adequate for small persistent memory capacities
Line 133: Line 146:
     * intended for large persistent memory capacities (there might not be enough regular memory in the system!)     * intended for large persistent memory capacities (there might not be enough regular memory in the system!)
     * persistence of the PFN metadata is not important; this is just convenient because it scales with the persistent memory capacity     * persistence of the PFN metadata is not important; this is just convenient because it scales with the persistent memory capacity
 +
 +The PFN metadata size is 64 bytes per 4 KiB of persistent memory (1.5625%). For some common persistent memory capacities:
 +^  persistent memory capacity ​ ^  PFN metadata size  ^ example ^
 +|  8 GiB    |  128 MiB  | |
 +|  16 GiB   ​| ​ 256 MiB  | |
 +|  96 GiB   ​| ​ 1.5 GiB  | Six 16 GiB NVDIMMs |
 +|  128 GiB  |  2 GiB    | |
 +|  192 GiB  |  3 GiB    | Twelve 16 GiB NVDIMMs |
 +|  256 GiB  |  4 GiB    | |
 +|  512 GiB  |  8 GiB    | |
 +|  768 GiB  |  12 GiB   | Six 128 GiB NVDIMMs |
 +|  1 TiB    |  16 GiB   | |
 +|  1.5 TiB  |  24 GiB   | Six 256 GiB NVDIMMs |
 +|  3 TiB    |  48 GiB   | Six 512 GiB NVDIMMs |
 +|  6 TiB    |  96 GiB   | Six 1 TiB NVDIMMs, or twelve 512 GiB NVDIMMs |
 +
  
 Sector Atomic mode uses a Block Translation Layer (BTT) to help software that doesn'​t understand sectors might end up with a mix of old and new data if power loss occurs while writes were underway. Sector Atomic mode uses a Block Translation Layer (BTT) to help software that doesn'​t understand sectors might end up with a mix of old and new data if power loss occurs while writes were underway.
Line 314: Line 343:
  
 We then structure the rest of the ''​jq''​ command to operate on normal objects, and it works whether we have one namespace or many. We then structure the rest of the ''​jq''​ command to operate on normal objects, and it works whether we have one namespace or many.
 +
 +==== Persistent Naming ====
 +----
 +The device names chosen by the kernel are subject to creation order and discovery order. Environments can not rely the kernel name being consistent from one boot to the next. For the most part they do not change if the configuration stays static, but if a permanent name is needed use /​dev/​disk/​by-id. Recent versions of udev deploy the following udev rule in 60-persistent-storage.rules:​
 +<​code>​
 +# PMEM devices
 +KERNEL=="​pmem*",​ ENV{DEVTYPE}=="​disk",​ ATTRS{uuid}=="?​*",​ SYMLINK+="​disk/​by-id/​pmem-$attr{uuid}"​
 +</​code>​
 +
 +This rule yields symlinks like the following to be created for namespaces defined by labels:
 +
 +<​code>​
 +ls -l /​dev/​disk/​by-id/​*
 + ​lrwxrwxrwx 1 root root 13 Jul  9 15:24 pmem-206dcdfe-69b7-4e86-a01b-f540621ce62e -> ../​../​pmem1.2
 + ​lrwxrwxrwx 1 root root 13 Jul  9 15:24 pmem-73840bf1-4e74-4ba4-a9c8-8248934c07c8 -> ../​../​pmem1.1
 + ​lrwxrwxrwx 1 root root 13 Jul  9 15:24 pmem-8137bdfd-3c4d-4b26-b326-21da3d4cd4e5 -> ../​../​pmem1.4
 + ​lrwxrwxrwx 1 root root 13 Jul  9 15:24 pmem-f43d1b6e-3300-46cb-8afc-06d66a7c16f6 -> ../​../​pmem1.3
 +</​code>​
 +
 +The persistent name for a pmem namespace is then listed in /etc/fstab like so:
 +<​code>​
 +/​dev/​disk/​by-id/​pmem-206dcdfe-69b7-4e86-a01b-f540621ce62e /mnt/pmem xfs       ​defaults,​dax ​       1 2
 +</​code>​
 +
  
 ==== Partitions ==== ==== Partitions ====
Line 376: Line 429:
 <​code>​ <​code>​
 $ mkfs.ext4 -F /​dev/​pmem0p1 $ mkfs.ext4 -F /​dev/​pmem0p1
-$ mkfs.xfs -f /​dev/​pmem0p2+$ mkfs.xfs -f -m reflink=0 ​/​dev/​pmem0p2
 $ mkfs.btrfs -f /​dev/​pmem0p3 $ mkfs.btrfs -f /​dev/​pmem0p3
-$ mount -o dax /​dev/​pmem0p1 /​mnt/​ext4-pmem0 +$ mount [dax_mount_options] ​/​dev/​pmem0p1 /​mnt/​ext4-pmem0 
-$ mount -o dax /​dev/​pmem0p2 /​mnt/​xfs-pmem0+$ mount [dax_mount_options] ​/​dev/​pmem0p2 /​mnt/​xfs-pmem0
 $ mount /​dev/​pmem0p3 /​mnt/​btrfs-pmem0 $ mount /​dev/​pmem0p3 /​mnt/​btrfs-pmem0
  
Line 401: Line 454:
 /​dev/​pmem0p3 ​                   4.3G   ​17M ​ 4.1G   1% /​mnt/​btrfs-pmem0 /​dev/​pmem0p3 ​                   4.3G   ​17M ​ 4.1G   1% /​mnt/​btrfs-pmem0
 </​code>​ </​code>​
 +
 +Where **[dax_mount_options]** depends on the kernel support you have and the desired behavior. ​ See [[fs_mount_options|fs_mount_options]] for details.
  
 Check the kernel log to ensure the DAX mount option was honored; mount does not print this information. Example failures: Check the kernel log to ensure the DAX mount option was honored; mount does not print this information. Example failures:
Line 428: Line 483:
 ==== iostats ==== ==== iostats ====
 ---- ----
-iostats are disabled by default due to performance overhead (e.g., 12M IOPS dropping 25% to 9M IOPS). However, they can be enabled in sysfs if desired.  ​+iostats are disabled by default due to performance overhead (e.g., 12M IOPS dropping 25% to 9M IOP 
 +S). However, they can be enabled in sysfs if desired.  ​
  
 As of kernel 4.5, iostats are only collected for the base pmem device, not per-partition. Also, I/Os that go through DAX paths (rw_page, rw_bytes, and direct_access functions) are not counted, so nothing is collected for: As of kernel 4.5, iostats are only collected for the base pmem device, not per-partition. Also, I/Os that go through DAX paths (rw_page, rw_bytes, and direct_access functions) are not counted, so nothing is collected for:
start.txt · Last modified: 2020/06/12 15:58 by Ira Weiny