[IEEE '25] Accelerating Page Migrations in Operating Systems With Intel DSA

2025. 3. 25. 16:31·[ Laboratory ]/Paper review

IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 24, NO. 1, JANUARY-JUNE 2025

Accelerating Page Migrations in Operating Systems With Intel DSA 

Jongho Baik , Jonghyeon Kim , Chang Hyun Park , Member, IEEE, and Jeongseob Ahn

https://ieeexplore.ieee.org/document/10841986

 

- index terms - data migration, hardware accelerator, linux

 

해당 아티클은 구독자 대상으로 pdf가 공개되므로 블로그 게시글에서 원문은 포함하지 않을 예정

 

Modern server-class CPUs are introducing special-purpose accelerators on the same chip to improve the performance of data-intensive applications.

This paper proposes a method to accelerate data migrations within the operating system using Intel’s Data Streaming Accelerator (DSA).

To the best of our knowledge, this is the first study that exploits a hardware-assisted data migration scheme in the operating system.

 

Linux kernel components that can benefit from hardware acceleration.

>> kernel system using kernel function (migrate_pages())

 

hardware accelerator is not suitable for transferring a small amount of data due to the HW setup overhead

>> accelerating migrate_pages() with DSA (dynamically determines the offloading decision based on the volume of pages)

>> real-world page compaction (kcompactd) and promotion (kdamond) scenarios

 

kcompactd : rearrange physical pages → mitigate the memory fragmentation, larger contiguous free memory pages

DAMON's kdamond : kernel daemon thread ... that continuously runs in the background within the Linux kernel to perform specific tasks ... promote frequently accessed pages to faster memory, or demote infrequently accessed pages to slower memory

 

+) DAMON = Data Access MONitor ... monitoring memory access pattern

 

>> revised migration function processes a batch of pages that is more efficient for DSA and outperforms the CPU-based migration

 

-- 

 

heterogeneous memory (dram+nvm or fast+slow) /  tiered memory (HBM, DDR, PMEM ...) 

accelerating the migrate_pages() function is crucial for enhancing the efficiency and performance of the Linux kernel’s memory management system

 

some prior studies on accelerating data migrations in the os  (related work)

Nimble page management for tiered memory systems : parallelization technique to achieve speedup for transferring pages between different types of memory 

HeMem : DMA engine to offload the data migration

vector instructions : the parallel processing of multiple data in a single instruction but additional overhead of saving and restoring the AVX registers

 

--

 

intel dsa 

 

 

dsa hw unit includes work queues (WQs)

WQs hold DSA descriptors

Each descriptor specifies an operation type, src address, dest address, size 

 

Intel DSA exposes portals

dev can submit the dsa descriptors using portal 

MMIO = memory mapped io = hw dev register or control port - memory 

... the execution unit inside the DSA that performs the actual operations.

portal = the entry in the MMIO address space used to submit DSA descriptor

... since it is a specialized hardware device that cannot be used like regular instructions, commands are issued by writing to memory-mapped addresses

 

pe = processing entry = executing the operation by fetching a descriptor from the work queue

 

---

 

migrating pages requires TLB shootdown operations

(when page table changes)

 

to reduce the migration overhead, Linux v6.3 added features to migrate pages in a batch through migrate_pages_batch(), which is also utilized in DAMON with a list of pages (folios)

 

the latest page management approach in the Linux kernel. A folio is an object that groups multiple pages (e.g., 2 or 4) into a larger unit.

DAMON tracks memory access patterns based on folios and processes them in batches during migration.

 

 

---

accelerating data migration with dsa

memcpy() function is the core primitive for copying data ... widely use!

... transferring data from the kernel to the user in a read() system call

... device drivers to copy data from the device-specific memory to the system memory

 

>> first-order consideration is to study and amortize the HW offloading cost !!

 

 

 

time for copying 4kb pages >> DSA is preferred when the number of pages exceeds 32

in the CPU configuration, we measured the time spent copying pages through copy_highpage() in /include/linux/highmem.h

 

if the amount of data is not sufficiently large, it is more efficient to use the conventional CPU copying method due to the offloading cost

costs are non-negligible when the amount of data to be moved is small

>> it is important to make an offloading decision based on the amount of data movement

 

offloading 

- create SG table - dma mapping config - submit dsa descriptor - unmap

 

revised migrate_pages() function designed to exploit the performance advantages of DSA

 

an additional path (right-hand side of the figure) for migrating pages with DSA and the migration path with CPU remains as a fallback mechanism that is used when the number of pages is less than the threshold (e.g., 32)

 

DSA path to perform the migrations in a batch to minimize the offloading overhead

 

- build the scatter-gather tables for all the pages to be migrated and then generate DMA-capable addresses at once

- reserve a DSA engine for the operating system to exclusively use one work queue and one PE unit during boot time

- Once a DSA descriptor is submitted to the queue, the corresponding PE unit fetches an item and starts the data movement without the CPU’s involvement

- migrate_page operations were run synchronously, where the kernel thread waited for the completion of the transfer through busy waiting

- an asynchronous approach switching to other kernel tasks in future work

 

https://github.com/Sys-KU/DSA-Linux

 

 

---

select two scenarios: memory compaction and memory promotion

- kcompactd 

- how many pages are processed per second

 

- fragmentation score during a compaction period

 

 

- page promotion in tiered memory environments

 

decompose the migration time into the setup cost and actual data transfer

 

- the latency changes  when a user application thread and a kernel thread utilize DSA at the same time

 

 

728x90

'[ Laboratory ] > Paper review' 카테고리의 다른 글

[VLDB'25] VStream: A Distributed Streaming Vector Search System  (0) 2025.11.12
[SOSP'25] HedraRAG: Co-Optimizing Generation and Retrieval for Heterogeneous RAG Workflows  (0) 2025.11.12
[SOSP'25] How to Copy Memory? Coordinated Asynchronous Copy as a First-Class OS Service  (0) 2025.11.12
[FAST ' 24] MiDAS: Minimizing Write Amplification in Log-Structured Systems  (0) 2025.03.11
omnicache  (0) 2025.02.04
[Fast '24] Baleen: ML Admission & Prefetching for Flash Caches 세 줄 핵심 요약  (0) 2024.12.31
[HotStorage '24] ION: Navigating the HPC I/O Optimization Journey using Large Language Models 세 줄 핵심 요약  (1) 2024.12.31
'[ Laboratory ]/Paper review' 카테고리의 다른 글
  • [SOSP'25] HedraRAG: Co-Optimizing Generation and Retrieval for Heterogeneous RAG Workflows
  • [SOSP'25] How to Copy Memory? Coordinated Asynchronous Copy as a First-Class OS Service
  • [FAST ' 24] MiDAS: Minimizing Write Amplification in Log-Structured Systems
  • omnicache
dev charlotte
dev charlotte
주 - 컴퓨터공학 / 복수 - 산업 보안
    250x250
  • dev charlotte
    int main() {
    dev charlotte
  • 전체
    오늘
    어제
    • 분류 전체보기 N
      • [ Laboratory ] N
        • Paper review
        • Advanced Operating System N
        • System Software & Storage
        • Lab etc
      • [ Computer Science ]
        • Algorithm_C++
        • Operating System
        • Information Retrieval
        • Database_sql
        • SW Engineering
        • Computer Network
        • JavaScript
        • Python
        • Data Structure
        • CS study
        • Distributed systems
      • [ Computer Security ]
        • Convergence Security
        • Web Security
        • PIMS
        • Network Security
        • Digital Finance
      • [ Artificial Intelligence ]
        • Trend
        • Seminar
      • [ 미래, 같이, LG ]
      • [ Development ]
        • [ Front-end ]
        • [ Back-end ] Spring 기본
        • [ Back-end ] Node.js
      • etc
        • 현대오토에버 스마트 모빌리티 공학 체험 교육
      • It's me
  • 블로그 메뉴

    • 링크

      • GitHub
    • 공지사항

    • 인기 글

    • 태그

      프로그래밍 언어론
      싸피
      티스토리챌린지
      SQL
      프로그래머스
      데이터베이스
      spdk
      ACM
      ssafy 13기
      소프트웨어 공학
      SSAFY
      현대오토에버
      프로그래머스 c++
      코드잇
      오블완
      코딩 인강
      현대오토에버 스마트 모빌리티
      비전공자 코딩
      백준
      MySQL
      hotstorage
      자바스크립트
      코딩 교육
      프로그래머스 입문
      프로그래머스 코테
      싸피 13기
      공대생 대외활동
      스마트모빌리티공학체험교육
      대학생 대외활동
      현대오토에버 스마트모빌리티
    • 최근 댓글

    • 최근 글

    • hELLO· Designed By정상우.v4.10.4
    dev charlotte
    [IEEE '25] Accelerating Page Migrations in Operating Systems With Intel DSA
    상단으로

    티스토리툴바