CRIU - User contributions [en]

Google Summer of Code Ideas

2025-03-03T07:14:40Z

Ptikhomirov: /* Add support for checkpoint/restore of CORK-ed UDP socket */

Google Summer of Code (GSoC) is a global program that offers post-secondary students an opportunity to be paid for contributing to an open source project over a three month period.

This page contains project ideas for upcoming Google Summer of Code.

== Contacts ==

Please contact the respective mentor for the idea you are interested in. For general questions feel free to send an email to the [mailto:criu@lists.linux.dev mailing list] or write in [https://gitter.im/save-restore/criu gitter].

== Project ideas ==

=== Add support for memory compression ===

'''Summary:''' Support compression for page images

We would like to support memory page files compression
in CRIU using one of the fastest algorithms (it's matter
of discussion which one to choose!).

This task does not require any Linux kernel modifications
and scope is limited to CRIU itself. At the same time it's
complex enough as we need to touch memory dump/restore codepath
in CRIU and also handle many corner cases like page-server and stuff.

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Suggested by: Andrei Vagin <avagin@gmail.com>
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Alexander Mikhalitsyn <alexander@mihalicyn.com>, Andrei Vagin <avagin@gmail.com>

=== Use eBPF to lock and unlock the network ===

'''Summary:''' Use eBPF instead of external iptables-restore tool for network lock and unlock.

During checkpointing and restoring CRIU locks the network to make sure no network packets are accepted by the network stack during the time the process is checkpointed. Currently CRIU calls out to iptables-restore to create and delete the corresponding iptables rules. Another approach which avoids calling out to the external binary iptables-restore would be to directly inject eBPF rules. There have been reports from users that iptables-restore fails in some way and eBPF could avoid this external dependency.

'''Links:'''
* https://www.criu.org/TCP_connection#Checkpoint_and_restore_TCP_connection
* https://github.com/systemd/systemd/blob/master/src/core/bpf-firewall.c
* https://blog.zeyady.com/2021-08-16/gsoc-criu

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Adrian Reber <areber@redhat.com>

=== Files on detached mounts ===

'''Summary:''' Initial support of open files on "detached" mounts

When criu dumps a process with an open fd on a file, it gets the mount identifier (mnt_id) via /proc/<pid>/fdinfo/<fd>, so that criu knows from which exact mount the file was initially opened. This way criu can restore this fd by opening the same exact file from topologically the same mount in restored mount tree.

Restoring fd from the right mount can be important in different cases, for instance if the process would later want to resolve paths relative to the fd, and obviously resolving from the same file on different mount can lead to different resolved paths, or if the process wants to check path to the file via /proc/<pid>/fd/<fd>.

But we have a problem finding on which mount we need to reopen the file at restore if we only know mnt_id but can't find this mnt_id in /proc/<pid>/mountinfo.

Mountinfo file shows the mount tree topology of current mntns: parent - child relations, sharing group information, mountpoint and fs root information. And if we don't see mnt_id in it we don't know anything about this mount.

This can happen in two cases

* 1) external mount or file - if file was opened from e.g. host it's mount would not be visible in container mountinfo
* 2) mount was lazily unmounted

In case of 1) we have criu options to help criu handle external dependencies.

In case of 2) or no options provided criu can't resolve mnt_id in mountinfo and criu fails.

'''Solution:'''
We can handle 2) with: resolving major/minor via fstat, using name_to_handle_at and open_by_handle_at to open same file on any other available mount from same superblock (same major/minor) in container. Now we have fd2 of the same file as fd, but on existing mount we can dump it as usual instead, and mark it as "detached" in image, now criu on restore knows where to find this file, but instead of just opening fd2 from actually restored mount, we create a temporary bindmount which is lazy unmounted just after open making the file appear as a file on detached mount.

Known problems with this approach:

* Stat on btrfs gives wrong major/minor
* file handles does not work everywhere
* file handles can return fd2 on deleted file or on other hardlink, this needs special handling.

Additionally (optional part):
We can export real major/minor in fdinfo (kernel).
We can think of new kernel interface to get mount's major/minor and root (shift from fsroot) for detached mounts, if we have it we don't need file handle hack to find file on other mount (see fsinfo or getvalues kernel patches in LKML, can we add this info there?).

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentor: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
* Suggested by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=== Checkpointing of POSIX message queues ===

'''Summary:''' Add support for checkpoint/restore of POSIX message queues

POSIX message queues are a widely used inter-process communication mechanism. Message queues are implemented as files on a virtual filesystem (mqueue), where a file descriptor (message queue descriptor) is used to perform operations such as sending or receiving messages. To support checkpoint/restore of POSIX message queues, we need a kernel interface (similar to [https://github.com/checkpoint-restore/criu/commit/8ce9e947051e43430eb2ff06b96dddeba467b4fd MSG_PEEK]) that would enable the retrieval of messages from a queue without removing them. This project aims to implement such an interface that allows retrieving all messages and their priorities from a POSIX message queue.

'''Links:'''
* https://github.com/checkpoint-restore/criu/issues/2285
* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/ipc/mqueue.c
* https://www.man7.org/tlpi/download/TLPI-52-POSIX_Message_Queues.pdf

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Pavel Tikhomirov <ptikhomirov@virtuozzo.com>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=== Add support for arm64 Guarded Control Stack (GCS) ===

'''Summary:''' Support arm64 Guarded Control Stack (GCS)

The arm64 Guarded Control Stack (GCS) feature provides support for
hardware protected stacks of return addresses, intended to provide
hardening against return oriented programming (ROP) attacks and to make
it easier to gather call stacks for applications such as profiling (taken from [1]).
We would like to support arm64 Guarded Control Stack (GCS) in CRIU, which means
that CRIU should be able to Checkpoint/Restore applications using GCS.

This task should not require any Linux kernel modifications
but will require a lot of effort to understand Linux kernel and
glibc support patches. We have a good example of support for
x86 shadow stack [4] thanks to Mike.

'''Links:'''
* [1] kernel support https://lore.kernel.org/all/20241001-arm64-gcs-v13-0-222b78d87eee@kernel.org
* [2] libc support https://inbox.sourceware.org/libc-alpha/20250117174119.3254972-1-yury.khrustalev@arm.com
* [3] libc tests https://inbox.sourceware.org/libc-alpha/20250210114538.1723249-1-yury.khrustalev@arm.com
* [4] x86 support (a great reference!) https://github.com/checkpoint-restore/criu/pull/2306

'''Details:'''
* Skill level: expert (a lot of moving parts: Linux kernel / libc / CRIU)
* Language: C
* Expected size: 350 hours
* Suggested by: Mike Rapoport <rppt@kernel.org>
* Mentors: Mike Rapoport <rppt@kernel.org>, Andrei Vagin <avagin@gmail.com>, Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Coordinated checkpointing of distributed applications ===

'''Summary:''' Enable coordinated container checkpointing with Kubernetes.

Checkpointing support has been recently introduced in Kubernetes, where the
smallest deployable unit is a Pod (a group of containers). Kubernetes is often
used to deploy applications that are distributed across multiple nodes.
However, checkpointing such distributed applications requires a coordination
mechanism to synchronize the checkpoint and restore operations. To address this
challenge, we have developed a new tool called <code>criu-coordinator</code>
that relies on the action-script functionality of CRIU to enable synchronization
in distributed environments. This project aims to extend this tool to enable
seamless integration with the checkpointing functionality of Kubernetes.

'''Links:'''
* https://github.com/checkpoint-restore/criu-coordinator
* https://lpc.events/event/18/contributions/1803/
* https://sched.co/1YeT4
* https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/

'''Details:'''
* Skill level: intermediate
* Language: Rust / Go / C
* Expected size: 350 hours
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

== Suspended project ideas ==

Listed here are tasks that seem suitable for GSoC, but currently do not have anybody to mentor it.

=== Optimize logging engine ===

'''Summary:''' CRIU puts a lots of logs when doing its job. Logging is done with simple fprintf function. They are typically useless, but ''if'' some operation fails -- the logs are the only way to find what was the reason for failure.

At the same time the printf family of functions is known to take some time to work -- they need to scan the format string for %-s and then convert the arguments into strings. If comparing criu dump with and without logs the time difference is notable (15%-20%), so speeding the logs up will help improve criu performance.

One of the solutions to the problem might be binary logging. The problem with binary logs is the amount of efforts to convert existing logs to binary form. Preferably, the switch to binary logging either keeps existing log() calls intact, either has some automatics to convert them.

The option to keep log() calls intact might be in pre-compilation pass of the sources. In this pass each <code>log(fmt, ...)</code> call gets translated into a call to a binary log function that saves <code>fmt</code> identifier copies all the args ''as is'' into the log file. The binary log decode utility, required in this case, should then find the fmt string by its ID in the log file and print the resulting message.

'''Links:'''
* [[Better logging]]

'''Details:'''
* Skill level: intermediate
* Language: C, though decoder/preprocessor can be in any language
* Expected size: 350 hours
* Suggested by: Andrei Vagin
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== IOUring support ===
The io_uring Asynchronous I/O (AIO) framework is a new Linux I/O interface, first introduced in upstream Linux kernel version 5.1 (March 2019). It provides a low-latency and feature-rich interface for applications that require AIO functionality.

'''Links:'''
* https://blogs.oracle.com/linux/an-introduction-to-the-io_uring-asynchronous-io-framework
* https://github.com/axboe/liburing

'''Details:'''
* Skill level: expert (+linux kernel)
* Expected size: 350 hours

=== Add support for SPFS ===

'''Summary:''' The SPFS is a special filesystem that allows checkpoint and restore of such things as NFS and FUSE

NFS support is already implemented in Virtuozzo CRIU, but it's very beneficial to port it to mainline CRIU. The importaint part of it is the need to implement the integration of Stub-Proxy File System (SPFS) with LXC/yet_another_containers_environment.

'''Links'''
* https://github.com/checkpoint-restore/criu/issues/60
* https://github.com/checkpoint-restore/criu/issues/53
* https://github.com/skinsbursky/spfs
* https://patchwork.criu.org/series/137/

'''Details:'''
* Skill level: expert
* Language: C
* Mentor: Alexander Mikhalitsyn <alexander@mihalicyn.com>
* Suggested by: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Anonymise image files ===

'''Summary:''' Teach [[CRIT]] to remove sensitive information from images

When reporting a BUG it may be not acceptable for the reporter to send us raw images, as they may contain sensitive data. Need to teach CRIT to "anonymise" images for publication.

List of data to shred:

* Memory contents. For the sake of investigation, all the memory contents can be just removed. Only the sizes of pages*.img files are enough.
* Paths to files. Here we should keep the paths relations to each other. The simplest way seem to be replacing file names with "random" (or sequential) strings, BUT (!) keeping an eye on making this mapping be 1:1. Note, that file paths may also sit in sk-unix.img.
* Registers.
* Process names. (But relations should be kept).
* Contents of streams, i.e. pipe/fifo data, sk-queue, tcp-stream, tty data.
* Ghost files.
* Tarballs with tmpfs-s.
* IP addresses in sk-inet-s, ip tool dumps and net*.img.

'''Links:'''
* [[Anonymize image files]]
* https://github.com/checkpoint-restore/criu/issues/360
* [[CRIT]], [[Images]]
* External links to mailing lists or web sites

'''Details:'''
* Skill level: beginner
* Language: Python

=== Add support for checkpoint/restore of CORK-ed UDP socket ===

'''Summary:''' Support C/R of corked UDP socket

There's UDP_CORK option for sockets. As man page says:
<pre>
If this option is enabled, then all data output on this socket
is accumulated into a single datagram that is transmitted when
the option is disabled. This option should not be used in
code intended to be portable.
</pre>

Currently criu refuses to dump this case, so it's effectively a bug. Supporting
this will need extending the kernel API to allow criu read back the write queue
of the socket (see [[TCP connection|how it's done]] for TCP sockets, for example). Then
the queue is written into the image and is restored into the socket (with the CORK
bit set too).

'''Notes:'''

We already had a couple (3) of tries for this problem:

* UDP_REPAIR approach didn't succeed: https://lore.kernel.org/netdev/721a2e32-c930-ad6b-5055-631b502ed11b@gmail.com/, https://lore.kernel.org/netdev/?q=udp_repair
* eBPF (CRIB) approach, socket queue iterator was not merged: https://lore.kernel.org/netdev/AM6PR03MB5848EDA002E3D7EACA7C6BDA99A52@AM6PR03MB5848.eurprd03.prod.outlook.com/, and we have general objections to CRIB approach https://lore.kernel.org/bpf/CAHk-=wjLWFa3i6+Tab67gnNumTYipj_HuheXr2RCq4zn0tCTzA@mail.gmail.com/

We still have one idea we didn't try, as UDP allows packets to be lost on the way on restore we can somehow mark the socket to drop all data before UNCORK. This way we don't really need to restore contents of UDP CORK-ed sockets send queue.

'''Links:'''
* https://github.com/checkpoint-restore/criu/issues/409
* https://github.com/criupatchwork/criu/commit/a532312
* [[Sockets]], [[TCP connection]]
* [[https://groups.google.com/forum/#!topic/comp.os.linux.networking/Uz8PYiTCZSg UDP cork explained]]

'''Details:'''
* Skill level: intermediate (+linux kernel)
* Language: C
* Expected size: 350 hours
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Pavel Tikhomirov <ptikhomirov@virtuozzo.com>, Andrei Vagin <avagin@gmail.com>

[[Category:GSoC]]
[[Category:Development]]

Google Summer of Code Ideas

2025-03-03T07:11:12Z

Ptikhomirov: /* Add support for checkpoint/restore of CORK-ed UDP socket */

Google Summer of Code (GSoC) is a global program that offers post-secondary students an opportunity to be paid for contributing to an open source project over a three month period.

This page contains project ideas for upcoming Google Summer of Code.

== Contacts ==

Please contact the respective mentor for the idea you are interested in. For general questions feel free to send an email to the [mailto:criu@lists.linux.dev mailing list] or write in [https://gitter.im/save-restore/criu gitter].

== Project ideas ==

=== Add support for memory compression ===

'''Summary:''' Support compression for page images

We would like to support memory page files compression
in CRIU using one of the fastest algorithms (it's matter
of discussion which one to choose!).

This task does not require any Linux kernel modifications
and scope is limited to CRIU itself. At the same time it's
complex enough as we need to touch memory dump/restore codepath
in CRIU and also handle many corner cases like page-server and stuff.

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Suggested by: Andrei Vagin <avagin@gmail.com>
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Alexander Mikhalitsyn <alexander@mihalicyn.com>, Andrei Vagin <avagin@gmail.com>

=== Use eBPF to lock and unlock the network ===

'''Summary:''' Use eBPF instead of external iptables-restore tool for network lock and unlock.

During checkpointing and restoring CRIU locks the network to make sure no network packets are accepted by the network stack during the time the process is checkpointed. Currently CRIU calls out to iptables-restore to create and delete the corresponding iptables rules. Another approach which avoids calling out to the external binary iptables-restore would be to directly inject eBPF rules. There have been reports from users that iptables-restore fails in some way and eBPF could avoid this external dependency.

'''Links:'''
* https://www.criu.org/TCP_connection#Checkpoint_and_restore_TCP_connection
* https://github.com/systemd/systemd/blob/master/src/core/bpf-firewall.c
* https://blog.zeyady.com/2021-08-16/gsoc-criu

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Adrian Reber <areber@redhat.com>

=== Files on detached mounts ===

'''Summary:''' Initial support of open files on "detached" mounts

When criu dumps a process with an open fd on a file, it gets the mount identifier (mnt_id) via /proc/<pid>/fdinfo/<fd>, so that criu knows from which exact mount the file was initially opened. This way criu can restore this fd by opening the same exact file from topologically the same mount in restored mount tree.

Restoring fd from the right mount can be important in different cases, for instance if the process would later want to resolve paths relative to the fd, and obviously resolving from the same file on different mount can lead to different resolved paths, or if the process wants to check path to the file via /proc/<pid>/fd/<fd>.

But we have a problem finding on which mount we need to reopen the file at restore if we only know mnt_id but can't find this mnt_id in /proc/<pid>/mountinfo.

Mountinfo file shows the mount tree topology of current mntns: parent - child relations, sharing group information, mountpoint and fs root information. And if we don't see mnt_id in it we don't know anything about this mount.

This can happen in two cases

* 1) external mount or file - if file was opened from e.g. host it's mount would not be visible in container mountinfo
* 2) mount was lazily unmounted

In case of 1) we have criu options to help criu handle external dependencies.

In case of 2) or no options provided criu can't resolve mnt_id in mountinfo and criu fails.

'''Solution:'''
We can handle 2) with: resolving major/minor via fstat, using name_to_handle_at and open_by_handle_at to open same file on any other available mount from same superblock (same major/minor) in container. Now we have fd2 of the same file as fd, but on existing mount we can dump it as usual instead, and mark it as "detached" in image, now criu on restore knows where to find this file, but instead of just opening fd2 from actually restored mount, we create a temporary bindmount which is lazy unmounted just after open making the file appear as a file on detached mount.

Known problems with this approach:

* Stat on btrfs gives wrong major/minor
* file handles does not work everywhere
* file handles can return fd2 on deleted file or on other hardlink, this needs special handling.

Additionally (optional part):
We can export real major/minor in fdinfo (kernel).
We can think of new kernel interface to get mount's major/minor and root (shift from fsroot) for detached mounts, if we have it we don't need file handle hack to find file on other mount (see fsinfo or getvalues kernel patches in LKML, can we add this info there?).

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentor: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
* Suggested by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=== Checkpointing of POSIX message queues ===

'''Summary:''' Add support for checkpoint/restore of POSIX message queues

POSIX message queues are a widely used inter-process communication mechanism. Message queues are implemented as files on a virtual filesystem (mqueue), where a file descriptor (message queue descriptor) is used to perform operations such as sending or receiving messages. To support checkpoint/restore of POSIX message queues, we need a kernel interface (similar to [https://github.com/checkpoint-restore/criu/commit/8ce9e947051e43430eb2ff06b96dddeba467b4fd MSG_PEEK]) that would enable the retrieval of messages from a queue without removing them. This project aims to implement such an interface that allows retrieving all messages and their priorities from a POSIX message queue.

'''Links:'''
* https://github.com/checkpoint-restore/criu/issues/2285
* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/ipc/mqueue.c
* https://www.man7.org/tlpi/download/TLPI-52-POSIX_Message_Queues.pdf

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Pavel Tikhomirov <ptikhomirov@virtuozzo.com>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=== Add support for arm64 Guarded Control Stack (GCS) ===

'''Summary:''' Support arm64 Guarded Control Stack (GCS)

The arm64 Guarded Control Stack (GCS) feature provides support for
hardware protected stacks of return addresses, intended to provide
hardening against return oriented programming (ROP) attacks and to make
it easier to gather call stacks for applications such as profiling (taken from [1]).
We would like to support arm64 Guarded Control Stack (GCS) in CRIU, which means
that CRIU should be able to Checkpoint/Restore applications using GCS.

This task should not require any Linux kernel modifications
but will require a lot of effort to understand Linux kernel and
glibc support patches. We have a good example of support for
x86 shadow stack [4] thanks to Mike.

'''Links:'''
* [1] kernel support https://lore.kernel.org/all/20241001-arm64-gcs-v13-0-222b78d87eee@kernel.org
* [2] libc support https://inbox.sourceware.org/libc-alpha/20250117174119.3254972-1-yury.khrustalev@arm.com
* [3] libc tests https://inbox.sourceware.org/libc-alpha/20250210114538.1723249-1-yury.khrustalev@arm.com
* [4] x86 support (a great reference!) https://github.com/checkpoint-restore/criu/pull/2306

'''Details:'''
* Skill level: expert (a lot of moving parts: Linux kernel / libc / CRIU)
* Language: C
* Expected size: 350 hours
* Suggested by: Mike Rapoport <rppt@kernel.org>
* Mentors: Mike Rapoport <rppt@kernel.org>, Andrei Vagin <avagin@gmail.com>, Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Coordinated checkpointing of distributed applications ===

'''Summary:''' Enable coordinated container checkpointing with Kubernetes.

Checkpointing support has been recently introduced in Kubernetes, where the
smallest deployable unit is a Pod (a group of containers). Kubernetes is often
used to deploy applications that are distributed across multiple nodes.
However, checkpointing such distributed applications requires a coordination
mechanism to synchronize the checkpoint and restore operations. To address this
challenge, we have developed a new tool called <code>criu-coordinator</code>
that relies on the action-script functionality of CRIU to enable synchronization
in distributed environments. This project aims to extend this tool to enable
seamless integration with the checkpointing functionality of Kubernetes.

'''Links:'''
* https://github.com/checkpoint-restore/criu-coordinator
* https://lpc.events/event/18/contributions/1803/
* https://sched.co/1YeT4
* https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/

'''Details:'''
* Skill level: intermediate
* Language: Rust / Go / C
* Expected size: 350 hours
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

== Suspended project ideas ==

Listed here are tasks that seem suitable for GSoC, but currently do not have anybody to mentor it.

=== Optimize logging engine ===

'''Summary:''' CRIU puts a lots of logs when doing its job. Logging is done with simple fprintf function. They are typically useless, but ''if'' some operation fails -- the logs are the only way to find what was the reason for failure.

At the same time the printf family of functions is known to take some time to work -- they need to scan the format string for %-s and then convert the arguments into strings. If comparing criu dump with and without logs the time difference is notable (15%-20%), so speeding the logs up will help improve criu performance.

One of the solutions to the problem might be binary logging. The problem with binary logs is the amount of efforts to convert existing logs to binary form. Preferably, the switch to binary logging either keeps existing log() calls intact, either has some automatics to convert them.

The option to keep log() calls intact might be in pre-compilation pass of the sources. In this pass each <code>log(fmt, ...)</code> call gets translated into a call to a binary log function that saves <code>fmt</code> identifier copies all the args ''as is'' into the log file. The binary log decode utility, required in this case, should then find the fmt string by its ID in the log file and print the resulting message.

'''Links:'''
* [[Better logging]]

'''Details:'''
* Skill level: intermediate
* Language: C, though decoder/preprocessor can be in any language
* Expected size: 350 hours
* Suggested by: Andrei Vagin
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== IOUring support ===
The io_uring Asynchronous I/O (AIO) framework is a new Linux I/O interface, first introduced in upstream Linux kernel version 5.1 (March 2019). It provides a low-latency and feature-rich interface for applications that require AIO functionality.

'''Links:'''
* https://blogs.oracle.com/linux/an-introduction-to-the-io_uring-asynchronous-io-framework
* https://github.com/axboe/liburing

'''Details:'''
* Skill level: expert (+linux kernel)
* Expected size: 350 hours

=== Add support for SPFS ===

'''Summary:''' The SPFS is a special filesystem that allows checkpoint and restore of such things as NFS and FUSE

NFS support is already implemented in Virtuozzo CRIU, but it's very beneficial to port it to mainline CRIU. The importaint part of it is the need to implement the integration of Stub-Proxy File System (SPFS) with LXC/yet_another_containers_environment.

'''Links'''
* https://github.com/checkpoint-restore/criu/issues/60
* https://github.com/checkpoint-restore/criu/issues/53
* https://github.com/skinsbursky/spfs
* https://patchwork.criu.org/series/137/

'''Details:'''
* Skill level: expert
* Language: C
* Mentor: Alexander Mikhalitsyn <alexander@mihalicyn.com>
* Suggested by: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Anonymise image files ===

'''Summary:''' Teach [[CRIT]] to remove sensitive information from images

When reporting a BUG it may be not acceptable for the reporter to send us raw images, as they may contain sensitive data. Need to teach CRIT to "anonymise" images for publication.

List of data to shred:

* Memory contents. For the sake of investigation, all the memory contents can be just removed. Only the sizes of pages*.img files are enough.
* Paths to files. Here we should keep the paths relations to each other. The simplest way seem to be replacing file names with "random" (or sequential) strings, BUT (!) keeping an eye on making this mapping be 1:1. Note, that file paths may also sit in sk-unix.img.
* Registers.
* Process names. (But relations should be kept).
* Contents of streams, i.e. pipe/fifo data, sk-queue, tcp-stream, tty data.
* Ghost files.
* Tarballs with tmpfs-s.
* IP addresses in sk-inet-s, ip tool dumps and net*.img.

'''Links:'''
* [[Anonymize image files]]
* https://github.com/checkpoint-restore/criu/issues/360
* [[CRIT]], [[Images]]
* External links to mailing lists or web sites

'''Details:'''
* Skill level: beginner
* Language: Python

=== Add support for checkpoint/restore of CORK-ed UDP socket ===

'''Summary:''' Support C/R of corked UDP socket

There's UDP_CORK option for sockets. As man page says:
<pre>
If this option is enabled, then all data output on this socket
is accumulated into a single datagram that is transmitted when
the option is disabled. This option should not be used in
code intended to be portable.
</pre>

Currently criu refuses to dump this case, so it's effectively a bug. Supporting
this will need extending the kernel API to allow criu read back the write queue
of the socket (see [[TCP connection|how it's done]] for TCP sockets, for example). Then
the queue is written into the image and is restored into the socket (with the CORK
bit set too).

'''Notes:'''

We already had a couple (3) of tries for this problem:

* UDP_REPAIR approach didn't succeed: https://lore.kernel.org/netdev/721a2e32-c930-ad6b-5055-631b502ed11b@gmail.com/, https://lore.kernel.org/netdev/?q=udp_repair
* eBPF (CRIB) approach, socket queue iterator was not merged: https://lore.kernel.org/netdev/AM6PR03MB5848EDA002E3D7EACA7C6BDA99A52@AM6PR03MB5848.eurprd03.prod.outlook.com/, and we have general objections to CRIB approach https://lore.kernel.org/bpf/CAHk-=wjLWFa3i6+Tab67gnNumTYipj_HuheXr2RCq4zn0tCTzA@mail.gmail.com/

'''Links:'''
* https://github.com/checkpoint-restore/criu/issues/409
* https://github.com/criupatchwork/criu/commit/a532312
* [[Sockets]], [[TCP connection]]
* [[https://groups.google.com/forum/#!topic/comp.os.linux.networking/Uz8PYiTCZSg UDP cork explained]]

'''Details:'''
* Skill level: intermediate (+linux kernel)
* Language: C
* Expected size: 350 hours
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Pavel Tikhomirov <ptikhomirov@virtuozzo.com>, Andrei Vagin <avagin@gmail.com>

[[Category:GSoC]]
[[Category:Development]]

Google Summer of Code Ideas

2025-03-03T07:06:06Z

Ptikhomirov: /* Add support for checkpoint/restore of CORK-ed UDP socket */

Google Summer of Code (GSoC) is a global program that offers post-secondary students an opportunity to be paid for contributing to an open source project over a three month period.

This page contains project ideas for upcoming Google Summer of Code.

== Contacts ==

Please contact the respective mentor for the idea you are interested in. For general questions feel free to send an email to the [mailto:criu@lists.linux.dev mailing list] or write in [https://gitter.im/save-restore/criu gitter].

== Project ideas ==

=== Add support for memory compression ===

'''Summary:''' Support compression for page images

We would like to support memory page files compression
in CRIU using one of the fastest algorithms (it's matter
of discussion which one to choose!).

This task does not require any Linux kernel modifications
and scope is limited to CRIU itself. At the same time it's
complex enough as we need to touch memory dump/restore codepath
in CRIU and also handle many corner cases like page-server and stuff.

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Suggested by: Andrei Vagin <avagin@gmail.com>
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Alexander Mikhalitsyn <alexander@mihalicyn.com>, Andrei Vagin <avagin@gmail.com>

=== Use eBPF to lock and unlock the network ===

'''Summary:''' Use eBPF instead of external iptables-restore tool for network lock and unlock.

During checkpointing and restoring CRIU locks the network to make sure no network packets are accepted by the network stack during the time the process is checkpointed. Currently CRIU calls out to iptables-restore to create and delete the corresponding iptables rules. Another approach which avoids calling out to the external binary iptables-restore would be to directly inject eBPF rules. There have been reports from users that iptables-restore fails in some way and eBPF could avoid this external dependency.

'''Links:'''
* https://www.criu.org/TCP_connection#Checkpoint_and_restore_TCP_connection
* https://github.com/systemd/systemd/blob/master/src/core/bpf-firewall.c
* https://blog.zeyady.com/2021-08-16/gsoc-criu

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Adrian Reber <areber@redhat.com>

=== Files on detached mounts ===

'''Summary:''' Initial support of open files on "detached" mounts

When criu dumps a process with an open fd on a file, it gets the mount identifier (mnt_id) via /proc/<pid>/fdinfo/<fd>, so that criu knows from which exact mount the file was initially opened. This way criu can restore this fd by opening the same exact file from topologically the same mount in restored mount tree.

Restoring fd from the right mount can be important in different cases, for instance if the process would later want to resolve paths relative to the fd, and obviously resolving from the same file on different mount can lead to different resolved paths, or if the process wants to check path to the file via /proc/<pid>/fd/<fd>.

But we have a problem finding on which mount we need to reopen the file at restore if we only know mnt_id but can't find this mnt_id in /proc/<pid>/mountinfo.

Mountinfo file shows the mount tree topology of current mntns: parent - child relations, sharing group information, mountpoint and fs root information. And if we don't see mnt_id in it we don't know anything about this mount.

This can happen in two cases

* 1) external mount or file - if file was opened from e.g. host it's mount would not be visible in container mountinfo
* 2) mount was lazily unmounted

In case of 1) we have criu options to help criu handle external dependencies.

In case of 2) or no options provided criu can't resolve mnt_id in mountinfo and criu fails.

'''Solution:'''
We can handle 2) with: resolving major/minor via fstat, using name_to_handle_at and open_by_handle_at to open same file on any other available mount from same superblock (same major/minor) in container. Now we have fd2 of the same file as fd, but on existing mount we can dump it as usual instead, and mark it as "detached" in image, now criu on restore knows where to find this file, but instead of just opening fd2 from actually restored mount, we create a temporary bindmount which is lazy unmounted just after open making the file appear as a file on detached mount.

Known problems with this approach:

* Stat on btrfs gives wrong major/minor
* file handles does not work everywhere
* file handles can return fd2 on deleted file or on other hardlink, this needs special handling.

Additionally (optional part):
We can export real major/minor in fdinfo (kernel).
We can think of new kernel interface to get mount's major/minor and root (shift from fsroot) for detached mounts, if we have it we don't need file handle hack to find file on other mount (see fsinfo or getvalues kernel patches in LKML, can we add this info there?).

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentor: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
* Suggested by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=== Checkpointing of POSIX message queues ===

'''Summary:''' Add support for checkpoint/restore of POSIX message queues

POSIX message queues are a widely used inter-process communication mechanism. Message queues are implemented as files on a virtual filesystem (mqueue), where a file descriptor (message queue descriptor) is used to perform operations such as sending or receiving messages. To support checkpoint/restore of POSIX message queues, we need a kernel interface (similar to [https://github.com/checkpoint-restore/criu/commit/8ce9e947051e43430eb2ff06b96dddeba467b4fd MSG_PEEK]) that would enable the retrieval of messages from a queue without removing them. This project aims to implement such an interface that allows retrieving all messages and their priorities from a POSIX message queue.

'''Links:'''
* https://github.com/checkpoint-restore/criu/issues/2285
* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/ipc/mqueue.c
* https://www.man7.org/tlpi/download/TLPI-52-POSIX_Message_Queues.pdf

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Pavel Tikhomirov <ptikhomirov@virtuozzo.com>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=== Add support for arm64 Guarded Control Stack (GCS) ===

'''Summary:''' Support arm64 Guarded Control Stack (GCS)

The arm64 Guarded Control Stack (GCS) feature provides support for
hardware protected stacks of return addresses, intended to provide
hardening against return oriented programming (ROP) attacks and to make
it easier to gather call stacks for applications such as profiling (taken from [1]).
We would like to support arm64 Guarded Control Stack (GCS) in CRIU, which means
that CRIU should be able to Checkpoint/Restore applications using GCS.

This task should not require any Linux kernel modifications
but will require a lot of effort to understand Linux kernel and
glibc support patches. We have a good example of support for
x86 shadow stack [4] thanks to Mike.

'''Links:'''
* [1] kernel support https://lore.kernel.org/all/20241001-arm64-gcs-v13-0-222b78d87eee@kernel.org
* [2] libc support https://inbox.sourceware.org/libc-alpha/20250117174119.3254972-1-yury.khrustalev@arm.com
* [3] libc tests https://inbox.sourceware.org/libc-alpha/20250210114538.1723249-1-yury.khrustalev@arm.com
* [4] x86 support (a great reference!) https://github.com/checkpoint-restore/criu/pull/2306

'''Details:'''
* Skill level: expert (a lot of moving parts: Linux kernel / libc / CRIU)
* Language: C
* Expected size: 350 hours
* Suggested by: Mike Rapoport <rppt@kernel.org>
* Mentors: Mike Rapoport <rppt@kernel.org>, Andrei Vagin <avagin@gmail.com>, Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Coordinated checkpointing of distributed applications ===

'''Summary:''' Enable coordinated container checkpointing with Kubernetes.

Checkpointing support has been recently introduced in Kubernetes, where the
smallest deployable unit is a Pod (a group of containers). Kubernetes is often
used to deploy applications that are distributed across multiple nodes.
However, checkpointing such distributed applications requires a coordination
mechanism to synchronize the checkpoint and restore operations. To address this
challenge, we have developed a new tool called <code>criu-coordinator</code>
that relies on the action-script functionality of CRIU to enable synchronization
in distributed environments. This project aims to extend this tool to enable
seamless integration with the checkpointing functionality of Kubernetes.

'''Links:'''
* https://github.com/checkpoint-restore/criu-coordinator
* https://lpc.events/event/18/contributions/1803/
* https://sched.co/1YeT4
* https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/

'''Details:'''
* Skill level: intermediate
* Language: Rust / Go / C
* Expected size: 350 hours
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

== Suspended project ideas ==

Listed here are tasks that seem suitable for GSoC, but currently do not have anybody to mentor it.

=== Optimize logging engine ===

'''Summary:''' CRIU puts a lots of logs when doing its job. Logging is done with simple fprintf function. They are typically useless, but ''if'' some operation fails -- the logs are the only way to find what was the reason for failure.

At the same time the printf family of functions is known to take some time to work -- they need to scan the format string for %-s and then convert the arguments into strings. If comparing criu dump with and without logs the time difference is notable (15%-20%), so speeding the logs up will help improve criu performance.

One of the solutions to the problem might be binary logging. The problem with binary logs is the amount of efforts to convert existing logs to binary form. Preferably, the switch to binary logging either keeps existing log() calls intact, either has some automatics to convert them.

The option to keep log() calls intact might be in pre-compilation pass of the sources. In this pass each <code>log(fmt, ...)</code> call gets translated into a call to a binary log function that saves <code>fmt</code> identifier copies all the args ''as is'' into the log file. The binary log decode utility, required in this case, should then find the fmt string by its ID in the log file and print the resulting message.

'''Links:'''
* [[Better logging]]

'''Details:'''
* Skill level: intermediate
* Language: C, though decoder/preprocessor can be in any language
* Expected size: 350 hours
* Suggested by: Andrei Vagin
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== IOUring support ===
The io_uring Asynchronous I/O (AIO) framework is a new Linux I/O interface, first introduced in upstream Linux kernel version 5.1 (March 2019). It provides a low-latency and feature-rich interface for applications that require AIO functionality.

'''Links:'''
* https://blogs.oracle.com/linux/an-introduction-to-the-io_uring-asynchronous-io-framework
* https://github.com/axboe/liburing

'''Details:'''
* Skill level: expert (+linux kernel)
* Expected size: 350 hours

=== Add support for SPFS ===

'''Summary:''' The SPFS is a special filesystem that allows checkpoint and restore of such things as NFS and FUSE

NFS support is already implemented in Virtuozzo CRIU, but it's very beneficial to port it to mainline CRIU. The importaint part of it is the need to implement the integration of Stub-Proxy File System (SPFS) with LXC/yet_another_containers_environment.

'''Links'''
* https://github.com/checkpoint-restore/criu/issues/60
* https://github.com/checkpoint-restore/criu/issues/53
* https://github.com/skinsbursky/spfs
* https://patchwork.criu.org/series/137/

'''Details:'''
* Skill level: expert
* Language: C
* Mentor: Alexander Mikhalitsyn <alexander@mihalicyn.com>
* Suggested by: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Anonymise image files ===

'''Summary:''' Teach [[CRIT]] to remove sensitive information from images

When reporting a BUG it may be not acceptable for the reporter to send us raw images, as they may contain sensitive data. Need to teach CRIT to "anonymise" images for publication.

List of data to shred:

* Memory contents. For the sake of investigation, all the memory contents can be just removed. Only the sizes of pages*.img files are enough.
* Paths to files. Here we should keep the paths relations to each other. The simplest way seem to be replacing file names with "random" (or sequential) strings, BUT (!) keeping an eye on making this mapping be 1:1. Note, that file paths may also sit in sk-unix.img.
* Registers.
* Process names. (But relations should be kept).
* Contents of streams, i.e. pipe/fifo data, sk-queue, tcp-stream, tty data.
* Ghost files.
* Tarballs with tmpfs-s.
* IP addresses in sk-inet-s, ip tool dumps and net*.img.

'''Links:'''
* [[Anonymize image files]]
* https://github.com/checkpoint-restore/criu/issues/360
* [[CRIT]], [[Images]]
* External links to mailing lists or web sites

'''Details:'''
* Skill level: beginner
* Language: Python

=== Add support for checkpoint/restore of CORK-ed UDP socket ===

'''Summary:''' Support C/R of corked UDP socket

There's UDP_CORK option for sockets. As man page says:
<pre>
If this option is enabled, then all data output on this socket
is accumulated into a single datagram that is transmitted when
the option is disabled. This option should not be used in
code intended to be portable.
</pre>

Currently criu refuses to dump this case, so it's effectively a bug. Supporting
this will need extending the kernel API to allow criu read back the write queue
of the socket (see [[TCP connection|how it's done]] for TCP sockets, for example). Then
the queue is written into the image and is restored into the socket (with the CORK
bit set too).

'''Notes:'''

We already had a couple (3) of tries for this problem:

* UDP_REPAIR approach didn't succeed: https://lore.kernel.org/netdev/721a2e32-c930-ad6b-5055-631b502ed11b@gmail.com/, https://lore.kernel.org/netdev/?q=udp_repair
* eBPF (CRIB) approach, socket queue iterator was not merged: https://lore.kernel.org/netdev/AM6PR03MB5848EDA002E3D7EACA7C6BDA99A52@AM6PR03MB5848.eurprd03.prod.outlook.com/

'''Links:'''
* https://github.com/checkpoint-restore/criu/issues/409
* https://github.com/criupatchwork/criu/commit/a532312
* [[Sockets]], [[TCP connection]]
* [[https://groups.google.com/forum/#!topic/comp.os.linux.networking/Uz8PYiTCZSg UDP cork explained]]

'''Details:'''
* Skill level: intermediate (+linux kernel)
* Language: C
* Expected size: 350 hours
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Pavel Tikhomirov <ptikhomirov@virtuozzo.com>, Andrei Vagin <avagin@gmail.com>

[[Category:GSoC]]
[[Category:Development]]

Google Summer of Code Ideas

2025-03-03T07:05:22Z

Ptikhomirov: /* Add support for checkpoint/restore of CORK-ed UDP socket */

Google Summer of Code (GSoC) is a global program that offers post-secondary students an opportunity to be paid for contributing to an open source project over a three month period.

This page contains project ideas for upcoming Google Summer of Code.

== Contacts ==

Please contact the respective mentor for the idea you are interested in. For general questions feel free to send an email to the [mailto:criu@lists.linux.dev mailing list] or write in [https://gitter.im/save-restore/criu gitter].

== Project ideas ==

=== Add support for memory compression ===

'''Summary:''' Support compression for page images

We would like to support memory page files compression
in CRIU using one of the fastest algorithms (it's matter
of discussion which one to choose!).

This task does not require any Linux kernel modifications
and scope is limited to CRIU itself. At the same time it's
complex enough as we need to touch memory dump/restore codepath
in CRIU and also handle many corner cases like page-server and stuff.

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Suggested by: Andrei Vagin <avagin@gmail.com>
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Alexander Mikhalitsyn <alexander@mihalicyn.com>, Andrei Vagin <avagin@gmail.com>

=== Use eBPF to lock and unlock the network ===

'''Summary:''' Use eBPF instead of external iptables-restore tool for network lock and unlock.

During checkpointing and restoring CRIU locks the network to make sure no network packets are accepted by the network stack during the time the process is checkpointed. Currently CRIU calls out to iptables-restore to create and delete the corresponding iptables rules. Another approach which avoids calling out to the external binary iptables-restore would be to directly inject eBPF rules. There have been reports from users that iptables-restore fails in some way and eBPF could avoid this external dependency.

'''Links:'''
* https://www.criu.org/TCP_connection#Checkpoint_and_restore_TCP_connection
* https://github.com/systemd/systemd/blob/master/src/core/bpf-firewall.c
* https://blog.zeyady.com/2021-08-16/gsoc-criu

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Adrian Reber <areber@redhat.com>

=== Files on detached mounts ===

'''Summary:''' Initial support of open files on "detached" mounts

When criu dumps a process with an open fd on a file, it gets the mount identifier (mnt_id) via /proc/<pid>/fdinfo/<fd>, so that criu knows from which exact mount the file was initially opened. This way criu can restore this fd by opening the same exact file from topologically the same mount in restored mount tree.

Restoring fd from the right mount can be important in different cases, for instance if the process would later want to resolve paths relative to the fd, and obviously resolving from the same file on different mount can lead to different resolved paths, or if the process wants to check path to the file via /proc/<pid>/fd/<fd>.

But we have a problem finding on which mount we need to reopen the file at restore if we only know mnt_id but can't find this mnt_id in /proc/<pid>/mountinfo.

Mountinfo file shows the mount tree topology of current mntns: parent - child relations, sharing group information, mountpoint and fs root information. And if we don't see mnt_id in it we don't know anything about this mount.

This can happen in two cases

* 1) external mount or file - if file was opened from e.g. host it's mount would not be visible in container mountinfo
* 2) mount was lazily unmounted

In case of 1) we have criu options to help criu handle external dependencies.

In case of 2) or no options provided criu can't resolve mnt_id in mountinfo and criu fails.

'''Solution:'''
We can handle 2) with: resolving major/minor via fstat, using name_to_handle_at and open_by_handle_at to open same file on any other available mount from same superblock (same major/minor) in container. Now we have fd2 of the same file as fd, but on existing mount we can dump it as usual instead, and mark it as "detached" in image, now criu on restore knows where to find this file, but instead of just opening fd2 from actually restored mount, we create a temporary bindmount which is lazy unmounted just after open making the file appear as a file on detached mount.

Known problems with this approach:

* Stat on btrfs gives wrong major/minor
* file handles does not work everywhere
* file handles can return fd2 on deleted file or on other hardlink, this needs special handling.

Additionally (optional part):
We can export real major/minor in fdinfo (kernel).
We can think of new kernel interface to get mount's major/minor and root (shift from fsroot) for detached mounts, if we have it we don't need file handle hack to find file on other mount (see fsinfo or getvalues kernel patches in LKML, can we add this info there?).

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentor: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
* Suggested by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=== Checkpointing of POSIX message queues ===

'''Summary:''' Add support for checkpoint/restore of POSIX message queues

POSIX message queues are a widely used inter-process communication mechanism. Message queues are implemented as files on a virtual filesystem (mqueue), where a file descriptor (message queue descriptor) is used to perform operations such as sending or receiving messages. To support checkpoint/restore of POSIX message queues, we need a kernel interface (similar to [https://github.com/checkpoint-restore/criu/commit/8ce9e947051e43430eb2ff06b96dddeba467b4fd MSG_PEEK]) that would enable the retrieval of messages from a queue without removing them. This project aims to implement such an interface that allows retrieving all messages and their priorities from a POSIX message queue.

'''Links:'''
* https://github.com/checkpoint-restore/criu/issues/2285
* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/ipc/mqueue.c
* https://www.man7.org/tlpi/download/TLPI-52-POSIX_Message_Queues.pdf

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Pavel Tikhomirov <ptikhomirov@virtuozzo.com>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=== Add support for arm64 Guarded Control Stack (GCS) ===

'''Summary:''' Support arm64 Guarded Control Stack (GCS)

The arm64 Guarded Control Stack (GCS) feature provides support for
hardware protected stacks of return addresses, intended to provide
hardening against return oriented programming (ROP) attacks and to make
it easier to gather call stacks for applications such as profiling (taken from [1]).
We would like to support arm64 Guarded Control Stack (GCS) in CRIU, which means
that CRIU should be able to Checkpoint/Restore applications using GCS.

This task should not require any Linux kernel modifications
but will require a lot of effort to understand Linux kernel and
glibc support patches. We have a good example of support for
x86 shadow stack [4] thanks to Mike.

'''Links:'''
* [1] kernel support https://lore.kernel.org/all/20241001-arm64-gcs-v13-0-222b78d87eee@kernel.org
* [2] libc support https://inbox.sourceware.org/libc-alpha/20250117174119.3254972-1-yury.khrustalev@arm.com
* [3] libc tests https://inbox.sourceware.org/libc-alpha/20250210114538.1723249-1-yury.khrustalev@arm.com
* [4] x86 support (a great reference!) https://github.com/checkpoint-restore/criu/pull/2306

'''Details:'''
* Skill level: expert (a lot of moving parts: Linux kernel / libc / CRIU)
* Language: C
* Expected size: 350 hours
* Suggested by: Mike Rapoport <rppt@kernel.org>
* Mentors: Mike Rapoport <rppt@kernel.org>, Andrei Vagin <avagin@gmail.com>, Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Coordinated checkpointing of distributed applications ===

'''Summary:''' Enable coordinated container checkpointing with Kubernetes.

Checkpointing support has been recently introduced in Kubernetes, where the
smallest deployable unit is a Pod (a group of containers). Kubernetes is often
used to deploy applications that are distributed across multiple nodes.
However, checkpointing such distributed applications requires a coordination
mechanism to synchronize the checkpoint and restore operations. To address this
challenge, we have developed a new tool called <code>criu-coordinator</code>
that relies on the action-script functionality of CRIU to enable synchronization
in distributed environments. This project aims to extend this tool to enable
seamless integration with the checkpointing functionality of Kubernetes.

'''Links:'''
* https://github.com/checkpoint-restore/criu-coordinator
* https://lpc.events/event/18/contributions/1803/
* https://sched.co/1YeT4
* https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/

'''Details:'''
* Skill level: intermediate
* Language: Rust / Go / C
* Expected size: 350 hours
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

== Suspended project ideas ==

Listed here are tasks that seem suitable for GSoC, but currently do not have anybody to mentor it.

=== Optimize logging engine ===

'''Summary:''' CRIU puts a lots of logs when doing its job. Logging is done with simple fprintf function. They are typically useless, but ''if'' some operation fails -- the logs are the only way to find what was the reason for failure.

At the same time the printf family of functions is known to take some time to work -- they need to scan the format string for %-s and then convert the arguments into strings. If comparing criu dump with and without logs the time difference is notable (15%-20%), so speeding the logs up will help improve criu performance.

One of the solutions to the problem might be binary logging. The problem with binary logs is the amount of efforts to convert existing logs to binary form. Preferably, the switch to binary logging either keeps existing log() calls intact, either has some automatics to convert them.

The option to keep log() calls intact might be in pre-compilation pass of the sources. In this pass each <code>log(fmt, ...)</code> call gets translated into a call to a binary log function that saves <code>fmt</code> identifier copies all the args ''as is'' into the log file. The binary log decode utility, required in this case, should then find the fmt string by its ID in the log file and print the resulting message.

'''Links:'''
* [[Better logging]]

'''Details:'''
* Skill level: intermediate
* Language: C, though decoder/preprocessor can be in any language
* Expected size: 350 hours
* Suggested by: Andrei Vagin
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== IOUring support ===
The io_uring Asynchronous I/O (AIO) framework is a new Linux I/O interface, first introduced in upstream Linux kernel version 5.1 (March 2019). It provides a low-latency and feature-rich interface for applications that require AIO functionality.

'''Links:'''
* https://blogs.oracle.com/linux/an-introduction-to-the-io_uring-asynchronous-io-framework
* https://github.com/axboe/liburing

'''Details:'''
* Skill level: expert (+linux kernel)
* Expected size: 350 hours

=== Add support for SPFS ===

'''Summary:''' The SPFS is a special filesystem that allows checkpoint and restore of such things as NFS and FUSE

NFS support is already implemented in Virtuozzo CRIU, but it's very beneficial to port it to mainline CRIU. The importaint part of it is the need to implement the integration of Stub-Proxy File System (SPFS) with LXC/yet_another_containers_environment.

'''Links'''
* https://github.com/checkpoint-restore/criu/issues/60
* https://github.com/checkpoint-restore/criu/issues/53
* https://github.com/skinsbursky/spfs
* https://patchwork.criu.org/series/137/

'''Details:'''
* Skill level: expert
* Language: C
* Mentor: Alexander Mikhalitsyn <alexander@mihalicyn.com>
* Suggested by: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Anonymise image files ===

'''Summary:''' Teach [[CRIT]] to remove sensitive information from images

When reporting a BUG it may be not acceptable for the reporter to send us raw images, as they may contain sensitive data. Need to teach CRIT to "anonymise" images for publication.

List of data to shred:

* Memory contents. For the sake of investigation, all the memory contents can be just removed. Only the sizes of pages*.img files are enough.
* Paths to files. Here we should keep the paths relations to each other. The simplest way seem to be replacing file names with "random" (or sequential) strings, BUT (!) keeping an eye on making this mapping be 1:1. Note, that file paths may also sit in sk-unix.img.
* Registers.
* Process names. (But relations should be kept).
* Contents of streams, i.e. pipe/fifo data, sk-queue, tcp-stream, tty data.
* Ghost files.
* Tarballs with tmpfs-s.
* IP addresses in sk-inet-s, ip tool dumps and net*.img.

'''Links:'''
* [[Anonymize image files]]
* https://github.com/checkpoint-restore/criu/issues/360
* [[CRIT]], [[Images]]
* External links to mailing lists or web sites

'''Details:'''
* Skill level: beginner
* Language: Python

=== Add support for checkpoint/restore of CORK-ed UDP socket ===

'''Summary:''' Support C/R of corked UDP socket

There's UDP_CORK option for sockets. As man page says:
<pre>
If this option is enabled, then all data output on this socket
is accumulated into a single datagram that is transmitted when
the option is disabled. This option should not be used in
code intended to be portable.
</pre>

Currently criu refuses to dump this case, so it's effectively a bug. Supporting
this will need extending the kernel API to allow criu read back the write queue
of the socket (see [[TCP connection|how it's done]] for TCP sockets, for example). Then
the queue is written into the image and is restored into the socket (with the CORK
bit set too).

'''Notes:'''
We already had a couple (3) of tries for this problem:

* UDP_REPAIR approach didn't succeed: https://lore.kernel.org/netdev/721a2e32-c930-ad6b-5055-631b502ed11b@gmail.com/, https://lore.kernel.org/netdev/?q=udp_repair
* eBPF (CRIB) approach, socket queue iterator was not merged: https://lore.kernel.org/netdev/AM6PR03MB5848EDA002E3D7EACA7C6BDA99A52@AM6PR03MB5848.eurprd03.prod.outlook.com/

'''Links:'''
* https://github.com/checkpoint-restore/criu/issues/409
* https://github.com/criupatchwork/criu/commit/a532312
* [[Sockets]], [[TCP connection]]
* [[https://groups.google.com/forum/#!topic/comp.os.linux.networking/Uz8PYiTCZSg UDP cork explained]]

'''Details:'''
* Skill level: intermediate (+linux kernel)
* Language: C
* Expected size: 350 hours
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Pavel Tikhomirov <ptikhomirov@virtuozzo.com>, Andrei Vagin <avagin@gmail.com>

[[Category:GSoC]]
[[Category:Development]]

Google Summer of Code Ideas

2025-03-03T07:04:57Z

Ptikhomirov: /* Add support for checkpoint/restore of CORK-ed UDP socket */

Google Summer of Code (GSoC) is a global program that offers post-secondary students an opportunity to be paid for contributing to an open source project over a three month period.

This page contains project ideas for upcoming Google Summer of Code.

== Contacts ==

Please contact the respective mentor for the idea you are interested in. For general questions feel free to send an email to the [mailto:criu@lists.linux.dev mailing list] or write in [https://gitter.im/save-restore/criu gitter].

== Project ideas ==

=== Add support for memory compression ===

'''Summary:''' Support compression for page images

We would like to support memory page files compression
in CRIU using one of the fastest algorithms (it's matter
of discussion which one to choose!).

This task does not require any Linux kernel modifications
and scope is limited to CRIU itself. At the same time it's
complex enough as we need to touch memory dump/restore codepath
in CRIU and also handle many corner cases like page-server and stuff.

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Suggested by: Andrei Vagin <avagin@gmail.com>
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Alexander Mikhalitsyn <alexander@mihalicyn.com>, Andrei Vagin <avagin@gmail.com>

=== Use eBPF to lock and unlock the network ===

'''Summary:''' Use eBPF instead of external iptables-restore tool for network lock and unlock.

During checkpointing and restoring CRIU locks the network to make sure no network packets are accepted by the network stack during the time the process is checkpointed. Currently CRIU calls out to iptables-restore to create and delete the corresponding iptables rules. Another approach which avoids calling out to the external binary iptables-restore would be to directly inject eBPF rules. There have been reports from users that iptables-restore fails in some way and eBPF could avoid this external dependency.

'''Links:'''
* https://www.criu.org/TCP_connection#Checkpoint_and_restore_TCP_connection
* https://github.com/systemd/systemd/blob/master/src/core/bpf-firewall.c
* https://blog.zeyady.com/2021-08-16/gsoc-criu

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Adrian Reber <areber@redhat.com>

=== Files on detached mounts ===

'''Summary:''' Initial support of open files on "detached" mounts

When criu dumps a process with an open fd on a file, it gets the mount identifier (mnt_id) via /proc/<pid>/fdinfo/<fd>, so that criu knows from which exact mount the file was initially opened. This way criu can restore this fd by opening the same exact file from topologically the same mount in restored mount tree.

Restoring fd from the right mount can be important in different cases, for instance if the process would later want to resolve paths relative to the fd, and obviously resolving from the same file on different mount can lead to different resolved paths, or if the process wants to check path to the file via /proc/<pid>/fd/<fd>.

But we have a problem finding on which mount we need to reopen the file at restore if we only know mnt_id but can't find this mnt_id in /proc/<pid>/mountinfo.

Mountinfo file shows the mount tree topology of current mntns: parent - child relations, sharing group information, mountpoint and fs root information. And if we don't see mnt_id in it we don't know anything about this mount.

This can happen in two cases

* 1) external mount or file - if file was opened from e.g. host it's mount would not be visible in container mountinfo
* 2) mount was lazily unmounted

In case of 1) we have criu options to help criu handle external dependencies.

In case of 2) or no options provided criu can't resolve mnt_id in mountinfo and criu fails.

'''Solution:'''
We can handle 2) with: resolving major/minor via fstat, using name_to_handle_at and open_by_handle_at to open same file on any other available mount from same superblock (same major/minor) in container. Now we have fd2 of the same file as fd, but on existing mount we can dump it as usual instead, and mark it as "detached" in image, now criu on restore knows where to find this file, but instead of just opening fd2 from actually restored mount, we create a temporary bindmount which is lazy unmounted just after open making the file appear as a file on detached mount.

Known problems with this approach:

* Stat on btrfs gives wrong major/minor
* file handles does not work everywhere
* file handles can return fd2 on deleted file or on other hardlink, this needs special handling.

Additionally (optional part):
We can export real major/minor in fdinfo (kernel).
We can think of new kernel interface to get mount's major/minor and root (shift from fsroot) for detached mounts, if we have it we don't need file handle hack to find file on other mount (see fsinfo or getvalues kernel patches in LKML, can we add this info there?).

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentor: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
* Suggested by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=== Checkpointing of POSIX message queues ===

'''Summary:''' Add support for checkpoint/restore of POSIX message queues

POSIX message queues are a widely used inter-process communication mechanism. Message queues are implemented as files on a virtual filesystem (mqueue), where a file descriptor (message queue descriptor) is used to perform operations such as sending or receiving messages. To support checkpoint/restore of POSIX message queues, we need a kernel interface (similar to [https://github.com/checkpoint-restore/criu/commit/8ce9e947051e43430eb2ff06b96dddeba467b4fd MSG_PEEK]) that would enable the retrieval of messages from a queue without removing them. This project aims to implement such an interface that allows retrieving all messages and their priorities from a POSIX message queue.

'''Links:'''
* https://github.com/checkpoint-restore/criu/issues/2285
* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/ipc/mqueue.c
* https://www.man7.org/tlpi/download/TLPI-52-POSIX_Message_Queues.pdf

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Pavel Tikhomirov <ptikhomirov@virtuozzo.com>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=== Add support for arm64 Guarded Control Stack (GCS) ===

'''Summary:''' Support arm64 Guarded Control Stack (GCS)

The arm64 Guarded Control Stack (GCS) feature provides support for
hardware protected stacks of return addresses, intended to provide
hardening against return oriented programming (ROP) attacks and to make
it easier to gather call stacks for applications such as profiling (taken from [1]).
We would like to support arm64 Guarded Control Stack (GCS) in CRIU, which means
that CRIU should be able to Checkpoint/Restore applications using GCS.

This task should not require any Linux kernel modifications
but will require a lot of effort to understand Linux kernel and
glibc support patches. We have a good example of support for
x86 shadow stack [4] thanks to Mike.

'''Links:'''
* [1] kernel support https://lore.kernel.org/all/20241001-arm64-gcs-v13-0-222b78d87eee@kernel.org
* [2] libc support https://inbox.sourceware.org/libc-alpha/20250117174119.3254972-1-yury.khrustalev@arm.com
* [3] libc tests https://inbox.sourceware.org/libc-alpha/20250210114538.1723249-1-yury.khrustalev@arm.com
* [4] x86 support (a great reference!) https://github.com/checkpoint-restore/criu/pull/2306

'''Details:'''
* Skill level: expert (a lot of moving parts: Linux kernel / libc / CRIU)
* Language: C
* Expected size: 350 hours
* Suggested by: Mike Rapoport <rppt@kernel.org>
* Mentors: Mike Rapoport <rppt@kernel.org>, Andrei Vagin <avagin@gmail.com>, Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Coordinated checkpointing of distributed applications ===

'''Summary:''' Enable coordinated container checkpointing with Kubernetes.

Checkpointing support has been recently introduced in Kubernetes, where the
smallest deployable unit is a Pod (a group of containers). Kubernetes is often
used to deploy applications that are distributed across multiple nodes.
However, checkpointing such distributed applications requires a coordination
mechanism to synchronize the checkpoint and restore operations. To address this
challenge, we have developed a new tool called <code>criu-coordinator</code>
that relies on the action-script functionality of CRIU to enable synchronization
in distributed environments. This project aims to extend this tool to enable
seamless integration with the checkpointing functionality of Kubernetes.

'''Links:'''
* https://github.com/checkpoint-restore/criu-coordinator
* https://lpc.events/event/18/contributions/1803/
* https://sched.co/1YeT4
* https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/

'''Details:'''
* Skill level: intermediate
* Language: Rust / Go / C
* Expected size: 350 hours
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

== Suspended project ideas ==

Listed here are tasks that seem suitable for GSoC, but currently do not have anybody to mentor it.

=== Optimize logging engine ===

'''Summary:''' CRIU puts a lots of logs when doing its job. Logging is done with simple fprintf function. They are typically useless, but ''if'' some operation fails -- the logs are the only way to find what was the reason for failure.

At the same time the printf family of functions is known to take some time to work -- they need to scan the format string for %-s and then convert the arguments into strings. If comparing criu dump with and without logs the time difference is notable (15%-20%), so speeding the logs up will help improve criu performance.

One of the solutions to the problem might be binary logging. The problem with binary logs is the amount of efforts to convert existing logs to binary form. Preferably, the switch to binary logging either keeps existing log() calls intact, either has some automatics to convert them.

The option to keep log() calls intact might be in pre-compilation pass of the sources. In this pass each <code>log(fmt, ...)</code> call gets translated into a call to a binary log function that saves <code>fmt</code> identifier copies all the args ''as is'' into the log file. The binary log decode utility, required in this case, should then find the fmt string by its ID in the log file and print the resulting message.

'''Links:'''
* [[Better logging]]

'''Details:'''
* Skill level: intermediate
* Language: C, though decoder/preprocessor can be in any language
* Expected size: 350 hours
* Suggested by: Andrei Vagin
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== IOUring support ===
The io_uring Asynchronous I/O (AIO) framework is a new Linux I/O interface, first introduced in upstream Linux kernel version 5.1 (March 2019). It provides a low-latency and feature-rich interface for applications that require AIO functionality.

'''Links:'''
* https://blogs.oracle.com/linux/an-introduction-to-the-io_uring-asynchronous-io-framework
* https://github.com/axboe/liburing

'''Details:'''
* Skill level: expert (+linux kernel)
* Expected size: 350 hours

=== Add support for SPFS ===

'''Summary:''' The SPFS is a special filesystem that allows checkpoint and restore of such things as NFS and FUSE

NFS support is already implemented in Virtuozzo CRIU, but it's very beneficial to port it to mainline CRIU. The importaint part of it is the need to implement the integration of Stub-Proxy File System (SPFS) with LXC/yet_another_containers_environment.

'''Links'''
* https://github.com/checkpoint-restore/criu/issues/60
* https://github.com/checkpoint-restore/criu/issues/53
* https://github.com/skinsbursky/spfs
* https://patchwork.criu.org/series/137/

'''Details:'''
* Skill level: expert
* Language: C
* Mentor: Alexander Mikhalitsyn <alexander@mihalicyn.com>
* Suggested by: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Anonymise image files ===

'''Summary:''' Teach [[CRIT]] to remove sensitive information from images

When reporting a BUG it may be not acceptable for the reporter to send us raw images, as they may contain sensitive data. Need to teach CRIT to "anonymise" images for publication.

List of data to shred:

* Memory contents. For the sake of investigation, all the memory contents can be just removed. Only the sizes of pages*.img files are enough.
* Paths to files. Here we should keep the paths relations to each other. The simplest way seem to be replacing file names with "random" (or sequential) strings, BUT (!) keeping an eye on making this mapping be 1:1. Note, that file paths may also sit in sk-unix.img.
* Registers.
* Process names. (But relations should be kept).
* Contents of streams, i.e. pipe/fifo data, sk-queue, tcp-stream, tty data.
* Ghost files.
* Tarballs with tmpfs-s.
* IP addresses in sk-inet-s, ip tool dumps and net*.img.

'''Links:'''
* [[Anonymize image files]]
* https://github.com/checkpoint-restore/criu/issues/360
* [[CRIT]], [[Images]]
* External links to mailing lists or web sites

'''Details:'''
* Skill level: beginner
* Language: Python

=== Add support for checkpoint/restore of CORK-ed UDP socket ===

'''Summary:''' Support C/R of corked UDP socket

There's UDP_CORK option for sockets. As man page says:
<pre>
If this option is enabled, then all data output on this socket
is accumulated into a single datagram that is transmitted when
the option is disabled. This option should not be used in
code intended to be portable.
</pre>

Currently criu refuses to dump this case, so it's effectively a bug. Supporting
this will need extending the kernel API to allow criu read back the write queue
of the socket (see [[TCP connection|how it's done]] for TCP sockets, for example). Then
the queue is written into the image and is restored into the socket (with the CORK
bit set too).

'''Notes:'''
We already had a couple (3) of tries for this problem:
* UDP_REPAIR approach didn't succeed: https://lore.kernel.org/netdev/721a2e32-c930-ad6b-5055-631b502ed11b@gmail.com/, https://lore.kernel.org/netdev/?q=udp_repair
* eBPF (CRIB) approach, socket queue iterator was not merged: https://lore.kernel.org/netdev/AM6PR03MB5848EDA002E3D7EACA7C6BDA99A52@AM6PR03MB5848.eurprd03.prod.outlook.com/

'''Links:'''
* https://github.com/checkpoint-restore/criu/issues/409
* https://github.com/criupatchwork/criu/commit/a532312
* [[Sockets]], [[TCP connection]]
* [[https://groups.google.com/forum/#!topic/comp.os.linux.networking/Uz8PYiTCZSg UDP cork explained]]

'''Details:'''
* Skill level: intermediate (+linux kernel)
* Language: C
* Expected size: 350 hours
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Pavel Tikhomirov <ptikhomirov@virtuozzo.com>, Andrei Vagin <avagin@gmail.com>

[[Category:GSoC]]
[[Category:Development]]

LXC

2024-07-01T04:33:37Z

Ptikhomirov: /* Checkpointing and restoring a container */

== Requirements ==

You should have built and installed a recent (>= 1.3.1) version of CRIU.

== Checkpointing and restoring a container ==

LXC upstream has begun to integrate checkpoint/restore support through the lxc-checkpoint tool. This functionality has been in the recent released version of LXC---LXC 1.1.0 , you can install the LXC 1.1.0 or you can check out the development version on Ubuntu by doing:
<source lang="bash">
sudo add-apt-repository ppa:ubuntu-lxc/daily
sudo apt-get update
sudo apt-get install lxc
</source>

Next, create a container:

<source lang="bash">
sudo lxc-create -t ubuntu -n u1 -- -r trusty -a amd64
</source>

And add the following lines (as above) to its config:

<source lang="bash">
cat | sudo tee -a /var/lib/lxc/u1/config << EOF
# hax for criu
lxc.console.path = none
lxc.tty.max = 0
lxc.cgroup.devices.deny = c 5:1 rwm
# on older lxc comment the above and uncomment the below
# lxc.console = none
# lxc.tty = 0
# lxc.cgroup.devices.deny = c 5:1 rwm
EOF
</source>

Finally, start, and checkpoint the container:

<source lang="bash">
sudo lxc-start -n u1
sleep 5s # let the container get to a more interesting state
sudo lxc-checkpoint -s -D /tmp/checkpoint -n u1
</source>

At this point, the container's state is stored in /tmp/checkpoint, and the filesystem is in /var/lib/lxc/u1/rootfs. You can restore the container by doing:

<source lang="bash">
sudo lxc-checkpoint -r -D /tmp/checkpoint -n u1
</source>

And then, get your container's IP and ssh in:

<source lang="bash">
ssh ubuntu@$(sudo lxc-info -i -H -n u1)
</source>

== Troubleshooting ==

=== Error (mount.c:805): fusectl isn't empty: 8388625 ===

Dumping of fuse filesystems is currently not supported. Empty the container's <code>/sys/fs/fuse/connections</code> and try again.

=== Error (mount.c:517): Mount 58 (master_id: 12 shared_id: 0) has unreachable sharing ===

CRIU doesn't yet support shared mountpoints as LXC does; make sure your rootfs is on a non-shared mount.

== External links ==

* [https://www.youtube.com/watch?v=a9T2gcnQg2k&feature=youtu.be&t=18m8s The New New Thing: Turning Docker Tech into a Full Speed Hypervisor] - Talk of Tycho Andersen with demo of migration LXC container with Doom inside
* [https://github.com/tych0/presentations/blob/master/ods2014.md Demo script]

[[Category: HOWTO]]
[[Category: Live migration]]

Google Summer of Code Ideas

2024-03-26T03:11:25Z

Ptikhomirov: /* Add support for checkpoint/restore of CORK-ed UDP socket */

Google Summer of Code (GSoC) is a global program that offers post-secondary students an opportunity to be paid for contributing to an open source project over a three month period.

This page contains project ideas for upcoming Google Summer of Code.

== Contacts ==

Please contact the respective mentor for the idea you are interested in. For general questions feel free to send an email to the [mailto:criu@openvz.org mailing list] or write in [https://gitter.im/save-restore/criu gitter].

== Project ideas ==

=== Add support for memory compression ===

'''Summary:''' Support compression for page images

We would like to support memory page files compression
in CRIU using one of the fastest algorithms (it's matter
of discussion which one to choose!).

This task does not require any Linux kernel modifications
and scope is limited to CRIU itself. At the same time it's
complex enough as we need to touch memory dump/restore codepath
in CRIU and also handle many corner cases like page-server and stuff.

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Suggested by: Andrei Vagin <avagin@gmail.com>
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Andrei Vagin <avagin@gmail.com>

=== Add support for checkpoint/restore of CORK-ed UDP socket ===

'''Summary:''' Support C/R of corked UDP socket

There's UDP_CORK option for sockets. As man page says:
<pre>
If this option is enabled, then all data output on this socket
is accumulated into a single datagram that is transmitted when
the option is disabled. This option should not be used in
code intended to be portable.
</pre>

Currently criu refuses to dump this case, so it's effectively a bug. Supporting
this will need extending the kernel API to allow criu read back the write queue
of the socket (see [[TCP connection|how it's done]] for TCP sockets, for example). Then
the queue is written into the image and is restored into the socket (with the CORK
bit set too).

'''Links:'''
* https://github.com/checkpoint-restore/criu/issues/409
* https://github.com/criupatchwork/criu/commit/a532312
* [[Sockets]], [[TCP connection]]
* [[https://groups.google.com/forum/#!topic/comp.os.linux.networking/Uz8PYiTCZSg UDP cork explained]]

'''Details:'''
* Skill level: intermediate (+linux kernel)
* Language: C
* Expected size: 350 hours
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Pavel Tikhomirov <ptikhomirov@virtuozzo.com>, Andrei Vagin <avagin@gmail.com>

=== Add support for pidfd file descriptors ===

'''Summary:''' Support C/R of pidfd descriptors

There is pidfd_open syscall which allows opening
a special PID file descriptor. A user can send a signal to
the process (pidfd_send_signal syscall), wait for the process
(poll() on pidfd).

At the moment CRIU can't dump processes that have pidfd's opened.

'''Links:'''
* https://lwn.net/Articles/801319/
* https://lwn.net/Articles/794707/
* https://github.com/torvalds/linux/blob/v5.16/kernel/fork.c#L1877

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Christian Brauner <christian@brauner.io>
* Suggested by: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Use eBPF to lock and unlock the network ===

'''Summary:''' Use eBPF instead of external iptables-restore tool for network lock and unlock.

During checkpointing and restoring CRIU locks the network to make sure no network packets are accepted by the network stack during the time the process is checkpointed. Currently CRIU calls out to iptables-restore to create and delete the corresponding iptables rules. Another approach which avoids calling out to the external binary iptables-restore would be to directly inject eBPF rules. There have been reports from users that iptables-restore fails in some way and eBPF could avoid this external dependency.

'''Links:'''
* https://www.criu.org/TCP_connection#Checkpoint_and_restore_TCP_connection
* https://github.com/systemd/systemd/blob/master/src/core/bpf-firewall.c
* https://blog.zeyady.com/2021-08-16/gsoc-criu

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Adrian Reber <areber@redhat.com>

=== Files on detached mounts ===

'''Summary:''' Initial support of open files on "detached" mounts

When criu dumps a process with an open fd on a file, it gets the mount identifier (mnt_id) via /proc/<pid>/fdinfo/<fd>, so that criu knows from which exact mount the file was initially opened. This way criu can restore this fd by opening the same exact file from topologically the same mount in restored mount tree.

Restoring fd from the right mount can be important in different cases, for instance if the process would later want to resolve paths relative to the fd, and obviously resolving from the same file on different mount can lead to different resolved paths, or if the process wants to check path to the file via /proc/<pid>/fd/<fd>.

But we have a problem finding on which mount we need to reopen the file at restore if we only know mnt_id but can't find this mnt_id in /proc/<pid>/mountinfo.

Mountinfo file shows the mount tree topology of current mntns: parent - child relations, sharing group information, mountpoint and fs root information. And if we don't see mnt_id in it we don't know anything about this mount.

This can happen in two cases

* 1) external mount or file - if file was opened from e.g. host it's mount would not be visible in container mountinfo
* 2) mount was lazily unmounted

In case of 1) we have criu options to help criu handle external dependencies.

In case of 2) or no options provided criu can't resolve mnt_id in mountinfo and criu fails.

'''Solution:'''
We can handle 2) with: resolving major/minor via fstat, using name_to_handle_at and open_by_handle_at to open same file on any other available mount from same superblock (same major/minor) in container. Now we have fd2 of the same file as fd, but on existing mount we can dump it as usual instead, and mark it as "detached" in image, now criu on restore knows where to find this file, but instead of just opening fd2 from actually restored mount, we create a temporary bindmount which is lazy unmounted just after open making the file appear as a file on detached mount.

Known problems with this approach:

* Stat on btrfs gives wrong major/minor
* file handles does not work everywhere
* file handles can return fd2 on deleted file or on other hardlink, this needs special handling.

Additionally (optional part):
We can export real major/minor in fdinfo (kernel).
We can think of new kernel interface to get mount's major/minor and root (shift from fsroot) for detached mounts, if we have it we don't need file handle hack to find file on other mount (see fsinfo or getvalues kernel patches in LKML, can we add this info there?).

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentor: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
* Suggested by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=== Checkpointing of POSIX message queues ===

'''Summary:''' Add support for checkpoint/restore of POSIX message queues

POSIX message queues are a widely used inter-process communication mechanism. Message queues are implemented as files on a virtual filesystem (mqueue), where a file descriptor (message queue descriptor) is used to perform operations such as sending or receiving messages. To support checkpoint/restore of POSIX message queues, we need a kernel interface (similar to [https://github.com/checkpoint-restore/criu/commit/8ce9e947051e43430eb2ff06b96dddeba467b4fd MSG_PEEK]) that would enable the retrieval of messages from a queue without removing them. This project aims to implement such an interface that allows retrieving all messages and their priorities from a POSIX message queue.

'''Links:'''
* https://github.com/checkpoint-restore/criu/issues/2285
* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/ipc/mqueue.c
* https://www.man7.org/tlpi/download/TLPI-52-POSIX_Message_Queues.pdf

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Pavel Tikhomirov <ptikhomirov@virtuozzo.com>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=== Kubernetes operator for managing container checkpoints ===

'''Summary:''' Develop a Kubernetes operator that automates the management of container checkpoints

Container checkpointing has recently been introduced as an alpha feature in Kubernetes.
To enable this feature, the kubelet API was extended with an endpoint that enables the
creation of checkpoints for individual containers. By default, all container checkpoints
are stored as tar archives in <code>/var/lib/kubelet/checkpoints</code> using the following
file name format: <code>checkpoint-<pod-name>_<namespace-name>-<container-name>-<timestamp>.tar</code>.
However, the current implementation does not provide a mechanism for limiting the number
of checkpoints, which may lead to filling up all existing disk space. This project aims to
develop a Kubernetes operator that automates the management of checkpoints and provides
a garbage collection mechanism to discard obsolete checkpoints.

'''Links:'''
* https://github.com/checkpoint-restore/checkpoint-restore-operator
* https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/
* https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/
* https://kubernetes.io/blog/2023/03/10/forensic-container-analysis/
* https://github.com/kubernetes/kubernetes/pull/115888
* https://github.com/kubernetes/enhancements/issues/2008

'''Details:'''
* Skill level: intermediate
* Language: Go
* Expected size: 350 hours
* Mentors: Adrian Reber <areber@redhat.com>, Radostin Stoyanov <rstoyanov@fedoraproject.org>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Adrian Reber

== Suspended project ideas ==

Listed here are tasks that seem suitable for GSoC, but currently do not have anybody to mentor it.

=== Optimize logging engine ===

'''Summary:''' CRIU puts a lots of logs when doing its job. Logging is done with simple fprintf function. They are typically useless, but ''if'' some operation fails -- the logs are the only way to find what was the reason for failure.

At the same time the printf family of functions is known to take some time to work -- they need to scan the format string for %-s and then convert the arguments into strings. If comparing criu dump with and without logs the time difference is notable (15%-20%), so speeding the logs up will help improve criu performance.

One of the solutions to the problem might be binary logging. The problem with binary logs is the amount of efforts to convert existing logs to binary form. Preferably, the switch to binary logging either keeps existing log() calls intact, either has some automatics to convert them.

The option to keep log() calls intact might be in pre-compilation pass of the sources. In this pass each <code>log(fmt, ...)</code> call gets translated into a call to a binary log function that saves <code>fmt</code> identifier copies all the args ''as is'' into the log file. The binary log decode utility, required in this case, should then find the fmt string by its ID in the log file and print the resulting message.

'''Links:'''
* [[Better logging]]

'''Details:'''
* Skill level: intermediate
* Language: C, though decoder/preprocessor can be in any language
* Expected size: 350 hours
* Suggested by: Andrei Vagin
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== IOUring support ===
The io_uring Asynchronous I/O (AIO) framework is a new Linux I/O interface, first introduced in upstream Linux kernel version 5.1 (March 2019). It provides a low-latency and feature-rich interface for applications that require AIO functionality.

'''Links:'''
* https://blogs.oracle.com/linux/an-introduction-to-the-io_uring-asynchronous-io-framework
* https://github.com/axboe/liburing

'''Details:'''
* Skill level: expert (+linux kernel)
* Expected size: 350 hours

=== Add support for SPFS ===

'''Summary:''' The SPFS is a special filesystem that allows checkpoint and restore of such things as NFS and FUSE

NFS support is already implemented in Virtuozzo CRIU, but it's very beneficial to port it to mainline CRIU. The importaint part of it is the need to implement the integration of Stub-Proxy File System (SPFS) with LXC/yet_another_containers_environment.

'''Links'''
* https://github.com/checkpoint-restore/criu/issues/60
* https://github.com/checkpoint-restore/criu/issues/53
* https://github.com/skinsbursky/spfs
* https://patchwork.criu.org/series/137/

'''Details:'''
* Skill level: expert
* Language: C
* Mentor: Alexander Mikhalitsyn <alexander@mihalicyn.com>
* Suggested by: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Anonymise image files ===

'''Summary:''' Teach [[CRIT]] to remove sensitive information from images

When reporting a BUG it may be not acceptable for the reporter to send us raw images, as they may contain sensitive data. Need to teach CRIT to "anonymise" images for publication.

List of data to shred:

* Memory contents. For the sake of investigation, all the memory contents can be just removed. Only the sizes of pages*.img files are enough.
* Paths to files. Here we should keep the paths relations to each other. The simplest way seem to be replacing file names with "random" (or sequential) strings, BUT (!) keeping an eye on making this mapping be 1:1. Note, that file paths may also sit in sk-unix.img.
* Registers.
* Process names. (But relations should be kept).
* Contents of streams, i.e. pipe/fifo data, sk-queue, tcp-stream, tty data.
* Ghost files.
* Tarballs with tmpfs-s.
* IP addresses in sk-inet-s, ip tool dumps and net*.img.

'''Links:'''
* [[Anonymize image files]]
* https://github.com/checkpoint-restore/criu/issues/360
* [[CRIT]], [[Images]]
* External links to mailing lists or web sites

'''Details:'''
* Skill level: beginner
* Language: Python

[[Category:GSoC]]
[[Category:Development]]

Google Summer of Code Ideas

2024-03-26T03:11:00Z

Ptikhomirov: /* Add support for memory compression */

Google Summer of Code (GSoC) is a global program that offers post-secondary students an opportunity to be paid for contributing to an open source project over a three month period.

This page contains project ideas for upcoming Google Summer of Code.

== Contacts ==

Please contact the respective mentor for the idea you are interested in. For general questions feel free to send an email to the [mailto:criu@openvz.org mailing list] or write in [https://gitter.im/save-restore/criu gitter].

== Project ideas ==

=== Add support for memory compression ===

'''Summary:''' Support compression for page images

We would like to support memory page files compression
in CRIU using one of the fastest algorithms (it's matter
of discussion which one to choose!).

This task does not require any Linux kernel modifications
and scope is limited to CRIU itself. At the same time it's
complex enough as we need to touch memory dump/restore codepath
in CRIU and also handle many corner cases like page-server and stuff.

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Suggested by: Andrei Vagin <avagin@gmail.com>
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Andrei Vagin <avagin@gmail.com>

=== Add support for checkpoint/restore of CORK-ed UDP socket ===

'''Summary:''' Support C/R of corked UDP socket

There's UDP_CORK option for sockets. As man page says:
<pre>
If this option is enabled, then all data output on this socket
is accumulated into a single datagram that is transmitted when
the option is disabled. This option should not be used in
code intended to be portable.
</pre>

Currently criu refuses to dump this case, so it's effectively a bug. Supporting
this will need extending the kernel API to allow criu read back the write queue
of the socket (see [[TCP connection|how it's done]] for TCP sockets, for example). Then
the queue is written into the image and is restored into the socket (with the CORK
bit set too).

'''Links:'''
* https://github.com/checkpoint-restore/criu/issues/409
* https://github.com/criupatchwork/criu/commit/a532312
* [[Sockets]], [[TCP connection]]
* [[https://groups.google.com/forum/#!topic/comp.os.linux.networking/Uz8PYiTCZSg UDP cork explained]]

'''Details:'''
* Skill level: intermediate (+linux kernel)
* Language: C
* Expected size: 350 hours
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Andrei Vagin <avagin@gmail.com>

=== Add support for pidfd file descriptors ===

'''Summary:''' Support C/R of pidfd descriptors

There is pidfd_open syscall which allows opening
a special PID file descriptor. A user can send a signal to
the process (pidfd_send_signal syscall), wait for the process
(poll() on pidfd).

At the moment CRIU can't dump processes that have pidfd's opened.

'''Links:'''
* https://lwn.net/Articles/801319/
* https://lwn.net/Articles/794707/
* https://github.com/torvalds/linux/blob/v5.16/kernel/fork.c#L1877

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Christian Brauner <christian@brauner.io>
* Suggested by: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Use eBPF to lock and unlock the network ===

'''Summary:''' Use eBPF instead of external iptables-restore tool for network lock and unlock.

During checkpointing and restoring CRIU locks the network to make sure no network packets are accepted by the network stack during the time the process is checkpointed. Currently CRIU calls out to iptables-restore to create and delete the corresponding iptables rules. Another approach which avoids calling out to the external binary iptables-restore would be to directly inject eBPF rules. There have been reports from users that iptables-restore fails in some way and eBPF could avoid this external dependency.

'''Links:'''
* https://www.criu.org/TCP_connection#Checkpoint_and_restore_TCP_connection
* https://github.com/systemd/systemd/blob/master/src/core/bpf-firewall.c
* https://blog.zeyady.com/2021-08-16/gsoc-criu

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Adrian Reber <areber@redhat.com>

=== Files on detached mounts ===

'''Summary:''' Initial support of open files on "detached" mounts

When criu dumps a process with an open fd on a file, it gets the mount identifier (mnt_id) via /proc/<pid>/fdinfo/<fd>, so that criu knows from which exact mount the file was initially opened. This way criu can restore this fd by opening the same exact file from topologically the same mount in restored mount tree.

Restoring fd from the right mount can be important in different cases, for instance if the process would later want to resolve paths relative to the fd, and obviously resolving from the same file on different mount can lead to different resolved paths, or if the process wants to check path to the file via /proc/<pid>/fd/<fd>.

But we have a problem finding on which mount we need to reopen the file at restore if we only know mnt_id but can't find this mnt_id in /proc/<pid>/mountinfo.

Mountinfo file shows the mount tree topology of current mntns: parent - child relations, sharing group information, mountpoint and fs root information. And if we don't see mnt_id in it we don't know anything about this mount.

This can happen in two cases

* 1) external mount or file - if file was opened from e.g. host it's mount would not be visible in container mountinfo
* 2) mount was lazily unmounted

In case of 1) we have criu options to help criu handle external dependencies.

In case of 2) or no options provided criu can't resolve mnt_id in mountinfo and criu fails.

'''Solution:'''
We can handle 2) with: resolving major/minor via fstat, using name_to_handle_at and open_by_handle_at to open same file on any other available mount from same superblock (same major/minor) in container. Now we have fd2 of the same file as fd, but on existing mount we can dump it as usual instead, and mark it as "detached" in image, now criu on restore knows where to find this file, but instead of just opening fd2 from actually restored mount, we create a temporary bindmount which is lazy unmounted just after open making the file appear as a file on detached mount.

Known problems with this approach:

* Stat on btrfs gives wrong major/minor
* file handles does not work everywhere
* file handles can return fd2 on deleted file or on other hardlink, this needs special handling.

Additionally (optional part):
We can export real major/minor in fdinfo (kernel).
We can think of new kernel interface to get mount's major/minor and root (shift from fsroot) for detached mounts, if we have it we don't need file handle hack to find file on other mount (see fsinfo or getvalues kernel patches in LKML, can we add this info there?).

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentor: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
* Suggested by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=== Checkpointing of POSIX message queues ===

'''Summary:''' Add support for checkpoint/restore of POSIX message queues

POSIX message queues are a widely used inter-process communication mechanism. Message queues are implemented as files on a virtual filesystem (mqueue), where a file descriptor (message queue descriptor) is used to perform operations such as sending or receiving messages. To support checkpoint/restore of POSIX message queues, we need a kernel interface (similar to [https://github.com/checkpoint-restore/criu/commit/8ce9e947051e43430eb2ff06b96dddeba467b4fd MSG_PEEK]) that would enable the retrieval of messages from a queue without removing them. This project aims to implement such an interface that allows retrieving all messages and their priorities from a POSIX message queue.

'''Links:'''
* https://github.com/checkpoint-restore/criu/issues/2285
* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/ipc/mqueue.c
* https://www.man7.org/tlpi/download/TLPI-52-POSIX_Message_Queues.pdf

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Pavel Tikhomirov <ptikhomirov@virtuozzo.com>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=== Kubernetes operator for managing container checkpoints ===

'''Summary:''' Develop a Kubernetes operator that automates the management of container checkpoints

Container checkpointing has recently been introduced as an alpha feature in Kubernetes.
To enable this feature, the kubelet API was extended with an endpoint that enables the
creation of checkpoints for individual containers. By default, all container checkpoints
are stored as tar archives in <code>/var/lib/kubelet/checkpoints</code> using the following
file name format: <code>checkpoint-<pod-name>_<namespace-name>-<container-name>-<timestamp>.tar</code>.
However, the current implementation does not provide a mechanism for limiting the number
of checkpoints, which may lead to filling up all existing disk space. This project aims to
develop a Kubernetes operator that automates the management of checkpoints and provides
a garbage collection mechanism to discard obsolete checkpoints.

'''Links:'''
* https://github.com/checkpoint-restore/checkpoint-restore-operator
* https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/
* https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/
* https://kubernetes.io/blog/2023/03/10/forensic-container-analysis/
* https://github.com/kubernetes/kubernetes/pull/115888
* https://github.com/kubernetes/enhancements/issues/2008

'''Details:'''
* Skill level: intermediate
* Language: Go
* Expected size: 350 hours
* Mentors: Adrian Reber <areber@redhat.com>, Radostin Stoyanov <rstoyanov@fedoraproject.org>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Adrian Reber

== Suspended project ideas ==

Listed here are tasks that seem suitable for GSoC, but currently do not have anybody to mentor it.

=== Optimize logging engine ===

'''Summary:''' CRIU puts a lots of logs when doing its job. Logging is done with simple fprintf function. They are typically useless, but ''if'' some operation fails -- the logs are the only way to find what was the reason for failure.

At the same time the printf family of functions is known to take some time to work -- they need to scan the format string for %-s and then convert the arguments into strings. If comparing criu dump with and without logs the time difference is notable (15%-20%), so speeding the logs up will help improve criu performance.

One of the solutions to the problem might be binary logging. The problem with binary logs is the amount of efforts to convert existing logs to binary form. Preferably, the switch to binary logging either keeps existing log() calls intact, either has some automatics to convert them.

The option to keep log() calls intact might be in pre-compilation pass of the sources. In this pass each <code>log(fmt, ...)</code> call gets translated into a call to a binary log function that saves <code>fmt</code> identifier copies all the args ''as is'' into the log file. The binary log decode utility, required in this case, should then find the fmt string by its ID in the log file and print the resulting message.

'''Links:'''
* [[Better logging]]

'''Details:'''
* Skill level: intermediate
* Language: C, though decoder/preprocessor can be in any language
* Expected size: 350 hours
* Suggested by: Andrei Vagin
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== IOUring support ===
The io_uring Asynchronous I/O (AIO) framework is a new Linux I/O interface, first introduced in upstream Linux kernel version 5.1 (March 2019). It provides a low-latency and feature-rich interface for applications that require AIO functionality.

'''Links:'''
* https://blogs.oracle.com/linux/an-introduction-to-the-io_uring-asynchronous-io-framework
* https://github.com/axboe/liburing

'''Details:'''
* Skill level: expert (+linux kernel)
* Expected size: 350 hours

=== Add support for SPFS ===

'''Summary:''' The SPFS is a special filesystem that allows checkpoint and restore of such things as NFS and FUSE

NFS support is already implemented in Virtuozzo CRIU, but it's very beneficial to port it to mainline CRIU. The importaint part of it is the need to implement the integration of Stub-Proxy File System (SPFS) with LXC/yet_another_containers_environment.

'''Links'''
* https://github.com/checkpoint-restore/criu/issues/60
* https://github.com/checkpoint-restore/criu/issues/53
* https://github.com/skinsbursky/spfs
* https://patchwork.criu.org/series/137/

'''Details:'''
* Skill level: expert
* Language: C
* Mentor: Alexander Mikhalitsyn <alexander@mihalicyn.com>
* Suggested by: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Anonymise image files ===

'''Summary:''' Teach [[CRIT]] to remove sensitive information from images

When reporting a BUG it may be not acceptable for the reporter to send us raw images, as they may contain sensitive data. Need to teach CRIT to "anonymise" images for publication.

List of data to shred:

* Memory contents. For the sake of investigation, all the memory contents can be just removed. Only the sizes of pages*.img files are enough.
* Paths to files. Here we should keep the paths relations to each other. The simplest way seem to be replacing file names with "random" (or sequential) strings, BUT (!) keeping an eye on making this mapping be 1:1. Note, that file paths may also sit in sk-unix.img.
* Registers.
* Process names. (But relations should be kept).
* Contents of streams, i.e. pipe/fifo data, sk-queue, tcp-stream, tty data.
* Ghost files.
* Tarballs with tmpfs-s.
* IP addresses in sk-inet-s, ip tool dumps and net*.img.

'''Links:'''
* [[Anonymize image files]]
* https://github.com/checkpoint-restore/criu/issues/360
* [[CRIT]], [[Images]]
* External links to mailing lists or web sites

'''Details:'''
* Skill level: beginner
* Language: Python

[[Category:GSoC]]
[[Category:Development]]

Google Summer of Code Ideas

2024-03-26T03:10:26Z

Ptikhomirov: /* Add support for memory compression */

Google Summer of Code (GSoC) is a global program that offers post-secondary students an opportunity to be paid for contributing to an open source project over a three month period.

This page contains project ideas for upcoming Google Summer of Code.

== Contacts ==

Please contact the respective mentor for the idea you are interested in. For general questions feel free to send an email to the [mailto:criu@openvz.org mailing list] or write in [https://gitter.im/save-restore/criu gitter].

== Project ideas ==

=== Add support for memory compression ===

'''Summary:''' Support compression for page images

We would like to support memory page files compression
in CRIU using one of the fastest algorithms (it's matter
of discussion which one to choose!).

This task does not require any Linux kernel modifications
and scope is limited to CRIU itself. At the same time it's
complex enough as we need to touch memory dump/restore codepath
in CRIU and also handle many corner cases like page-server and stuff.

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Suggested by: Andrei Vagin <avagin@gmail.com>
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Pavel Tikhomirov <ptikhomirov@virtuozzo.com>, Andrei Vagin <avagin@gmail.com>

=== Add support for checkpoint/restore of CORK-ed UDP socket ===

'''Summary:''' Support C/R of corked UDP socket

There's UDP_CORK option for sockets. As man page says:
<pre>
If this option is enabled, then all data output on this socket
is accumulated into a single datagram that is transmitted when
the option is disabled. This option should not be used in
code intended to be portable.
</pre>

Currently criu refuses to dump this case, so it's effectively a bug. Supporting
this will need extending the kernel API to allow criu read back the write queue
of the socket (see [[TCP connection|how it's done]] for TCP sockets, for example). Then
the queue is written into the image and is restored into the socket (with the CORK
bit set too).

'''Links:'''
* https://github.com/checkpoint-restore/criu/issues/409
* https://github.com/criupatchwork/criu/commit/a532312
* [[Sockets]], [[TCP connection]]
* [[https://groups.google.com/forum/#!topic/comp.os.linux.networking/Uz8PYiTCZSg UDP cork explained]]

'''Details:'''
* Skill level: intermediate (+linux kernel)
* Language: C
* Expected size: 350 hours
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Andrei Vagin <avagin@gmail.com>

=== Add support for pidfd file descriptors ===

'''Summary:''' Support C/R of pidfd descriptors

There is pidfd_open syscall which allows opening
a special PID file descriptor. A user can send a signal to
the process (pidfd_send_signal syscall), wait for the process
(poll() on pidfd).

At the moment CRIU can't dump processes that have pidfd's opened.

'''Links:'''
* https://lwn.net/Articles/801319/
* https://lwn.net/Articles/794707/
* https://github.com/torvalds/linux/blob/v5.16/kernel/fork.c#L1877

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Christian Brauner <christian@brauner.io>
* Suggested by: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Use eBPF to lock and unlock the network ===

'''Summary:''' Use eBPF instead of external iptables-restore tool for network lock and unlock.

During checkpointing and restoring CRIU locks the network to make sure no network packets are accepted by the network stack during the time the process is checkpointed. Currently CRIU calls out to iptables-restore to create and delete the corresponding iptables rules. Another approach which avoids calling out to the external binary iptables-restore would be to directly inject eBPF rules. There have been reports from users that iptables-restore fails in some way and eBPF could avoid this external dependency.

'''Links:'''
* https://www.criu.org/TCP_connection#Checkpoint_and_restore_TCP_connection
* https://github.com/systemd/systemd/blob/master/src/core/bpf-firewall.c
* https://blog.zeyady.com/2021-08-16/gsoc-criu

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Adrian Reber <areber@redhat.com>

=== Files on detached mounts ===

'''Summary:''' Initial support of open files on "detached" mounts

When criu dumps a process with an open fd on a file, it gets the mount identifier (mnt_id) via /proc/<pid>/fdinfo/<fd>, so that criu knows from which exact mount the file was initially opened. This way criu can restore this fd by opening the same exact file from topologically the same mount in restored mount tree.

Restoring fd from the right mount can be important in different cases, for instance if the process would later want to resolve paths relative to the fd, and obviously resolving from the same file on different mount can lead to different resolved paths, or if the process wants to check path to the file via /proc/<pid>/fd/<fd>.

But we have a problem finding on which mount we need to reopen the file at restore if we only know mnt_id but can't find this mnt_id in /proc/<pid>/mountinfo.

Mountinfo file shows the mount tree topology of current mntns: parent - child relations, sharing group information, mountpoint and fs root information. And if we don't see mnt_id in it we don't know anything about this mount.

This can happen in two cases

* 1) external mount or file - if file was opened from e.g. host it's mount would not be visible in container mountinfo
* 2) mount was lazily unmounted

In case of 1) we have criu options to help criu handle external dependencies.

In case of 2) or no options provided criu can't resolve mnt_id in mountinfo and criu fails.

'''Solution:'''
We can handle 2) with: resolving major/minor via fstat, using name_to_handle_at and open_by_handle_at to open same file on any other available mount from same superblock (same major/minor) in container. Now we have fd2 of the same file as fd, but on existing mount we can dump it as usual instead, and mark it as "detached" in image, now criu on restore knows where to find this file, but instead of just opening fd2 from actually restored mount, we create a temporary bindmount which is lazy unmounted just after open making the file appear as a file on detached mount.

Known problems with this approach:

* Stat on btrfs gives wrong major/minor
* file handles does not work everywhere
* file handles can return fd2 on deleted file or on other hardlink, this needs special handling.

Additionally (optional part):
We can export real major/minor in fdinfo (kernel).
We can think of new kernel interface to get mount's major/minor and root (shift from fsroot) for detached mounts, if we have it we don't need file handle hack to find file on other mount (see fsinfo or getvalues kernel patches in LKML, can we add this info there?).

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentor: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
* Suggested by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=== Checkpointing of POSIX message queues ===

'''Summary:''' Add support for checkpoint/restore of POSIX message queues

POSIX message queues are a widely used inter-process communication mechanism. Message queues are implemented as files on a virtual filesystem (mqueue), where a file descriptor (message queue descriptor) is used to perform operations such as sending or receiving messages. To support checkpoint/restore of POSIX message queues, we need a kernel interface (similar to [https://github.com/checkpoint-restore/criu/commit/8ce9e947051e43430eb2ff06b96dddeba467b4fd MSG_PEEK]) that would enable the retrieval of messages from a queue without removing them. This project aims to implement such an interface that allows retrieving all messages and their priorities from a POSIX message queue.

'''Links:'''
* https://github.com/checkpoint-restore/criu/issues/2285
* https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/ipc/mqueue.c
* https://www.man7.org/tlpi/download/TLPI-52-POSIX_Message_Queues.pdf

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Pavel Tikhomirov <ptikhomirov@virtuozzo.com>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

=== Kubernetes operator for managing container checkpoints ===

'''Summary:''' Develop a Kubernetes operator that automates the management of container checkpoints

Container checkpointing has recently been introduced as an alpha feature in Kubernetes.
To enable this feature, the kubelet API was extended with an endpoint that enables the
creation of checkpoints for individual containers. By default, all container checkpoints
are stored as tar archives in <code>/var/lib/kubelet/checkpoints</code> using the following
file name format: <code>checkpoint-<pod-name>_<namespace-name>-<container-name>-<timestamp>.tar</code>.
However, the current implementation does not provide a mechanism for limiting the number
of checkpoints, which may lead to filling up all existing disk space. This project aims to
develop a Kubernetes operator that automates the management of checkpoints and provides
a garbage collection mechanism to discard obsolete checkpoints.

'''Links:'''
* https://github.com/checkpoint-restore/checkpoint-restore-operator
* https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/
* https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/
* https://kubernetes.io/blog/2023/03/10/forensic-container-analysis/
* https://github.com/kubernetes/kubernetes/pull/115888
* https://github.com/kubernetes/enhancements/issues/2008

'''Details:'''
* Skill level: intermediate
* Language: Go
* Expected size: 350 hours
* Mentors: Adrian Reber <areber@redhat.com>, Radostin Stoyanov <rstoyanov@fedoraproject.org>, Prajwal S N <prajwalnadig21@gmail.com>
* Suggested by: Adrian Reber

== Suspended project ideas ==

Listed here are tasks that seem suitable for GSoC, but currently do not have anybody to mentor it.

=== Optimize logging engine ===

'''Summary:''' CRIU puts a lots of logs when doing its job. Logging is done with simple fprintf function. They are typically useless, but ''if'' some operation fails -- the logs are the only way to find what was the reason for failure.

At the same time the printf family of functions is known to take some time to work -- they need to scan the format string for %-s and then convert the arguments into strings. If comparing criu dump with and without logs the time difference is notable (15%-20%), so speeding the logs up will help improve criu performance.

One of the solutions to the problem might be binary logging. The problem with binary logs is the amount of efforts to convert existing logs to binary form. Preferably, the switch to binary logging either keeps existing log() calls intact, either has some automatics to convert them.

The option to keep log() calls intact might be in pre-compilation pass of the sources. In this pass each <code>log(fmt, ...)</code> call gets translated into a call to a binary log function that saves <code>fmt</code> identifier copies all the args ''as is'' into the log file. The binary log decode utility, required in this case, should then find the fmt string by its ID in the log file and print the resulting message.

'''Links:'''
* [[Better logging]]

'''Details:'''
* Skill level: intermediate
* Language: C, though decoder/preprocessor can be in any language
* Expected size: 350 hours
* Suggested by: Andrei Vagin
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== IOUring support ===
The io_uring Asynchronous I/O (AIO) framework is a new Linux I/O interface, first introduced in upstream Linux kernel version 5.1 (March 2019). It provides a low-latency and feature-rich interface for applications that require AIO functionality.

'''Links:'''
* https://blogs.oracle.com/linux/an-introduction-to-the-io_uring-asynchronous-io-framework
* https://github.com/axboe/liburing

'''Details:'''
* Skill level: expert (+linux kernel)
* Expected size: 350 hours

=== Add support for SPFS ===

'''Summary:''' The SPFS is a special filesystem that allows checkpoint and restore of such things as NFS and FUSE

NFS support is already implemented in Virtuozzo CRIU, but it's very beneficial to port it to mainline CRIU. The importaint part of it is the need to implement the integration of Stub-Proxy File System (SPFS) with LXC/yet_another_containers_environment.

'''Links'''
* https://github.com/checkpoint-restore/criu/issues/60
* https://github.com/checkpoint-restore/criu/issues/53
* https://github.com/skinsbursky/spfs
* https://patchwork.criu.org/series/137/

'''Details:'''
* Skill level: expert
* Language: C
* Mentor: Alexander Mikhalitsyn <alexander@mihalicyn.com>
* Suggested by: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Anonymise image files ===

'''Summary:''' Teach [[CRIT]] to remove sensitive information from images

When reporting a BUG it may be not acceptable for the reporter to send us raw images, as they may contain sensitive data. Need to teach CRIT to "anonymise" images for publication.

List of data to shred:

* Memory contents. For the sake of investigation, all the memory contents can be just removed. Only the sizes of pages*.img files are enough.
* Paths to files. Here we should keep the paths relations to each other. The simplest way seem to be replacing file names with "random" (or sequential) strings, BUT (!) keeping an eye on making this mapping be 1:1. Note, that file paths may also sit in sk-unix.img.
* Registers.
* Process names. (But relations should be kept).
* Contents of streams, i.e. pipe/fifo data, sk-queue, tcp-stream, tty data.
* Ghost files.
* Tarballs with tmpfs-s.
* IP addresses in sk-inet-s, ip tool dumps and net*.img.

'''Links:'''
* [[Anonymize image files]]
* https://github.com/checkpoint-restore/criu/issues/360
* [[CRIT]], [[Images]]
* External links to mailing lists or web sites

'''Details:'''
* Skill level: beginner
* Language: Python

[[Category:GSoC]]
[[Category:Development]]

Validate files on restore

2023-08-28T05:33:39Z

Ptikhomirov:

This article describes what CRIU does to make sure it restores the correct set of files and how this file validation is implemented in CRIU. This project was completed under the [https://summerofcode.withgoogle.com/projects/#5773537320632320 GSoC 2020 program].

Note: This is NOT merged to CRIU, see https://github.com/checkpoint-restore/criu/pull/1148

== The previous implementation ==
Since CRIU doesn’t carry the contents of files into images while dumping (Except for ghost files), files that are being restored must be validated to make sure they are the “same” as they were during the dumping process (Especially true for ELF files since there is a risk of restoring executables or libraries of a different version). This was being done by only storing and comparing the size of the file. By itself, this isn’t a very strong check.

== The current implementation ==
The file size method is used as a preliminary method, if it fails there’s no need to do any of the more intensive checks, instead it will immediately give out an error and stop restoring. Stronger checks are used only if it passes.

The simplest and strongest check is to calculate the checksum for the entire file but this would be very intensive for large files and therefore not always feasible. A reasonable compromise would be to calculate the checksum only for certain parts of the file. This is the checksum method and it is one of the two methods that have been implemented in CRIU.

The other method is the build-ID method. The build-ID is a "strongly unique embedded identifier" that (If present) is stored in a particular note section of ELF files.

== Build-ID ==
The build-ID (If present) is stored in a note of type <code>NT_GNU_BUILD_ID</code> in the ELF file. All notes are in the note section which is a program header of type <code>PT_NOTE</code> in the ELF file. After the file has been mmap-ed, the first thing that needs to be done is to check whether the file is an ELF file or not. This is done by checking for the magic number. The next thing to do is to identify whether the file is a 32-bit ELF file or a 64-bit ELF file since the data types of the variables used to parse the ELF file will change depending on the bitness of the file (There are specific 32-bit and 64-bit variants of the data structures in <code>elf.h</code>) but the procedure will remain the same.

The position of the program headers is stored as an offset in <code>phoff</code>. Since all the program headers are stored in an arbitrary order, each program header needs to be checked. If a program header of type <code>PT_NOTE</code> is found, the position of this note section is stored as an offset in <code>p_offset</code>. The notes are stored in an arbitrary order as well, so each note needs to be checked. If a note of type <code>NT_GNU_BUILD_ID</code> is found, the build-ID is present in its description.

== Checksum ==
CRC32C is used to calculate the checksum. The only difference between CRC32C and CRC32 is the polynomial being used. CRC32C uses the Castagnoli polynomial (0x82F63B78 in little-endian notation) and CRC32 uses 0xEDB88320 (In little-endian notation).

The file is mapped 10 MB at a time and the checksum is calculated on the required bytes (Depending on the configuration set - Entire file, First N bytes of the file, or Every Nth byte of the file). N is the checksum parameter.

Adding a new configuration is quite simple and only requires the iterator to be moved to the necessary bytes (0 refers to the first byte of the file and 1 refers to the second byte and so on):
# Input handling for the new configuration
# The <code>checksum_iterator_init</code> function in ''"criu/files-reg.c”'' sets the initial iterator position (The first byte to calculate the checksum)
# The <code>checksum_iterator_next</code> function in ''“criu/files-reg.c”'' moves the iterator to the next position (The next byte to calculate the checksum)
# The <code>checksum_iterator_stop</code> function in ''“criu/files-reg.c”'' returns true when the iterator has reached its final position
There is a separate check in the <code>calculate_checksum</code> function to make sure the iterator refers to a valid byte (Not negative and smaller than the total number of bytes in the file). If the iterator is outside the mapped region but still valid, the required region of the file will be mapped.

== Using different validation methods and parameters ==
The build-ID method is much less intensive compared to the checksum method while still being a much stronger check than simply comparing the file size and is therefore the default. In other words, <code>--file-validation buildid</code> will not make a difference as this is the default method.

CRIU can also be configured to use the checksum method by default by using the <code>--file-validation option</code>:
* <code>--file-validation checksum-full</code> to calculate and use the checksum on the entire file.
* <code>--file-validation checksum</code> to calculate and use the checksum on the first N bytes of the file. The parameter N is set by using the <code>--checksum-parameter</code> option.
* <code>--file-validation checksum-period</code> to calculate and use the checksum on every Nth byte of the file (Including the first byte of the file). The parameter N is set by using the <code>--checksum-parameter option</code>.
By default, the checksum parameter N is set to 1024. If a method that doesn’t require the checksum parameter is being used, then the checksum parameter is simply ignored.
For example, to use the checksum method on the first 2048 bytes of the file: <code>--file-validation checksum --checksum-parameter 2048</code>

If the build-ID method is being used and is inconclusive (Maybe because the ELF file doesn’t contain a build-ID), then the checksum method on the first 1024 bytes of the file is used as a fallback. If the checksum method is being used and is inconclusive, then the build-ID method is used as a fallback. If both are inconclusive, only the file size check is used (And a warning is put out to inform the user that only a weak check has been used for that particular file).

To explicitly use only the file size check all the time, the following command-line option can be used: <code>--file-validation filesize</code> (This is the fastest and least intensive check).

== Performance impact ==
The values shown below are the average times it took to finish the ZDTM tests over multiple runs, and are only to indicate the impact each method has in general.

Each test has 3 files (Of sizes: 0.09 MB, 2 MB and 0.17 MB approximately) and each test is run 3 times (In Host, Namespace and User Namespace). For each file the checksum/build-ID is obtained twice (During dump and restore) therefore the function to find checksum/build-ID is called 18 times overall per test.

For reference, these tests were run on tmpfs (To remove any disk latency) and on an undervolted i5 4800H.

{| class="wikitable" style="text-align: center;
|+zdtm/transition/shmem:
|-
|File Size
|3.782s
|-
|Build-ID
|4.153s (~9% increase)
|-
|Checksum (First 1024 bytes)
|4.465s (~18% increase)
|-
|Checksum (Entire File)
|4.722s (~24% increase)
|-
|Checksum (Every 1024th byte)
|4.498s (~19% increase)
|}

{| class="wikitable" style="text-align: center;
|+zdtm/static/maps04:
|-
|File Size
|35.317s
|-
|Build-ID
|35.720s (~1% increase)
|-
|Checksum (First 1024 bytes)
|35.919s (~2% increase)
|-
|Checksum (Entire File)
|36.679s (~4% increase)
|-
|Checksum (Every 1024th byte)
|36.476s (~3% increase)
|}

== Scope for improvement and future work ==
* Calculating the checksum can be made faster by using a lookup table.

[[Category:Under the hood]]

Google Summer of Code Ideas

2023-03-13T10:55:10Z

Ptikhomirov: /* Files on detached mounts */

Google Summer of Code (GSoC) is a global program that offers post-secondary students an opportunity to be paid for contributing to an open source project over a three month period.

This page contains project ideas for upcoming Google Summer of Code.

== Contacts ==

Please contact the respective mentor for the idea you are interested in. For general questions feel free to send an email to the [mailto:criu@openvz.org mailing list] or write in [https://gitter.im/save-restore/criu gitter].

== Project ideas ==

=== Optimize logging engine ===

'''Summary:''' CRIU puts a lots of logs when doing its job. Logging is done with simple fprintf function. They are typically useless, but ''if'' some operation fails -- the logs are the only way to find what was the reason for failure.

At the same time the printf family of functions is known to take some time to work -- they need to scan the format string for %-s and then convert the arguments into strings. If comparing criu dump with and without logs the time difference is notable (15%-20%), so speeding the logs up will help improve criu performance.

One of the solutions to the problem might be binary logging. The problem with binary logs is the amount of efforts to convert existing logs to binary form. Preferably, the switch to binary logging either keeps existing log() calls intact, either has some automatics to convert them.

The option to keep log() calls intact might be in pre-compilation pass of the sources. In this pass each <code>log(fmt, ...)</code> call gets translated into a call to a binary log function that saves <code>fmt</code> identifier copies all the args ''as is'' into the log file. The binary log decode utility, required in this case, should then find the fmt string by its ID in the log file and print the resulting message.

'''Links:'''
* [[Better logging]]

'''Details:'''
* Skill level: intermediate
* Language: C, though decoder/preprocessor can be in any language
* Expected size: 350 hours
* Suggested by: Andrei Vagin
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Add support for checkpoint/restore of CORK-ed UDP socket ===

'''Summary:''' Support C/R of corked UDP socket

There's UDP_CORK option for sockets. As man page says:
<pre>
If this option is enabled, then all data output on this socket
is accumulated into a single datagram that is transmitted when
the option is disabled. This option should not be used in
code intended to be portable.
</pre>

Currently criu refuses to dump this case, so it's effectively a bug. Supporting
this will need extending the kernel API to allow criu read back the write queue
of the socket (see [[TCP connection|how it's done]] for TCP sockets, for example). Then
the queue is written into the image and is restored into the socket (with the CORK
bit set too).

'''Links:'''
* https://github.com/checkpoint-restore/criu/issues/409
* https://github.com/criupatchwork/criu/commit/a532312
* [[Sockets]], [[TCP connection]]
* [[https://groups.google.com/forum/#!topic/comp.os.linux.networking/Uz8PYiTCZSg UDP cork explained]]

'''Details:'''
* Skill level: intermediate (+linux kernel)
* Language: C
* Expected size: 350 hours

=== Add support for pidfd file descriptors ===

'''Summary:''' Support C/R of pidfd descriptors

There is pidfd_open syscall which allows opening
a special PID file descriptor. A user can send a signal to
the process (pidfd_send_signal syscall), wait for the process
(poll() on pidfd).

At the moment CRIU can't dump processes that have pidfd's opened.

'''Links:'''
* https://lwn.net/Articles/801319/
* https://lwn.net/Articles/794707/
* https://github.com/torvalds/linux/blob/v5.16/kernel/fork.c#L1877

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Christian Brauner <christian@brauner.io>
* Suggested by: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Add support for memfd_secret file descriptors ===

'''Summary:''' Support C/R of memfd_secret descriptors

There is memfd_secret syscall which allows user to open
special memfd which is backed by special memory range which
is inaccessible by another processes (and the kernel too!).

At the moment CRIU can't dump processes that have memfd_secret's opened.

'''Links:'''
* https://lwn.net/Articles/865256/

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Mike Rapoport <mike.rapoport@gmail.com>
* Suggested by: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Forensic analysis of container checkpoints ===

'''Summary:''' Extending go-crit with capabilities for forensic analysis

The go-crit tool was created during GSoC 2022 to enable analysis of CRIU [[images]] with tools written in Go. It allows container management tools such as [https://github.com/checkpoint-restore/checkpointctl checkpointctl] and Podman to provide capabilities similar to CRIT. The goal of this project is to extend go-crit with functionality for forensic analysis of container checkpoints to provide a better user experience.

The go-crit tool is still in its early stages of development. To effectively utilise this new feature, the checkpointctl tool would be extended to display information about the processes included in a container checkpoint and their runtime state (e.g., memory, open files, sockets, etc).

'''Links:'''
* https://criu.org/CRIT_(Go_library)
* https://github.com/checkpoint-restore/go-criu/tree/master/crit
* https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/

'''Details:'''
* Skill level: intermediate
* Language: Go
* Expected size: 350 hours
* Mentor: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Adrian Reber <areber@redhat.com>
* Suggested by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

=== Use eBPF to lock and unlock the network ===

'''Summary:''' Use eBPF instead of external iptables-restore tool for network lock and unlock.

During checkpointing and restoring CRIU locks the network to make sure no network packets are accepted by the network stack during the time the process is checkpointed. Currently CRIU calls out to iptables-restore to create and delete the corresponding iptables rules. Another approach which avoids calling out to the external binary iptables-restore would be to directly inject eBPF rules. There have been reports from users that iptables-restore fails in some way and eBPF could avoid this external dependency.

'''Links:'''
* https://www.criu.org/TCP_connection#Checkpoint_and_restore_TCP_connection
* https://github.com/systemd/systemd/blob/master/src/core/bpf-firewall.c
* https://blog.zeyady.com/2021-08-16/gsoc-criu

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentor: Radostin Stoyanov <rstoyanov@fedoraproject.org>
* Suggested by: Adrian Reber <areber@redhat.com>

=== Files on detached mounts ===

'''Summary:''' Initial support of open files on "detached" mounts

When criu dumps a process with an open fd on a file, it gets the mount identifier (mnt_id) via /proc/<pid>/fdinfo/<fd>, so that criu knows from which exact mount the file was initially opened. This way criu can restore this fd by opening the same exact file from topologically the same mount in restored mount tree.

Restoring fd from the right mount can be important in different cases, for instance if the process would later want to resolve paths relative to the fd, and obviously resolving from the same file on different mount can lead to different resolved paths, or if the process wants to check path to the file via /proc/<pid>/fd/<fd>.

But we have a problem finding on which mount we need to reopen the file at restore if we only know mnt_id but can't find this mnt_id in /proc/<pid>/mountinfo.

Mountinfo file shows the mount tree topology of current mntns: parent - child relations, sharing group information, mountpoint and fs root information. And if we don't see mnt_id in it we don't know anything about this mount.

This can happen in two cases

* 1) external mount or file - if file was opened from e.g. host it's mount would not be visible in container mountinfo
* 2) mount was lazily unmounted

In case of 1) we have criu options to help criu handle external dependencies.

In case of 2) or no options provided criu can't resolve mnt_id in mountinfo and criu fails.

'''Solution:'''
We can handle 2) with: resolving major/minor via fstat, using name_to_handle_at and open_by_handle_at to open same file on any other available mount from same superblock (same major/minor) in container. Now we have fd2 of the same file as fd, but on existing mount we can dump it as usual instead, and mark it as "detached" in image, now criu on restore knows where to find this file, but instead of just opening fd2 from actually restored mount, we create a temporary bindmount which is lazy unmounted just after open making the file appear as a file on detached mount.

Known problems with this approach:

* Stat on btrfs gives wrong major/minor
* file handles does not work everywhere
* file handles can return fd2 on deleted file or on other hardlink, this needs special handling.

Additionally (optional part):
We can export real major/minor in fdinfo (kernel).
We can think of new kernel interface to get mount's major/minor and root (shift from fsroot) for detached mounts, if we have it we don't need file handle hack to find file on other mount (see fsinfo or getvalues kernel patches in LKML, can we add this info there?).

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentor: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
* Suggested by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

== Suspended project ideas ==

Listed here are tasks that seem suitable for GSoC, but currently do not have anybody to mentor it.

=== IOUring support ===
The io_uring Asynchronous I/O (AIO) framework is a new Linux I/O interface, first introduced in upstream Linux kernel version 5.1 (March 2019). It provides a low-latency and feature-rich interface for applications that require AIO functionality.

'''Links:'''
* https://blogs.oracle.com/linux/an-introduction-to-the-io_uring-asynchronous-io-framework
* https://github.com/axboe/liburing

'''Details:'''
* Skill level: expert (+linux kernel)
* Expected size: 350 hours

=== Add support for SPFS ===

'''Summary:''' The SPFS is a special filesystem that allows checkpoint and restore of such things as NFS and FUSE

NFS support is already implemented in Virtuozzo CRIU, but it's very beneficial to port it to mainline CRIU. The importaint part of it is the need to implement the integration of Stub-Proxy File System (SPFS) with LXC/yet_another_containers_environment.

'''Links'''
* https://github.com/checkpoint-restore/criu/issues/60
* https://github.com/checkpoint-restore/criu/issues/53
* https://github.com/skinsbursky/spfs
* https://patchwork.criu.org/series/137/

'''Details:'''
* Skill level: expert
* Language: C
* Mentor: Alexander Mikhalitsyn <alexander@mihalicyn.com>
* Suggested by: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Anonymise image files ===

'''Summary:''' Teach [[CRIT]] to remove sensitive information from images

When reporting a BUG it may be not acceptable for the reporter to send us raw images, as they may contain sensitive data. Need to teach CRIT to "anonymise" images for publication.

List of data to shred:

* Memory contents. For the sake of investigation, all the memory contents can be just removed. Only the sizes of pages*.img files are enough.
* Paths to files. Here we should keep the paths relations to each other. The simplest way seem to be replacing file names with "random" (or sequential) strings, BUT (!) keeping an eye on making this mapping be 1:1. Note, that file paths may also sit in sk-unix.img.
* Registers.
* Process names. (But relations should be kept).
* Contents of streams, i.e. pipe/fifo data, sk-queue, tcp-stream, tty data.
* Ghost files.
* Tarballs with tmpfs-s.
* IP addresses in sk-inet-s, ip tool dumps and net*.img.

'''Links:'''
* [[Anonymize image files]]
* https://github.com/checkpoint-restore/criu/issues/360
* [[CRIT]], [[Images]]
* External links to mailing lists or web sites

'''Details:'''
* Skill level: beginner
* Language: Python

[[Category:GSoC]]
[[Category:Development]]

GSoC completed projects

2023-01-25T07:28:44Z

Ptikhomirov: /* Support sparse ghosts */

=== Restrict checks for open/mmaped files ===

'''Summary:''' Make sure the file opened (for fd or mapping) at restore is "the same" as it was on dump

'''Merged:''' https://github.com/checkpoint-restore/criu/pull/1123

CRIU doesn't carry files contents (except for ghost ones) into images. Thus on dump it saves some "meta" for file to validate it's "the same" on restore. Currently this meta includes only the file size. The task is to add some cookie value that's somehow affected by file's contents. This is primarily needed to reduce the possibility to restore with wrong libraries.

'''Links:'''
* https://www.criu.org/Category:Files
* https://en.wikipedia.org/wiki/File_verification

=== Optimize the pre-dump algorithm ===

'''Summary:''' Optimize the pre-dump algorithm to avoid pinning to many memory in RAM

'''Merged:''' https://github.com/checkpoint-restore/criu/commit/98608b90de0f853b1c8a6e15b312320e1441c359

Current [[CLI/cmd/pre-dump|pre-dump]] mode is used to write task memory contents into image
files w/o stopping the task for too long. It does this by stopping the task, infecting it and
draining all the memory into a set of pipes. Then the task is cured, resumed and the pipes'
contents is written into images (maybe a [[page server]]). Unfortunately, this approach creates
a big stress on the memory subsystem, as keeping all memory in pipes creates a lot of unreclaimable
memory (pages in pipes are not swappable), as well as the number of pipes themselves can be huge, as
one pipe doesn't store more than a fixed amount of data (see pipe(7) man page).

A solution for this problem is to use a sys_read_process_vm() syscall, which will mitigate
all of the above. To do this we need to allocate a temporary buffer in criu, then walk the
target process vm by copying the memory piece-by-piece into it, then flush the data into image
(or page server), and repeat.

Ideally there should be sys_splice_process_vm() syscall in the kernel, that does the same as
the read_process_vm does, but vmsplices the data

'''Links:'''
* [[Memory pre dump]]
* https://github.com/checkpoint-restore/criu/issues/351
* [[Memory dumping and restoring]], [[Memory changes tracking]]
* [http://man7.org/linux/man-pages/man2/process_vm_readv.2.html process_vm_readv(2)] [http://man7.org/linux/man-pages/man2/vmsplice.2.html vmsplice(2)] [https://lkml.org/lkml/2018/1/9/32 RFC for splice_process_vm syscall]

=== Porting crit functionalities in GO ===

'''Summary:''' Implement image view and manipulation in Go

'''Merged:''' https://github.com/checkpoint-restore/go-criu/pull/66

CRIU's checkpoint images are stored on disk using protobuf. For easier analysis of checkpoint files CRIU has a tool called [[CRIT|CRiu Image Tool (CRIT)]]. It can display/decode CRIU image files from binary protobuf to JSON as well as encode JSON files back to the binary format. With closer integration of CRIU in container runtimes it becomes important to be able to view the CRIU output files. Either for manipulation before restoring or for reading checkpoint statistics (memory pages written to disk, memory pages skipped, process downtime).

Currently CRIT is implemented in Python, for easier integration in other Go projects it is important to have image manipulation and analysis available from GO. This means we need a Go based library to read/modify/write/encode/decode CRIU's image files. Based on this library a Go based implementation of CRIT would be useful.

'''Links:'''
* [[CRIT (Go library)]]
* https://github.com/snprajwal/gsoc-2022

=== Support sparse ghosts ===

'''Summary:''' While sparse ghost files were in part supported for quiet some time, we still was not able to handle big sparse ghost files and highly fragmented sparse ghost files effectively.

'''Merged:''' https://github.com/checkpoint-restore/criu/pull/1944 https://github.com/checkpoint-restore/criu/pull/1963

When criu dumps processes it also dumps files that are opened by them. It does this by saving file names by which the files are accessible. But sometimes files can have no names. It may happen if a task opened a file and then removed it. To dump this file criu cannot save its name (because the name doesn't exist). Instead criu saves the whole file. This is called "ghost file". Since saving the whole file is very expensive (copying lots of data on disk) criu limits the maximum size of a ghost file. The latter is also not good, because there are "sparse" files, that are large in size, but may be small from the real disk usage perspective. The goal of the task is to support sparse ghost files, i.e. limit the size of the ghost not by its length but by disk usage and when copying the data detect the used blocks and save only those.

'''Links:'''

*[https://en.wikipedia.org/wiki/Sparse_file Sparse files]
*[[Dumping files]]
*[[Invisible files]]
*[https://www.kernel.org/doc/html/latest/filesystems/fiemap.html Fiemap ioctl]

[[Category:GSoC]]

GSoC completed projects

2023-01-25T07:28:14Z

Ptikhomirov:

=== Restrict checks for open/mmaped files ===

'''Summary:''' Make sure the file opened (for fd or mapping) at restore is "the same" as it was on dump

'''Merged:''' https://github.com/checkpoint-restore/criu/pull/1123

CRIU doesn't carry files contents (except for ghost ones) into images. Thus on dump it saves some "meta" for file to validate it's "the same" on restore. Currently this meta includes only the file size. The task is to add some cookie value that's somehow affected by file's contents. This is primarily needed to reduce the possibility to restore with wrong libraries.

'''Links:'''
* https://www.criu.org/Category:Files
* https://en.wikipedia.org/wiki/File_verification

=== Optimize the pre-dump algorithm ===

'''Summary:''' Optimize the pre-dump algorithm to avoid pinning to many memory in RAM

'''Merged:''' https://github.com/checkpoint-restore/criu/commit/98608b90de0f853b1c8a6e15b312320e1441c359

Current [[CLI/cmd/pre-dump|pre-dump]] mode is used to write task memory contents into image
files w/o stopping the task for too long. It does this by stopping the task, infecting it and
draining all the memory into a set of pipes. Then the task is cured, resumed and the pipes'
contents is written into images (maybe a [[page server]]). Unfortunately, this approach creates
a big stress on the memory subsystem, as keeping all memory in pipes creates a lot of unreclaimable
memory (pages in pipes are not swappable), as well as the number of pipes themselves can be huge, as
one pipe doesn't store more than a fixed amount of data (see pipe(7) man page).

A solution for this problem is to use a sys_read_process_vm() syscall, which will mitigate
all of the above. To do this we need to allocate a temporary buffer in criu, then walk the
target process vm by copying the memory piece-by-piece into it, then flush the data into image
(or page server), and repeat.

Ideally there should be sys_splice_process_vm() syscall in the kernel, that does the same as
the read_process_vm does, but vmsplices the data

'''Links:'''
* [[Memory pre dump]]
* https://github.com/checkpoint-restore/criu/issues/351
* [[Memory dumping and restoring]], [[Memory changes tracking]]
* [http://man7.org/linux/man-pages/man2/process_vm_readv.2.html process_vm_readv(2)] [http://man7.org/linux/man-pages/man2/vmsplice.2.html vmsplice(2)] [https://lkml.org/lkml/2018/1/9/32 RFC for splice_process_vm syscall]

=== Porting crit functionalities in GO ===

'''Summary:''' Implement image view and manipulation in Go

'''Merged:''' https://github.com/checkpoint-restore/go-criu/pull/66

CRIU's checkpoint images are stored on disk using protobuf. For easier analysis of checkpoint files CRIU has a tool called [[CRIT|CRiu Image Tool (CRIT)]]. It can display/decode CRIU image files from binary protobuf to JSON as well as encode JSON files back to the binary format. With closer integration of CRIU in container runtimes it becomes important to be able to view the CRIU output files. Either for manipulation before restoring or for reading checkpoint statistics (memory pages written to disk, memory pages skipped, process downtime).

Currently CRIT is implemented in Python, for easier integration in other Go projects it is important to have image manipulation and analysis available from GO. This means we need a Go based library to read/modify/write/encode/decode CRIU's image files. Based on this library a Go based implementation of CRIT would be useful.

'''Links:'''
* [[CRIT (Go library)]]
* https://github.com/snprajwal/gsoc-2022

=== Support sparse ghosts ===

'''Summary:''' While sparse ghost files were in part supported for quiet some time, we still was not able to handle big sparse ghost files and highly fragmented sparse ghost files.

'''Merged:''' https://github.com/checkpoint-restore/criu/pull/1944 https://github.com/checkpoint-restore/criu/pull/1963

When criu dumps processes it also dumps files that are opened by them. It does this by saving file names by which the files are accessible. But sometimes files can have no names. It may happen if a task opened a file and then removed it. To dump this file criu cannot save its name (because the name doesn't exist). Instead criu saves the whole file. This is called "ghost file". Since saving the whole file is very expensive (copying lots of data on disk) criu limits the maximum size of a ghost file. The latter is also not good, because there are "sparse" files, that are large in size, but may be small from the real disk usage perspective. The goal of the task is to support sparse ghost files, i.e. limit the size of the ghost not by its length but by disk usage and when copying the data detect the used blocks and save only those.

'''Links:'''

*[https://en.wikipedia.org/wiki/Sparse_file Sparse files]
*[[Dumping files]]
*[[Invisible files]]
*[https://www.kernel.org/doc/html/latest/filesystems/fiemap.html Fiemap ioctl]

[[Category:GSoC]]

GSoC completed projects

2023-01-25T07:24:37Z

Ptikhomirov: remove details for sparce ghost files

=== Restrict checks for open/mmaped files ===

'''Summary:''' Make sure the file opened (for fd or mapping) at restore is "the same" as it was on dump

'''Merged:''' https://github.com/checkpoint-restore/criu/pull/1123

CRIU doesn't carry files contents (except for ghost ones) into images. Thus on dump it saves some "meta" for file to validate it's "the same" on restore. Currently this meta includes only the file size. The task is to add some cookie value that's somehow affected by file's contents. This is primarily needed to reduce the possibility to restore with wrong libraries.

'''Links:'''
* https://www.criu.org/Category:Files
* https://en.wikipedia.org/wiki/File_verification

=== Optimize the pre-dump algorithm ===

'''Summary:''' Optimize the pre-dump algorithm to avoid pinning to many memory in RAM

'''Merged:''' https://github.com/checkpoint-restore/criu/commit/98608b90de0f853b1c8a6e15b312320e1441c359

Current [[CLI/cmd/pre-dump|pre-dump]] mode is used to write task memory contents into image
files w/o stopping the task for too long. It does this by stopping the task, infecting it and
draining all the memory into a set of pipes. Then the task is cured, resumed and the pipes'
contents is written into images (maybe a [[page server]]). Unfortunately, this approach creates
a big stress on the memory subsystem, as keeping all memory in pipes creates a lot of unreclaimable
memory (pages in pipes are not swappable), as well as the number of pipes themselves can be huge, as
one pipe doesn't store more than a fixed amount of data (see pipe(7) man page).

A solution for this problem is to use a sys_read_process_vm() syscall, which will mitigate
all of the above. To do this we need to allocate a temporary buffer in criu, then walk the
target process vm by copying the memory piece-by-piece into it, then flush the data into image
(or page server), and repeat.

Ideally there should be sys_splice_process_vm() syscall in the kernel, that does the same as
the read_process_vm does, but vmsplices the data

'''Links:'''
* [[Memory pre dump]]
* https://github.com/checkpoint-restore/criu/issues/351
* [[Memory dumping and restoring]], [[Memory changes tracking]]
* [http://man7.org/linux/man-pages/man2/process_vm_readv.2.html process_vm_readv(2)] [http://man7.org/linux/man-pages/man2/vmsplice.2.html vmsplice(2)] [https://lkml.org/lkml/2018/1/9/32 RFC for splice_process_vm syscall]

=== Porting crit functionalities in GO ===

'''Summary:''' Implement image view and manipulation in Go

'''Merged:''' https://github.com/checkpoint-restore/go-criu/pull/66

CRIU's checkpoint images are stored on disk using protobuf. For easier analysis of checkpoint files CRIU has a tool called [[CRIT|CRiu Image Tool (CRIT)]]. It can display/decode CRIU image files from binary protobuf to JSON as well as encode JSON files back to the binary format. With closer integration of CRIU in container runtimes it becomes important to be able to view the CRIU output files. Either for manipulation before restoring or for reading checkpoint statistics (memory pages written to disk, memory pages skipped, process downtime).

Currently CRIT is implemented in Python, for easier integration in other Go projects it is important to have image manipulation and analysis available from GO. This means we need a Go based library to read/modify/write/encode/decode CRIU's image files. Based on this library a Go based implementation of CRIT would be useful.

'''Links:'''
* [[CRIT (Go library)]]
* https://github.com/snprajwal/gsoc-2022

=== Support sparse ghosts ===

'''Merged:''' https://github.com/checkpoint-restore/criu/pull/1944 https://github.com/checkpoint-restore/criu/pull/1963

When criu dumps processes it also dumps files that are opened by them. It does this by saving file names by which the files are accessible. But sometimes files can have no names. It may happen if a task opened a file and then removed it. To dump this file criu cannot save its name (because the name doesn't exist). Instead criu saves the whole file. This is called "ghost file". Since saving the whole file is very expensive (copying lots of data on disk) criu limits the maximum size of a ghost file. The latter is also not good, because there are "sparse" files, that are large in size, but may be small from the real disk usage perspective. The goal of the task is to support sparse ghost files, i.e. limit the size of the ghost not by its length but by disk usage and when copying the data detect the used blocks and save only those.

'''Links:'''

*[https://en.wikipedia.org/wiki/Sparse_file Sparse files]
*[[Dumping files]]
*[[Invisible files]]
*[https://www.kernel.org/doc/html/latest/filesystems/fiemap.html Fiemap ioctl]

[[Category:GSoC]]

GSoC completed projects

2023-01-25T07:22:34Z

Ptikhomirov: move support spare ghosts to done

=== Restrict checks for open/mmaped files ===

'''Summary:''' Make sure the file opened (for fd or mapping) at restore is "the same" as it was on dump

'''Merged:''' https://github.com/checkpoint-restore/criu/pull/1123

CRIU doesn't carry files contents (except for ghost ones) into images. Thus on dump it saves some "meta" for file to validate it's "the same" on restore. Currently this meta includes only the file size. The task is to add some cookie value that's somehow affected by file's contents. This is primarily needed to reduce the possibility to restore with wrong libraries.

'''Links:'''
* https://www.criu.org/Category:Files
* https://en.wikipedia.org/wiki/File_verification

=== Optimize the pre-dump algorithm ===

'''Summary:''' Optimize the pre-dump algorithm to avoid pinning to many memory in RAM

'''Merged:''' https://github.com/checkpoint-restore/criu/commit/98608b90de0f853b1c8a6e15b312320e1441c359

Current [[CLI/cmd/pre-dump|pre-dump]] mode is used to write task memory contents into image
files w/o stopping the task for too long. It does this by stopping the task, infecting it and
draining all the memory into a set of pipes. Then the task is cured, resumed and the pipes'
contents is written into images (maybe a [[page server]]). Unfortunately, this approach creates
a big stress on the memory subsystem, as keeping all memory in pipes creates a lot of unreclaimable
memory (pages in pipes are not swappable), as well as the number of pipes themselves can be huge, as
one pipe doesn't store more than a fixed amount of data (see pipe(7) man page).

A solution for this problem is to use a sys_read_process_vm() syscall, which will mitigate
all of the above. To do this we need to allocate a temporary buffer in criu, then walk the
target process vm by copying the memory piece-by-piece into it, then flush the data into image
(or page server), and repeat.

Ideally there should be sys_splice_process_vm() syscall in the kernel, that does the same as
the read_process_vm does, but vmsplices the data

'''Links:'''
* [[Memory pre dump]]
* https://github.com/checkpoint-restore/criu/issues/351
* [[Memory dumping and restoring]], [[Memory changes tracking]]
* [http://man7.org/linux/man-pages/man2/process_vm_readv.2.html process_vm_readv(2)] [http://man7.org/linux/man-pages/man2/vmsplice.2.html vmsplice(2)] [https://lkml.org/lkml/2018/1/9/32 RFC for splice_process_vm syscall]

=== Porting crit functionalities in GO ===

'''Summary:''' Implement image view and manipulation in Go

'''Merged:''' https://github.com/checkpoint-restore/go-criu/pull/66

CRIU's checkpoint images are stored on disk using protobuf. For easier analysis of checkpoint files CRIU has a tool called [[CRIT|CRiu Image Tool (CRIT)]]. It can display/decode CRIU image files from binary protobuf to JSON as well as encode JSON files back to the binary format. With closer integration of CRIU in container runtimes it becomes important to be able to view the CRIU output files. Either for manipulation before restoring or for reading checkpoint statistics (memory pages written to disk, memory pages skipped, process downtime).

Currently CRIT is implemented in Python, for easier integration in other Go projects it is important to have image manipulation and analysis available from GO. This means we need a Go based library to read/modify/write/encode/decode CRIU's image files. Based on this library a Go based implementation of CRIT would be useful.

'''Links:'''
* [[CRIT (Go library)]]
* https://github.com/snprajwal/gsoc-2022

=== Support sparse ghosts ===

'''Merged:''' https://github.com/checkpoint-restore/criu/pull/1944 https://github.com/checkpoint-restore/criu/pull/1963

When criu dumps processes it also dumps files that are opened by them. It does this by saving file names by which the files are accessible. But sometimes files can have no names. It may happen if a task opened a file and then removed it. To dump this file criu cannot save its name (because the name doesn't exist). Instead criu saves the whole file. This is called "ghost file". Since saving the whole file is very expensive (copying lots of data on disk) criu limits the maximum size of a ghost file. The latter is also not good, because there are "sparse" files, that are large in size, but may be small from the real disk usage perspective. The goal of the task is to support sparse ghost files, i.e. limit the size of the ghost not by its length but by disk usage and when copying the data detect the used blocks and save only those.

'''Links:'''

*[https://en.wikipedia.org/wiki/Sparse_file Sparse files]
*[[Dumping files]]
*[[Invisible files]]
*[https://www.kernel.org/doc/html/latest/filesystems/fiemap.html Fiemap ioctl]

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentor: Pavel Emelyanov <ovzxemul@gmail.com>
* Mentor: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
* Suggested by: Pavel Emelyanov <ovzxemul@gmail.com>

[[Category:GSoC]]

Google Summer of Code Ideas

2023-01-25T07:17:45Z

Ptikhomirov: move sparce ghost files project to done

Google Summer of Code (GSoC) is a global program that offers post-secondary students an opportunity to be paid for contributing to an open source project over a three month period.

This page contains project ideas for upcoming Google Summer of Code.

== Contacts ==

Please contact the respective mentor for the idea you are interested in. For general questions feel free to send an email to the [mailto:criu@openvz.org mailing list] or write in [https://gitter.im/save-restore/criu gitter].

== Project ideas ==

=== Optimize logging engine ===

'''Summary:''' CRIU puts a lots of logs when doing its job. Logging is done with simple fprintf function. They are typically useless, but ''if'' some operation fails -- the logs are the only way to find what was the reason for failure.

At the same time the printf family of functions is known to take some time to work -- they need to scan the format string for %-s and then convert the arguments into strings. If comparing criu dump with and without logs the time difference is notable (15%-20%), so speeding the logs up will help improve criu performance.

One of the solutions to the problem might be binary logging. The problem with binary logs is the amount of efforts to convert existing logs to binary form. Preferably, the switch to binary logging either keeps existing log() calls intact, either has some automatics to convert them.

The option to keep log() calls intact might be in pre-compilation pass of the sources. In this pass each <code>log(fmt, ...)</code> call gets translated into a call to a binary log function that saves <code>fmt</code> identifier copies all the args ''as is'' into the log file. The binary log decode utility, required in this case, should then find the fmt string by its ID in the log file and print the resulting message.

'''Links:'''
* [[Better logging]]

'''Details:'''
* Skill level: intermediate
* Language: C, though decoder/preprocessor can be in any language
* Expected size: 350 hours
* Mentor: Pavel Emelyanov <ovzxemul@gmail.com>
* Suggested by: Andrei Vagin <avagin@gmail.com>

=== Add support for checkpoint/restore of CORK-ed UDP socket ===

'''Summary:''' Support C/R of corked UDP socket

There's UDP_CORK option for sockets. As man page says:
<pre>
If this option is enabled, then all data output on this socket
is accumulated into a single datagram that is transmitted when
the option is disabled. This option should not be used in
code intended to be portable.
</pre>

Currently criu refuses to dump this case, so it's effectively a bug. Supporting
this will need extending the kernel API to allow criu read back the write queue
of the socket (see [[TCP connection|how it's done]] for TCP sockets, for example). Then
the queue is written into the image and is restored into the socket (with the CORK
bit set too).

'''Links:'''
* https://github.com/checkpoint-restore/criu/issues/409
* [[Sockets]], [[TCP connection]]
* [[https://groups.google.com/forum/#!topic/comp.os.linux.networking/Uz8PYiTCZSg UDP cork explained]]

'''Details:'''
* Skill level: intermediate (+linux kernel)
* Language: C
* Expected size: 350 hours
* Mentor: Pavel Emelianov <ovzxemul@gmail.com>
* Suggested by: Pavel Emelianov <ovzxemul@gmail.com>

=== Add support for pidfd file descriptors ===

'''Summary:''' Support C/R of pidfd descriptors

There is pidfd_open syscall which allows opening
a special PID file descriptor. A user can send a signal to
the process (pidfd_send_signal syscall), wait for the process
(poll() on pidfd).

At the moment CRIU can't dump processes that have pidfd's opened.

'''Links:'''
* https://lwn.net/Articles/801319/
* https://lwn.net/Articles/794707/
* https://github.com/torvalds/linux/blob/v5.16/kernel/fork.c#L1877

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Christian Brauner <christian@brauner.io>
* Suggested by: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Add support for memfd_secret file descriptors ===

'''Summary:''' Support C/R of memfd_secret descriptors

There is memfd_secret syscall which allows user to open
special memfd which is backed by special memory range which
is inaccessible by another processes (and the kernel too!).

At the moment CRIU can't dump processes that have memfd_secret's opened.

'''Links:'''
* https://lwn.net/Articles/865256/

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Mike Rapoport <mike.rapoport@gmail.com>
* Suggested by: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Use eBPF to lock and unlock the network ===

'''Summary:''' Use eBPF instead of external iptables-restore tool for network lock and unlock.

During checkpointing and restoring CRIU locks the network to make sure no network packets are accepted by the network stack during the time the process is checkpointed. Currently CRIU calls out to iptables-restore to create and delete the corresponding iptables rules. Another approach which avoids calling out to the external binary iptables-restore would be to directly inject eBPF rules. There have been reports from users that iptables-restore fails in some way and eBPF could avoid this external dependency.

'''Links:'''
* https://www.criu.org/TCP_connection#Checkpoint_and_restore_TCP_connection
* https://github.com/systemd/systemd/blob/master/src/core/bpf-firewall.c
* https://blog.zeyady.com/2021-08-16/gsoc-criu

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentor: Radostin Stoyanov <rstoyanov@fedoraproject.org>
* Suggested by: Adrian Reber <areber@redhat.com>

=== CGroup-v2 support ===

'''Summary:''' cgroup is a mechanism to organize processes hierarchically and distribute system resources along the hierarchy in a controlled and configurable manner. cgroup v2 is a new version of the cgroup file system. Unlike v1, cgroup v2 has only single hierarchy. CRIU has to dump/restore a container cgroup hierarchy along with all per-cgroup options. The cgroupv2 support in CRIU has to be compatible with Docker, containerd and cri-o.

'''Links:'''
* [[CGroups]]
* https://github.com/checkpoint-restore/criu/issues/252
* https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentor: Andrei Vagin <avagin@gmail.com>
* Suggested by: Andrei Vagin <avagin@gmail.com>

=== Dump shmem in user-mode (unprivileged-mode) ===

CRIU uses /proc/pid/map_files to dump and restore anonymous shared memory regions, but map_files is restricted to the global CAP_SYS_ADMIN capability. In most cases, it is possible to dump/restore shared memory region without map_files and we need to implement this in CRIU.

'''Links:'''
* [[User-mode]]

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Suggested by: Andrei Vagin <avagin@gmail.com>
* Suggested by: Pavel Emelyanov <ovzxemul@gmail.com>
* Mentor: Pavel Emelyanov <ovzxemul@gmail.com>

=== Files on detached mounts ===

'''Summary:''' Initial support of open files on "detached" mounts

When criu dumps a process with an open fd on a file, it gets the mount identifier (mnt_id) via /proc/<pid>/fdinfo/<fd>, so that criu knows from which exact mount the file was initially opened. This way criu can restore this fd by opening the same exact file from topologically the same mount in restored mount tree.

Restoring fd from the right mount can be important in different cases, for instance if the process would later want to resolve paths relative to the fd, and obviously resolving from the same file on different mount can lead to different resolved paths, or if the process wants to check path to the file via /proc/<pid>/fd/<fd>.

But we have a problem finding on which mount we need to reopen the file at restore if we only know mnt_id but can't find this mnt_id in /proc/<pid>/mountinfo.

Mountinfo file shows the mount tree topology of current mntns: parent - child relations, sharing group information, mountpoint and fs root information. And if we don't see mnt_id in it we don't know anything about this mount.

This can happen in two cases

* 1) external mount or file - if file was opened from e.g. host it's mount would not be visible in container mountinfo
* 2) mount was lazily unmounted

In case of 1) we have criu options to help criu handle external dependencies.

In case of 2) or no options provided criu can't resolve mnt_id in mountinfo and criu fails.

'''Solution:'''
We can handle 2) with: resolving major/minor via fstat, using name_to_handle_at and open_by_handle_at to open same file on any other available mount from same superblock (same major/minor) in container. Now we have fd2 of the same file as fd, but on existing mount we can dump it as usual instead, and mark it as "detached" in image, now criu on restore knows where to find this file, but instead of just opening fd2 from actually restored mount, we create a temporary bindmount which is lazy unmounted just after open making the file appear as a file on detached mount.

Known problems with this approach:

* Stat on btrfs gives wrong major/minor
* file handles does not work everywhere
* file handles can return fd2 on deleted file or on other hardlink, this needs special handling.

Additionally (optional part):
We can export real major/minor in fdinfo (kernel).
We can think of new kernel interface to get mount's major/minor and root (shift from fsroot) for detached mounts, if we have it we don't need file handle hack to find file on other mount (see fsinfo or getvalues kernel patches in LKML, can we add this info there?).

'''Details:'''
* Skill level: intermediate
* Language: C
* Mentor: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
* Suggested by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

== Suspended project ideas ==

Listed here are tasks that seem suitable for GSoC, but currently do not have anybody to mentor it.

=== IOUring support ===
The io_uring Asynchronous I/O (AIO) framework is a new Linux I/O interface, first introduced in upstream Linux kernel version 5.1 (March 2019). It provides a low-latency and feature-rich interface for applications that require AIO functionality.

'''Links:'''
* https://blogs.oracle.com/linux/an-introduction-to-the-io_uring-asynchronous-io-framework
* https://github.com/axboe/liburing

'''Details:'''
* Skill level: expert (+linux kernel)
* Expected size: 350 hours
* Suggested by: Pavel Emelyanov <ovzxemul@gmail.com>
* Mentor: Pavel Emelyanov <ovzxemul@gmail.com>

=== Add support for SPFS ===

'''Summary:''' The SPFS is a special filesystem that allows checkpoint and restore of such things as NFS and FUSE

NFS support is already implemented in Virtuozzo CRIU, but it's very beneficial to port it to mainline CRIU. The importaint part of it is the need to implement the integration of Stub-Proxy File System (SPFS) with LXC/yet_another_containers_environment.

'''Links'''
* https://github.com/checkpoint-restore/criu/issues/60
* https://github.com/checkpoint-restore/criu/issues/53
* https://github.com/skinsbursky/spfs
* https://patchwork.criu.org/series/137/

'''Details:'''
* Skill level: expert
* Language: C
* Mentor: Alexander Mikhalitsyn <alexander@mihalicyn.com> / <alexander.mikhalitsyn@virtuozzo.com>
* Suggested by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

=== Anonymise image files ===

'''Summary:''' Teach [[CRIT]] to remove sensitive information from images

When reporting a BUG it may be not acceptable for the reporter to send us raw images, as they may contain sensitive data. Need to teach CRIT to "anonymise" images for publication.

List of data to shred:

* Memory contents. For the sake of investigation, all the memory contents can be just removed. Only the sizes of pages*.img files are enough.
* Paths to files. Here we should keep the paths relations to each other. The simplest way seem to be replacing file names with "random" (or sequential) strings, BUT (!) keeping an eye on making this mapping be 1:1. Note, that file paths may also sit in sk-unix.img.
* Registers.
* Process names. (But relations should be kept).
* Contents of streams, i.e. pipe/fifo data, sk-queue, tcp-stream, tty data.
* Ghost files.
* Tarballs with tmpfs-s.
* IP addresses in sk-inet-s, ip tool dumps and net*.img.

'''Links:'''
* [[Anonymize image files]]
* https://github.com/checkpoint-restore/criu/issues/360
* [[CRIT]], [[Images]]
* External links to mailing lists or web sites

'''Details:'''
* Skill level: beginner
* Language: Python
* Mentor: Pavel Emelianov <xemul@virtuozzo.com>
* Suggested by: Pavel Emelianov <xemul@virtuozzo.com>

[[Category:GSoC]]
[[Category:Development]]

GSoC Students Recommendations

2022-05-23T21:28:02Z

Ptikhomirov: /* Takeoff */

[[Category:GSoC]]

== Contacts ==

The entry points for the community is the [https://github.com/checkpoint-restore GitHub checkpoint-restore project], [https://gitter.im/save-restore/CRIU Gitter] and <code>criu@openvz.org</code> mailing list. Also, the [[Google Summer of Code Ideas|ideas]] page contains mentors' personal e-mails for each sub-project.

== Takeoff ==

Starting playing with CRIU is as simple as one, two, three:

# Get the sources from [https://github.com/checkpoint-restore/criu]
# Build them with <code>make</code>
# Do your first C/R by running a simple test with <code>zdtm.py run -t zdtm/static/env00</code>

Here are links for further reading

* [[Installation]]
* [[CLI]]
* [[Simple loop]]
* [[ZDTM test suite]]

== Contributing ==

When a new patch is ready, it can be submitted for merging either [[How_to_submit_patches|via the CRIU mailing list]] or via GitHub PR.

External bind mounts

2022-05-16T09:41:25Z

Ptikhomirov:

__TOC__

One of typical external resources when dumping a container (especially LXC/Docker) is a mount point whose root sits outside of the container's root. This situation was intended to be resolved using [[plugins]] but turned out to be common enough to introduce a built-in way of handling it.

== What is external bind mount ==

The way to create such is simple as

mkdir /root
mount --bind /foo /root/bar
chroot /root

This is it. From now on, the /bar file is a mountpoint whose root (the source) is not accessible directly.

If you look at the /proc/$pid/mountinfo file of a task seeing such you would see smth like

11 23 8:3 /root / ... - ext4 /dev/sda1 ...
23 34 8:3 /foo /bar ... - ext4 /dev/sda1 ...

The columns 4 and 5 are root and mountpoint respectively. You can see, that the / is /root file from /dev/sda1 device and /bar file is a mountpoint with the root being /foo file from the same device.

== How to teach CRIU to dump them ==

By default CRIU doesn't dump such mountpoints, because there's no way CRIU will be able to restore it -- the root of these mounts is out of scope of what CRIU dumped. In the logs you would see a message like

34:/bar doesn't have a proper root mount

which means the mountpoint /bar has inaccessible root.

To dump and restore them there's the <code>--external mnt[KEY]:VAL</code> option that sets up external mounts root mapping.

On dump, KEY is a mountpoint inside container, and corresponding VAL is a string that will be written into the image as mountpoint's root value.

On restore, KEY is the value from the image (VAL from dump), and the VAL is the path on host that will be bind-mounted into container (to the mountpoint path from image).

For example, if we want to dump the task above we should call

criu dump ... --external mnt[/bar]:barmount

The word <code>barmount</code> is an arbitrary identifier, that will be put in the image file instead of the original root path

criu show -f mountpoints.img -F mnt_id,root,mountpoint
mnt_id: 0x22 root: barmount mountpoint: /bar

On restore we should tell CRIU where to bind mount the <code>barmount</code> from like this

criu restore ... --external mnt[barmount]:/foo

With this CRIU will bind mount the /foo into proper mountpoint.

Note: Mounts from same superblock should remain mounts from same superblock after migration. Options `--external mnt[smth]:/smth` force criu to bindmount from the provided source, that can lead to mounts, which were from the same supperblock before dump, appear to be from different supperblock after restore, which is wrong so these option should be used carefully (can break sharing groups restore).

== Auto detection ==

In case one wants CRIU to autodetect and dump all the external bind mounts, and there is no need to change host mount points on restore, one can use a special syntax:

criu dump ... --external mnt[]:''flags''

Note here is nothing inside square brackets, and the optional <code>:''flags''</code> argument can contain the following characters:

; <code>m</code>
: Also enable dumping of external master mounts (as in <code>mount --make-slave</code>)
; <code>s</code>
: Also enable dumping of external shared mounts (as in <code>mount --make-shared</code>)

By default, neither master nor shared external mounts are dumped (if found, dump is aborted). Note if <code>''flags''</code> are not given, semicolon is optional.

=== Examples ===

criu dump ... --external 'mnt[]'

Auto-detect and dump all external bind mounts.

criu dump ... --external 'mnt[]:s'

Auto-detect and dump all external bind mounts, including the shared ones.

criu dump ... --external 'mnt[]:sm'

Auto-detect and dump all external bind mounts, including the shared and the master ones.

== Sharing ==

External bindmounts can both have internal/external sharing. Please see the example:

# Preparation
unshare -m --propagation private
mkdir /external_mount_sharing_test
mount -t tmpfs tmpfs /external_mount_sharing_test/
mount --make-private /external_mount_sharing_test/
cd /external_mount_sharing_test
# Source of external mount
mkdir external_mount
mount -t tmpfs tmpfs-external external_mount/
mount --make-shared external_mount/
cat /proc/$$/mountinfo | grep external
# 811 755 0:60 / /external_mount_sharing_test rw,relatime - tmpfs tmpfs rw
# 812 811 0:62 / /external_mount_sharing_test/external_mount rw,relatime shared:290 - tmpfs tmpfs-external rw

# Switch to CT mntns
unshare -m --propagation unchanged sh
mkdir root
mount -t tmpfs tmpfs-root root/
mkdir root/external_sharing root/internal_sharing root/proc

# Create external mount
mount --bind external_mount/ root/external_sharing
mount --bind external_mount/ root/internal_sharing
mount --make-private root/internal_sharing
mount --make-shared root/internal_sharing

# More preparations
mount --bind /proc root/proc
cd root
mkdir bin lib64
SH=$(which sh)
cp $SH bin
cp $(ldd $SH | grep "/lib64" | sed 's/^.*$\/lib64\S*$\s.*$/\1/') lib64
CAT=$(which cat)
cp $CAT bin
cp $(ldd $CAT | grep "/lib64" | sed 's/^.*$\/lib64\S*$\s.*$/\1/') lib64
PATH=$PATH:/bin
chroot . sh
cat /proc/$$/mountinfo
# 843 841 0:63 / / rw,relatime - tmpfs tmpfs-root rw
# 861 843 0:62 / /external_sharing rw,relatime shared:290 - tmpfs tmpfs-external rw
# 898 843 0:62 / /internal_sharing rw,relatime shared:349 - tmpfs tmpfs-external rw
# 899 843 0:5 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw

Mounts 812 (on the host) and 861 (in a container) have the same sharing (shared group) - external sharing and mount 898 has it's own local shared group - internal sharing. Same is applicable for master_ids, if we convert them into slaves external/internal shared_id would convert to external/internal master_id.

[https://criu.org/Mount-v2 Mount-v2] is introducing a better support of external sharing:

- External sharing is not supported (converted to internal sharing after c/r) as reasonable container environments should not allow it due to security reasons, and implementing it's lookup would lead to bad performance (host mountinfo reading).

- External slavery is supported for mountpoint external mounts and the root mount. It is detected when criu can't lookup master_id of the mount across shared_ids in container mount namespaces. CRIU relies that mountpoint external source provides right shared/slave mount to copy sharing from. Everything else is considered as internal sharing/slavery.

== Old days ==

For now the same behavior is configured with the <code>--ext-mount-map KEY:VAL</code> option. Soon this option will be [[deprecation|deprecated]].

[[Category:HOWTO]]
[[Category:External]]

External bind mounts

2022-04-28T11:17:22Z

Ptikhomirov: /* Sharing */

__TOC__

One of typical external resources when dumping a container (especially LXC/Docker) is a mount point whose root sits outside of the container's root. This situation was intended to be resolved using [[plugins]] but turned out to be common enough to introduce a built-in way of handling it.

== What is external bind mount ==

The way to create such is simple as

mkdir /root
mount --bind /foo /root/bar
chroot /root

This is it. From now on, the /bar file is a mountpoint whose root (the source) is not accessible directly.

If you look at the /proc/$pid/mountinfo file of a task seeing such you would see smth like

11 23 8:3 /root / ... - ext4 /dev/sda1 ...
23 34 8:3 /foo /bar ... - ext4 /dev/sda1 ...

The columns 4 and 5 are root and mountpoint respectively. You can see, that the / is /root file from /dev/sda1 device and /bar file is a mountpoint with the root being /foo file from the same device.

== How to teach CRIU to dump them ==

By default CRIU doesn't dump such mountpoints, because there's no way CRIU will be able to restore it -- the root of these mounts is out of scope of what CRIU dumped. In the logs you would see a message like

34:/bar doesn't have a proper root mount

which means the mountpoint /bar has inaccessible root.

To dump and restore them there's the <code>--external mnt[KEY]:VAL</code> option that sets up external mounts root mapping.

On dump, KEY is a mountpoint inside container, and corresponding VAL is a string that will be written into the image as mountpoint's root value.

On restore, KEY is the value from the image (VAL from dump), and the VAL is the path on host that will be bind-mounted into container (to the mountpoint path from image).

For example, if we want to dump the task above we should call

criu dump ... --external mnt[/bar]:barmount

The word <code>barmount</code> is an arbitrary identifier, that will be put in the image file instead of the original root path

criu show -f mountpoints.img -F mnt_id,root,mountpoint
mnt_id: 0x22 root: barmount mountpoint: /bar

On restore we should tell CRIU where to bind mount the <code>barmount</code> from like this

criu restore ... --external mnt[barmount]:/foo

With this CRIU will bind mount the /foo into proper mountpoint.

== Auto detection ==

In case one wants CRIU to autodetect and dump all the external bind mounts, and there is no need to change host mount points on restore, one can use a special syntax:

criu dump ... --external mnt[]:''flags''

Note here is nothing inside square brackets, and the optional <code>:''flags''</code> argument can contain the following characters:

; <code>m</code>
: Also enable dumping of external master mounts (as in <code>mount --make-slave</code>)
; <code>s</code>
: Also enable dumping of external shared mounts (as in <code>mount --make-shared</code>)

By default, neither master nor shared external mounts are dumped (if found, dump is aborted). Note if <code>''flags''</code> are not given, semicolon is optional.

=== Examples ===

criu dump ... --external 'mnt[]'

Auto-detect and dump all external bind mounts.

criu dump ... --external 'mnt[]:s'

Auto-detect and dump all external bind mounts, including the shared ones.

criu dump ... --external 'mnt[]:sm'

Auto-detect and dump all external bind mounts, including the shared and the master ones.

== Sharing ==

External bindmounts can both have internal/external sharing. Please see the example:

# Preparation
unshare -m --propagation private
mkdir /external_mount_sharing_test
mount -t tmpfs tmpfs /external_mount_sharing_test/
mount --make-private /external_mount_sharing_test/
cd /external_mount_sharing_test
# Source of external mount
mkdir external_mount
mount -t tmpfs tmpfs-external external_mount/
mount --make-shared external_mount/
cat /proc/$$/mountinfo | grep external
# 811 755 0:60 / /external_mount_sharing_test rw,relatime - tmpfs tmpfs rw
# 812 811 0:62 / /external_mount_sharing_test/external_mount rw,relatime shared:290 - tmpfs tmpfs-external rw

# Switch to CT mntns
unshare -m --propagation unchanged sh
mkdir root
mount -t tmpfs tmpfs-root root/
mkdir root/external_sharing root/internal_sharing root/proc

# Create external mount
mount --bind external_mount/ root/external_sharing
mount --bind external_mount/ root/internal_sharing
mount --make-private root/internal_sharing
mount --make-shared root/internal_sharing

# More preparations
mount --bind /proc root/proc
cd root
mkdir bin lib64
SH=$(which sh)
cp $SH bin
cp $(ldd $SH | grep "/lib64" | sed 's/^.*$\/lib64\S*$\s.*$/\1/') lib64
CAT=$(which cat)
cp $CAT bin
cp $(ldd $CAT | grep "/lib64" | sed 's/^.*$\/lib64\S*$\s.*$/\1/') lib64
PATH=$PATH:/bin
chroot . sh
cat /proc/$$/mountinfo
# 843 841 0:63 / / rw,relatime - tmpfs tmpfs-root rw
# 861 843 0:62 / /external_sharing rw,relatime shared:290 - tmpfs tmpfs-external rw
# 898 843 0:62 / /internal_sharing rw,relatime shared:349 - tmpfs tmpfs-external rw
# 899 843 0:5 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw

Mounts 812 (on the host) and 861 (in a container) have the same sharing (shared group) - external sharing and mount 898 has it's own local shared group - internal sharing. Same is applicable for master_ids, if we convert them into slaves external/internal shared_id would convert to external/internal master_id.

[https://criu.org/Mount-v2 Mount-v2] is introducing a better support of external sharing:

- External sharing is not supported (converted to internal sharing after c/r) as reasonable container environments should not allow it due to security reasons, and implementing it's lookup would lead to bad performance (host mountinfo reading).

- External slavery is supported for mountpoint external mounts and the root mount. It is detected when criu can't lookup master_id of the mount across shared_ids in container mount namespaces. CRIU relies that mountpoint external source provides right shared/slave mount to copy sharing from. Everything else is considered as internal sharing/slavery.

== Old days ==

For now the same behavior is configured with the <code>--ext-mount-map KEY:VAL</code> option. Soon this option will be [[deprecation|deprecated]].

[[Category:HOWTO]]
[[Category:External]]

External bind mounts

2022-04-28T11:16:38Z

Ptikhomirov: /* Sharing */

__TOC__

One of typical external resources when dumping a container (especially LXC/Docker) is a mount point whose root sits outside of the container's root. This situation was intended to be resolved using [[plugins]] but turned out to be common enough to introduce a built-in way of handling it.

== What is external bind mount ==

The way to create such is simple as

mkdir /root
mount --bind /foo /root/bar
chroot /root

This is it. From now on, the /bar file is a mountpoint whose root (the source) is not accessible directly.

If you look at the /proc/$pid/mountinfo file of a task seeing such you would see smth like

11 23 8:3 /root / ... - ext4 /dev/sda1 ...
23 34 8:3 /foo /bar ... - ext4 /dev/sda1 ...

The columns 4 and 5 are root and mountpoint respectively. You can see, that the / is /root file from /dev/sda1 device and /bar file is a mountpoint with the root being /foo file from the same device.

== How to teach CRIU to dump them ==

By default CRIU doesn't dump such mountpoints, because there's no way CRIU will be able to restore it -- the root of these mounts is out of scope of what CRIU dumped. In the logs you would see a message like

34:/bar doesn't have a proper root mount

which means the mountpoint /bar has inaccessible root.

To dump and restore them there's the <code>--external mnt[KEY]:VAL</code> option that sets up external mounts root mapping.

On dump, KEY is a mountpoint inside container, and corresponding VAL is a string that will be written into the image as mountpoint's root value.

On restore, KEY is the value from the image (VAL from dump), and the VAL is the path on host that will be bind-mounted into container (to the mountpoint path from image).

For example, if we want to dump the task above we should call

criu dump ... --external mnt[/bar]:barmount

The word <code>barmount</code> is an arbitrary identifier, that will be put in the image file instead of the original root path

criu show -f mountpoints.img -F mnt_id,root,mountpoint
mnt_id: 0x22 root: barmount mountpoint: /bar

On restore we should tell CRIU where to bind mount the <code>barmount</code> from like this

criu restore ... --external mnt[barmount]:/foo

With this CRIU will bind mount the /foo into proper mountpoint.

== Auto detection ==

In case one wants CRIU to autodetect and dump all the external bind mounts, and there is no need to change host mount points on restore, one can use a special syntax:

criu dump ... --external mnt[]:''flags''

Note here is nothing inside square brackets, and the optional <code>:''flags''</code> argument can contain the following characters:

; <code>m</code>
: Also enable dumping of external master mounts (as in <code>mount --make-slave</code>)
; <code>s</code>
: Also enable dumping of external shared mounts (as in <code>mount --make-shared</code>)

By default, neither master nor shared external mounts are dumped (if found, dump is aborted). Note if <code>''flags''</code> are not given, semicolon is optional.

=== Examples ===

criu dump ... --external 'mnt[]'

Auto-detect and dump all external bind mounts.

criu dump ... --external 'mnt[]:s'

Auto-detect and dump all external bind mounts, including the shared ones.

criu dump ... --external 'mnt[]:sm'

Auto-detect and dump all external bind mounts, including the shared and the master ones.

== Sharing ==

External bindmounts can both have internal/external sharing. Please see the example:

# Preparation
unshare -m --propagation private
mkdir /external_mount_sharing_test
mount -t tmpfs tmpfs /external_mount_sharing_test/
mount --make-private /external_mount_sharing_test/
cd /external_mount_sharing_test
# Source of external mount
mkdir external_mount
mount -t tmpfs tmpfs-external external_mount/
mount --make-shared external_mount/
cat /proc/$$/mountinfo | grep external
# 811 755 0:60 / /external_mount_sharing_test rw,relatime - tmpfs tmpfs rw
# 812 811 0:62 / /external_mount_sharing_test/external_mount rw,relatime shared:290 - tmpfs tmpfs-external rw

# Switch to CT mntns
unshare -m --propagation unchanged sh
mkdir root
mount -t tmpfs tmpfs-root root/
mkdir root/external_sharing root/internal_sharing root/proc

# Create external mount
mount --bind external_mount/ root/external_sharing
mount --bind external_mount/ root/internal_sharing
mount --make-private root/internal_sharing
mount --make-shared root/internal_sharing

# More preparations
mount --bind /proc root/proc
cd root
mkdir bin lib64
SH=$(which sh)
cp $SH bin
cp $(ldd $SH | grep "/lib64" | sed 's/^.*$\/lib64\S*$\s.*$/\1/') lib64
CAT=$(which cat)
cp $CAT bin
cp $(ldd $CAT | grep "/lib64" | sed 's/^.*$\/lib64\S*$\s.*$/\1/') lib64
PATH=$PATH:/bin
chroot . sh
cat /proc/$$/mountinfo
# 843 841 0:63 / / rw,relatime - tmpfs tmpfs-root rw
# 861 843 0:62 / /external_sharing rw,relatime shared:290 - tmpfs tmpfs-external rw
# 898 843 0:62 / /internal_sharing rw,relatime shared:349 - tmpfs tmpfs-external rw
# 899 843 0:5 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw

Mounts 812 (on the host) and 861 (in a container) have the same sharing (shared group) - external sharing and mount 898 has it's own local shared group - internal sharing. Same is applicable for master_ids, if we convert them into slaves external/internal shared_id would convert to external/internal master_id.

[https://criu.org/Mount-v2 Mount-v2] is introducing a better support of external sharing:

- External sharing is not supported (converted to internal sharing after c/r) as reasonable container environments should not allow it due to security reasons, and implementing it's lookup would lead to bad performance (host mountinfo reading).
- External slavery is supported for mountpoint external mounts and the root mount. It is detected when criu can't lookup master_id of the mount across shared_ids in container mount namespaces. CRIU relies that mountpoint external source provides right shared/slave mount to copy sharing from. Everything else is considered as internal sharing/slavery.

== Old days ==

For now the same behavior is configured with the <code>--ext-mount-map KEY:VAL</code> option. Soon this option will be [[deprecation|deprecated]].

[[Category:HOWTO]]
[[Category:External]]

Google Summer of Code Ideas

2022-04-11T12:00:36Z

Ptikhomirov: Add detached mounts idea

Google Summer of Code (GSoC) is a global program that offers post-secondary students an opportunity to be paid for contributing to an open source project over a three month period.

This page contains project ideas for upcoming Google Summer of Code.

== Contacts ==

Please contact the respective mentor for the idea you are interested in. For general questions feel free to send an email to the [mailto:criu@openvz.org mailing list] or write in [https://gitter.im/save-restore/criu gitter].

== Project ideas ==

=== Support sparse ghosts ===

When criu dumps processes it also dumps files that are opened by them. It does this by saving file names by which the files are accessible. But sometimes files can have no names. It may happen if a task opened a file and then removed it. To dump this file criu cannot save its name (because the name doesn't exist). Instead criu saves the whole file. This is called "ghost file". Since saving the whole file is very expensive (copying lots of data on disk) criu limits the maximum size of a ghost file. The latter is also not good, because there are "sparse" files, that are large in size, but may be small from the real disk usage perspective. The goal of the task is to support sparse ghost files, i.e. limit the size of the ghost not by its length but by disk usage and when copying the data detect the used blocks and save only those.

'''Links:'''

*[https://en.wikipedia.org/wiki/Sparse_file Sparse files]
*[[Dumping files]]
*[[Invisible files]]
*[https://www.kernel.org/doc/html/latest/filesystems/fiemap.html Fiemap ioctl]

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentor: Pavel Emelyanov <ovzxemul@gmail.com>
* Mentor: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
* Suggested by: Pavel Emelyanov <ovzxemul@gmail.com>

=== Optimize logging engine ===

'''Summary:''' CRIU puts a lots of logs when doing its job. Logging is done with simple fprintf function. They are typically useless, but ''if'' some operation fails -- the logs are the only way to find what was the reason for failure.

At the same time the printf family of functions is known to take some time to work -- they need to scan the format string for %-s and then convert the arguments into strings. If comparing criu dump with and without logs the time difference is notable (15%-20%), so speeding the logs up will help improve criu performance.

One of the solutions to the problem might be binary logging. The problem with binary logs is the amount of efforts to convert existing logs to binary form. Preferably, the switch to binary logging either keeps existing log() calls intact, either has some automatics to convert them.

The option to keep log() calls intact might be in pre-compilation pass of the sources. In this pass each <code>log(fmt, ...)</code> call gets translated into a call to a binary log function that saves <code>fmt</code> identifier copies all the args ''as is'' into the log file. The binary log decode utility, required in this case, should then find the fmt string by its ID in the log file and print the resulting message.

'''Links:'''
* [[Better logging]]

'''Details:'''
* Skill level: intermediate
* Language: C, though decoder/preprocessor can be in any language
* Expected size: 350 hours
* Mentor: Pavel Emelyanov <ovzxemul@gmail.com>
* Suggested by: Andrei Vagin <avagin@gmail.com>

=== Add support for checkpoint/restore of CORK-ed UDP socket ===

'''Summary:''' Support C/R of corked UDP socket

There's UDP_CORK option for sockets. As man page says:
<pre>
If this option is enabled, then all data output on this socket
is accumulated into a single datagram that is transmitted when
the option is disabled. This option should not be used in
code intended to be portable.
</pre>

Currently criu refuses to dump this case, so it's effectively a bug. Supporting
this will need extending the kernel API to allow criu read back the write queue
of the socket (see [[TCP connection|how it's done]] for TCP sockets, for example). Then
the queue is written into the image and is restored into the socket (with the CORK
bit set too).

'''Links:'''
* https://github.com/checkpoint-restore/criu/issues/409
* [[Sockets]], [[TCP connection]]
* [[https://groups.google.com/forum/#!topic/comp.os.linux.networking/Uz8PYiTCZSg UDP cork explained]]

'''Details:'''
* Skill level: intermediate (+linux kernel)
* Language: C
* Expected size: 350 hours
* Mentor: Pavel Emelianov <ovzxemul@gmail.com>
* Suggested by: Pavel Emelianov <ovzxemul@gmail.com>

=== Add support for pidfd file descriptors ===

'''Summary:''' Support C/R of pidfd descriptors

There is pidfd_open syscall which allows opening
a special PID file descriptor. A user can send a signal to
the process (pidfd_send_signal syscall), wait for the process
(poll() on pidfd).

At the moment CRIU can't dump processes that have pidfd's opened.

'''Links:'''
* https://lwn.net/Articles/801319/
* https://lwn.net/Articles/794707/
* https://github.com/torvalds/linux/blob/v5.16/kernel/fork.c#L1877

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Christian Brauner <christian@brauner.io>
* Suggested by: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Add support for memfd_secret file descriptors ===

'''Summary:''' Support C/R of memfd_secret descriptors

There is memfd_secret syscall which allows user to open
special memfd which is backed by special memory range which
is inaccessible by another processes (and the kernel too!).

At the moment CRIU can't dump processes that have memfd_secret's opened.

'''Links:'''
* https://lwn.net/Articles/865256/

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Mike Rapoport <mike.rapoport@gmail.com>
* Suggested by: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Use eBPF to lock and unlock the network ===

'''Summary:''' Use eBPF instead of external iptables-restore tool for network lock and unlock.

During checkpointing and restoring CRIU locks the network to make sure no network packets are accepted by the network stack during the time the process is checkpointed. Currently CRIU calls out to iptables-restore to create and delete the corresponding iptables rules. Another approach which avoids calling out to the external binary iptables-restore would be to directly inject eBPF rules. There have been reports from users that iptables-restore fails in some way and eBPF could avoid this external dependency.

'''Links:'''
* https://www.criu.org/TCP_connection#Checkpoint_and_restore_TCP_connection
* https://github.com/systemd/systemd/blob/master/src/core/bpf-firewall.c
* https://blog.zeyady.com/2021-08-16/gsoc-criu

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentor: Radostin Stoyanov <rstoyanov@fedoraproject.org>
* Suggested by: Adrian Reber <areber@redhat.com>

=== CGroup-v2 support ===

'''Summary:''' cgroup is a mechanism to organize processes hierarchically and distribute system resources along the hierarchy in a controlled and configurable manner. cgroup v2 is a new version of the cgroup file system. Unlike v1, cgroup v2 has only single hierarchy. CRIU has to dump/restore a container cgroup hierarchy along with all per-cgroup options. The cgroupv2 support in CRIU has to be compatible with Docker, containerd and cri-o.

'''Links:'''
* [[CGroups]]
* https://github.com/checkpoint-restore/criu/issues/252
* https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentor: Andrei Vagin <avagin@gmail.com>
* Suggested by: Andrei Vagin <avagin@gmail.com>

=== Dump shmem in user-mode (unprivileged-mode) ===

CRIU uses /proc/pid/map_files to dump and restore anonymous shared memory regions, but map_files is restricted to the global CAP_SYS_ADMIN capability. In most cases, it is possible to dump/restore shared memory region without map_files and we need to implement this in CRIU.

'''Links:'''
* [[User-mode]]

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Suggested by: Andrei Vagin <avagin@gmail.com>
* Suggested by: Pavel Emelyanov <ovzxemul@gmail.com>
* Mentor: Pavel Emelyanov <ovzxemul@gmail.com>

=== Porting crit functionalities in GO ===

'''Summary:''' Implement image view and manipulation in Go

CRIU's checkpoint images are stored on disk using protobuf. For easier analysis of checkpoint files CRIU has a tool called [[CRIT|CRiu Image Tool (CRIT)]]. It can display/decode CRIU image files from binary protobuf to JSON as well as encode JSON files back to the binary format. With closer integration of CRIU in container runtimes it becomes important to be able to view the CRIU output files. Either for manipulation before restoring or for reading checkpoint statistics (memory pages written to disk, memory pages skipped, process downtime).

Currently CRIT is implemented in Python, for easier integration in other Go projects it is important to have image manipulation and analysis available from GO. This means we need a Go based library to read/modify/write/encode/decode CRIU's image files. Based on this library a Go based implementation of CRIT would be useful.

'''Links:'''
* [[CRIT]]
* Possible use case see LXD: https://github.com/lxc/lxd/blob/cb55b1c5a484a43e0c21c6ae8c4a2e30b4d45be3/lxd/migrate_container.go#L179
* https://github.com/lxc/lxd/pull/4072
* https://github.com/checkpoint-restore/go-criu/tree/master/stats
* https://github.com/checkpoint-restore/go-criu/pull/28

'''Details:'''
* Skill level: beginner
* Language: Go
* Expected size: 350 hours
* Mentor: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
* Suggested by: Adrian Reber <areber@redhat.com>

=== Files on detached mounts ===

'''Summary:''' Initial support of open files on "detached" mounts

When criu dumps a process with an open fd on a file, it gets the mount identifier (mnt_id) via /proc/<pid>/fdinfo/<fd>, so that criu knows from which exact mount the file was initially opened. This way criu can restore this fd by opening the same exact file from topologically the same mount in restored mount tree.

Restoring fd from the right mount can be important in different cases, for instance if the process would later want to resolve paths relative to the fd, and obviously resolving from the same file on different mount can lead to different resolved paths, or if the process wants to check path to the file via /proc/<pid>/fd/<fd>.

But we have a problem finding on which mount we need to reopen the file at restore if we only know mnt_id but can't find this mnt_id in /proc/<pid>/mountinfo.

Mountinfo file shows the mount tree topology of current mntns: parent - child relations, sharing group information, mountpoint and fs root information. And if we don't see mnt_id in it we don't know anything about this mount.

This can happen in two cases

* 1) external mount or file - if file was opened from e.g. host it's mount would not be visible in container mountinfo
* 2) mount was lazily unmounted

In case of 1) we have criu options to help criu handle external dependencies.

In case of 2) or no options provided criu can't resolve mnt_id in mountinfo and criu fails.

'''Solution:'''
We can handle 2) with: resolving major/minor via fstat, using name_to_handle_at and open_by_handle_at to open same file on any other available mount from same superblock (same major/minor) in container. Now we have fd2 of the same file as fd, but on existing mount we can dump it as usual instead, and mark it as "detached" in image, now criu on restore knows where to find this file, but instead of just opening fd2 from actually restored mount, we create a temporary bindmount which is lazy unmounted just after open making the file appear as a file on detached mount.

Known problems with this approach:

* Stat on btrfs gives wrong major/minor
* file handles does not work everywhere
* file handles can return fd2 on deleted file or on other hardlink, this needs special handling.

Additionally (optional part):
We can export real major/minor in fdinfo (kernel).
We can think of new kernel interface to get mount's major/minor and root (shift from fsroot) for detached mounts, if we have it we don't need file handle hack to find file on other mount (see fsinfo or getvalues kernel patches in LKML, can we add this info there?).

'''Details:'''
* Skill level: intermediate
* Language: C
* Mentor: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
* Suggested by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

== Suspended project ideas ==

Listed here are tasks that seem suitable for GSoC, but currently do not have anybody to mentor it.

=== IOUring support ===
The io_uring Asynchronous I/O (AIO) framework is a new Linux I/O interface, first introduced in upstream Linux kernel version 5.1 (March 2019). It provides a low-latency and feature-rich interface for applications that require AIO functionality.

'''Links:'''
* https://blogs.oracle.com/linux/an-introduction-to-the-io_uring-asynchronous-io-framework
* https://github.com/axboe/liburing

'''Details:'''
* Skill level: expert (+linux kernel)
* Expected size: 350 hours
* Suggested by: Pavel Emelyanov <ovzxemul@gmail.com>
* Mentor: Pavel Emelyanov <ovzxemul@gmail.com>

=== Add support for SPFS ===

'''Summary:''' The SPFS is a special filesystem that allows checkpoint and restore of such things as NFS and FUSE

NFS support is already implemented in Virtuozzo CRIU, but it's very beneficial to port it to mainline CRIU. The importaint part of it is the need to implement the integration of Stub-Proxy File System (SPFS) with LXC/yet_another_containers_environment.

'''Links'''
* https://github.com/checkpoint-restore/criu/issues/60
* https://github.com/checkpoint-restore/criu/issues/53
* https://github.com/skinsbursky/spfs
* https://patchwork.criu.org/series/137/

'''Details:'''
* Skill level: expert
* Language: C
* Mentor: Alexander Mikhalitsyn <alexander@mihalicyn.com> / <alexander.mikhalitsyn@virtuozzo.com>
* Suggested by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

=== Anonymise image files ===

'''Summary:''' Teach [[CRIT]] to remove sensitive information from images

When reporting a BUG it may be not acceptable for the reporter to send us raw images, as they may contain sensitive data. Need to teach CRIT to "anonymise" images for publication.

List of data to shred:

* Memory contents. For the sake of investigation, all the memory contents can be just removed. Only the sizes of pages*.img files are enough.
* Paths to files. Here we should keep the paths relations to each other. The simplest way seem to be replacing file names with "random" (or sequential) strings, BUT (!) keeping an eye on making this mapping be 1:1. Note, that file paths may also sit in sk-unix.img.
* Registers.
* Process names. (But relations should be kept).
* Contents of streams, i.e. pipe/fifo data, sk-queue, tcp-stream, tty data.
* Ghost files.
* Tarballs with tmpfs-s.
* IP addresses in sk-inet-s, ip tool dumps and net*.img.

'''Links:'''
* [[Anonymize image files]]
* https://github.com/checkpoint-restore/criu/issues/360
* [[CRIT]], [[Images]]
* External links to mailing lists or web sites

'''Details:'''
* Skill level: beginner
* Language: Python
* Mentor: Pavel Emelianov <xemul@virtuozzo.com>
* Suggested by: Pavel Emelianov <xemul@virtuozzo.com>

[[Category:GSoC]]
[[Category:Development]]

Google Summer of Code Ideas

2022-04-11T09:34:51Z

Ptikhomirov: /* Support sparse ghosts */

Google Summer of Code (GSoC) is a global program that offers post-secondary students an opportunity to be paid for contributing to an open source project over a three month period.

This page contains project ideas for upcoming Google Summer of Code.

== Contacts ==

Please contact the respective mentor for the idea you are interested in. For general questions feel free to send an email to the [mailto:criu@openvz.org mailing list] or write in [https://gitter.im/save-restore/criu gitter].

== Project ideas ==

=== Support sparse ghosts ===

When criu dumps processes it also dumps files that are opened by them. It does this by saving file names by which the files are accessible. But sometimes files can have no names. It may happen if a task opened a file and then removed it. To dump this file criu cannot save its name (because the name doesn't exist). Instead criu saves the whole file. This is called "ghost file". Since saving the whole file is very expensive (copying lots of data on disk) criu limits the maximum size of a ghost file. The latter is also not good, because there are "sparse" files, that are large in size, but may be small from the real disk usage perspective. The goal of the task is to support sparse ghost files, i.e. limit the size of the ghost not by its length but by disk usage and when copying the data detect the used blocks and save only those.

'''Links:'''

*[https://en.wikipedia.org/wiki/Sparse_file Sparse files]
*[[Dumping files]]
*[[Invisible files]]
*[https://www.kernel.org/doc/html/latest/filesystems/fiemap.html Fiemap ioctl]

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentor: Pavel Emelyanov <ovzxemul@gmail.com>
* Mentor: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
* Suggested by: Pavel Emelyanov <ovzxemul@gmail.com>

=== Optimize logging engine ===

'''Summary:''' CRIU puts a lots of logs when doing its job. Logging is done with simple fprintf function. They are typically useless, but ''if'' some operation fails -- the logs are the only way to find what was the reason for failure.

At the same time the printf family of functions is known to take some time to work -- they need to scan the format string for %-s and then convert the arguments into strings. If comparing criu dump with and without logs the time difference is notable (15%-20%), so speeding the logs up will help improve criu performance.

One of the solutions to the problem might be binary logging. The problem with binary logs is the amount of efforts to convert existing logs to binary form. Preferably, the switch to binary logging either keeps existing log() calls intact, either has some automatics to convert them.

The option to keep log() calls intact might be in pre-compilation pass of the sources. In this pass each <code>log(fmt, ...)</code> call gets translated into a call to a binary log function that saves <code>fmt</code> identifier copies all the args ''as is'' into the log file. The binary log decode utility, required in this case, should then find the fmt string by its ID in the log file and print the resulting message.

'''Links:'''
* [[Better logging]]

'''Details:'''
* Skill level: intermediate
* Language: C, though decoder/preprocessor can be in any language
* Expected size: 350 hours
* Mentor: Pavel Emelyanov <ovzxemul@gmail.com>
* Suggested by: Andrei Vagin <avagin@gmail.com>

=== Add support for checkpoint/restore of CORK-ed UDP socket ===

'''Summary:''' Support C/R of corked UDP socket

There's UDP_CORK option for sockets. As man page says:
<pre>
If this option is enabled, then all data output on this socket
is accumulated into a single datagram that is transmitted when
the option is disabled. This option should not be used in
code intended to be portable.
</pre>

Currently criu refuses to dump this case, so it's effectively a bug. Supporting
this will need extending the kernel API to allow criu read back the write queue
of the socket (see [[TCP connection|how it's done]] for TCP sockets, for example). Then
the queue is written into the image and is restored into the socket (with the CORK
bit set too).

'''Links:'''
* https://github.com/checkpoint-restore/criu/issues/409
* [[Sockets]], [[TCP connection]]
* [[https://groups.google.com/forum/#!topic/comp.os.linux.networking/Uz8PYiTCZSg UDP cork explained]]

'''Details:'''
* Skill level: intermediate (+linux kernel)
* Language: C
* Expected size: 350 hours
* Mentor: Pavel Emelianov <ovzxemul@gmail.com>
* Suggested by: Pavel Emelianov <ovzxemul@gmail.com>

=== Add support for pidfd file descriptors ===

'''Summary:''' Support C/R of pidfd descriptors

There is pidfd_open syscall which allows opening
a special PID file descriptor. A user can send a signal to
the process (pidfd_send_signal syscall), wait for the process
(poll() on pidfd).

At the moment CRIU can't dump processes that have pidfd's opened.

'''Links:'''
* https://lwn.net/Articles/801319/
* https://lwn.net/Articles/794707/
* https://github.com/torvalds/linux/blob/v5.16/kernel/fork.c#L1877

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Christian Brauner <christian@brauner.io>
* Suggested by: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Add support for memfd_secret file descriptors ===

'''Summary:''' Support C/R of memfd_secret descriptors

There is memfd_secret syscall which allows user to open
special memfd which is backed by special memory range which
is inaccessible by another processes (and the kernel too!).

At the moment CRIU can't dump processes that have memfd_secret's opened.

'''Links:'''
* https://lwn.net/Articles/865256/

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentors: Alexander Mikhalitsyn <alexander@mihalicyn.com>, Mike Rapoport <mike.rapoport@gmail.com>
* Suggested by: Alexander Mikhalitsyn <alexander@mihalicyn.com>

=== Use eBPF to lock and unlock the network ===

'''Summary:''' Use eBPF instead of external iptables-restore tool for network lock and unlock.

During checkpointing and restoring CRIU locks the network to make sure no network packets are accepted by the network stack during the time the process is checkpointed. Currently CRIU calls out to iptables-restore to create and delete the corresponding iptables rules. Another approach which avoids calling out to the external binary iptables-restore would be to directly inject eBPF rules. There have been reports from users that iptables-restore fails in some way and eBPF could avoid this external dependency.

'''Links:'''
* https://www.criu.org/TCP_connection#Checkpoint_and_restore_TCP_connection
* https://github.com/systemd/systemd/blob/master/src/core/bpf-firewall.c
* https://blog.zeyady.com/2021-08-16/gsoc-criu

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentor: Radostin Stoyanov <rstoyanov@fedoraproject.org>
* Suggested by: Adrian Reber <areber@redhat.com>

=== CGroup-v2 support ===

'''Summary:''' cgroup is a mechanism to organize processes hierarchically and distribute system resources along the hierarchy in a controlled and configurable manner. cgroup v2 is a new version of the cgroup file system. Unlike v1, cgroup v2 has only single hierarchy. CRIU has to dump/restore a container cgroup hierarchy along with all per-cgroup options. The cgroupv2 support in CRIU has to be compatible with Docker, containerd and cri-o.

'''Links:'''
* [[CGroups]]
* https://github.com/checkpoint-restore/criu/issues/252
* https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Mentor: Andrei Vagin <avagin@gmail.com>
* Suggested by: Andrei Vagin <avagin@gmail.com>

=== Dump shmem in user-mode (unprivileged-mode) ===

CRIU uses /proc/pid/map_files to dump and restore anonymous shared memory regions, but map_files is restricted to the global CAP_SYS_ADMIN capability. In most cases, it is possible to dump/restore shared memory region without map_files and we need to implement this in CRIU.

'''Links:'''
* [[User-mode]]

'''Details:'''
* Skill level: intermediate
* Language: C
* Expected size: 350 hours
* Suggested by: Andrei Vagin <avagin@gmail.com>
* Suggested by: Pavel Emelyanov <ovzxemul@gmail.com>
* Mentor: Pavel Emelyanov <ovzxemul@gmail.com>

=== Porting crit functionalities in GO ===

'''Summary:''' Implement image view and manipulation in Go

CRIU's checkpoint images are stored on disk using protobuf. For easier analysis of checkpoint files CRIU has a tool called [[CRIT|CRiu Image Tool (CRIT)]]. It can display/decode CRIU image files from binary protobuf to JSON as well as encode JSON files back to the binary format. With closer integration of CRIU in container runtimes it becomes important to be able to view the CRIU output files. Either for manipulation before restoring or for reading checkpoint statistics (memory pages written to disk, memory pages skipped, process downtime).

Currently CRIT is implemented in Python, for easier integration in other Go projects it is important to have image manipulation and analysis available from GO. This means we need a Go based library to read/modify/write/encode/decode CRIU's image files. Based on this library a Go based implementation of CRIT would be useful.

'''Links:'''
* [[CRIT]]
* Possible use case see LXD: https://github.com/lxc/lxd/blob/cb55b1c5a484a43e0c21c6ae8c4a2e30b4d45be3/lxd/migrate_container.go#L179
* https://github.com/lxc/lxd/pull/4072
* https://github.com/checkpoint-restore/go-criu/tree/master/stats
* https://github.com/checkpoint-restore/go-criu/pull/28

'''Details:'''
* Skill level: beginner
* Language: Go
* Expected size: 350 hours
* Mentor: Radostin Stoyanov <rstoyanov@fedoraproject.org>, Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
* Suggested by: Adrian Reber <areber@redhat.com>

== Suspended project ideas ==

Listed here are tasks that seem suitable for GSoC, but currently do not have anybody to mentor it.

=== IOUring support ===
The io_uring Asynchronous I/O (AIO) framework is a new Linux I/O interface, first introduced in upstream Linux kernel version 5.1 (March 2019). It provides a low-latency and feature-rich interface for applications that require AIO functionality.

'''Links:'''
* https://blogs.oracle.com/linux/an-introduction-to-the-io_uring-asynchronous-io-framework
* https://github.com/axboe/liburing

'''Details:'''
* Skill level: expert (+linux kernel)
* Expected size: 350 hours
* Suggested by: Pavel Emelyanov <ovzxemul@gmail.com>
* Mentor: Pavel Emelyanov <ovzxemul@gmail.com>

=== Add support for SPFS ===

'''Summary:''' The SPFS is a special filesystem that allows checkpoint and restore of such things as NFS and FUSE

NFS support is already implemented in Virtuozzo CRIU, but it's very beneficial to port it to mainline CRIU. The importaint part of it is the need to implement the integration of Stub-Proxy File System (SPFS) with LXC/yet_another_containers_environment.

'''Links'''
* https://github.com/checkpoint-restore/criu/issues/60
* https://github.com/checkpoint-restore/criu/issues/53
* https://github.com/skinsbursky/spfs
* https://patchwork.criu.org/series/137/

'''Details:'''
* Skill level: expert
* Language: C
* Mentor: Alexander Mikhalitsyn <alexander@mihalicyn.com> / <alexander.mikhalitsyn@virtuozzo.com>
* Suggested by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

=== Anonymise image files ===

'''Summary:''' Teach [[CRIT]] to remove sensitive information from images

When reporting a BUG it may be not acceptable for the reporter to send us raw images, as they may contain sensitive data. Need to teach CRIT to "anonymise" images for publication.

List of data to shred:

* Memory contents. For the sake of investigation, all the memory contents can be just removed. Only the sizes of pages*.img files are enough.
* Paths to files. Here we should keep the paths relations to each other. The simplest way seem to be replacing file names with "random" (or sequential) strings, BUT (!) keeping an eye on making this mapping be 1:1. Note, that file paths may also sit in sk-unix.img.
* Registers.
* Process names. (But relations should be kept).
* Contents of streams, i.e. pipe/fifo data, sk-queue, tcp-stream, tty data.
* Ghost files.
* Tarballs with tmpfs-s.
* IP addresses in sk-inet-s, ip tool dumps and net*.img.

'''Links:'''
* [[Anonymize image files]]
* https://github.com/checkpoint-restore/criu/issues/360
* [[CRIT]], [[Images]]
* External links to mailing lists or web sites

'''Details:'''
* Skill level: beginner
* Language: Python
* Mentor: Pavel Emelianov <xemul@virtuozzo.com>
* Suggested by: Pavel Emelianov <xemul@virtuozzo.com>

[[Category:GSoC]]
[[Category:Development]]

Mount-v2

2022-01-31T09:57:49Z

Ptikhomirov:

Mount-v2 CRIU algorithm

== Introduction ==

After we've merged MOVE_MOUNT_SET_GROUP feature to mainstream linux v5.15 [https://github.com/torvalds/linux/commit/9ffb14ef61bab83fa818736bf3e7e6b6e182e8e2 torvalds/linux@9ffb14e] now we can use it to restore sharing groups of mounts without the need to care about inheriting those groups when create mounts, we can just set sharing groups at later stage and before that construct mount trees with private mounts.

Restoring propagation right with conservative approach of both creating mounts and inheriting propagation groups looks like mission impossible task for us due to many problems:

* Criu knows nothing about the initial history or order of mount tree creation;
* Propagation can create tons of mounts;
* Propagation may change parent mounts for existing mount tree;
* "Mount trap" - propagation may cover initial mount;
* "Non-uniform" propagation - there are different tricks with mount order and temporary children-"lock" mounts, which create mount trees which can't be restored without those tricks;
* "Cross-namespace" sharing groups creation need to be ordered with mount namespace creation right;
* Sharing groups vs mount tree order inversion can be very complex to restore and require multiple auxiliary. (see example below)

See my talks about it on Linux Plumbers Conference:
* [https://www.linuxplumbersconf.org/event/7/contributions/640/ CRIU mounts migration: problems and solutions]
* [https://linuxplumbersconf.org/event/11/contributions/923/ Mount-v2 CRIU migration engine: status update]

And here is the example of order inversion where multiple temporary mounts needed to achieve the result:
[[File:Mounts-inverse-order-example.gif|none|link=|Mounts-inverse-order-example.gif]]

== Mount-v2 description ==

New mount-v2 algorithm is integrated deeply in the original one, so that dumping of mounts is done exactly the same for original mount engine and new one. So mount-v2 series has preparatory steps related to bindmount detection, external mounts detection and helper mounts handling to make the original mount code more robust, to make it easier to reuse it in mount-v2.

==== Plain mountpoints ====

One of main differences of mount-v2 comparing to original is that mounts are initially created "plain", for instance if we had '''MOUNT''' with '''mnt_id=1000''' and '''ns_mountpoint="/mount/point/path"''', original mount engine would originally mount this '''MOUNT''' in the mount tree to '''<criu_root_yard>/<mntns>/mount/point/path''' so that if this mount had '''PARENT''' mount with '''mnt_id=999''' and '''ns_mountpoint="/mount/point"''' corresponding mount for '''PARENT''' would be created in '''<criu_root_yard>/<mntns>/mount/point''' thus restoring parent-child relationship between them initially. For mount-v2 '''MOUNT''' would be first mounted to '''<criu_root_yard>/mnt-1000''' and '''PARENT''' would be mounted to '''<criu_root_yard>/mnt-999''' so that on the first stage we only create mounts and then on separate second stage handle the tree assembling separately. This way we can have useful heuristics like on the second stage we can create overmounts after mounts they overmount, and on the first stage we can create external mounts before their bindmounts and these two do not clinch with each other.

But it is not so simple actually because we do not want to rewrite all the code for instance for restoring mount content or restoring ghost and remap files, which used mountpoint paths in "tree" format. So in all places where it does not matter (where we do not access <criu_root_yard>/<mntns>/... paths) we switched from using mount_info->mountpoint to mount_info->ns_mountpoint and in all places where we actually needed "tree" format paths we replace them with service_mountpoint() helper which would return "tree" paths for original mount engine and "plain" paths for mount-v2. This way we can safely switch from one to another.

==== Resolving sharing groups ====

Just after reading mounts from images in read_mnt_ns_img() when mount-v2 is enabled we have an additional step to collect sharing group information from mounts and turn it to sharing groups forest graph (resolve_shared_mounts_v2). First, we just walk over all mounts and create sharing group for each mount with unique shared_id + master_id pair, also we sew all mounts to corresponding sharing group with same id pair. Second, we walk over all sharing groups which has non-zero master_id and lookup the corresponding parent sharing groups and connect them with a tree.

There is also a case when master_id is non-zero but there is no corresponding parent sharing group, this means that outside of dumped container there is mount with matching shared_id - external slavery detected. For this case we just collect sibling sharing groups in list with empty parent link. Also we detect source path from which the master_id would be inherited either from some mountpoint-external mount or from root container mount.

==== Actual restore of mounts ====

Actual restore of mounts in original mount engine starts with prepare_mnt_ns() function, when mount-v2 is enabled we pass controll from it to prepare_mnt_ns_v2() instead. It consists of several stages:

1) We pre-create mount namespaces for each restored mount namespace in pre_create_mount_namespaces(). These namespaces appear almost empty: they contain tmpfs as their root, they have root yard path created in it with another tmpfs mounted in it, and"namespace" path for assembling tree of mounts in it created in corresponding subdirectory of root yard mount. Surely we also save nsfs fds to each mount namespace to be able to reenter them later.

2) In populate_mnt_ns_v2() we reuse mnt_tree_for_each() walk over mount tree from original mount engine and so we walk mounts in tree order with addition of temporary skipping mounts and their descendants with can_mount_now_v2() in case they depend from other mounts, restarting the walk for them later. The can_mount_now_v2() is basically skipping mounts which should be restored as bindmounts but their source is not ready yet, this is true for bindmounts of root, external or plugin mounts or non-fsroot mounts.

3) In the mentioned walk over mounts forest in do_mount_one_v2() we determine if the newly created mount is directory one or a file one in detect_is_dir(), we just open its mountpoint path relative to parent "plain" mountpoint and do stat. That's why it is important to use mnt_tree_for_each() as it insures that parent is already "plain" mounted.

4) In the mentioned walk over mounts forest in do_mount_one_v2() we create "plain" mountpoint for a new mount, empty file or directory based on the previous step.

5) In the mentioned walk over mounts forest in do_mount_one_v2() we actually create new mount, either we create completely new mount or device-external in do_new_mount_v2() if it's supported, or bind container root mount in do_mount_root_v2() from the still visible host mount tree, or bind mountpoint-external mount in do_bind_mount_v2() and similarly bind any mount for which superblock is already created by other mount beforehand and we can just bind it in do_bind_mount_v2(). These functions act similar to ones in original mount engine but simplified as they don't need to care about inheriting sharing groups.

6) The do_bind_mount_v2() is improved to do bindmount via open_tree() + move_mount() with flags allowing not to traverse symlinks or autofs mounts.

7) Also we cross-namespace bindmount the newly created mount to restored mount namespace to the same "plain" mountpoint in do_mount_in_right_mntns(). So that we initially have a mount which would be visible after restore, this would be required in future to be able to restore bindmounted unix sockets on the right mount.

8) Now after the walk we don't plan to do bindmounts anymore so we set unbindable flags on mounts.

9) Next we assemble mount trees in each restored mount namespace in assemble_mount_namespaces() by again reusing move_mount_to_tree() to have tree order of moving mounts into proper places in mount tree. Also we open fds on the mountpoint: one mp_fd_id before moving and another mnt_fd_id after, so that we can access files on each mount later from final mntns via those fds.

10) Finally we do restore sharing groups on the assembled mount forest in restore_mount_sharing_options(). It walks each root sharing group and their descendants with dfs tree walk. It creates sharing for the first mount in the sharing group and then sets the same sharing on all other mounts in this group.

Sharing creation for first mount is two step:

a) If mount has master_id we either copy shared_id from parent sharing group or from external source and then make mount slave thus converting it to right master_id.
b) Next if mount has shared_id we just make us shared, creating right shared_id.

We need to use userns_call() for MOVE_MOUNT_SET_GROUP to have all right permissions for copying sharing (move_mount_set_group()). Also we need to resolve external paths given by user to their actual mountpoint, we do so with openat2(RESOLVE_NO_XDEV) in resolve_mountpoint, this also only works from userns_call().

11) We remove sources of deleted mounts making them actually deleted (from "service" mount namespace), as moving deleted mounts is not allowed and just to simplify things we do it at the last step.

==== Links ====

"Virtuozzo" (original) version (using non-mainstream kernel interface): [[Mounts-v2-Virtuozzo|Mounts-v2-Virtuozzo]] It actually has cool features we don't have in mainstream yet, for instance - nested pidns proc handling, this feature requires nested pidns support beforehand.

MOVE_MOUNT_SET_GROUP kernel feature: [https://github.com/torvalds/linux/commit/9ffb14ef61bab83fa818736bf3e7e6b6e182e8e2 torvalds/linux@9ffb14e]

Mount-v2 PR to criu: [https://github.com/checkpoint-restore/criu/pull/1721 #1721]

[[Category: Under the hood]]

Mount-v2

2022-01-31T09:56:34Z

Ptikhomirov:

Mount-v2 CRIU algorithm

== Introduction ==

After we've merged MOVE_MOUNT_SET_GROUP feature to mainstream linux v5.15 [https://github.com/torvalds/linux/commit/9ffb14ef61bab83fa818736bf3e7e6b6e182e8e2 torvalds/linux@9ffb14e] now we can use it to restore sharing groups of mounts without the need to care about inheriting those groups when create mounts, we can just set sharing groups at later stage and before that construct mount trees with private mounts.

Restoring propagation right with conservative approach of both creating mounts and inheriting propagation groups looks like mission impossible task for us due to many problems:

* Criu knows nothing about the initial history or order of mount tree creation;
* Propagation can create tons of mounts;
* Propagation may change parent mounts for existing mount tree;
* "Mount trap" - propagation may cover initial mount;
* "Non-uniform" propagation - there are different tricks with mount order and temporary children-"lock" mounts, which create mount trees which can't be restored without those tricks;
* "Cross-namespace" sharing groups creation need to be ordered with mount namespace creation right;
* Sharing groups vs mount tree order inversion can be very complex to restore and require multiple auxiliary. (see example below)

See my talks about it on Linux Plumbers Conference:
* [https://www.linuxplumbersconf.org/event/7/contributions/640/ CRIU mounts migration: problems and solutions]
* [https://linuxplumbersconf.org/event/11/contributions/923/ Mount-v2 CRIU migration engine: status update]

And here is the example of order inversion where multiple temporary mounts needed to achieve the result:
[[File:Mounts-inverse-order-example.gif|none|link=|Mounts-inverse-order-example.gif]]

== Mount-v2 description ==

New mount-v2 algorithm is integrated deeply in the original one, so that dumping of mounts is done exactly the same for original mount engine and new one. So mount-v2 series has preparatory steps related to bindmount detection, external mounts detection and helper mounts handling to make the original mount code more robust, to make it easier to reuse it in mount-v2.

==== Plain mountpoints ====

One of main differences of mount-v2 comparing to original is that mounts are initially created "plain", for instance if we had '''MOUNT''' with '''mnt_id=1000''' and '''ns_mountpoint="/mount/point/path"''', original mount engine would originally mount this '''MOUNT''' in the mount tree to '''<criu_root_yard>/<mntns>/mount/point/path''' so that if this mount had '''PARENT''' mount with '''mnt_id=999''' and '''ns_mountpoint="/mount/point"''' corresponding mount for '''PARENT''' would be created in '''<criu_root_yard>/<mntns>/mount/point''' thus restoring parent-child relationship between them initially. For mount-v2 '''MOUNT''' would be first mounted to '''<criu_root_yard>/mnt-1000''' and '''PARENT''' would be mounted to '''<criu_root_yard>/mnt-999''' so that on the first stage we only create mounts and then on separate second stage handle the tree assembling separately. This way we can have useful heuristics like on the second stage we can create overmounts after mounts they overmount, and on the first stage we can create external mounts before their bindmounts and these two do not clinch with each other.

But it is not so simple actually because we do not want to rewrite all the code for instance for restoring mount content or restoring ghost and remap files, which used mountpoint paths in "tree" format. So in all places where it does not matter (where we do not access <criu_root_yard>/<mntns>/... paths) we switched from using mount_info->mountpoint to mount_info->ns_mountpoint and in all places where we actually needed "tree" format paths we replace them with service_mountpoint() helper which would return "tree" paths for original mount engine and "plain" paths for mount-v2. This way we can safely switch from one to another.

==== Resolving sharing groups ====

Just after reading mounts from images in read_mnt_ns_img() when mount-v2 is enabled we have an additional step to collect sharing group information from mounts and turn it to sharing groups forest graph (resolve_shared_mounts_v2). First, we just walk over all mounts and create sharing group for each mount with unique shared_id + master_id pair, also we sew all mounts to corresponding sharing group with same id pair. Second, we walk over all sharing groups which has non-zero master_id and lookup the corresponding parent sharing groups and connect them with a tree.

There is also a case when master_id is non-zero but there is no corresponding parent sharing group, this means that outside of dumped container there is mount with matching shared_id - external slavery detected. For this case we just collect sibling sharing groups in list with empty parent link. Also we detect source path from which the master_id would be inherited either from some mountpoint-external mount or from root container mount.

==== Actual restore of mounts ====

Actual restore of mounts in original mount engine starts with prepare_mnt_ns() function, when mount-v2 is enabled we pass controll from it to prepare_mnt_ns_v2() instead. It consists of several stages:

1) We pre-create mount namespaces for each restored mount namespace in pre_create_mount_namespaces(). These namespaces appear almost empty: they contain tmpfs as their root, they have root yard path created in it with another tmpfs mounted in it, and"namespace" path for assembling tree of mounts in it created in corresponding subdirectory of root yard mount. Surely we also save nsfs fds to each mount namespace to be able to reenter them later.

2) In populate_mnt_ns_v2() we reuse mnt_tree_for_each() walk over mount tree from original mount engine and so we walk mounts in tree order with addition of temporary skipping mounts and their descendants with can_mount_now_v2() in case they depend from other mounts, restarting the walk for them later. The can_mount_now_v2() is basically skipping mounts which should be restored as bindmounts but their source is not ready yet, this is true for bindmounts of root, external or plugin mounts or non-fsroot mounts.

3) In the mentioned walk over mounts forest in do_mount_one_v2() we determine if the newly created mount is directory one or a file one in detect_is_dir(), we just open its mountpoint path relative to parent "plain" mountpoint and do stat. That's why it is important to use mnt_tree_for_each() as it insures that parent is already "plain" mounted.

4) In the mentioned walk over mounts forest in do_mount_one_v2() we create "plain" mountpoint for a new mount, empty file or directory based on the previous step.

5) In the mentioned walk over mounts forest in do_mount_one_v2() we actually create new mount, either we create completely new mount or device-external in do_new_mount_v2() if it's supported, or bind container root mount in do_mount_root_v2() from the still visible host mount tree, or bind mountpoint-external mount in do_bind_mount_v2() and similarly bind any mount for which superblock is already created by other mount beforehand and we can just bind it in do_bind_mount_v2(). These functions act similar to ones in original mount engine but simplified as they don't need to care about inheriting sharing groups.

6) The do_bind_mount_v2() is improved to do bindmount via open_tree() + move_mount() with flags allowing not to traverse symlinks or autofs mounts.

7) Also we cross-namespace bindmount the newly created mount to restored mount namespace to the same "plain" mountpoint in do_mount_in_right_mntns(). So that we initially have a mount which would be visible after restore, this would be required in future to be able to restore bindmounted unix sockets on the right mount.

8) Now after the walk we don't plan to do bindmounts anymore so we set unbindable flags on mounts.

9) Next we assemble mount trees in each restored mount namespace in assemble_mount_namespaces() by again reusing move_mount_to_tree() to have tree order of moving mounts into proper places in mount tree. Also we open fds on the mountpoint: one mp_fd_id before moving and another mnt_fd_id after, so that we can access files on each mount later from final mntns via those fds.

10) Finally we do restore sharing groups on the assembled mount forest in restore_mount_sharing_options(). It walks each root sharing group and their descendants with dfs tree walk. It creates sharing for the first mount in the sharing group and then sets the same sharing on all other mounts in this group.

Sharing creation for first mount is two step:

a) If mount has master_id we either copy shared_id from parent sharing group or from external source and then make mount slave thus converting it to right master_id.
b) Next if mount has shared_id we just make us shared, creating right shared_id.

We need to use userns_call() for MOVE_MOUNT_SET_GROUP to have all right permissions for copying sharing (move_mount_set_group()). Also we need to resolve external paths given by user to their actual mountpoint, we do so with openat2(RESOLVE_NO_XDEV) in resolve_mountpoint, this also only works from userns_call().

11) We remove sources of deleted mounts making them actually deleted (from "service" mount namespace), as moving deleted mounts is not allowed and just to simplify things we do it at the last step.

==== Links ====

"Virtuozzo" (original) version (using non-mainstream kernel interface): [[Mounts-v2-Virtuozzo|Mounts-v2-Virtuozzo]] It actually has cool features we don't have in mainstream yet, for instance - nested pidns proc handling, this feature requires nested pidns support beforehand.

MOVE_MOUNT_SET_GROUP kernel feature: [https://github.com/torvalds/linux/commit/9ffb14ef61bab83fa818736bf3e7e6b6e182e8e2 torvalds/linux@9ffb14e]

PR with this feature to criu: [https://github.com/checkpoint-restore/criu/pull/1721 #1721]

[[Category: Under the hood]]

Mount-v2

2022-01-27T15:03:27Z

Ptikhomirov:

Mount-v2

2022-01-27T11:34:57Z

Ptikhomirov:

Mount-v2

2022-01-27T11:33:19Z

Ptikhomirov:

Mount-v2

2022-01-26T08:24:32Z

Ptikhomirov: /* Introduction */

Mount-v2

2022-01-26T07:17:45Z

Ptikhomirov:

Mount-v2

2022-01-26T07:09:33Z

Ptikhomirov:

Mount-v2 CRIU algorithm

After we've merged MOVE_MOUNT_SET_GROUP feature to mainstream linux v5.15 [https://github.com/torvalds/linux/commit/9ffb14ef61bab83fa818736bf3e7e6b6e182e8e2 torvalds/linux@9ffb14e] now we can use it to restore sharing groups of mounts without the need to care about inheriting those groups when create mounts, we can just set sharing groups at later stage and before that construct mount trees with private mounts.

Restoring propagation right with conservative approach of both creating mounts and inheriting propagation groups looks like mission impossible task for us due to many problems:

* Criu knows nothing about the initial history or order of mount tree creation;
* Propagation can create tons of mounts;
* Propagation may change parent mounts for existing mount tree;
* "Mount trap" - propagation may cover initial mount;
* "Non-uniform" propagation - there are different tricks with mount order and temporary children-"lock" mounts, which create mount trees which can't be restored without those tricks.
* "Cross-namespace" sharing groups creation need to be ordered with mount namespace creation right.
* Sharing groups vs mount tree order inversion can be very complex to restore and require multiple auxiliary. (see example below)

See my talks about it on Linux Plumbers Conference:
* [https://www.linuxplumbersconf.org/event/7/contributions/640/ CRIU mounts migration: problems and solutions]
* [https://linuxplumbersconf.org/event/11/contributions/923/ Mount-v2 CRIU migration engine: status update]

And here is the example of order inversion where multiple temporary mounts needed to achieve the result:
[[File:Mounts-inverse-order-example.gif|none|link=|Mounts-inverse-order-example.gif]]

to be continued...

Mount-v2

2022-01-25T16:26:25Z

Ptikhomirov: Mount-v2 CRIU algorithm which uses new MOVE_MOUNT_SET_GROUP kernel feature

Mount-v2 CRIU algorithm

After we've merged MOVE_MOUNT_SET_GROUP feature to mainstream linux v5.15 [https://github.com/torvalds/linux/commit/9ffb14ef61bab83fa818736bf3e7e6b6e182e8e2 torvalds/linux@9ffb14e] now we can use it to restore sharing groups of mounts without the need to care about inheriting those groups when create mounts, we can just set sharing groups at later stage and before that construct mount trees with private mounts.

Restoring propagation right with conservative approach of both creating mounts and inheriting propagation groups looks like mission impossible task for us due to many problems:

* Criu knows nothing about the initial history or order of mount tree creation;
* Propagation can create tons of mounts;
* Propagation may change parent mounts for existing mount tree;
* "Mount trap" - propagation may cover initial mount;
* "Non-uniform" propagation - there are different tricks with mount order and temporary children-"lock" mounts, which create mount trees which can't be restored without those tricks.
* "Cross-namespace" sharing groups creation need to be ordered with mount namespace creation right.
* Sharing groups vs mount tree order inversion can be very complex to restore and require multiple auxiliary. (see example below)
[[File:Mounts-inverse-order-example.gif|none|Mounts-inverse-order-example.gif]]

to be continued...

File:Mounts-inverse-order-example.gif

2022-01-25T16:16:40Z

Ptikhomirov: Ptikhomirov uploaded a new version of File:Mounts-inverse-order-example.gif

== Summary ==
This a gif to illustrate how complex inverse mnt_id via sharing_id chains can be created.

File:Mounts-inverse-order-example-2.gif

2022-01-25T16:15:22Z

Ptikhomirov: This a gif to illustrate how complex inverse mnt_id via sharing_id chains can be created.

== Summary ==
This a gif to illustrate how complex inverse mnt_id via sharing_id chains can be created.

File:Mounts-inverse-order-example.gif

2022-01-25T16:05:16Z

Ptikhomirov: Ptikhomirov uploaded a new version of File:Mounts-inverse-order-example.gif

== Summary ==
This a gif to illustrate how complex inverse mnt_id via sharing_id chains can be created.

File:Mounts-inverse-order-example.gif

2022-01-25T15:26:26Z

Ptikhomirov: This a gif to illustrate how complex inverse mnt_id via sharing_id chains can be created.

== Summary ==
This a gif to illustrate how complex inverse mnt_id via sharing_id chains can be created.

Mounts-v2

2022-01-25T15:19:00Z

Ptikhomirov: Ptikhomirov moved page Mounts-v2 to Mounts-v2-Virtuozzo: Need a place for a Mount-v2 description for a changed ported version.

#REDIRECT [[Mounts-v2-Virtuozzo]]

Mounts-v2-Virtuozzo

2022-01-25T15:19:00Z

Ptikhomirov: Ptikhomirov moved page Mounts-v2 to Mounts-v2-Virtuozzo: Need a place for a Mount-v2 description for a changed ported version.

Mounts v2 CRIU algorithm

This algorithm is designed to overcome problems with sharing group restore, overmounted files, mounts with namespace tags and some more smaller problems.

(assume single userns for now)

* Mounts image read stage (read_mnt_ns_img + read_mnt_ns_img_v2)
** Read mount_infos from images for each mount namespace to lists (collect_mnt_from_image)
** Put mounts to trees for each mount namespace (mnt_build_tree)
** Group mounts by superblock equality into "bind" lists (search_bindmounts)
** Prepare sharing groups
*** Group mounts into shared group by equality of (master_id + shared_id)
*** Put shared groups in tree where parent->shared_id == child->master_id
*** If two groups has same master_id, make them siblings (even if no parent)
** Prepare "internal yard" mount_info aside (setup_internal_yards)
*** ns mountpoint "/internal-yard-XXXXXX"
*** will require writable namespace root mount
*** needed for mount stage after forking tasks
** Prepare nested pidns procfses
*** Copy namespace tag across "bind" list (search_nested_pidns_proc)
*** Create helpers for descendants of nested pidns procfses in "internal yard" (handle_nested_pidns_proc)
**** These helpers get root "/" for simplicity (deleted, file/dir)
***** no nsfs bind support
**** ns mountpoint "/internal-yard-XXXXXX/hlp-[mnt_id]
** Prepare "root yard"
*** helper mount with mountpoint "/tmp/.criu.mntns.XXXXXX/"
*** Merge mount trees of all mount namespaces as subdirectories of "root yard" (merge_mount_trees)
**** mountpoint "/tmp/.criu.mntns.XXXXXX/[nns_id]"

* Mounting, first stage (before forking processes) from init task in "service" mntns (prepare_mnt_ns_v2)
** Actually create and mount "root yard" (populate_mnt_ns_v2 -> populate_roots_yard_v2)
** Replace mounts for after forking tasks stage (insert_internal_yards)
*** Delete nested pidns procfses from tree
*** Insert internal yards with helpers to tree
** Walk the merged mount tree (mnt_tree_for_each) parents before children
*** Mount all mounts "plain" (do_mount_one_v2)
**** check mount can be mounted (can_mount_now_v2) e.g. for overlay, root, external, bind or nsfs
**** create mountpoint "/tmp/.criu.mntns.XXXXXX/mnt-[mnt_id]" (create_plain_mountpoint)
***** dir/file detected by stat on mountpoint
**** Mount all mounts private and "plain"
***** just mount a new mount (do_new_mount_v2)
****** setup as bind source for other mounts of this super block (propagate_mount_v2)
***** bind if superblock is already mounted or external or root (do_bind_mount_v2, do_mount_root_v2)
****** create sources for "deleted" bind mounts and leave it for now
***** Handle internal yard (do_internal_yard_mount_v2)
****** mount tmpfs
****** create mountpoints for children
****** mount host's proc helper mount inside
**** Exept for "plain", "private" and helpers from "internal yard" we restore each mount as it should be in the final mountns (all flags and options applied)
*** This mounting all the mounts from all final mount namespaces in a single service mount namespace allows us to do "cross-namespace" bindmounts

* Mounting, second stage ("plain" to "tree" mount) (prepare_mnt_ns_v2)
** Walk across all mount namespaces
*** unshare(CLONE_NEWNS)
*** Walk all mounts belonging to this mntns (tree order) (assemble_tree_from_plain_mounts)
**** mountpoint "/tmp/.criu.mntns.XXXXXX/[nns_id]/[ns_mountpoint]"
**** Open mountpoint fd before moving mount to it and save (mp_fd)
**** Move (MS_MOVE) mount to the tree
**** Open mount fd (root dentry on a mount) (mnt_fd)
*** Pivot root to ""/tmp/.criu.mntns.XXXXXX/[nns_id]"
**** leaving only mounts which should be in this mntns
** Extract "internal yard"s from the tree and put back procfses and their ancestors (extract_internal_yards)
** Remove sources of deleted mounts making them really "deleted" from "service" mntns (remove_sources_of_deleted_mounts)

* Forking stage: fork all processes (tree order)
** Inits also creat pid namespaces
** Enter mount namespace
** Mmap files from mounted filesystem to restore COW mappings
*** We assume here that we don't have file mappings on delayed mounts else we can't handle it
*** Ghost/Link remaps may be created here
** Fork children

* Mounting, third stage (after forking processes) (from main criu task) (__fini_restore_mntns_v2)
** Enter CT userns (fini_restore_mntns_v2)
** For each mount namespace
*** For each procfs of this mntns (fixup_nested_pidns_proc)
**** Enter tagged pidns
**** Mount procfs from it in "internal yard"
*** Walk the mount tree of each mntns and mount all yet not mounted mounts to the tree
**** Find the mountpoint for the mount via mnt_fd of parent and mp_fds of sibling overmounts
**** Bind the mount to it from the internal yard helper or procfs helper
***** via /proc/self/fd/<id> on hosts proc in "internal yard"
**** Also open mnt_fd and mp_fd for a new mount (before and after bind)
*** Umount and rmdir "internal yard"

* And finally
** Restore sharing groups for each mount (use mnt_fd to access mounts) (restore_mount_sharing_options)
*** Walk sharing group trees (parents before children)
**** Setup first (any) mount in a group
***** Is slave
****** Find any mount from parent sg or find external mount source
****** Copy sharing from it with MS_SET_GROUP
****** Make slave
***** Is shared - make it also shared
**** Setup other mounts - copy sharing from the first one

* Done

Here are links to mounts-v2 implementation in Virtuozzo criu:
* Main part: https://src.openvz.org/projects/OVZ/repos/criu/commits?until=v3.12.3.12
* Delayed proc part: https://src.openvz.org/projects/OVZ/repos/criu/commits?until=v3.12.5.13
* Kernel patch for MS_SET_GROUP: https://lore.kernel.org/lkml/1485214628-23812-1-git-send-email-avagin@openvz.org/

[[Category: Under the hood]]

Mounts-v2-Virtuozzo

2020-08-21T08:49:39Z

Ptikhomirov: Add category.

Mounts-v2-Virtuozzo

2020-08-21T08:28:12Z

Ptikhomirov: Design of mounts-v2 engine

RPC

2020-03-25T12:28:21Z

Ptikhomirov: remove systemd service part as unsupported

CRIU-RPC is a remote procedure call (RPC) protocol which uses Google Protocol Buffers to encode its calls. The requests are served by CRIU when either launched in so called "swrk" mode or by a service started with the <code>criu service</code> command. It uses a <code>SEQPACKET</code> Unix domain socket for transport. In case of a standalone service it listens at <code>/var/run/criu-service.socket</code> as a transport.

The <code>criu_req</code>/<code>criu_resp</code> are two main messages for requests/responses. They are to be used for transferring messages and needed to provide compatibility with an older versions of rpc. Field type in them ''must'' be set accordingly to type of request/response that is stored. Types of request/response are defined in <code>enum criu_req_type</code>. See the [[API compliance]] page for information what each option might mean.

== Protocol ==

The protocol is simple: client sends a <code>criu_req</code> message to server, server responds back with <code>criu_resp</code>. In most of the cases the socket gets closed, but there are three exceptions from this rule, see below.

== Request ==

This is the header of the request. It defines the operation requested and options.

<source lang="c">
message criu_req {
required criu_req_type type = 1;
optional criu_opts opts = 2;
optional notify_success = 3; /* see Notifications below */
optional keep_open = 4; /* for multi-req, below */
}
</source>

Currently, there are a few request/response types:

<source lang="c">
enum criu_req_type {
EMPTY = 0;
DUMP = 1; /* criu dump */
RESTORE = 2; /* criu restore */
CHECK = 3; /* criu check */
PRE_DUMP = 4; /* criu pre-dump */
PAGE_SERVER = 5; /* criu page-server */
NOTIFY = 6; /* see Notifications below */
CPUINFO_DUMP = 7; /* criu cpuinfo dump */
CPUINFO_CHECK = 8; /* criu cpuinfo check */
}
</source>

The following options are available:

<source lang="c">
message criu_opts {
required int32 images_dir_fd = 1;
optional int32 pid = 2; /* if not set on dump, will dump requesting process */

optional bool leave_running = 3;
optional bool ext_unix_sk = 4;
optional bool tcp_established = 5;
optional bool evasive_devices = 6;
optional bool shell_job = 7;
optional bool file_locks = 8;
optional int32 log_level = 9 [default = 2];
optional string log_file = 10; /* No subdirs are allowed. Consider using work-dir */

optional criu_page_server_info ps = 11;

optional bool notify_scripts = 12;

optional string root = 13;
optional string parent_img = 14;
optional bool track_mem = 15;
optional bool auto_dedup = 16;

optional int32 work_dir_fd = 17;
optional bool link_remap = 18;
repeated criu_veth_pair veths = 19;

optional uint32 cpu_cap = 20 [default = 0xffffffff];
optional bool force_irmap = 21;
repeated string exec_cmd = 22;

repeated ext_mount_map ext_mnt = 23;
optional bool manage_cgroups = 24;
repeated cgroup_root cg_root = 25;

optional bool rst_sibling = 26; /* swrk only */
}
</source>

=== Comments and examples ===

* If no <code>pid</code> is set and type is <code>DUMP</code>, CRIU will dump client process by default.
* All processes in the subtree starting with <code>''pid''</code> must have the same uid, as a client, or client's uid must be root (uid == 0), otherwise CRIU will return an error.
* Only the <code>images_dir_fd</code> is required, all other fields are optional. Client must open directory for/with images by itself and set <code>images_dir_fd</code> to the opened <code>fd</code>. CRIU will open <code>/proc/''client_pid''/fd/''images_dir_fd''</code>.

The logic of setting request is the same as when setting options in console.

Here is an example:

# criu restore -D /path/to/imgs_dir -v4 -o restore.log

This is equal to:

<source lang="c">
request.type = RESTORE;

request.opts.imgs_dir_fd = open("/path/to/imgs_dir")
request.opts.log_level = 4
request.opts.log_file = "restore.log"
</source>

=== Sub-messages for options ===

==== Info about page-server ====

<source lang="c">
message criu_page_server_info {
optional string address = 1; /* bind address -- if not set 0.0.0.0 is used */
optional int32 port = 2; /* bind port -- if not set on request, autobind is used and port is returned in response */
optional int32 pid = 3; /* page-server pid -- returned in response */
optional int32 fd = 4; /* could be used to inherit fd by page-server */
}
</source>

<source lang="c">
message criu_veth_pair {
required string if_in = 1; /* inside veth device name */
required string if_out = 2; /* outside veth device name */
};
</source>

==== Info about veth mappings (<code>--ext-mount-map</code> analogue) ====
<source lang="c">
message ext_mount_map {
required string key = 1;
required string val = 2;
};
</source>

==== Specifying where cgroup root should be (<code>--cgroup-root</code> analogue) ====
<source lang="c">
message cgroup_root {
optional string ctrl = 1;
required string path = 2;
};
</source>

== Response ==

This message is sent after (un)successful execution of the request.

<source lang="c">
message criu_resp {
required criu_req_type type = 1;
required bool success = 2;

optional criu_dump_resp dump = 3;
optional criu_restore_resp restore = 4;
optional criu_notify notify = 5;
optional criu_page_server_info ps = 6;

optional int32 cr_errno = 7;
}
</source>

The field <code>success</code> reports result of processing request, while <code>criu_***_resp</code> store some request-specific information. The response type is set to the corresponding request type or to <code>EMPTY</code> to report a "generic" error. If <code>success == false</code>, one should check <code>cr_errno</code> field to get a more detailed error code (see [https://github.com/xemul/criu/blob/master/include/cr-errno.h#L8 include/cr-errno.h]).

==== The criu_dump_resp is used to store response from DUMP request ====

<source lang="c">
message criu_dump_resp {
optional bool restored = 1;
}
</source>

This message can be sent twice — one time for the process that calls DUMP, and another time for the same process again, in case it requested a self-dump. In the latter case the ''restored'' field would be true.

==== The response on RESTORE request ====

<source lang="c">
message criu_restore_resp {
required int32 pid = 1;
}
</source>

The <code>pid</code> field is set to the PID of the newly restored process.

==== Info about page server ====

The <code>criu_page_server_info</code> from requests will be sent back on <code>PAGE_SERVER</code> request. The <code>port</code> field will contain the port to which the server is bound.

=== Notifications ===

If the <code>opts.notify_scripts</code> in the request is set to <code>TRUE</code>, CRIU would report back resp messages with type set to <code>NOTIFY</code> and this field present. The notifications are the way [[action scripts]] work for RPC mode.

<source lang="c">
message criu_notify {
optional string script = 1;
optional int32 pid = 2;
}
</source>

After handling the notification the client must response with the request again with the type set to <code>NOTIFY</code> and the <code>notify_success</code> set to the whether the notification was successful. In case of successful notification acknowledge the server doesn't close the socket and continues to work.

== Pre-dumps ==

Before issuing a <code>DUMP</code> request client may send one or more <code>PRE_DUMP</code> requests. Once the <code>PRE_DUMP</code> is sent and response is received, client may send one more <code>PRE_DUMP</code> or <code>DUMP</code> request. The server would only close the socket after the <code>DUMP</code> one.

== Multi-request mode ==

If the <code>req.keep_open</code> flag is set to true server will not close the socket after response, but will wait for more requests. This mode is supported only for the following request types:

* <code>PRE_DUMP</code> (automatically)
* <code>PAGE_SERVER</code>
* <code>CPUINFO_DUMP</code> and <code>CPUINFO_CHECK</code>

== Run ==

=== SWRK mode ===

This mode turns on when one <code>fork() + exec()</code> CRIU with the <code>swrk</code> action and one more argument specifying the number of descriptor with <code>SOCK_SEQPACKET</code> Unix socket. With this CRIU works as service worker task accepting standard RPC requests via the mentioned socket and using one to do action scripts notifications and result reporting.

=== Server ===

On a server side, CRIU creates <code>SOCK_SEQPACKET</code> Unix socket and listens for connections on it. After receiving <code>criu_req</code>, CRIU processes it, does what was requested and sends <code>criu_resp</code> with set request-specific <code>criu_***_resp</code> field back.
If CRIU gets unknown type of request, it will return <code>criu_resp</code> with <code>type == EMPTY</code> and <code>success == false</code>.

To launch the service, run:

# criu service [options]

Options accepted by service are

; --address <path>
: where to put listening socket

; --pid-file <path>
: where to write pid of service process

; --daemon
: tells service to daemonize

; -o <file>
: says where to write logs

; -v[N]
: sets the log level

=== Client ===

Client, in its turn, must connect to service socket, send <code>criu_req</code> with request in it, and wait for a <code>criu_resp</code> with response.
You can find examples of client programs in C and Python in test/rpc/.

With RPC facilities one can perform a [[self dump]].

There's a [[C API|library]] that implements simple wrappers on top of RPC.

== See also ==
* [[CLI]]
* [[C API]]

[[Category: API]]

External bind mounts

2020-03-25T10:07:59Z

Ptikhomirov: External mounts, external/internal sharing/slavery.

__TOC__

One of typical external resources when dumping a container (especially LXC/Docker) is a mount point whose root sits outside of the container's root. This situation was intended to be resolved using [[plugins]] but turned out to be common enough to introduce a built-in way of handling it.

== What is external bind mount ==

The way to create such is simple as

mkdir /root
mount --bind /foo /root/bar
chroot /root

This is it. From now on, the /bar file is a mountpoint whose root (the source) is not accessible directly.

If you look at the /proc/$pid/mountinfo file of a task seeing such you would see smth like

11 23 8:3 /root / ... - ext4 /dev/sda1 ...
23 34 8:3 /foo /bar ... - ext4 /dev/sda1 ...

The columns 4 and 5 are root and mountpoint respectively. You can see, that the / is /root file from /dev/sda1 device and /bar file is a mountpoint with the root being /foo file from the same device.

== How to teach CRIU to dump them ==

By default CRIU doesn't dump such mountpoints, because there's no way CRIU will be able to restore it -- the root of these mounts is out of scope of what CRIU dumped. In the logs you would see a message like

34:/bar doesn't have a proper root mount

which means the mountpoint /bar has inaccessible root.

To dump and restore them there's the <code>--external mnt[KEY]:VAL</code> option that sets up external mounts root mapping.

On dump, KEY is a mountpoint inside container, and corresponding VAL is a string that will be written into the image as mountpoint's root value.

On restore, KEY is the value from the image (VAL from dump), and the VAL is the path on host that will be bind-mounted into container (to the mountpoint path from image).

For example, if we want to dump the task above we should call

criu dump ... --external mnt[/bar]:barmount

The word <code>barmount</code> is an arbitrary identifier, that will be put in the image file instead of the original root path

criu show -f mountpoints.img -F mnt_id,root,mountpoint
mnt_id: 0x22 root: barmount mountpoint: /bar

On restore we should tell CRIU where to bind mount the <code>barmount</code> from like this

criu restore ... --external mnt[barmount]:/foo

With this CRIU will bind mount the /foo into proper mountpoint.

== Auto detection ==

In case one wants CRIU to autodetect and dump all the external bind mounts, and there is no need to change host mount points on restore, one can use a special syntax:

criu dump ... --external mnt[]:''flags''

Note here is nothing inside square brackets, and the optional <code>:''flags''</code> argument can contain the following characters:

; <code>m</code>
: Also enable dumping of external master mounts (as in <code>mount --make-slave</code>)
; <code>s</code>
: Also enable dumping of external shared mounts (as in <code>mount --make-shared</code>)

By default, neither master nor shared external mounts are not dumped (if found, dump is aborted). Note if <code>''flags''</code> are not given, semicolon is optional.

=== Examples ===

criu dump ... --external 'mnt[]'

Auto detect and dump all external bind mounts.

criu dump ... --external 'mnt[]:s'

Auto detect and dump all external bind mounts, including the shared ones.

criu dump ... --external 'mnt[]:sm'

Auto detect and dump all external bind mounts, including the shared and the master ones.

== Old days ==

For now the same behavior is configured with the <code>--ext-mount-map KEY:VAL</code> option. Soon this option will be [[deprecation|deprecated]].

[[Category:HOWTO]]
[[Category:External]]

=== Sharing for external bindmounts ===

External bindmounts can both have internal/external sharing. Please see the example:

# Preparation
unshare -m --propagation private
mkdir /external_mount_sharing_test
mount -t tmpfs tmpfs /external_mount_sharing_test/
mount --make-private /external_mount_sharing_test/
cd /external_mount_sharing_test
# Source of external mount
mkdir external_mount
mount -t tmpfs tmpfs-external external_mount/
mount --make-shared external_mount/
cat /proc/$$/mountinfo | grep external
# 811 755 0:60 / /external_mount_sharing_test rw,relatime - tmpfs tmpfs rw
# 812 811 0:62 / /external_mount_sharing_test/external_mount rw,relatime shared:290 - tmpfs tmpfs-external rw

# Switch to CT mntns
unshare -m --propagation unchanged sh
mkdir root
mount -t tmpfs tmpfs-root root/
mkdir root/external_sharing root/internal_sharing root/proc

# Create external mount
mount --bind external_mount/ root/external_sharing
mount --bind external_mount/ root/internal_sharing
mount --make-private root/internal_sharing
mount --make-shared root/internal_sharing

# More preparations
mount --bind /proc root/proc
cd root
mkdir bin lib64
SH=$(which sh)
cp $SH bin
cp $(ldd $SH | grep "/lib64" | sed 's/^.*$\/lib64\S*$\s.*$/\1/') lib64
CAT=$(which cat)
cp $CAT bin
cp $(ldd $CAT | grep "/lib64" | sed 's/^.*$\/lib64\S*$\s.*$/\1/') lib64
PATH=$PATH:/bin
chroot . sh
cat /proc/$$/mountinfo
# 843 841 0:63 / / rw,relatime - tmpfs tmpfs-root rw
# 861 843 0:62 / /external_sharing rw,relatime shared:290 - tmpfs tmpfs-external rw
# 898 843 0:62 / /internal_sharing rw,relatime shared:349 - tmpfs tmpfs-external rw
# 899 843 0:5 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw

Mounts 812 (on host) and 861 (in container) have the same sharing (shared group) - external sharing and mount 898 has it's own local shared group - internal sharing.

Before [https://github.com/checkpoint-restore/criu/pull/906 #906] we were detecting this external/internal sharing state for auto detected external mounts only, but we need it for manual external mounts too. Moreover this also applies to manual external slave mounts they can be external/internal slaves too.

So we detect that the mount is from external sharing if in mount namespace of CRIU there are mounts of same shared group and also we detect that the mount is from external slavery if there is no master mount for it in CT mount namespaces.

Mount points

2020-03-25T07:17:28Z

Ptikhomirov: /* TODO */

This page describes what we do with mount points trees.
== Introduction ==
When we are thinking about restoring a mount tree, we need to remember a few things:
* shared and slave groups
* how mounts are propagated inside one group
* bind mounts (rw, ro)

The algorithm described here is not able to cover all the cases, so this solution is a temporary one.

== Dump ==

There is nothing interesting here. We just dump information about mounts and validate them to be sure that we are able to restore them.

== Restore ==

Mounts are restored for a few iterations. On each iteration we enumerate all mounts and mount everything we can. On the next iteration we mount a bit more and continue to do so step by step. The idea is that we will be able to mount something new on each iteration. If we can't mount anything, we stop and report an error telling that we can't restore this configuration.
For example, a mount can't be mounted if its parent isn't mounted yet. Or a more interesting example, a mount can't be mounted if not all the mounts from its parent shared group are mounted.

== Known issues ==
CRIU doesn't support configurations where two mounts of one shared group have different set of mounts. This is not a feature, this is a bug and you are welcome to fix it.

(done and checked by non_uniform_share_propagation in zdtm)

== TODO ==
* Read-only bind mounts
(not sure was meant here but e.g. ghost files on readonly mounts handled and checked by ghost_on_rofs zdtm test)
* Skipping mountpoints
* Enabling FS runtime

== See also ==
[[External bind mounts]]

[[Category: Under the hood]]
[[Category: Fly in the ointment]]

Mount points

2020-03-25T07:15:53Z

Ptikhomirov: /* Known issues */

This page describes what we do with mount points trees.
== Introduction ==
When we are thinking about restoring a mount tree, we need to remember a few things:
* shared and slave groups
* how mounts are propagated inside one group
* bind mounts (rw, ro)

The algorithm described here is not able to cover all the cases, so this solution is a temporary one.

== Dump ==

There is nothing interesting here. We just dump information about mounts and validate them to be sure that we are able to restore them.

== Restore ==

Mounts are restored for a few iterations. On each iteration we enumerate all mounts and mount everything we can. On the next iteration we mount a bit more and continue to do so step by step. The idea is that we will be able to mount something new on each iteration. If we can't mount anything, we stop and report an error telling that we can't restore this configuration.
For example, a mount can't be mounted if its parent isn't mounted yet. Or a more interesting example, a mount can't be mounted if not all the mounts from its parent shared group are mounted.

== Known issues ==
CRIU doesn't support configurations where two mounts of one shared group have different set of mounts. This is not a feature, this is a bug and you are welcome to fix it.

(done and checked by non_uniform_share_propagation in zdtm)

== TODO ==
* Read-only bind mounts
* Skipping mountpoints
* Enabling FS runtime

== See also ==
[[External bind mounts]]

[[Category: Under the hood]]
[[Category: Fly in the ointment]]

GSoC20 Students Requests

2020-03-04T07:45:06Z

Ptikhomirov: fixup

# Manas Mangaonkar
#* Subproject: [[Google_Summer_of_Code_Ideas#Porting_crit_functionalities_in_GO|Porting crit functionalities in GO]]
# Anmol
#* Subproject: [[Google_Summer_of_Code_Ideas#Anonymise_image_files|Anonymise image files]]
# Aashim Garg
#* Subproject: [[Google_Summer_of_Code_Ideas#Restrict_checks_for_open.2Fmmaped_files|Restrict checks for open/mmaped files]]
# Zeyad Yasser
#* Subproject: [[Google_Summer_of_Code_Ideas#Anonymise_image_files|Anonymise image files]]
# Kaushlendra Pratap
#* Subproject: [[Google_Summer_of_Code_Ideas#Anonymise_image_files|Anonymise image files]]
# Adhitya Mahajan
#* Subproject: [[Google_Summer_of_Code_Ideas#Restrict_checks_for_open.2Fmmaped_files|Restrict checks for open/mmaped files]]
# Nishchay Agrawal
#* Subproject: [[Google_Summer_of_Code_Ideas#Use_eBPF_to_lock_and_unlock_the_network|Use eBPF to lock and unlock the network]]
# Shivamani Patil
#* Subproject: [[Google_Summer_of_Code_Ideas#Porting_crit_functionalities_in_GO|Porting crit functionalities in GO]]
# Yannis Thomopoulos
#* Subproject: [[Google_Summer_of_Code_Ideas#Add_support_for_SPFS|Add support for SPFS]]
# Sahil Kumar Sahu
#* Subproject: [[Google_Summer_of_Code_Ideas#Use_eBPF_to_lock_and_unlock_the_network|Use eBPF to lock and unlock the network]]
# Puranjay Mohan
#* Subproject: [[Google_Summer_of_Code_Ideas#Optimize_logging_engine|Optimize logging engine]]
# Ajay Bharadwa
#* Subproject: [[Google_Summer_of_Code_Ideas#Restrict_checks_for_open.2Fmmaped_files|Restrict checks for open/mmaped files]]
# Subham Pandey
#* Subproject: [[Google_Summer_of_Code_Ideas#Memory_changes_tracking_with_userfaultfd-WP|Memory changes tracker]]
# Vineet Jain
#* Subproject: [[Google_Summer_of_Code_Ideas#Use_eBPF_to_lock_and_unlock_the_network|Use eBPF to lock and unlock the network]] or [[Google_Summer_of_Code_Ideas#Restrict_checks_for_open.2Fmmaped_files|Restrict checks for open/mmaped files]]

[[Category:GSoC]]

GSoC20 Students Requests

2020-03-02T07:54:06Z

Ptikhomirov:

GSoC20 Students Requests

2020-02-28T20:52:52Z

Ptikhomirov: Created page with " # Manas Mangaonkar #* Subproject: Porting crit functionalities in GO # Anmol #* Subproject: Google_Summer..."