[MDBF-414] bb-rhel8-docker create routine to clean podman system Created: 2022-05-13  Updated: 2022-09-19  Resolved: 2022-09-19

Status: Closed
Project: MariaDB Foundation Development
Component/s: Buildbot
Affects Version/s: None
Fix Version/s: N/A

Type: Task Priority: Minor
Reporter: Faustin Lammler Assignee: Faustin Lammler
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: 0d
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

Find an automated way of cleaning podman system on this machine before /data is full.



 Comments   
Comment by Faustin Lammler [ 2022-05-13 ]

/data was almost full, danblack maybe you have an idea...

Comment by Daniel Black [ 2022-05-13 ]

An error condition on last docker library test.
Likely relates to:

++ docker run -d --rm --name mariadb-container-10510-6676 -e MASTER_HOST=mariadbcontainer3441 -e MARIADB_RANDOM_ROOT_PASSWORD=1 -e MARIADB_MYSQL_LOCALHOST_USER=1 -e 'MARIADB_MYSQL_LOCALHOST_GRANTS=REPLICATION CLIENT /*!100509 ,REPLICA MONITOR */' -v /home/buildbot/amd64-rhel8-dockerlibrary/build/mariadb-docker/.test/replica-initdb.d/:/docker-entrypoint-initdb.d:Z --network=container:mariadbcontainer3441 '--health-cmd=healthcheck.sh --su-mysql --replication_io --replication_sql --replication_seconds_behind_master=0 --replication' --health-interval=3s 2a8319acb3d1d4e247f0af2a40809f1b11e60d23957702a3066d8f809044faaa --server-id=2 --port 3307
time="2022-05-13T22:05:43+02:00" level=error msg="unable to get systemd connection to add healthchecks: dial unix /tmp/podman-run-1003/systemd/private: connect: no such file or directory"
time="2022-05-13T22:05:43+02:00" level=error msg="unable to get systemd connection to start healthchecks: dial unix /tmp/podman-run-1003/systemd/private: connect: no such file or directory"

Frequent $arch-ubuntu-$lts-deb-autobake failures due to incorrect dependencies not installed onto builders (fix) aren't a cause as pmem (10.7+) doesn't use ubuntu-18.04.

As the delta of builds across arches has reduced maybe the week gap can be reduced. But some more measurement is needed (ppc64le focal 2 days).

Builds terminated by the hard SIGKILL of buildbot shutdown aren't cleaned up.

Comment by Daniel Black [ 2022-05-16 ]

Receiving a trigger from a $arch-ubuntu-$lts-deb-autobake builder with haltOnFailure=False set, with the failure trigger, could enable the running containers and the containers from the other architectures to be cleared.

Comment by Daniel Black [ 2022-05-16 ]

Noting implementation.

Note build on L87 builds without a tag, runs a test (L118) (most of the bb time), before adding it to a manifest on L188.

A cron pruning of untagged images is dangerous in removing good images that have their test being run.

What is needed in script is:

  • cleanup on manifest on test failures, ensure test cleanup kills of running containers (most started with --rm, so just need to ensure they are killed)
  • L145-150 can be removed
  • If we are building 10.5-X and 10.5-Y manifest is there, we don't have a good resolution strategy. First completed gets pushed.
Comment by Faustin Lammler [ 2022-05-16 ]

Ok that make sense. We can prune podman system on a weekly or daily basis in a cronjob, this is not clean but it will do the job. What do you think ?

Ok, the jira task was updated and I did not refresh the page...

Comment by Faustin Lammler [ 2022-05-16 ]

> What is needed in script is:
>
> cleanup on manifest on test failures, ensure test cleanup kills of running containers (most started with --rm, so just need to ensure they are killed)
> L145-150 can be removed
> If we are building 10.5-X and 10.5-Y manifest is there, we don't have a good resolution strategy. First completed gets pushed.

Sounds good, how do you want to proceed?

Comment by Daniel Black [ 2022-05-16 ]

Disable the cron pruning. Its breaking thinks like:

https://buildbot.mariadb.org/#/builders/311/builds/9109/steps/2/logs/stdio

 
[buildbot@bb-rhel8-docker ~]$ buildah manifest push --all --rm mariadb-devel-10.8-5bfd9e51b35b5e538254860c425e9759c2e1f5fa docker://quay.io/mariadb-foundation/mariadb-devel:10.8
initializing source containers-storage:[overlay@/data/buildbot/containers/storage+/var/tmp/containers-user-1003/containers:overlay.mount_program=/bin/fuse-overlayfs]@5961a90fcc8dac44d0fc7cc66c50f604d34c3f92c4400cd499e0ea0c638004ab: error opening "containers-storage:[overlay@/data/buildbot/containers/storage+/var/tmp/containers-user-1003/containers:overlay.mount_program=/bin/fuse-overlayfs]@a48f4178d9efc88ca4b0459ff40264755ee47c2e6a0c086fc8a1ad140aa94042" as image source: reading image "a48f4178d9efc88ca4b0459ff40264755ee47c2e6a0c086fc8a1ad140aa94042": error locating image with ID "a48f4178d9efc88ca4b0459ff40264755ee47c2e6a0c086fc8a1ad140aa94042": image not known
ERRO[0000] exit status 125 

Then let's see the kinds of leftovers we get.

Like the test script trap "killoff" EXIT is needed in the bb script to be called on the failure condition to clean the environment depending on the image build, image text, manifest push stage.

If you want to try writing a PR for this please do so.

Comment by Faustin Lammler [ 2022-05-16 ]

> Disable the cron pruning

I did not activate any pruning routine so far.

> If you want to try writing a PR for this please do so.

Ok will try that in the next days.

Comment by Faustin Lammler [ 2022-05-18 ]

First attempt in https://github.com/MariaDB/mariadb.org-tools/pull/156, need testing and polishing.
Also as a reminder, command needs to be run as the buildbot user on the bb-rhel8 machine to verify images/containers presence.

Comment by Faustin Lammler [ 2022-06-15 ]

The routine is probably missing something or needs improvements as /data is almost full:

[buildbot@bb-rhel8-docker ~]$ podman images | grep none | wc -l
209
[buildbot@bb-rhel8-docker ~]$ df -h
Filesystem             Size  Used Avail Use% Mounted on
devtmpfs               1.9G     0  1.9G   0% /dev
tmpfs                  2.0G  336K  2.0G   1% /dev/shm
tmpfs                  2.0G  193M  1.8G  10% /run
tmpfs                  2.0G     0  2.0G   0% /sys/fs/cgroup
/dev/mapper/rhel-root   26G  5.5G   21G  21% /
/dev/vdb                98G   93G  5.2G  95% /data
/dev/vda1             1014M  267M  748M  27% /boot
tmpfs                  393M   12K  393M   1% /run/user/1000

If you danblack want to take a look before I do some manual cleaning, it would be good.

Comment by Daniel Black [ 2022-06-15 ]

The answer is obvious enough from bb status

For each branch there should be 3 (< 10.5) or 4 (>= 10.5) builds on the amd64-rhel8-docker.

There are so many failures its filling up quicker than the purging can handle.

cleanups only happen after a successful push, so 2 days ago.

7 day backlog of amd64-2004-deb-autobake isn't helping.

10.3, and soon to be all others failing because of https://github.com/MariaDB/buildbot/blob/main/scripts/docker-library-build-and-test.sh#L90 and https://github.com/MariaDB/mariadb-docker/blob/next/update.sh#L31 cause https://github.com/MariaDB/server/pull/2141 / MDEV-28628.

Also https://github.com/MariaDB/mariadb.org-tools/pull/81 for cleaning when there is a failure in the dependency.

Comment by Faustin Lammler [ 2022-08-09 ]

Disk was increased by 50GB.

Comment by Faustin Lammler [ 2022-09-19 ]

Problem fixed with :

  • cleaning routine
  • increasing size

The machine was also moved to a stronger (and more reliable builder).

Generated at Thu Feb 08 03:37:43 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.