Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-9264

Implement Linux Kernel Out Of Memory Configuration (oom_score_adj) to my.cnf file and mysqld global variable

    Details

    • Type: Task
    • Status: Open (View Workflow)
    • Priority: Minor
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      Hi guys
      We have a big problem when system is Out Of Memory
      running a apache+mysql server, i prefer that kernel kill the apache server instead of mysql
      when mysqld restart, it can repair all tables (myisam/innodb/aria/toku/etc...), a crash (kill signal when out of memory) at database pid (somethiing killall -9 mysqld) is more critical (consume more time to repair / can loose important data) than a crash at application (only static/cached/transaction data) (something like killall -9 httpd)

      in this task we will not check others process oom_score_adj, we will only think about mysqld oom_score_adj values


      the idea is: include a option at my.cnf / mysqld / mysqld_multi / mysqld_safe, and a global variable (SHOW VARIABLES) to allow server oom_score_adj configuration,
      the /proc kernel interface may use oom_adj AND/OR oom_score_adj file (the second have a bigger range -1000 to 1000, instead of -17 to +15, the first is older and exist in more kernels , maybe before 2.6 kernels)

      a comment about someone that have the same problem using oracle server: http://www.oracle.com/technetwork/articles/servers-storage-dev/oom-killer-1911807.html

      at my.cnf file we could include:
      linux-oom_adj=xxx (a number from -17 to +15 user MUST know that we are using oom_adj or oom_score_adj)
      linux-oom_score_adj=xxx (a number from -1000 to 1000, user MUST know that we are using oom_adj or oom_score_adj)

      tested with kernel 4.xxx, we should use oom_score_adj:
      [179443.053386] bash (19390): /proc/2398/oom_adj is deprecated, please use /proc/2398/oom_score_adj instead.
      https://www.kernel.org/doc/Documentation/filesystems/proc.txt

      important notes:

      (from 2.4 kernel docs):

      It is the job of the linux 'oom killer' to sacrifice one or more processes in order to free up memory for the system when all else fails. It will also kill any process sharing the same mm_struct as the selected process, for obvious reasons. *Any particular process leader may be immunized against the oom killer* if the value of its /proc/<pid>/oomadj is set to the constant OOM_DISABLE (currently defined as *-17*).

      (from 4.xxx kernels docs):

      The badness heuristic assigns a value to each candidate task ranging from *0 (never kill) to 1000 (always kill)* to determine which process is targeted.  
      The units are roughly a proportion along that range of allowed memory the process may allocate from based on an estimation of its current memory and swap use.
      For example, if a task is using all allowed memory, its badness score will be 1000.  If it is using half of its allowed memory, its score will be 500.
       
      The value of /proc/<pid>/oom_score_adj is added to the badness score before it is used to determine which task to kill.  
      Acceptable values range from *-1000 (OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX)*.  
      This allows userspace to polarize the preference for oom killing either by always preferring a certain task or completely disabling it.  
      The lowest possible value, *-1000, is equivalent to disabling oom killing entirely for that task* since it will always report a badness score of 0.
       
      Caveat: when a parent task is selected, the oom killer will sacrifice any first generation children with separate address spaces instead, if possible.  
      This avoids servers and important system daemons from being killed and loses the minimal amount of work.

      --------------------------------------------------
      when linux-oom_adj is set at my.cnf, server MUST write this value to /proc/<pid>/task/*/oom_adj file at mysqld start
      when linux-oom_score_adj is set at my.cnf, server MUST write this value to /proc/<pid>/task/*/oom_score_adj file at mysqld start

      --------------------------------------------------
      GLOBAL VARIABLE

      READING GLOBAL VARIABLE VALUE (SELECT @@global.linux_oom_adj)
      we should report ALL DISTINCT values from ALL TASKS in the main mysqld process <pid> (i tested with child and it works too...):
      linux-oom_adj will report: DISTINCT FREAD /proc/<pid>/task/*/oom_adj, if not exists, or don't have permission to read it MUST report NULL
      linux-oom_score_adj will report: DISTINCT FREAD /proc/<pid>/task/*/oom_score_adj, if not exists, or don't have permission to read it MUST report NULL

      SETTING GLOBAL VARIABLE VALUE (SET @@global.linux_oom_adj=-17)
      linux-oom_adj will FWRITE to all files at /proc/<pid>/task/*/oom_adj, if not exists, or don't have permission it MUST report a WARNING message (no permission - error 13)
      linux-oom_score_adj will FWRITE to all files at /proc/<pid>/task/*/oom_score_adj, if not exists, or don't have permission it MUST report a WARNING message (no permission - error 13)

      TEST NOTES:
      using oom_adj sometimes dont change oom_score_adj automatically as docs reported, it only change if value >1 or <-1, 0/1/-1 = 0

      changing main process sometimes change child tasks too, but it's a bit confusing, maybe change all mysqld pid is better, the problem is know what's the value to report to mysql variable if we have many threads (that's why we will report ALL DISTINCT VALUES FROM ALL TASKS)

      root@rspadim-Latitude:/proc/2398/task# ls
      *2398*  *_2400_*  2402  2403  2404  2406  2413  2414  2415  2416  2417  2418  2419  2420  2423

      here every oom_score_adj was =0, i'm checking only 2398 (main), and 2400 (child) in this report, but all others taks (child) are equal to 2400

      root@rspadim-Latitude:/proc/2398# echo -n 200 > oom_score_adj; cat oom_score; cat task/2400/oom_score_adj 
      200
      200
      root@rspadim-Latitude:/proc/2398# echo -n -200 > task/2400/oom_score_adj; cat oom_score; cat task/2400/oom_score_adj 
      0
      -200
      root@rspadim-Latitude:/proc/2398# echo -n 200 > task/2400/oom_score_adj; cat oom_score; cat task/2400/oom_score_adj 
      200
      200
      root@rspadim-Latitude:/proc/2398# echo -n -200 > task/2400/oom_score_adj; cat oom_score; cat task/2400/oom_score_adj 
      0
      -200
      root@rspadim-Latitude:/proc/2398# echo -n 200 > oom_score_adj; cat oom_score; cat task/2400/oom_score_adj 
      200
      200
      root@rspadim-Latitude:/proc/2398# echo -n -100 > oom_score_adj; cat oom_score; cat task/2400/oom_score_adj 
      0
      -100
      root@rspadim-Latitude:/proc/2398# echo -n -150 > oom_score_adj; cat oom_score; cat task/2400/oom_score_adj 
      0
      -150
      root@rspadim-Latitude:/proc/2398# echo -n -200 > oom_score_adj; cat oom_score; cat task/2400/oom_score_adj 
      0
      -200
      root@rspadim-Latitude:/proc/2398# echo -n -500 > oom_score_adj; cat oom_score; cat task/2400/oom_score_adj 
      0
      -500

      tested with root user, must check what happens (permissions) with mysql code fopen/fread/fwrite/fclose


      others opensource projects that implement it:

      Linux OOM.H
      http://lxr.free-electrons.com/source/include/uapi/linux/oom.h#L8

      /*
       * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for
       * pid.
       */
      #define OOM_SCORE_ADJ_MIN       (-1000)
      #define OOM_SCORE_ADJ_MAX       1000
      /*
       * /proc/<pid>/oom_adj set to -17 protects from the oom killer for legacy
       * purposes.
       */
      #define OOM_DISABLE (-17)
      /* inclusive */
      #define OOM_ADJUST_MIN (-16)
      #define OOM_ADJUST_MAX 15

      Chromium (Google Chrome?!)

      http://src.chromium.org/chrome/trunk/src/base/process/memory_linux.cc
      important include files
      http://src.chromium.org/chrome/trunk/src/base/process/internal_linux.cc
      http://src.chromium.org/chrome/trunk/src/base/process/internal_linux.h

      // NOTE: This is not the only version of this function in the source:
      // the setuid sandbox (in process_util_linux.c, in the sandbox source)
      // also has its own C version.
      bool AdjustOOMScore(ProcessId process, int score) {
        if (score < 0 || score > kMaxOomScore)
          return false;
       
        FilePath oom_path(internal::GetProcPidDir(process));
       
        // Attempt to write the newer oom_score_adj file first.
        FilePath oom_file = oom_path.AppendASCII("oom_score_adj");
        if (PathExists(oom_file)) {
          std::string score_str = IntToString(score);
          DVLOG(1) << "Adjusting oom_score_adj of " << process << " to "
                   << score_str;
          int score_len = static_cast<int>(score_str.length());
          return (score_len == WriteFile(oom_file, score_str.c_str(), score_len));
        }
       
        // If the oom_score_adj file doesn't exist, then we write the old
        // style file and translate the oom_adj score to the range 0-15.
        oom_file = oom_path.AppendASCII("oom_adj");
        if (PathExists(oom_file)) {
          // Max score for the old oom_adj range.  Used for conversion of new
          // values to old values.
          const int kMaxOldOomScore = 15;
       
          int converted_score = score * kMaxOldOomScore / kMaxOomScore;
          std::string score_str = IntToString(converted_score);
          DVLOG(1) << "Adjusting oom_adj of " << process << " to " << score_str;
          int score_len = static_cast<int>(score_str.length());
          return (score_len == WriteFile(oom_file, score_str.c_str(), score_len));
        }
       
        return false;
      }


      maybe after we must implement something at windows source too?
      http://src.chromium.org/chrome/trunk/src/base/process/memory_win.cc - void OnNoMemory(); / void EnableTerminationOnOutOfMemory();

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              rspadim roberto spadim
            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: