Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-5815

Fix flame graph generation error (Stack count is low (0))

Details

    • Task
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • None
    • N/A
    • burza
    • None

    Description

      When generating flame graphs, an error sometimes occurs:

      2024-10-23 13:19:27,136[D] - flame_graph - Chose representative entry for test case q11: PerfDataEntry(suite_name='tpc_h', case_name='q11', test_run_id=4, process_name='PrimProc', data_file_path='/home/vagrant/columnstore-tooling/burza/results/perf_data/q11_run4_PrimProc.data', test_run_duration=0.2790718078613281, labels={
      'scale_factor': 10, 'code_id': 'MariaDBEnterprise-bb-10.6.19-15-cs-23.02-perf-1-79efe6007c8611e7534211b7ad9f4378e9c10d4b', 'test_start_time': '2024-10-23T11:18:00+00:00', 'perf_freq': '500'},
      created_at='2024-10-23T11:19:15.830578+00:00')
      Stack count is low (0). Did something go wrong?
      ERROR: No stack counts found
      Exception in thread Thread-5 (_run_tests_in_thread):
      Traceback (most recent call last):
      File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
      self.run()
      File "/usr/lib/python3.10/threading.py", line 953, in run
      self._target(*self._args, **self._kwargs)
      File "/home/vagrant/columnstore-tooling/burza/burza/plugins/run_control/sequential_test_runner.py", line 39, in _run_tests_in_thread
      self.pm.hook.after_test_case_teardown(suite_name=self.suite_name, case_name=test_case.name)
      File "/home/vagrant/columnstore-tooling/burza/.venv/lib/python3.10/site-packages/pluggy/hooks.py", line 513, in __call__
      return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
      File "/home/vagrant/columnstore-tooling/burza/.venv/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
      return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
      File "/home/vagrant/columnstore-tooling/burza/.venv/lib/python3.10/site-packages/pluggy/_callers.py", line 139, in _multicall
      raise exception.with_traceback(exception._traceback_)
      File "/home/vagrant/columnstore-tooling/burza/.venv/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
      res = hook_impl.function(*args)
      File "/home/vagrant/columnstore-tooling/burza/burza/plugins/report_generators/flame_graph.py", line 270, in after_test_case_teardown
      flame_graph_path = self.generate_flame_graph(data_file_path, folded_file_path)
      File "/home/vagrant/columnstore-tooling/burza/burza/plugins/report_generators/flame_graph.py", line 313, in generate_flame_graph
      subprocess.run(flamegraph_cmd, stdout=flame_graph_fh, check=True)
      File "/usr/lib/python3.10/subprocess.py", line 526, in run
      raise CalledProcessError(retcode, process.args,
      subprocess.CalledProcessError: Command '['flamegraph.pl', '/tmp/q11_run4_PrimProc.folded']' returned non-zero exit status 2.
      

      Most often, this happens on scale 1, so I thought that perf events collects too little data (some queries complete in 0.1 seconds). But it also repeated at scale 100, so this needs investigation.

      How we collect data for flame graphs:

      Perf Data Generation:

      • The perf_events plugin should be enabled in DATA_POINT_GENERATORS.
      • For each test_case_run and each process we monitor (by default only PrimProc), we launch perf. Since it needs to run as root, a special fifo is used to manage the process.
      • Once the query completes, we also generate a metadata file where we log the query duration (this will be important later). A special hook is called to signal the generation of perf.data.

      Flame Graph Generation:

      • The flame_graph plugin must be enabled in REPORT_GENERATORS.
      • We run each query multiple times to eliminate noise and find the average, so we have many test_case_runs. However, we only need one flame graph per test case, and we cannot average multiple flame graphs into one. So, the plugin finds all the perf.data files and metadata for that test case and selects the one considered most representative (currently, this is based on query duration, which should be average). The flame graph is then generated from this file.
      • Several Perl scripts are run to generate the flame graph file. This is where we find out that the chosen file is not suitable. We may need to filter out such files earlier, before selecting the representative one, but in any case, we need to understand why perf sometimes generates empty files.

      The easiest way to catch this problem is on scale 1.

      Attachments

        Activity

          There are no comments yet on this issue.

          People

            alan.mologorsky Alan Mologorsky
            AlexanderPresniakov Alexander Presniakov
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.