Details
-
Task
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
None
-
None
Description
When generating flame graphs, an error sometimes occurs:
2024-10-23 13:19:27,136[D] - flame_graph - Chose representative entry for test case q11: PerfDataEntry(suite_name='tpc_h', case_name='q11', test_run_id=4, process_name='PrimProc', data_file_path='/home/vagrant/columnstore-tooling/burza/results/perf_data/q11_run4_PrimProc.data', test_run_duration=0.2790718078613281, labels={ |
'scale_factor': 10, 'code_id': 'MariaDBEnterprise-bb-10.6.19-15-cs-23.02-perf-1-79efe6007c8611e7534211b7ad9f4378e9c10d4b', 'test_start_time': '2024-10-23T11:18:00+00:00', 'perf_freq': '500'}, |
created_at='2024-10-23T11:19:15.830578+00:00') |
Stack count is low (0). Did something go wrong? |
ERROR: No stack counts found
|
Exception in thread Thread-5 (_run_tests_in_thread): |
Traceback (most recent call last):
|
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner |
self.run()
|
File "/usr/lib/python3.10/threading.py", line 953, in run |
self._target(*self._args, **self._kwargs)
|
File "/home/vagrant/columnstore-tooling/burza/burza/plugins/run_control/sequential_test_runner.py", line 39, in _run_tests_in_thread |
self.pm.hook.after_test_case_teardown(suite_name=self.suite_name, case_name=test_case.name)
|
File "/home/vagrant/columnstore-tooling/burza/.venv/lib/python3.10/site-packages/pluggy/hooks.py", line 513, in __call__ |
return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult) |
File "/home/vagrant/columnstore-tooling/burza/.venv/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec |
return self._inner_hookexec(hook_name, methods, kwargs, firstresult) |
File "/home/vagrant/columnstore-tooling/burza/.venv/lib/python3.10/site-packages/pluggy/_callers.py", line 139, in _multicall |
raise exception.with_traceback(exception._traceback_)
|
File "/home/vagrant/columnstore-tooling/burza/.venv/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall |
res = hook_impl.function(*args)
|
File "/home/vagrant/columnstore-tooling/burza/burza/plugins/report_generators/flame_graph.py", line 270, in after_test_case_teardown |
flame_graph_path = self.generate_flame_graph(data_file_path, folded_file_path)
|
File "/home/vagrant/columnstore-tooling/burza/burza/plugins/report_generators/flame_graph.py", line 313, in generate_flame_graph |
subprocess.run(flamegraph_cmd, stdout=flame_graph_fh, check=True)
|
File "/usr/lib/python3.10/subprocess.py", line 526, in run |
raise CalledProcessError(retcode, process.args,
|
subprocess.CalledProcessError: Command '['flamegraph.pl', '/tmp/q11_run4_PrimProc.folded']' returned non-zero exit status 2. |
Most often, this happens on scale 1, so I thought that perf events collects too little data (some queries complete in 0.1 seconds). But it also repeated at scale 100, so this needs investigation.
How we collect data for flame graphs:
Perf Data Generation:
• The perf_events plugin should be enabled in DATA_POINT_GENERATORS.
• For each test_case_run and each process we monitor (by default only PrimProc), we launch perf. Since it needs to run as root, a special fifo is used to manage the process.
• Once the query completes, we also generate a metadata file where we log the query duration (this will be important later). A special hook is called to signal the generation of perf.data.
Flame Graph Generation:
• The flame_graph plugin must be enabled in REPORT_GENERATORS.
• We run each query multiple times to eliminate noise and find the average, so we have many test_case_runs. However, we only need one flame graph per test case, and we cannot average multiple flame graphs into one. So, the plugin finds all the perf.data files and metadata for that test case and selects the one considered most representative (currently, this is based on query duration, which should be average). The flame graph is then generated from this file.
• Several Perl scripts are run to generate the flame graph file. This is where we find out that the chosen file is not suitable. We may need to filter out such files earlier, before selecting the representative one, but in any case, we need to understand why perf sometimes generates empty files.
The easiest way to catch this problem is on scale 1.