[MCOL-5815] Fix flame graph generation error (Stack count is low (0)) - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: N/A
Component/s: burza
Labels:
None

Description

When generating flame graphs, an error sometimes occurs:

2024-10-23 13:19:27,136[D] - flame_graph - Chose representative entry for test case q11: PerfDataEntry(suite_name='tpc_h', case_name='q11', test_run_id=4, process_name='PrimProc', data_file_path='/home/vagrant/columnstore-tooling/burza/results/perf_data/q11_run4_PrimProc.data', test_run_duration=0.2790718078613281, labels={

'scale_factor': 10, 'code_id': 'MariaDBEnterprise-bb-10.6.19-15-cs-23.02-perf-1-79efe6007c8611e7534211b7ad9f4378e9c10d4b', 'test_start_time': '2024-10-23T11:18:00+00:00', 'perf_freq': '500'},

created_at='2024-10-23T11:19:15.830578+00:00')

Stack count is low (0). Did something go wrong?

ERROR: No stack counts found

Exception in thread Thread-5 (_run_tests_in_thread):

Traceback (most recent call last):

File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner

self.run()

File "/usr/lib/python3.10/threading.py", line 953, in run

self._target(*self._args, **self._kwargs)

File "/home/vagrant/columnstore-tooling/burza/burza/plugins/run_control/sequential_test_runner.py", line 39, in _run_tests_in_thread

self.pm.hook.after_test_case_teardown(suite_name=self.suite_name, case_name=test_case.name)

File "/home/vagrant/columnstore-tooling/burza/.venv/lib/python3.10/site-packages/pluggy/hooks.py", line 513, in __call__

return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)

File "/home/vagrant/columnstore-tooling/burza/.venv/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec

return self._inner_hookexec(hook_name, methods, kwargs, firstresult)

File "/home/vagrant/columnstore-tooling/burza/.venv/lib/python3.10/site-packages/pluggy/_callers.py", line 139, in _multicall

raise exception.with_traceback(exception._traceback_)

File "/home/vagrant/columnstore-tooling/burza/.venv/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall

res = hook_impl.function(*args)

File "/home/vagrant/columnstore-tooling/burza/burza/plugins/report_generators/flame_graph.py", line 270, in after_test_case_teardown

flame_graph_path = self.generate_flame_graph(data_file_path, folded_file_path)

File "/home/vagrant/columnstore-tooling/burza/burza/plugins/report_generators/flame_graph.py", line 313, in generate_flame_graph

subprocess.run(flamegraph_cmd, stdout=flame_graph_fh, check=True)

File "/usr/lib/python3.10/subprocess.py", line 526, in run

raise CalledProcessError(retcode, process.args,

subprocess.CalledProcessError: Command '['flamegraph.pl', '/tmp/q11_run4_PrimProc.folded']' returned non-zero exit status 2.

Most often, this happens on scale 1, so I thought that perf events collects too little data (some queries complete in 0.1 seconds). But it also repeated at scale 100, so this needs investigation.

How we collect data for flame graphs:

Perf Data Generation:

• The perf_events plugin should be enabled in DATA_POINT_GENERATORS.
• For each test_case_run and each process we monitor (by default only PrimProc), we launch perf. Since it needs to run as root, a special fifo is used to manage the process.
• Once the query completes, we also generate a metadata file where we log the query duration (this will be important later). A special hook is called to signal the generation of perf.data.

Flame Graph Generation:

• The flame_graph plugin must be enabled in REPORT_GENERATORS.
• We run each query multiple times to eliminate noise and find the average, so we have many test_case_runs. However, we only need one flame graph per test case, and we cannot average multiple flame graphs into one. So, the plugin finds all the perf.data files and metadata for that test case and selects the one considered most representative (currently, this is based on query duration, which should be average). The flame graph is then generated from this file.
• Several Perl scripts are run to generate the flame graph file. This is where we find out that the chosen file is not suitable. We may need to filter out such files earlier, before selecting the representative one, but in any case, we need to understand why perf sometimes generates empty files.

The easiest way to catch this problem is on scale 1.

Attachments

Activity

There are no comments yet on this issue.

People

Assignee:: Alan Mologorsky

Reporter:: Alexander Presniakov

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 2024-10-23 13:33

Updated:: 2024-12-10 00:31

Resolved:: 2024-12-10 00:31

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB ColumnStore