Details
-
New Feature
-
Status: Open (View Workflow)
-
Critical
-
Resolution: Unresolved
-
Server 12.1 dev sprint
Description
Description:
As a database administrator, I want the sampling mechanism to capture detailed state information for each active session so that I can diagnose what each session was doing during performance issues.
Acceptance Criteria:
For each sampled session, collect the metrics listed in the example table below.
Include a wait_event_group metric derived from wait_event, with predefined groups such as 'CPU', 'IO', 'Lock', 'Network', 'Commit', 'Idle', to categorize waits for easier troubleshooting (e.g., identifying if slow performance is due to CPU usage, I/O delays, or commit operations).
For sessions actively executing (on CPU), set wait_event_group to 'CPU' and wait_event to 'on CPU'.
For sessions waiting, assign wait_event_group based on the type of wait event (e.g., 'IO' for I/O waits, 'Lock' for lock waits).
Ensure data collection is optimized to avoid significant performance overhead (e.g., reuse existing Performance Schema hooks where possible).
Example Metrics Table:
The table below outlines the metrics to collect for each sampled session to support ASH-like monitoring in MariaDB. This comprehensive list ensures detailed performance analysis, incorporating feedback about grouping wait events.
Column Name | Data Type | Description |
---|---|---|
sample_time | TIMESTAMP | Time when the sample was taken |
session_id | INT | Unique identifier for the session |
user | VARCHAR | Username associated with the session |
host | VARCHAR | Host from which the session is connected |
database | VARCHAR | Current database of the session |
connection_type | VARCHAR | Type of connection (e.g., TCP/IP, socket) |
connection_start_time | TIMESTAMP | Time when the session started |
current_sql | TEXT | SQL statement being executed |
sql_id | VARCHAR | Hash or ID of the SQL statement |
query_phase | VARCHAR | Current phase of query execution (e.g., parsing, executing) |
rows_processed | INT | Number of rows processed by the query |
temp_tables_used | INT | Number of temporary tables used |
wait_event | VARCHAR | Specific wait event (e.g., 'lock wait', 'I/O wait') |
wait_event_group | VARCHAR | Group of the wait event (e.g., 'CPU', 'IO', 'Lock') |
wait_time | FLOAT | Time spent waiting on the event (in seconds) |
event_specific_details | VARCHAR | Additional details about the wait event |
cpu_time | FLOAT | CPU time used by the session (in seconds) |
io_operations | INT | Number of I/O operations (reads/writes) |
memory_usage | INT | Memory used by the session (in bytes) |
network_usage | INT | Network bytes sent/received |
transaction_id | INT | Current transaction ID |
transaction_state | VARCHAR | State of the transaction (e.g., active, committed) |
transaction_start_time | TIMESTAMP | Start time of the transaction |
isolation_level | VARCHAR | Isolation level of the transaction |
locks_held | INT | Number of locks held by the session |
locks_waited_on | INT | Number of locks the session is waiting on |
blocking_session_id | INT | ID of the session blocking this one |
lock_wait_time | FLOAT | Time spent waiting for locks (in seconds) |
client_application | VARCHAR | Name of the client application |
optimizer_statistics | VARCHAR | Optimizer decisions (e.g., join type) |
error_flags | VARCHAR | Any error or warning flags |
thread_state | VARCHAR | State of the thread (e.g., executing, waiting, sleeping) |
priority | INT | Priority of the thread |
cpu_core_affinity | INT | CPU core affinity of the thread |
Notes on Wait Event Grouping:
'CPU': Assigned when the session is actively executing (e.g., wait_event = 'on CPU').
'IO': For I/O-related waits (e.g., wait/io/file/sql/binlog → 'IO'). Maps to User I/O or System I/O based on context.
'Lock': For lock-related waits (e.g., wait/synch/mutex/sql/LOCK_global_system_variables → 'Lock').
*'Network': *For network-related delays.
'Commit': For waits related to transaction commits (e.g., redo log or I/O during commit).
'Idle': For sessions in a sleep state (though typically not sampled unless configured otherwise).
This grouping aligns with general user feedback and pgsentinel’s approach, making it easier to troubleshoot performance bottlenecks by category.
Server Enhancement Requirements:
Instrumentation: Extend the MariaDB server’s instrumentation to collect additional metrics such as cpu_time, network_usage, lock_wait_time, and detailed wait event information. This may require adding new hooks or extending existing Performance Schema instruments.
Wait Event Grouping Logic: Implement logic to derive wait_event_group from wait_event. Initially, use a fixed mapping based on the wait event name prefix (e.g., 'wait/io' → 'IO', 'wait/synch/mutex' → 'Lock'). Future enhancements could use a configurable mapping table.
Optimization: Ensure the sampling mechanism efficiently gathers metrics by leveraging existing Performance Schema data (e.g., events_waits_current) and only calculates additional metrics when necessary.
Attachments
Issue Links
- blocks
-
MDEV-36407 Store Sampled Performance Data Efficiently for Performance Analysis
-
- Open
-
- is blocked by
-
MDEV-36404 Implement Session Sampling Mechanism for Performance Analysis
-
- Open
-