Now this is far from being any observability manual for your InnoDB Cluster and let alone go into everything MySQL Shell API Admin, or the collectDiagnostics utility. You can also use the default javascript commands that we all know and love via dba.getCluster() and so on, but here’s a different take.
I just want to share something I’ve been playing with to pull out some key info from mycluster. Hope it helps someone else out there.
General setup:
select cluster_id, cluster_name, description, cluster_type, primary_mode, clusterset_id
from mysql_innodb_cluster_metadata.clusters;
Members of our cluster:
select * from performance_schema.replication_group_members order by MEMBER_ROLE;
GR Applier & Recovery threads:
select CHANNEL_NAME, SERVICE_STATE, COUNT_RECEIVED_HEARTBEATS, RECEIVED_TRANSACTION_SET, LAST_ERROR_NUMBER, LAST_ERROR_MESSAGE, LAST_ERROR_TIMESTAMP, LAST_QUEUED_TRANSACTION, QUEUEING_TRANSACTION
from performance_schema.replication_connection_status \G
select CHANNEL_NAME, WORKER_ID, THREAD_ID, SERVICE_STATE, LAST_ERROR_NUMBER, LAST_ERROR_MESSAGE, LAST_ERROR_TIMESTAMP, LAST_APPLIED_TRANSACTION, APPLYING_TRANSACTION, APPLYING_TRANSACTION_RETRIES_COUNT
from performance_schema.replication_applier_status_by_worker
order by CHANNEL_NAME,WORKER_ID,THREAD_ID ;
From o.s. command line
Check all cluster members have the same topology understanding (if not, we will see split-brain topology):
mysqlsh icadmin@rtnode1:3306 --sqlc --redirect-primary -e "select @@hostname,@@port;"
mysqlsh icadmin@rtnode2:3306 --sqlc --redirect-primary -e "select @@hostname,@@port;"
mysqlsh icadmin@dbnode1:3306 --sqlc --redirect-primary -e "select @@hostname,@@port;"
mysqlsh icadmin@dbnode2:3306 --sqlc --redirect-primary -e "select @@hostname,@@port;"
mysqlsh icadmin@dbnode3:3306 --sqlc --redirect-primary -e "select @@hostname,@@port;"
Now run each of the previous commands from the other 4 servers…
MySQL Routers, according to the cluster:
select router_name, product_name , address, version, last_check_in
from mysql_innodb_cluster_metadata.routers ;
MySQL Router state on each Router server:
ssh rtnode1
ps -ef | grep mysqlrouter
cat /var/lib/mysqlrouter/state.json
ssh rtnode2
ps -ef | grep mysqlrouter
cat /var/lib/mysqlrouter/state.json
Network testing
Make sure these are run from all servers to all servers;
netstat -l | grep mysql
tracepath -b -p 3306 dbnode1
Get the instances config:
select @@hostname,@@port, @@external_user, @@proxy_user,@@read_only,
@@pid_file,
@@pseudo_thread_id,
@@socket,
@@wait_timeout,
@@init_connect,
@@init_file,
@@init_replica,
@@innodb_status_output,
@@net_retry_count
\G
GR & Replica config / status:
select @@group_replication_group_name, @@group_replication_local_address, @@group_replication_group_seeds,
@@group_replication_communication_stack,
@@group_replication_flow_control_mode,
@@group_replication_flow_control_min_quota,
@@group_replication_flow_control_max_quota,
@@group_replication_communication_debug_options,
@@group_replication_autorejoin_tries,
@@replica_transaction_retries,
@@replica_checkpoint_period,
@@group_replication_exit_state_action,
@@group_replication_flow_control_period,
@@group_replication_unreachable_majority_timeout,
@@replica_net_timeout,
@@group_replication_transaction_size_limit,
@@replica_max_allowed_packet,
@@gtid_purged,
@@gtid_executed,
@@gtid_next,
@@replica_exec_mode,
@@replica_preserve_commit_order,
@@replica_type_conversions,
@@replication_sender_observe_commit_only
\G
Connection thread memory consumption (BG vs FG):
select type,sum(TOTAL_MEMORY)/1048576 "Memory(Mb)"
from performance_schema.threads
group by type;
Count of connection types:
select name, connection_type, count()
from performance_schema.threads
where connection_type is not null
group by name, connection_type
order by 3 desc ;
Some connection statistics:
select VARIABLE_NAME, VARIABLE_VALUE
from global_status
where VARIABLE_NAME like 'connect%';
Group Replication thread memory usage:
select name, type, sum(TOTAL_MEMORY)/1048576 "Memory(Mb)", count()
from performance_schema.threads
where name like 'thread/group_rpl%'
group by name, type order by 3 desc ;
Global thread memory usage:
select sum(TOTAL_MEMORY)/1048576 "Memory(Mb)", name, type, CONNECTION_TYPE
from performance_schema.threads
group by name, type, CONNECTION_TYPE
order by 1 desc;
Bytes sent/received per user:
select user, VARIABLE_NAME, VARIABLE_VALUE/1048576 from performance_schema.status_by_user where VARIABLE_NAME in ('Bytes_received','Bytes_sent') order by 3 asc;
Naughty users:
select user, VARIABLE_NAME, VARIABLE_VALUE
from performance_schema.status_by_user
where VARIABLE_NAME in ('Select_full_join','Select_scan','Slow_queries','Max_execution_time_exceeded')
and VARIABLE_VALUE > '0'
order by 1,2;
WITH RECURSIVE ratios (user, Misses, Hits) as (
SELECT user,
MAX(CASE WHEN variable_name='Table_open_cache_misses' THEN variable_value END) AS Misses,
MAX(CASE WHEN variable_name='Table_open_cache_hits' THEN variable_value END) AS Hits
FROM performance_schema.status_by_user
where variable_value > 0
and variable_value is not null
group by user
)
select user, Misses, Hits, round(ratios.Misses/ratios.Hits*100,2) as "Ratio%" from ratios order BY 4 asc;
Addition:
Whilst running some sysbench tests I wanted to see how the transactions were progressing from Primary to the Secondarys and also the time taken to apply each, so I feel I should add the following:
select a.CHANNEL_NAME, substr(b.MEMBER_HOST,1,instr(MEMBER_HOST,'.')-1) DBHost,
b.MEMBER_ROLE Role, b.MEMBER_STATE State,
a.COUNT_TRANSACTIONS_IN_QUEUE "TransQ",
a.COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE "TransRemQ"
from performance_schema.replication_group_member_stats a,
performance_schema.replication_group_members b
where a.MEMBER_ID=b.MEMBER_ID ;
select CHANNEL_NAME, SOURCE_UUID, THREAD_ID,
LAST_QUEUED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP LQTransIMM,
LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP LQTransOrig,
LAST_QUEUED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP -
LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP LQDiff,
QUEUEING_TRANSACTION, QUEUEING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP -
QUEUEING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP QTransDiff
from performance_schema.replication_connection_status ;
PS/ This might make it all easier: https://github.com/khollman5/MySQL-InnoDBCluster/blob/main/obs_mysql_cluster.sh