Advanced system recovery and maintenance
In this chapter, you will understand some advanced troubleshooting methods. Many of these procedures will be executed into a senhasegura instance at the operational system layer. You can have access to all technology manuals explained here. But remember that senhasegura development team had changed the default configuration of all these software for a better a secure experience. Do not change any parameter by yourself. Call our Support team if needed.
senhasegura Cluster reconciliation
senhasegura uses the MariaDB Galera Cluster as database high-availability cluster technology. This section will explain the unavailable scenarios and the steps to recovery safely.
First, we will explain a little bit about how the MariaDB Galera Cluster syncs data.
A briefly explain about SST and IST differences
Data reconciliation with SST1 is made by transferring a complete dataset from a node member to other.
Therefore, data reconciliation using IST2 is made comparing missing transactions between nodes, syncing only missing data between nodes, instead of sync the entire database.
Into a under control scenario, is better use IST instead SST, for a better performance.
Data reconciliation inside senhasegura
Usually, in scenarios of temporary interruption of data replication between cluster nodes with standard configurations, there is a tolerance of approximately 3 hours of interruption in which the cluster only needs an IST to resolve the reconciliation, that is, just sending the incremental data. In this case, no intervention is necessary since the cluster solves the problem of reconciliation automatically.
Longer outages usually require a complete data transfer (SST).
In most cases, the senhasegura cluster is resilient and intelligent enough to resolve the reconciliation by performing an SST automatically. Only in some cases is the intervention of the support team necessary to ensure the integrity of the data by performing an SST manually.
Manual intervention to perform an SST in the cluster
First step. Check the syncronization status login the database and verifying the following variables control:
SHOW GLOBAL STATUS LIKE 'wsrep_connected';
SHOW GLOBAL STATUS LIKE 'wsrep_cluster_status';
SHOW GLOBAL STATUS LIKE 'wsrep_local_state_comment';
Looking at the official Galera Cluster documentation:
wsrep_connected: shows whether the node has network connectivity with any other nodes;
wsrep_cluster_status: shows the primary status of the cluster component that the node is in, which you can use in determining whether your cluster is experiencing a partition;
wsrep_local_state_comment: shows the node state in a human readable format;
First steps at primary node (data donor)
Stop the MariaDB process;
sudo systemctl stop mariadb.service
Disable the replication changing the
galera.cnf
configuration file;Edit the
/etc/mysql/conf.d/galera.cnf
configuration file;Locate the
wsrep_on
parameter and change it value toOFF
Save the file and exit the editor;
Delete the old cluster control files;
sudo rm /var/lib/mysql/galera.cache
;sudo rm /var/lib/mysql/grastate.dat
;sudo rm /var/lib/mysql/multimaster.info
;Start the MariaDB process;
sudo systemctl start mariadb.service
First steps at the secondary node (joiner)
Stop the MariaDB process;
sudo systemctl stop mariadb.service
Rename the current database data folder for a backup purpose;
sudo mv /var/lib/mysql /var/lib/mysql-$(date +%d%m%y%H%M)
Create a new database data folder;
sudo install -d /var/lib/mysql -o mysql -g mysql
Second steps at primary node (data donor)
Stop the MariaDB process;
sudo systemctl stop mariadb.service
Enable the replication:
Edit the
/etc/mysql/conf.d/galera.cnf
configuration file;Locate the
wsrep_on
parameter and change it value toON
Save the file and exit the editor;
Into another terminal, keep your attention to the database logs:
sudo tailf /var/log/mysql/mysql-error.log
Recreate the cluster;
sudo galera_new_cluster
Wait for the complete initialization;
Second steps at secondary node (joiner)
Confirm that the replication is enabled at the
galera.cnf
configuration file;Edit the
/etc/mysql/conf.d/galera.cnf
configuration file;Locate the
wsrep_on
parameter and change it value toON
Save the file and exit the editor;
Into another terminal, keep your attention to the database logs:
sudo tailf /var/log/mysql/mysql-error.log
Start the MariaDB process;
sudo systemctl start mariadb.service
Check if the number of cluster members are correct at database log (E.g.: if there is 2 members, the message
members = 2/2(joined/total)
should be printed);Check if the sync confirmation appears
WSREP: Member 0.0 (vsrv-senhasegura-cert05) synced with group.
Application status and services
All the services used by senhasegura platform can be managed by orbit
command line. This powerful tool has it own manual. Check out all available commands for a better senhasegura administration experience.
For now, we will explain the most common commands sequences for restarting its basic services.
Restarting primary instance
A primary instance is an instance that centers all services execution. And also used as a primary member of the cluster schema.
You can check how the instance is configured using the orbit status
command.
To switch a instance to primary and activate it usage, use the following command sequence to grant a correctly usage:
sudo orbit application stop
;sudo orbit application master
;sudo orbit application start
;sudo orbit proxy fajita restart
;sudo orbit proxy rdpgate restart
;
The orbit application stop
and orbit application start
will also restart the basic web server services NGINX and PHP-FPM.
Restarting Linux services
All services can be restarted using the orbit
command interface.
Use the sudo orbit service
command to restart a linux service.
Keep a close attention to the following services status. You can restart it by yourself if an unexpected service stop happens.
nginx: Web server service. If restarted, restart php-fpm service also;
php-fpm: PHP Wrapper service;
mariadb: Database service;
docker: Proxy isolation service;
wazuh-manager: HIDS service;
IP blocked by HIDS
If an IP has being blocked by the HIDS, you can unblock the IP using the command orbit firewall
.
sudo orbit firewall –show
sudo orbit firewall unblock –host=[blocked IP]
Restarting cluster environment
Into a cluster environment you should restart or shutdown instances into the right order to avoid problems.
Use the sudo orbit shutdown
into cluster members, one instance at time, waiting for the complete shutdown to start the process into another member.
Doing this way, the available cluster members will understand that members are going down. Keep the primary node to be the last one to be shutdown.
Orbini services and task execution
Orbini services is the senhasegura abstraction layer for services executed by senhasegura modules.
You can control its execution into the menu Settings ➔ Execution processes ➔ Processes.
Every process has an execution timeout configuration, and sometimes multiple processes can be accumulated waiting to be executed.
To understand why the oldest process is stuck on the task list, execute the process manually.
sudo orbit execution --code ID --verbose --debug