Version: 3.23

Advanced system recovery and maintenance

In this chapter, you will understand some advanced troubleshooting methods. Many of these procedures will be executed into a senhasegura instance at the operational system layer. You can have access to all technology manuals explained here. But remember that senhasegura development team had changed the default configuration of all these software for a better a secure experience. Do not change any parameter by yourself. Call our Support team if needed.

senhasegura Cluster reconciliation

senhasegura uses the MariaDB Galera Cluster as database high-availability cluster technology. This section will explain the unavailable scenarios and the steps to recovery safely.

First, we will explain a little bit about how the MariaDB Galera Cluster syncs data.

A briefly explain about SST and IST differences

Data reconciliation with SST¹ is made by transferring a complete dataset from a node member to other.

Therefore, data reconciliation using IST² is made comparing missing transactions between nodes, syncing only missing data between nodes, instead of sync the entire database.

Into a under control scenario, is better use IST instead SST, for a better performance.

Data reconciliation inside senhasegura

Usually, in scenarios of temporary interruption of data replication between cluster nodes with standard configurations, there is a tolerance of approximately 3 hours of interruption in which the cluster only needs an IST to resolve the reconciliation, that is, just sending the incremental data. In this case, no intervention is necessary since the cluster solves the problem of reconciliation automatically.

Longer outages usually require a complete data transfer (SST).

In most cases, the senhasegura cluster is resilient and intelligent enough to resolve the reconciliation by performing an SST automatically. Only in some cases is the intervention of the support team necessary to ensure the integrity of the data by performing an SST manually.

Manual intervention to perform an SST in the cluster

First step. Check the syncronization status login the database and verifying the following variables control:

SHOW GLOBAL STATUS LIKE 'wsrep_connected'; 
SHOW GLOBAL STATUS LIKE 'wsrep_cluster_status'; 
SHOW GLOBAL STATUS LIKE 'wsrep_local_state_comment'; 

Looking at the official Galera Cluster documentation:

wsrep_connected: shows whether the node has network connectivity with any other nodes;
wsrep_cluster_status: shows the primary status of the cluster component that the node is in, which you can use in determining whether your cluster is experiencing a partition;
wsrep_local_state_comment: shows the node state in a human readable format;

First steps at primary node (data donor)

Stop the MariaDB process;
```
sudo systemctl stop mariadb.service 
```
Disable the replication changing the galera.cnf configuration file;
Edit the /etc/mysql/conf.d/galera.cnf configuration file;
Locate the wsrep_on parameter and change it value to OFF
Save the file and exit the editor;
Delete the old cluster control files;
sudo rm /var/lib/mysql/galera.cache;
sudo rm /var/lib/mysql/grastate.dat;
sudo rm /var/lib/mysql/multimaster.info;
Start the MariaDB process;
```
sudo systemctl start mariadb.service 
```

First steps at the secondary node (joiner)

Stop the MariaDB process;
```
sudo systemctl stop mariadb.service 
```
Rename the current database data folder for a backup purpose;
```
sudo mv /var/lib/mysql /var/lib/mysql-$(date +%d%m%y%H%M) 
```

Create a new database data folder;

sudo install -d /var/lib/mysql -o mysql -g mysql

Second steps at primary node (data donor)

Stop the MariaDB process;
```
sudo systemctl stop mariadb.service 
```
Enable the replication:
Edit the /etc/mysql/conf.d/galera.cnf configuration file;
Locate the wsrep_on parameter and change it value to ON
Save the file and exit the editor;
Into another terminal, keep your attention to the database logs:
```
sudo tailf /var/log/mysql/mysql-error.log 
```
Recreate the cluster;
```
sudo galera_new_cluster 
```
Wait for the complete initialization;

Second steps at secondary node (joiner)

Confirm that the replication is enabled at the galera.cnf configuration file;
Edit the /etc/mysql/conf.d/galera.cnf configuration file;
Locate the wsrep_on parameter and change it value to ON
Save the file and exit the editor;
Into another terminal, keep your attention to the database logs:
```
sudo tailf /var/log/mysql/mysql-error.log 
```
Start the MariaDB process;
```
sudo systemctl start mariadb.service 
```
Check if the number of cluster members are correct at database log (E.g.: if there is 2 members, the message members = 2/2(joined/total) should be printed);

Check if the sync confirmation appears

WSREP: Member 0.0 (vsrv-senhasegura-cert05) synced with group.

Application status and services

All the services used by senhasegura platform can be managed by orbit command line. This powerful tool has it own manual. Check out all available commands for a better senhasegura administration experience.

For now, we will explain the most common commands sequences for restarting its basic services.

Restarting primary instance

A primary instance is an instance that centers all services execution. And also used as a primary member of the cluster schema.

You can check how the instance is configured using the orbit status command.

To switch a instance to primary and activate it usage, use the following command sequence to grant a correctly usage:

sudo orbit application stop;
sudo orbit application master;
sudo orbit application start;
sudo orbit proxy fajita restart;
sudo orbit proxy rdpgate restart;

The orbit application stop and orbit application start will also restart the basic web server services NGINX and PHP-FPM.

Restarting Linux services

All services can be restarted using the orbit command interface.

Use the sudo orbit service command to restart a linux service.

Keep a close attention to the following services status. You can restart it by yourself if an unexpected service stop happens.

nginx: Web server service. If restarted, restart php-fpm service also;
php-fpm: PHP Wrapper service;
mariadb: Database service;
docker: Proxy isolation service;
wazuh-manager: HIDS service;

IP blocked by HIDS

If an IP has being blocked by the HIDS, you can unblock the IP using the command orbit firewall.

sudo orbit firewall –show
sudo orbit firewall unblock –host=[blocked IP]

Restarting cluster environment

Into a cluster environment you should restart or shutdown instances into the right order to avoid problems.

Use the sudo orbit shutdown into cluster members, one instance at time, waiting for the complete shutdown to start the process into another member.

Doing this way, the available cluster members will understand that members are going down. Keep the primary node to be the last one to be shutdown.

Orbini services and task execution

Orbini services is the senhasegura abstraction layer for services executed by senhasegura modules.

You can control its execution into the menu Settings ➔ Execution processes ➔ Processes.

Every process has an execution timeout configuration, and sometimes multiple processes can be accumulated waiting to be executed.

To understand why the oldest process is stuck on the task list, execute the process manually.

sudo orbit execution --code ID --verbose --debug

State Snapshot Transfer↩
Incremental State Transfer↩

Advanced system recovery and maintenance

senhasegura Cluster reconciliation​

A briefly explain about SST and IST differences​

Data reconciliation inside senhasegura​

Manual intervention to perform an SST in the cluster​

First steps at primary node (data donor)​

First steps at the secondary node (joiner)​

Second steps at primary node (data donor)​

Second steps at secondary node (joiner)​

Application status and services​

Restarting primary instance​

Restarting Linux services​

IP blocked by HIDS​

Restarting cluster environment​

Orbini services and task execution​