db1081, acting as s4 (commonswiki) primary master is on the list of hosts that might have a BBU crash anytime (T258386).
We need to promote db1138 instead as a primary master.
When: Tue 26th January 07:00AM UTC - 07:15 AM UTC
Checklist:
- Restart db1138 to pick up report_host T271106
- Create a task to communicate the chosen date and send an announcement to the community
NEW master: db1138
OLD master: db1081
- Check configuration differences between new and old master
pt-config-diff h=db1081.eqiad.wmnet,F=/root/.my.cnf h=db1138.eqiad.wmnet,F=/root/.my.cnf
- Silence alerts on all hosts
- Set NEW master with weight 0 s4
dbctl instance db1138 edit
dbctl config commit -m "Set db1138 with weight 0 T271427"
- Topology changes, connect everything to db1138
db-switchover --timeout=15 --only-slave-move db1081.eqiad.wmnet db1138.eqiad.wmnet
- Disable puppet @db1138 and @db1081 puppet agent --disable "switchover to db1138"
- Merge gerrit puppet change to promote db1138: https://gerrit.wikimedia.org/r/c/operations/puppet/+/658211/
Failover:
- Start the failover
!log Starting s4 eqiad failover from db1081 to db1138 - T271427
- Read only on s4
dbctl --scope eqiad section s4 ro "Maintenance till 07:15M UTC T271427" && dbctl config commit -m "Set s4 as read-only for maintenance T271427"
- Check s4 is indeed on read only
- run switchover script from cumin1001:
db-switchover --skip-slave-move db1081 db1138 ; echo db1081; mysql.py -hdb1081 -e "show slave status\G" ; echo db1138 ; mysql.py -hdb1138 -e "show slave status\G
- Promote db1138 as new master and remove read-only
dbctl --scope eqiad section s4 set-master db1138 && dbctl --scope eqiad section s4 rw && dbctl config commit -m "Promote db1138 to s4 master and remove read-only from s4 T271427"
- Restart puppet on old and new masters (for heartbeat): db1138 and db1081
run-puppet-agent -e "switchover to db1138"
- Give weight to db1081 in s4
dbctl instance db1081 edit
- left depooled
Clean up tasks:
- change events for query killer:
events_coredb_master.sql on the new master db1138 events_coredb_slave.sql on the new slave db1081
- Update DNS: https://gerrit.wikimedia.org/r/c/operations/dns/+/658213/
- Update candidate master dbctl notes and pick new candidate master: db1081
dbctl instance db1138 set-candidate-master --section s4 false dbctl instance db1081 set-candidate-master --section s4 true
- Check tendril was updated
- Check zarcillo was updated
- Had to be done manually: https://phabricator.wikimedia.org/P13956
- Update/resolve phabricator ticket about failover