boyxuper's blog: May 2015

14:00 reported site is down to another team
14:09 the other team can not solve the problem and reported by my team leader that website is down
I was interviewing a person, then received a call from leader that website is down.
Come back to desk, found website is unable to access, resulting 504 Gateway timeout.
First consideration is backup plan, switch to backup version.
SA reported that nginx -s reload takes very long time.
Tracked server log which reported a redis is unable to connect.
Someone reports that 10.10 section is unable to access & found that IDC down.
Started local test server for isolate affects.

Found 2 functions are affected from this issue.

14:16 decided to remove functions involved temporayly

14:24 issue reported by TECH-NO

故障名称：北京电信兆维IDC2-3机房7-12排电力中断
故障时间：2015年5月19日13:54—未恢复
影响范围：北京电信兆维IDC2-3机房7-12排服务器所有公私网访问全部中断
故障原因：现场查看是运营商电力中断

14:25 packaged website and begins to deploy
14:27 package deployed, but switch version (nginx reload) stucks.
14:30 nginx finally reloaded and website begins to recover, part of site is unavailable due another service is still down.
14:37 deployed second part of patch
14:40 begins to fix rest parts of the website.
-- re-configured internal data service to available redis.
-- another part of website is recovered
-- utsf is down, and timeline to be filled
-- some redis is recovered, so that we re-deployed old version online.
17:00 found that some part of website is not same. but every thing is just fine.
17:04 re-deployed and switched server and everything works
17:08 found another issue is same effect.
17:16 isolated that issue and found one nginx is new to website, and switch server operation is not affected that server.
17:25 updated deploy script that includes that server.

-- till now everything is okay.

so, here is the resolution.
1. allow service to be unavailable, if it happens, fill with empty data.
2. replica of important services, and marks those failed to be untouched
3. unit and cell -ize services between IDCs, which boosts speed and also provide HA for the mainline service.
4. reporting flow improvement. TBD.
5. server partitioning

boyxuper's blog

Tuesday, May 19, 2015

mSOHU failure recovery