Wiki/Operation/Continuity.md

3.2 KiB

Operational concerns are as important as development or security ones. Having a good layering of operational continuity will allow services to continue while the network operators follow the incident response guidelines. Users should understand the posture outlined here to know how an incident will affect them.

Self-recovering Services

Ideally, services should be designed & developed to recover themselves. This is the second-best option to not having problems in the first place. Tools like systemd can restart services, or monitoring can identify & remediate issues. This kind of automation can sort out issues before users or admins know about them.

High Availability & Geodiversity

High availability can allow inconsistently failing nodes to not take the service down with them. If one node fails, the traffic will get routed to the next one so that users don't see issues. Admins can get the notification and sort the problem out before users even see the issue. Tools for this can be in webservers or appliances like F5 load balancers.

Geodiversity allows some kind of resilience against environmental issues. One needs tools like round-robin DNS or eBGP to broadcast the fallback sites, but if an ISP suffers a line cut or the site endures a natural disaster (or planned maintenance), traffic will fall over to the next site. This can be a cost issue, since the deployment needs to decide the cost model. If any site can handle peak load, then the organization is wasting compute & power that's not doing work during normal operation. If any site can handle median load, peak will get handled by both nodes but it saves some cost during normal operation. If both sites are needed to handle peak activity, then you will see a service degradation during an event but this will be the most fiscally conservative option. Don't design services to only handle median load.

This option is not currently available to us, as we don't have a second site for peering.

Disaster Recovery

Disaster recovery is responding to terrible issues that can't be caught by the prior two solutions. This includes options like Infrastructure-as-Code, backups, and AniNIX/Aether, that provide various options for rebuilding services during an event. DR procedures are critical for resolving ransomware.

Business Continuity

Business continuity operation is perhaps the most critical to AniNIX operations, since it allows the best options when issues take long enough to resolve that a user will notice. AniNIX/Yggdrasil, AniNIX/Foundation, and AniNIX/Singularity allow offline options for when the services aren't available but still allow users to use the content. Other services, like AniNIX/WolfPack or AniNIX/Maat, are convenience and if they aren't available users have the option to wait before using them. Discord is currently providing our fallback for IRC.

Core business continuity procedures:

  • Maintain local clones of any AniNIX/Foundation projects you're working on.
  • Use the "Download Media" option in the Emby web interface for AniNIX/Yggdrasil
  • AniNIX/Singularity's TT-RSS mobile app has a "work offline" feature -- this will let the user look through the last set of articles the app downloaded.