<!-- .slide: data-background="https://upload.wikimedia.org/wikipedia/commons/9/98/Burned_laptop_secumem_11.jpg"--> # <b><u>Infra</u></b>structure Review <p style="margin-top: 360px; font-size: 18px; color: #333"> https://upload.wikimedia.org/wikipedia/commons/9/98/Burned_laptop_secumem_11.jpg </p> --- ## Blech - ~**9TB** RAM - ~**1700** CPU Cores - **1** dead SSD - **5** dead RAID Controllers, most replaced with new servers - **100%** uptime ---- - (42) hours spent staring at enterprise server boot screens + up to 20 minutes reboot time, thank you HP + we now have Certified Enterprise Observers! - ~27% visitors used IPv6, less than the ~32% Google publishes + We aimed for full IPv6 coverage --- <!-- .slide: data-background="https://i.imgur.com/ireU0VU.jpg"--> <p style="margin-top: 450px; font-size: 18px; color: #333"> Before the event... And it was DNS. At least once! </p> --- ## OS <!-- count(count by (instance) (up{instance!~"(.*at\\.rc3\\.world.*)|(.*\\.infra\\.run.*)|(.*hub\\.rc3\\.world.*)|(.*dereferrer\\.rc3\\.world.*)|(deployer\\.rc3\\.world.*)"} == 1)) --> ~300 nodes provisioned Ansible powered madness ---- + Full disk encryption on all nodes + No IPs logged in access logs + Minimal logging wherever possible + Even piped some logs to /dev/null because software wouldn't stop logging IPs + No personal data, no GDPR headache! Your data is safe (TM) + magical deployment, that debootstraps a running system and assimilates it into the rC3 infrastructure hive ---- + When deployment was broken, it was almost always due to network and/or routing <div style="margin: 20px; margin-bottom: 60px; color: #1889DD; padding: 4px"> <div style="display: inline-block; border: 3px solid #1889DD; border-radius: 40px; width: 60px; height: 60px; margin-right: 8px"> i </div> Official network team stated that this is false and misleading </div> + One time, deployment broke because of a trigger happy infra angel ---- + Demand was even greater than expected, with some particularly loyal users sending us millions of requests per second for a while :heart: + ddos24.net gladly helped here + We quickly provisioned extra infra to satisfy this extraordinary demand. + Even during times of such unprecedented demand, most services were still available. + Props to VOC for fast deployment on provisioned nodes. + Amazing Time to Market (TM) ---- ## Locations 6 locations 6 wildly different special snowflakes ---- ### DUS 816 CPU Cores 2TB RAM 10 Gbps Interconnect (sadly we could not use the 1Tbps Infiniband Connections) Weird and ancient IPMI An Admin who has never deployed bare metal to a data center Maximum heat, one server, 7U. Over 9000 Watts of Power :fire: (11.600W to be exact) ---- ### FRA 620 Gbps total uplink capacity 22 Gbps / 12 Mpps peak during "peak demand", thanks to our premium sponsor ddos24.net zero network congestion 1.5 Gbps of which were IPv6 sadly no traffic challenge Full-L3 architecture, MPLS between the WAN routers Nightshift on the 26.-27. for more servers, because shipments didn't arrive ---- Maximum bandwidth, some servers had 50 Gbit/s Maximum manual intervention 0 Oversubscriptions ---- ### STR Easy Deployment Nightshift for server deployments Thanks to neuner & team Most silent DC, no incident from Day -5 till now ---- ### WOB Smallest DC, we only have three servers within this DC, we killed one HW raid controller, so we could only use two of them. :cry: ---- ### HAM Minimum uptime, never deployed this DC because of broken netboot ---- ### Fun deployment facts - Received Covid warning from data center - luckily didn't affect us - Team lead of a sponsor needs to install Proxmox within DC without any clue what he's doing - We installed Proxmox within HAM DC and no server wanted to talk to us - "I will take care of this after I finished towing a lorry." - neuner --- ## Jitsi Peak user count **1105** <!-- sum(max_over_time(jicofo_participants[4d])) --> Peak conference count **204** <!-- `sum(max_over_time(jicofo_conferences[4d]))` --> Peak conference size **94** <!-- `max(max_over_time(jicofo_largest_conference[4d]))` --> Peak Outgoing Video traffic (JVB) **1,3GBit/s** ---- ![JVB traffic](https://i.imgur.com/dHCrjV9.png) ---- - complete deployment with ansible :heart: (from zero to Jitsi in 15 minutes) - Four shards for better scalability and resilience - jvbs load at around 42% peak - exluding our smallest videobridge (8 Cores, 8 GB) - We overprovisioned a bit <!-- `max(max_over_time(jitsi_stress_level{instance!~"s00-v000.+"}[4d]))` --> - there will be a blog post on our Jitsi Meet deployment ---- - Thanks to the FFMEET Projekt/FFMUC, their Jitsi tuning tips were invaluable :thumbsup: :heart: - https://meet.ffmuc.net/ - State of DECT Call-in? - 48 hours of trying to get it to work - Jigasi doesn't seem ready to work with our deployed version - jitsi.rc3.world will be running over new year! \o/ --- ## Monitoring Prometheus with Alertmanager and Grafana Netbox-driven service discovery We received - 34858 critical alerts - 13070 warnings Don't worry, we silenced most of them <!-- `sum by(severity) (changes(ALERTS_FOR_STATE[3d]))` --> --- ## Abuse - 2 Mails (from Hetzner, letting us know that someone doesn't like our infrastructure as much as we do) - 1 Call (wanted to know if it's possible to buy tickets from us..) --- ## Other + Premium Ansible deployment, brought to you by turing-complete YAML. + 130k DNS Updates :fire: :fire_engine: + DNS, Prometheus and Grafana deployed on/by NixOS --- ## Sponsors Thanks a lot. Not possible without you. <!-- .slide: data-background="white" --> ![DECIX](https://howto.rc3.world/img/logos/DE-CIX.png =100x) ![Deutsche Telekom](https://howto.rc3.world/img/logos/dtag.png =200x) ![Flexoptix](https://howto.rc3.world/img/logos/flexoptix.jpeg =200x) ![German Edge Cloud](https://howto.rc3.world/img/logos/GEC_Logo_4C.png =75x) ![infra.run](https://i.imgur.com/aRn4i7n.png =200x) ![Hetzner Cloud](https://howto.rc3.world/img/logos/Hetzner_Logo.png =100x) ![iphh](https://howto.rc3.world/img/logos/iphh.png =90x) ![myLoc](https://howto.rc3.world/img/logos/myLoc.png =100x) ![marbis](https://howto.rc3.world/img/logos/nitrado.png =100x) ![Syseleven](https://howto.rc3.world/img/logos/sys11.png =100x) ![WOBCOM](https://howto.rc3.world/img/logos/WOBCOM.png =200x)