Author Archives: Grant Slater

A Year of Infrastructure Progress: Site Reliability Engineer 2023/2024 Update

As the OpenStreetMap Foundation’s Senior Site Reliability Engineer (SRE), my focus in the OpenStreetMap Operations Team over the last year has been on driving efficiency, improving resiliency, and scaling our infrastructure to support the continued growth of the OpenStreetMap project. From cloud migration to server upgrades, we’ve made several improvements since last year to better position OpenStreetMap’s infrastructure to meet these resiliency and growth challenges.

Improving User Facing Services

Upgraded Rendering Services

The tile rendering infrastructure saw notable upgrades, including hardware and software optimisations, faster tile cache expiry to address vandalism, and automation to block non-attributing users. We now re-render low-zoom tiles daily, improving both performance and allowing a faster mapper feedback loop. The tile service is widely used and keeping up with demand is an ongoing challenge.

New Aerial Imagery Service

Launched a new aerial imagery service that supports GeoTIFF COGs. The service now hosts aerial.openstreetmap.org.za which is backed by 16TB of high-resolution imagery. The new service makes it easier to host additional imagery in the future.

Transition to Gmail Alternative & Spam Mitigation

After facing significant spam issues with the OSMF’s Google Workspace, I migrated OSMF email services to mailbox.org. This has reduced the spam volume and improved administrative efficiency. We’re also in the process of transitioning historical OSMF Google Docs data to a self hosted service.

Dealing with DDoS Attacks and Vandalism

This year, we faced several Distributed Denial of Service (DDoS) attacks, including a major DDoS for ransom incident, which was reported to law enforcement. These attacks tested our infrastructure, but we’ve implemented measures to strengthen our resilience and better protect against future threats.

We also dealt with large-scale vandalism that affected OpenStreetMap services. Thanks to the swift response and adjustments made by the Operations team, we’ve reinforced our infrastructure to better handle abuse and ensure continuous service.

Planet Data Hosting on AWS S3

With the OpenStreetMap Operations Team I’ve moved our planet data hosting to AWS S3 with mirrors in both the EU and US, allowing us to fully reinstate the back catalog of historical data. Through AWS’s OpenData sponsorship, replication diffs and planet data are now more accessible.

Making Systems Easier to Manage

Full AWS Infrastructure Management Using OpenTofu

With the OpenStreetMap Operations Team I’ve successfully migrated all manually managed AWS resources to Infrastructure-as-Code (IAC) using OpenTofu (formerly Terraform). This transition allowed us to improve cost efficiency, enhance security by adopting a least privilege IAM model, and gain better visibility into expenditures through detailed billing tags. Additionally, we’ve integrated S3 Storage Analytics to further optimise our costs, set up additional backups, and implemented enhanced lifecycle rules.

Improved Service Outage Alerting

We implemented SMS-based alerting for critical service outages, alongside a sponsored PagerDuty account. These improvements ensure quicker response times and better coordination during outages, with full integration with Prometheus/Alertmanager and Statuscake in the works.

Technical Debt reduction

This year, we made progress in reducing technical debt by moving several legacy services to more maintainable solutions. For instance, we containerised old services, including legacy State of the Map websites that were previously running poorly maintained WordPress installations. This transition has improved the scalability, security, and long-term maintainability of these services.

Additionally, we replaced our custom source installation of OTRS with a Znuny package installation from Debian. This shift simplifies upgrades and reduces the maintenance burden, ensuring the system remains up to date and secure without custom modifications.

Ensuring Infrastructure Resilience Despite Hardware Failures

Over the past year, we’ve maintained a resilient infrastructure even in the face of hardware failures. We replaced numerous disks and RAM, ensuring minimal disruption to services. Our bespoke monitoring system allows us to detect early signs of hardware failure, enabling us to act quickly and replace faulty components before they cause significant issues. This proactive approach has been key to maintaining system uptime and reliability.

Upgrading Infrastructure

Cross-Site Replication of Backups

To ensure robust disaster recovery, I’ve established cross-account, cross-region replication for AWS S3 backups, enabling point-in-time recovery. This safeguards critical data and services, even in the face of major failures, providing long-term peace of mind.

High Availability Infrastructure

Key hardware upgrades in our Amsterdam, Dublin, and OSUOSL sites improved performance, storage capacity, and network reliability. New switches were installed in 2022, and we’ve now finished setting up a high availability (HA) configurations to ensure improved service, which we have continued improve the setup by moving to dual diverse uplinks to our ISP for better resilience.

Debian Migration

We are migrating from Ubuntu to Debian 12 (Bookworm) as our standard distribution. All new servers now run on Debian. Our chef configuration management has been updated with test code to ensure ongoing compatibility. This transition marks a shift towards greater long-term stability and security. Mastodon post celebrating the transition.

Looking Ahead

The year ahead brings exciting new opportunities as we build on our progress. Key priorities for 2024 / 2025 include:

Engaging

Community Engagement & Outward Communication: Enhancing collaboration with the Communication Working Group (CWG) and improving our public-facing communication around service status and outages.

Improving Documentation and Onboarding: We’ll enhance onboarding documentation and conduct dedicated sessions to help new contributors get involved in operations more easily. This includes improving the reliability and coverage of our testing processes, ensuring smoother contributions and reducing the learning curve for new team members.

Planning and Optimizing

Capacity Planning for Infrastructure Growth: As OpenStreetMap and the demand on our services grow, we will ensure we can scale to meet demand. By anticipating future needs and balancing performance with cost-effective growth, we aim to maintain the service quality and availability our community expects.

Ongoing Cost Optimisation: We’ll continue to find ways to reduce costs by leveraging sponsorships like the AWS OpenData programme, ensuring sustainable operations.

Continuing to Reduce Technical Debt: We will continue simplifying our infrastructure by reducing the maintenance burden of legacy systems, such as increasing the use of containers. This will help streamline management tasks and allow us to focus on other improvements, making the infrastructure more efficient and scalable over time.

Continue Infrastructure Improvements

Implementation of High Availability Load Balancers: Rolling out the HA (VRRP + LVS + DSR) configuration for load balancers to improve system reliability and reduce potential downtime.

Finalising Prometheus Integration with PagerDuty: Completing the integration of Prometheus for monitoring and PagerDuty for streamlined alerting and incident response.

Complete the Transition to Full Debian Environment: Migrating all remaining services from Ubuntu to Debian for increased stability and security.

Enhancing Disaster Recovery & Backup Strategies: Further refining our recovery documentation and introducing additional backup measures across critical services are protected and recoverable in the event of failure.


Powering OpenStreetMap’s Future: A year of improvements from OpenStreetMap Foundation’s Site Reliability Engineer

Just over one year ago, I joined the OpenStreetMap Foundation (OSMF) with the goal of enhancing the reliability and security of the technology and infrastructure that underpins OpenStreetMap. Throughout the past year, I have worked closely with the Operations Working Group, a dedicated team of volunteers. Together, we have made significant progress in improving our processes and documentation, ultimately strengthening our collective effectiveness. I am immensely grateful for the support and collaboration within this group, and I am delighted to witness the remarkable strides we have taken in building a solid foundation for the future of OpenStreetMap.

I’ll go into a little detail below about what’s transpired. At a high level, I made it easier to manage deployment of the software running on our servers; hardened our network infrastructure through better redundancy, monitoring, access, and documentation; grew our use of cloud services for tile rendering, leveraging a generous AWS sponsorship; improved our security practices; refreshed our developer environments; and last but definitely not least, finalised migration of 16 years of content from our old forums to our new forums.

If you want to hear more from me over the course of the work last year, check out my talk at State of the Map 2022 and my interview on the GeoMob podcast. And I’d love to hear from you, email me at osmfuture@firefishy.com.

2022-2023 Site Reliability Details

Managing software on our servers

Containerised small infrastructure components (GitHub Actions for building)

I have containerised many of our small sites which were previously built using bespoke methods in our chef codebase as part of the “Configuration as code” setup. Moved the build steps to Github Actions. Setup a base for any future container (“docker”) based projects going forward. These are our first container / docker based projects hosted on OSMF infrastructure.

Our chef based code is now simpler, more secure and deploys faster.

Improved chef testing (ops onboarding documentation)

We use chef.io for infrastructure (configuration) management of all our servers and the software used on them. Over the last year the chef test kitchen tests have been extended and now also work on modern Apple Silicon machines. The tests now reliably run as part of our CI / PR processes. The tests add quality control and assurance to the changes we make. Adding ARM support was easier to add because we could use test kitchen before moving onto ARM server hardware.

Having reliable tests should help onboard new chef contributors.

Hardened our network infrastructure

Network Upgrades @ AMS (New Switches, Dual Redundant Links, Dublin soon)

Our network setup in Amsterdam was not as redundant as it should have been. The Cisco Small Business equipment we used we had out-grown. We had unexpected power outages due to hardware issues. The equipment was also limiting future upgrades. The ops group decided to replace the hardware with Juniper equipment which we had standardised on at the Dublin data centre. I replaced the equipment with minimal downtime in a live environment (<15mins).

Both Dublin and Amsterdam data centres now use a standardised and more security configuration. Each server now has fully bonded links for improved redundancy and performance. The switches have improved power and network redundancy. We are awaiting the install of the fully resilient uplinks (order submitted) in the next month.

Out of Band access to both data centres (4G based)

I built and installed an out-of-band access devices at each site. The devices are hard wired to networking and power management equipment using serial consoles. The out-of-band devices have resilient 4G link to a private 4G network (1NCE). The out-of-band access devices are custom built Raspberry PIs with redundant power supplies and 4x serial connectors.

Documentation of Infrastructure to easy maintenance (Racks / Power)

I documented each rack unit, power port (Power Distribution Unit), network connection and cable at the data centres. This makes it easier to manage the servers, reduces errors and allows us to properly instruct remote hands (external support provider) to makes any chances on our behalf.

Oxidized (Visibility of Network Equipment)

Our network and power distribution configuration is now stored in git and changes are reported. This improves visibility of any changes, which in turn improves security.

Config is continiously monitored and any config drift between our sites is now much easier to resolve.

Terraform Infrastructure as Code (improve management / repeatability)

Terraform is an infrastructure-as-code tool, we now use it for managing our remote monitoring service (statuscake) and I am in the process of implementing it to manage our AWS and Fastly infrastructure.

Previous these components were all managed manually using the respective web UIs. Infrastructure-as-code allows the Ops team to collaboratively work on changes, enhances visibility and the repeatability / rollback of any changes.

We manage all domains DNS using dnscontrol code. Incremental improvements have been made over the last year, including add CI tests to improve outside collaboration.

Grew our use of cloud services

AWS in use for rendering infrastructure. Optimised AWS costs. Improved security. Improved Backup. More in pipeline

Ops team have slowly been increasing our usage of AWS over a few years. I have built out multiple usage specific AWS accounts using an AWS organisation model to improve billing and security as per AWS best practise guidelines. We generously received AWS sponsorship for expanding our rendering infrastructure. We built the experimental new rendering infrastructure using ARM architecture using AWS Graviton2 EC2.
We haven’t previously used ARM based servers. As part of improvements to our chef (configuration as code) we had added local testing support for Apple Silicon (ARM), only small additions were required to add the required compatibility for ARM servers to chef.

We were impressed by the performance of AWS Graviton2 EC2 instances for running the OSM tile rendering stack. We also tested on-demand scaling and instance snapshotting for potential further rending stack improvements on AWS.
We have increased our usage of AWS for data backup.

Improved our security

Over the last year a number of general security improvements have been made. For example: Server access is now via ssh key (password access now disabled). We’ve also moved from a bespoke gpg based password manager for the ops team to using gopass (feature rich version of https://www.passwordstore.org/ ), gopass improves key management and sharing the password store.

Additionally we have also enhanced the lockdown of our wordpress instances by reducing installed components, disabling inline updates and disabling XMLRPC access. We are also working to reduce the users with access and removing unused access permissions.

Documented key areas of vulnerability requiring improvement (Redundancy, Security, etc)

Documentation on technical vulnerability: I am producing a report on key areas of vulnerability requiring improvement (Redundancy, Security, etc). The document can be used to focus investment in future to further reduce our expose to risks.

Refreshed our developer environments

New Dev Server

We migrated all dev users to a new dev server in the last year. The old server was end of life (~10 years old) and was reaching capacity limits (CPU and storage). The new server was delivered directly to the Amsterdam data centre, physically installed by remote hands and I communicated the migration, and then migrated all users and projects across.

Retired Subversion

I retired our old svn.openstreetmap.org code repository in the last year. The code repository was used since the inception of the project, containing a rich history of code development in the project. I converted svn code repository to git using a custom reposurgeon config, attention was made to maintain the full contribution history and correctly link previous contributors (350+) to respective github and related accounts. The old svn links were maintained and now link to the archive on github https://github.com/openstreetmap/svn-archive

Forum Migration

The old forum migration, we migrated 1 million posts and 16 years of posts to discourse. All posts were converted from fluxbb markdown to discourse’s flavour of markdown. All accounts were merged and auth converted to OpenStreetMap.org “single sign-on” based auth.

All the old forum links redirect (link to the imported) to correct content. Users, Categories (Countries etc), Thread Topics, and individual posts.

Meet Grant Slater, the OpenStreetMap Foundation’s new Senior Site Reliability Engineer

Thanks to the support of corporate donors, the OpenStreetMap Foundation has been able to hire its first employee, who is starting on 1 May 2022. Grant Slater and Guillaume Rischard, the Foundation’s chairman, sat down for a virtual chat.

Hi! Tell us about you?

Hi! I’m Grant Slater, and I’m the new Senior Site Reliability Engineer (SRE) working for the OpenStreetMap Foundation. I’m originally from South Africa, and now live in London (UK) with my wife Ingrida and our son Richard.

What do you do in OSM? Where do you like to map?

I’ve been mapping since 2006, mostly in the Southern Africa and in the United Kingdom. I have a strong interest in mapping the rail network of South Africa; holidays “back home” often involve booking railway trips across the country, with a GPS in hand.

My latest toy is an RTK GPS base station and rover. I’ll soon be mapping my neighbourhood with centimetre-level accuracy.

For the last 15 years, I’ve been part of the volunteer OpenStreetMap Operations Team who install and maintain the servers and infrastructure which runs the OpenStreetMap.org website and many other related services.

What are your plans for the new SRE job?

My main objective will be helping improve the reliability and security of the project’s technology and infrastructure.

One of my goals will be to improve the project’s long-term stability as we grow. OWG can’t work without volunteers, and I will be improving the Operation Team’s bus factor by also improving our processes, documentation, and by smoothing the path to onboarding new team members.

I will be helping to drive forward modernising the project’s infrastructure by reducing complexity, paying-down technical debt, and reducing our need to maintain undifferentiated heavy lifting, by tactically using Cloud and SaaS services, where suitable.

Is there anything else you’d like to say?

With time, I would like to see OpenStreetMap introduce new tools and services to improve our mappers’ access to opted-in passively collected data to improve the mappers’ ability to map and detect change.

Gamification! OpenStreetMap should always remain a fun and gratifying experience for all. We’re building an invaluable and unique dataset with far-reaching consequences for which we should be incredibly proud. Happy Mapping!

I would like to hear your feedback and suggestions, please email me osmfuture@firefishy.com

Grant gave a talk at State of the Map US (2013) – OSM Core Architecture and DevOps and is hoping to give an updated talk at State of the Map 2022 in Florence, Italy 19–21 August 2022.

https://www.openstreetmap.org/user/Firefishy
https://twitter.com/firefishy1
https://github.com/firefishy

Upcoming Maintenance

On Saturday 5th of July 2014 between 09:00 and 19:00 (GMT / UTC) we are moving our servers hosted by University College London to another data center.

The following services will be affected:

  • Search (nominatim.openstreetmap.org) will be unavailable. *
  • Slower map updates / Reduced tile rendering capacity. (Yevaud outage)
  • OSM Foundation websites and blog.openstreetmap.org will be unavailable.
  • Taginfo (taginfo.openstreetmap.org) will be unavailable.
  • Development Server (errol) will be unavailable.
  • Some imagery services will be unavailable. (GPX Render, OS
    Streetview, OOC, AGRI, CD:NGI aerial)

Other OpenStreetMap provided services should not be affected – all of the following are expected to function normally:

  • www.openstreetmap.org web site will allow edits as per normal (iD or
    Potlatch).
  • API will allow map editing (using iD, JOSM, Merkaartor etc.)
  • Forum
  • trac (bug-tracker)
  • help.openstreetmap.org
  • tile serving (“View The Map” & “Export”)
  • Wiki
  • mailing lists
  • subversion and git (source code repositories)
  • donate.openstreetmap.org

Technical: We are moving all the servers listed here to a new UCL data center. The current building is being closed soon for refurbishment. The new data center has better server racks, power feeds, cooling and faster networking.

* Searches through the website will still work – we will redirect
them to another nominatim instance temporarily.

Sincerely
Grant Slater
On behalf of the OpenStreetMap sysadmin team.

OpenStreetMap Enhances User Privacy

Today, OpenStreetMap has enabled encryption (SSL) to all of the openstreetmap.org website, thereby enhancing the privacy of its users.

You can now browse the site at https://openstreetmap.org (note the ‘https’). This means your browsing activity is secure from snooping.

OpenStreetMap stands with the Open Rights Group and the Electronic Frontier Foundation in asserting greater Internet freedom, including the right to individual privacy. With this action providing the highest quality Free/Open Data Geographic resource to everyone.

We are proud to roll this out on the same day as the “Day We Fight Back” campaign.

Other aspects of privacy around OpenStreetMap are discussed on this wiki page.

OpenStreetMap infrastructure donation – Bluehost

Thanks to generous donations and active local community members, the OpenStreetMap distributed tile delivery infrastructure continues to grow.

Two tile servers, nadder-01 and nadder-02, have been added to the OpenStreetMap tile cache network.  Based in Provo, Utah, USA, these servers provide tiles to the Americas.

Map tiles are delivered to users based on their GeoDNS location. The OpenStreetMap Foundation seeks additional distributed tile servers. If you would like to donate a tile server and hosting, please see the Tile CDN requirements page on the wiki.

tile serving geodns map

We would like to thank the BOSS team (Bluehost Open Source Solutions) and especially Jared Smith at Bluehost.com for this generous donation to OpenStreetMap infrastructure.

The OpenStreetMap Foundation is a not-for-profit organization, formed in the UK to support the OpenStreetMap Project. It is dedicated to encouraging the growth, development and distribution of free geospatial data and to providing geospatial data for anyone to use and share. The OpenStreetMap Foundation owns and maintains the infrastructure of the OpenStreetMap project. You can support OpenStreetMap by donating to the OpenStreetMap Foundation.

New tile rendering and CartoCSS stylesheet

The default OpenStreetMap.org “standard” map was switched across to a new rendering server setup over the last weekend.

In addition to new hardware, the rendering server also uses the new “openstreetmap-carto” stylesheet. This is a complete re-write of the old XML stylesheet to use CartoCSS, making it easier for our cartographers to work with. The style is designed to look as similar as possible to the old XML stylesheet.

Andy Allan presented a great talk at State of the Map US conference describing the reasons for re-writing the stylesheet: Putting the Carto into OpenStreetMap Cartography

Andy Allan Stylesheets

Andy will present a follow-up at State of the Map next month.

The “openstreetmap-carto” stylesheet is maintained on github

“openstreetmap-carto” is a good base for creating custom styles, and should be much easier to work with. If you want to help improve the style, or add new features, please fork it and contribute pull requests!

Please support OSM’s server hardware fundraising drive