Skip to content
Git

Infrastructure for building CRAN binaries

The infrastructure powering rpkgs.com consists of the following components:

  • A multi-architecture k3s cluster (on Hetzner)
  • S3 object storage (on Hetzner)
  • A CDN (Bunny CDN)

Resource needs for this project vary significantly, making Kubernetes an ideal solution for efficient use of shared resources.

Building efficiently on different architectures requires access to native servers for those specific architectures. While most cloud providers offer this capability, Kubernetes simplifies the process by orchestrating specific builds to designated nodes. This flexibility ensures builds execute on appropriate hardware, optimizing performance and compatibility across diverse environments.

Memory requirements for individual packages vary dramatically, ranging from a few hundred megabytes to approximately 12 GB. These varying needs must be reflected in the resource requirements of each pod. Using requests.memory of 5Gi and limits.memory of 14Gi has proven reliable for both scheduling and individual resource needs.

Daily package updates can be handled with a single process per OS/version. However, building binaries for all CRAN packages requires a different orchestration strategy. On average, each CRAN package has six versions, calculated by dividing total binaries by OS/versions built and the number of unique packages. This scale makes some level of parallelization necessary.

Initially, we implemented parallelization at the package version level. However, this approach led to occasional conflicts when dependencies were installed into a shared package cache. It also introduced unpredictable memory requirements within workflows. Some packages caused memory usage to spike beyond 30 GB, depending on the number of parallel workers. These spikes not only caused individual processes to crash but also demanded significantly higher overall resource limits.

We found a more robust solution by processing individual packages sequentially within each matrix job.

To build all versions of all packages, we divide CRAN packages into subsets. Each subset comprises 1/10 or fewer of the total packages, and we process these subsets in parallel. The total time required depends on factors such as the distribution and the number of parallel workers. Distributions with newer C compilers like Alpine tend to be faster. This approach typically takes anywhere from a few days to two weeks.

Binaries need storage, and S3 provides an excellent solution. S3 is significantly more cost-effective than traditional cloud disk storage. It offers the added benefit of being accessible via a public API. Beyond AWS (the original S3 provider), numerous alternatives offer better price-to-storage ratios and lower transfer costs.

The timing was perfect when Hetzner introduced their own S3-compatible object storage, coinciding with the start of this project's build processing. This solution brings multiple advantages: lower overall storage costs, free internal traffic between Hetzner servers and their S3 storage, and proximity of storage to build servers. The proximity minimizes upload latency significantly.

Storing binaries in S3 works well for distribution, but it's not inherently fast. Adding a CDN in front of S3 enables caching and allows asset distribution via servers located in various regions worldwide. This significantly reduces download latency, making downloads feel much faster.

All packages are delivered through a CDN, which includes three dedicated static caches. These caches are strategically placed in Germany, the USA, and Asia.

With a CDN in place, downloads are optimized to feel fast from virtually anywhere. Only minor variations occur depending on the user's location.

The CDN determines when an asset is added to its permanent cache and how often it revalidates against the S3 source. Since package binaries are one-time builds that typically remain unchanged unless a forced rebuild occurs, relying heavily on permanent cache is highly efficient in this context.