Skip to content
Git

Requirements for Custom Repositories

Creating custom repositories for your personal R packages requires several key components:

  1. The R package {bincraft} for building and publishing packages.
  2. (Optional) A database (SQLite, Postgres, MariaDB) accessible from your build environment. This component is only needed if you want to track build metadata, which we recommend.
  3. An S3 bucket for package storage that your build environment can access.
  4. (Optional) A Content Delivery Network (CDN) to accelerate S3 assets and enable caching. We recommend this option if you want fast and efficient downloads with cache reuse. Alternatively, you can use a reverse proxy or direct S3 URLs.
  5. (Optional) A DNS entry with a custom URL for your packages.

A S3-compatible storage option is needed.

The database stores build metadata such as timestamps, file sizes, and build errors. Using a database during builds is optional.

The database serves two main purposes:

  1. Storing build metadata for analytics and reporting purposes
  2. Using metadata to skip builds of packages that have already failed processing

For analytics and reporting, you need to create and host a dashboard that displays build metadata in a user-friendly format. The rpkgs project currently uses a Shiny-based dashboard. Contributions welcome!

The database contains a single table with these columns:

  • name: character
  • tag: character
  • platform: character
  • arch: character
  • error_occurred: logical
  • error_text: character
  • size: integer
  • timestamp: datetime
  • duration: integer
  • removed: logical

The system initializes this table automatically during the first build.

Hardware requirements vary significantly based on whether you build selected packages or maintain a full dynamic repository like CRAN.

Packages requiring C/C++ compilation generally need more resources than pure R packages. Packages that link against system libraries (such as spatial packages) also require additional resources.

We have tested parallel builds and multi-core processing for handling multiple tags simultaneously. However, parallelization doesn't always reduce overall build time. Some builds complete so quickly that parallelization overhead exceeds the time saved by running sequentially.

Parallelization can help at the package level by building multiple packages per runner. This approach is only beneficial for large-scale building and introduces challenges like balancing memory allocation with actual usage. For example, when parallelizing five packages, three might be "high memory" packages that together exceed available memory, causing all builds to fail.

Building binaries for personal or internal packages doesn't require concerns about speed or extensive resources. Any virtual machine or local machine will suffice for this purpose. Most packages build within seconds to one minute per version (one minute is already quite long).

For concrete specifications: any machine with at least 1 GB of free memory can process most plain R packages without issues. Over 90% of packages don't exceed 1-2 GB of memory during building. Package authors typically know when their packages require more memory.

Building and maintaining large-scale repositories like CRAN makes hardware considerations more important. The goal is maximizing parallel package builds on a machine while avoiding memory overallocation when multiple high-memory packages build simultaneously.

While it's unlikely that all parallel builds will process high-memory packages simultaneously, at least two instances processing such packages is probable at scale. This means you must ensure substantial memory buffer for potential usage spikes.

As a real-world example: the rpkgs project allocates 5 GB as "requests" (fully reserved memory) and allows spikes up to 14 GB ("limits") per package build. This configuration applies to five parallel jobs on a 64 GB machine, reserving 25 GB of memory with approximately 40 GB available for spikes. While more parallel runners could work for most builds, this configuration provides a robust balance between memory usage and parallelism.