Service-Enhancements | 2i2c

Protecting our hubs against the CopyFail kernel exploit

Mon, 04 May 2026 00:00:00 +0000

The recently disclosed CopyFail Linux kernel zero-day (CVE-2026-31431) opens up a way for code running inside a container to break out onto the underlying node. We took a close look at our hubs to confirm whether they were exposed, confirmed that our hubs are likely not at risk, and added another layer of protection just in case.

Are 2i2c’s hubs at risk? #

No - based on our testing and mitigation efforts, our hubs are not vulnerable to CopyFail.

Why do we think we’re not at risk? #

We tried to reproduce the exploit on a staging hub by following the public Kubernetes proof-of-concept on both AWS and EKS, and the exploit was unable to break out of the container.
Existing JupyterHub hardening on Kubernetes from jupyterhub/kubespawner#545 (originally added by Yuvi in 2021 in response to a different security issue) had already significantly reduced our risk exposure, and the exposure of anyone else running Z2JH (the standard way to deploy JupyterHub on Kubernetes).
As an extra layer of protection, we deployed copyfail-ebpf-k8s as a daemonset across all of our clusters in 2i2c-org/infrastructure#8227. This runs on every node and covers all of our hubs (including those on non-commercial cloud infrastructure, like JetStream2). It blocks the specific kernel features that CopyFail depends on. See the project’s explanation for how that works.
We’ve upgraded all GKE clusters to use a patched image in 2i2c-org/infrastructure#8230.

What else did we look into #

Deckhouse’s mitigation was too platform-specific for us.
OVHcloud’s modprobe blocking likely won’t work on Amazon Linux 2023, since the relevant module is built into the kernel image.
AL2023 security advisories - no patched AL2023 image is available yet, so we can’t rely on a kernel-level fix from AWS for now.

Acknowledgements #

Huge thanks to Georgiana for the deep dive into the exploit and whether we’re exposed here.
Thanks to Yuvi for the PR that reduces JupyterHub’s exposure to this back in 2021!
Thanks to iwanhae for the eBPF daemonset we deployed in Kubernetes, and to JupyterHub for the upstream kubespawner hardening that lowered our exposure.
Thanks to our collaborators at NASA VEDA for the ongoing conversations about hub security.
Thanks to our collaborators at Pythia for supporting ongoing work around security in JupyterHub and BinderHub, especially on non-commercial cloud like JetStream.

Upgrading community infrastructure to Kubernetes 1.34 and JupyterHub 4.3.3

Wed, 08 Apr 2026 00:00:00 +0000

We’ve completed a major round of infrastructure upgrades across all 2i2c-managed hubs - every hub is now running Kubernetes 1.34 and Z2JH helm chart 4.3.3.

Running up-to-date versions of both Kubernetes and the JupyterHub helm chart ensures that our communities get the best support and reliability, both in terms of features and security.

A new approach to infrastructure upgrades: upgrading in rounds #

This was the first time we rolled out JupyterHub helm chart upgrades in rounds rather than all at once. By upgrading a subset of hubs at a time, we could identify and fix issues in isolation before they affected the broader network. This made the process safer and more predictable.

We’re planning to perform these kinds of upgrades on a regular schedule for our member communities. Around every 6 months we’ll create an issue to make sure nothing falls through the cracks (here’s example config for creating our reminder issues).

Check out our process docs for multi-hub upgrades for more information.

Learn more #

Check out these pages for what kinds of improvements we’ve brought into our clusters / hubs with these latest updates.

Acknowledgements #

Thanks to Georgiana Dolocan for leading this upgrade effort and establishing the rounds-based approach.
Thanks to Chris Holdgraf for adapting and editing Georgiana’s notes into a blog post.

How regularly upgrading core infrastructure leads to upstream improvements and better infrastructure

Fri, 03 Apr 2026 00:00:00 +0000

Our collaborators at NASA VEDA recently asked us about the rationale behind policies for upgrading our infrastructure relatively quickly when new versions come out. Here’s the explanation that we shared with them, in case it’s useful for others as well.

In this case, the decision was whether to upgrade to Helm 4, and you can find our rationale in the /initiatives repository. Here’s a brief summary from Yuvi:

Fundamentally, it helps keep moving us and the ecosystem forward, and drive improvements upstream, in both JupyterHub and Helm.

It has driven these PRs in JupyterHub:

jupyterhub/action-k3s-helm#126 (merged)
jupyterhub/zero-to-jupyterhub-k8s#3797 (validated, but not merged yet)

It’s also driven improvements to helm itself - see this bug report that is being worked on:

helm/helm#31919

Upgrading helm versions can break things (and it has for some of our other communities in the past - see this example). So it’s important we do that on a reasonable timeframe and carefully, to avoid disruptions.

We’re also discovering for example that potentially the new nginx-ingress controller we had to move to has some issues working with older helm versions (ongoing WIP in 2i2c-org/infrastructure#7995). That feels much more tractable because we can now go ‘ok, let us just apply a quick fix now, and wait for the helm 4 rollout, and try again’ instead of being totally stuck.

This is similar to the other part of [/our VEDA objective] - rolling out new versions of jupyterhub. If we need to roll out security fixes, it’s much easier now because we already did the hard work of being up to date:

2i2c-org/infrastructure#7996

This isn’t the case quite yet for helm v3, as it’s still supported, but it’s much better to do this work earlier than wait.

If you encounter a bug in a popular open source software, often you can just ‘wait’ for it to be fixed. But this isn’t just about time - someone somewhere has to put in the effort of getting it fixed, filing helpful upstream bug reports, and testing to make sure it works. This is an example of 2i2c continuing to contribute this effort upstream wherever we can.

Acknowledgements #

Thanks to NASA VEDA for collaborating deeply with us on infrastructure questions like this.

Enabling CloudBank to safely manage their own cluster infrastructure

Tue, 20 Jan 2026 00:00:00 +0000

We recently enabled CloudBank to run Terraform changes for their cluster without needing to wait on 2i2c engineers for each request. They run 50+ hubs for various community colleges, and we want to enable them to self serve as much of that as possible. When we introduced home directory quotas, they were no longer able to set up hubs by themselves without help from 2i2c engineers. Our goal was to empower them to be able to set up new hubs in a safe way while still benefiting from the home directory limits work.

CloudBank simplifies cloud access for computer science research and education.

To do this safely, we needed to avoid granting access to shared Terraform state that could impact other communities. Following Yuvi’s suggestion, we migrated CloudBank’s Terraform state to CloudBank’s own GCP project so that infrastructure changes from the CloudBank team are isolated to their cluster only, making this safe to try. This unblocks CloudBank to run changes like terraform plan and terraform apply themselves, meaning that CloudBank can deploy and update a hub without 2i2c engineers in the loop.

This is a good example of how we aim to balance community autonomy with infrastructure safety. CloudBank can now self-serve routine operations while our broader infrastructure remains protected.

Learn more #

Acknowledgements #

Thanks to Sean Morris and the CloudBank team at UC Berkeley for collaborating on this workflow.

Improving our community hub reliability and stability in Q4 2025

Tue, 16 Dec 2025 00:00:00 +0000

This year we’ve prioritized making the cloud safe to try for our member communities. This has driven work in monitoring, alerting, and automating infrastructure so that we resolve small problems before they become big problems. In the last quarter of 2025, we wrapped up this effort by testing the following hypothesis:

We can reduce P1 incidents if we shorten the time to act on current alerts and learnings from prior incidents.

Here’s what we accomplished and what we learned.

What we accomplished #

In short: we’re now much more confident in the stability of community infrastructure. Here’s a snapshot of our new incident dashboard, which shows high-level trends for the stability of our infrastructure:

See the real-time status of our community hubs at status.2i2c.org

We improved infrastructure reliability for our communities #

We made several technology and team process improvements that led to these benefits for our communities:

We are now more likely to catch outages before a community reports them to us.
We are now less likely to have an outage happen more than once, or affect more than one community, because we consistently fix the issues that cause outages.

We saw a consistent drop in critical alerts that required immediate response:

For August and September we had an average of 7 outages/month (6 from alerts, 1 from community)
In October, November, and December we had an average of 3 outages/month (9 in October, 0 in November, 1 in December, with only one of these being reported by a community)

We became more efficient, responsive, and focused #

We also got several team benefits from this work:

We get fewer interruptions and distractions from deeper work.
We have clear assignment policies to make it clear who is responsible for acting in response to alerts.
We avoid invisible work from falling down rabbit-holes when responding to outages.
We decreased the stress and pressure of doing upgrades, making them easier to split into sprint items and more likely to get done consistently.

The improvements we made #

Infrastructure improvements #

Created a status page for all 2i2c community hubs, giving our team and communities visibility into the status of our infrastructure.
Created an alert that triggers when two servers fail to start consecutively in a 30-minute time window.
Improved deployment infrastructure so that we can roll out sub-chart upgrades to individual clusters, allowing us to roll out major changes in batches.
Removed our “configurator” application from community hubs, because it was causing more confusion than it was resolving.
Allowed servers to start even when users hit their storage quotas.
Provided a number of upgrades to Kubernetes and the support services that we run alongside each community hub.

Process improvements #

Made a team commitment to prioritize issues from incident reports and other stability-related problems.
Defined incident escalation policies using the status page to calibrate the urgency of our response to the severity of incidents.
Defined “on-call” procedures so our team knows when and how to be more responsive to outages.
Time-boxed our alert response process to avoid accidentally falling down rabbit holes for non-urgent problems.
Created a more reliable process for responding to incidents and writing incident reports.

Looking forward #

After this push around infrastructure reliability, we’re significantly more confident in the stability and transparency of our community hub infrastructure. This will deliver better service for our member communities and free up more of our time to engage with them instead of fighting infrastructure fires.

We will continue to improve our infrastructure, and have a better foundation to do so incrementally in the coming quarters. Here are a few things we’d still like to improve:

We still need to improve how reliably we complete follow-up actions from incidents (e.g., writing incident reports). When a process doesn’t fit into planning & scoping ceremonies, we struggle to follow it consistently.
We’d like to improve our testing framework for major upgrades across all hubs (e.g., Kubernetes version upgrades) to catch bugs before communities do.

Learn More #

Faster reporting of user home directory sizes

Tue, 09 Dec 2025 00:00:00 +0000

Storage quotas help users avoid running out of space unexpectedly and give administrators visibility into capacity planning. However, storage usage can change rapidly, and it’s important to have quick information so that administrators know whether they are close to hitting limits.

We’ve improved how quickly hub administrators can see user home directory sizes across our JupyterHubs. This makes monitoring more responsive and adds quota limit visibility that wasn’t possible before.

Using `jupyterhub-home-nfs` for near-instant disk usage metrics #

Our existing storage monitoring tool, prometheus-dirsize-exporter, deliberately runs slowly to avoid excessive disk I/O. This meant home directory metrics could be hours out of date on systems with many users or large directories. Plus, there was no way to report user quota limits at all.

Our home directory storage is managed by jupyterhub-home-nfs, which enforces per-user quotas. It could also expose usage and limit information as Prometheus metrics using data from the underlying filesystem quota system. Because this information is already tracked by the filesystem, it’s available immediately without scanning individual files.

We made two key improvements:

Make disk usage reporting almost instantaneous. We made jupyterhub-home-nfs export total_size_bytes and hard_limit_bytes metrics to Prometheus for near-instant reporting. We used the same metric names and namespace as prometheus-dirsize-exporter for compatibility. See 2i2c-org/jupyterhub-home-nfs#76
Allow this to be used upstream in JupyterHub Grafana Dashboards so that it can support both types of disk usage reporting. This means users of the upstream JupyterHub Grafana dashboards get the same useful view about home directory usage, regardless of whether the metric comes from prometheus-dirsize-exporter or jupyterhub-home-nfs. See 2i2c-org/prometheus-dirsize-exporter#29

These changes were deployed across all our communities, so administrators can now access current home directory information within minutes regardless of directory size.

Home Directory Usage dashboard showing total size metrics from jupyterhub-home-nfs and other data from prometheus-dirsize-exporter

Try it out #

2i2c member organizations can try this out now. If you have access to your hub’s Grafana instance, you can see these new metrics in the Home Directory Usage dashboard:

Open your hub’s Grafana dashboard.
Go to Dashboards -> JupyterHub Default Dashboards -> Home Directory Usage.
Check the table for up-to-date total size and quota limit values.

For more details, see our docs on filesystem and disk dashboards.

Coming next #

We’d like to build on this work to enable alerting when individual users near their disk quotas. This will make it easier to more reliably track user disk usage across a community. See this issue for tracking: 2i2c-org/infrastructure#7166

Acknowledgements #

This was a directed contribution supported by NASA VEDA to enable more proactive monitoring and alerting for hub administrators.

Adding User Group Insights to Cloud Cost Dashboards with Grafana

Mon, 24 Nov 2025 00:00:00 +0000

We are excited to announce that we have extended our cloud cost dashboards to support display costs filtered by user groups using Grafana! This new feature allows administrators to monitor and manage cloud expenses based on user group memberships in JupyterHub.

Available for dedicated AWS clusters only (and excluding CloudBank managed accounts). Other deployments on GCP will be supported in the future.

Learn more #

Take a look at the Community Hub Guide to see what’s new
Check out the documentation of the 2i2c-org/jupyterhub-cost-monitoring project to see how it all works
Jenny recently presented her work on the cost monitoring system at JupyterCon 2025 earlier this month. Watch a video or look at the slides.

Give us feedback! Click here to provide feedback that will help us make this more impactful.

Acknowledgements #

Tarashish @ Development Seed for collaborating on this project with us.
NASA VEDA and the DSE Team at NASA MSFC ODSI for funding much of this work.
Kyle Lesinger from the NASA MSFC Office of Data Science and Informatics for providing valuable feedback and bug reports during development.

Enabling transparent cloud cost monitoring with user-level dashboards

Tue, 30 Sep 2025 00:00:00 +0000

We are excited to announce that dashboards to monitor cloud usage and costs at a per-user level are now available! See the cost monitoring documentation for more information.

A key goal of 2i2c is to make the cloud safe for science. By providing transparent cost monitoring, we give communities the confidence that they won’t face unexpected bills and can better understand how their usage patterns translate to cloud costs. This visibility is especially valuable in our shared platform model, where each community gets their own independent hub while benefiting from shared infrastructure expertise.

The user-level cost breakdown allows communities to identify individual usage trends and manage their resources more effectively. Communities can now see exactly how their computational work translates to cloud spending, enabling better resource planning and budget management.

Give us feedback! Click here to provide feedback that will help us make this more impactful.

Learn more #

Cost monitoring documentation

Acknowledgements #

Tarashish @ Development Seed for working on this with us.
NASA VEDA for funding much of this work.
Andy @ Openscapes, Alex @ Development Seed and Sarah @ Earthscope for giving us close feedback.

Demonstrating our infrastructure's reliability with a hub status page for our communities

Tue, 23 Sep 2025 00:00:00 +0000

One of 2i2c’s goals is to make the cloud safe for science. A big part of this is making the black box of commercial cloud infrastructure more predictable and reliable for our member communities, across our network of community hubs that all operate autonomously.

Give us feedback! Click here to provide feedback that will help us make this more impactful.

To that end, we’ve created a status page for 2i2c’s network of community hubs. This is a source of truth to provide a high-level picture of the stability of our infrastructure, let a community know if their hub is experiencing a problem, and to give us a heads up when things aren’t working as expected. You can check it out at:

👉 status.2i2c.org

The 2i2c Status Page gives communities a high-level view of the uptime for our entire network of community hubs.

While we make status more visible, we’re also streamlining our incident response processes in order to more quickly respond to outages when they occur (ideally, before a community has even noticed!).

There are still plenty of improvements we’d like to make: for example, we’re focusing on major outages right now, but would like to extend some level of reporting for degraded service, like unexpectedly slow start times.

Learn more #

👉 The status page
👉 The status page documentation
👉 Our new process for incident response
👉 Follow an in-progress initiative to improve the reliability of our infrastructure

Reducing base infrastructure costs on AWS with smarter instance types

Wed, 17 Sep 2025 00:00:00 +0000

We’ve been working to reduce the base costs of running our cloud infrastructure on AWS by switching to more efficient instance types for our core nodes. This is the core infrastructure we use to ensure hubs are “always available” for users, even when no one is actively using a hub. By moving from older r5.xlarge instances to newer, more efficient r8i-flex.large instances, we’ve significantly reduced these baseline costs while maintaining the same level of service. Here’s a plot of daily savings for the GeoJupyter community.

The graph above shows the impact on EC2 node costs specifically (this doesn’t include the entire cost of always-on infrastructure, but represents a significant portion). We are rolling out this change to all new clusters, and starting to work through our pre-existing AWS clusters.

Learn more #

Incident report: UC Merced user throttling during class startup

Tue, 16 Sep 2025 00:00:00 +0000

On August 29, 2025 our cloud infrastructure team experienced an incident with the UC Merced community hub when students tried to login simultaneously at the start of class. For more detailed technical information about this incident, see our full incident report.

What happened #

Students experienced issues when trying to login to the hub at the same time during the start of class.
The concurrent spawn limit was reached quickly due to the large number of users starting up simultaneously.
New nodes had to be brought up by the autoscaler, which took roughly 10 minutes from start to end.
Users who tried again after 1 minute weren’t guaranteed to get their servers started immediately since new nodes were still spinning up.
This was an “expected” scale-up event but the lack of clear messaging caused users to interpret it as instability.

What we learned #

We need better communication so users understand when infrastructure slowness is “expected” vs. “unstable”.
We need better alerting for concurrent user startup throttling - we found out about this issue from users rather than automated monitoring.
We learned that JupyterHub’s metrics don’t properly expose 429 status codes in our dashboards.
This will happen again if we don’t have proper scaling limits and node provisioning strategies for sudden user influxes.

Resolution #

We implemented several fixes:

Increased the concurrent spawn limit from 64 to 100.
Put UC Merced users on larger nodes to reduce the number of node spinups needed. this will cost more in cloud but result in fewer scale-up events.
Created action items to improve logging, alerting, and monitoring for similar incidents

Acknowledgements #

Thanks to UC Merced students and instructors for reporting the issue through our support system.

Our Product and Service goals for Q3 2025

Tue, 22 Jul 2025 00:00:00 +0000

As we enter Q3 2025, our focus remains on enabling better cost controls for our communities and increasing flexibility for end-users. In line with our commitment to transparency, we’re sharing our platform and service objectives for the quarter and inviting feedback to ensure our direction reflects what matters most to the communities we serve. See our product goals from the previous quarter here.

The themes below offer a high-level snapshot of where we aim to evolve our offerings in the coming months.

⭐ Connect with us

Give us feedback about our direction and how it can improve.

Fund parts of this work if you’re interested in making something happen.

Demonstrable reliability of our infrastructure #

Hub management has many moving parts, and things can go wrong. We want tighter control over the reliability of our infrastructure and better visibility into the status of our community hubs for administrators and members. We will take steps to improve alerting, uptime, and overall platform reliability, as well as review our internal incident response practices. Our goal is to improve the reliability and responsivity of our interactive computing hubs for both administrators and users throughout their community hub lifecycle.

User-level costs monitoring and group-level usage monitoring #

We’ll build on our recent Grafana dashboard usage monitoring to add usage monitoring for user groups and begin work on user-level costs monitoring. These are steps toward group-level cost tracking, which will enable better management of team and departmental expenses, especially in large institutions.

Improving our Incident Response capability #

As a special-case of doubling down on infrastructure reliability, we’re making a concerted effort to improve our incident response processes. This will allow us to respond more reliably and transparently to issues as they arise, and to give our communities confidence about the steps we’re taking to resolve them. Our goal is to improve response time and quality, and connect this with infrastructure reliability improvements.

Piloting a feature co-funding model #

This year, 2i2c has experimented with a collaborative community funding model for platform development. This is a way to share funding across communities, and to invite community champions to co-fund projects on the 2i2c roadmap. Our goal is to accelerate upstream development with funded time while minimizing costs for each community. Over the next quarter, we’ll share more details on how the program works, how communities can participate, and what we’re learning along the way.

Early candidates for community funding:

Compute quotas: Building on our user and group management foundation, we aim to let admins assign compute quotas to users or groups, increasing cost control and easing departmental budgeting.
Canvas authentication: Integrating with Canvas will enable management of users and groups within Canvas, while using the same data for usage and cost tracking.
nbgitpuller UX improvements: Enhance nbgitpuller for sharing interactive projects via a simple link, focusing on error handling, conflict resolution, and per-user authentication for content retrieval.

As we develop this initiative, we’ll need to learn how to credit community partners who co-sponsor work and align our roadmap with the interests of the upstream communities we support.

If you have suggestions for improving this process, or if any features interest your community, reach out to us to discuss joining other 2i2c communities in funding and accelerating their development.

Another update coming in Q4 #

Each quarter, we’ll share an update like this to outline our product priorities and track progress. When planning for Q4, we’ll review what we’ve accomplished and provide a community update. Stay tuned!

Meanwhile, let us know what you think about our direction. Your feedback helps us provide the best value to our communities. Thank you!

Acknowledgements #

Our strategic and organization-level work is supported by a grant from The Navigation Fund and fees from our member organizations.

Announcing `jupyterhub-groups-exporter`: monitor usage based on JupyterHub group membership with Prometheus and Grafana

Wed, 11 Jun 2025 00:00:00 +0000

Managing user groups in JupyterHub can be a challenging task, especially in environments with dynamic user bases and complex group structures. This post describes how we can leverage the latest group management features in JupyterHub, along with Prometheus and Grafana, to monitor group-level resource usage effectively.

⭐ Members of 2i2c’s community network can use this feature in their hubs by following our cost attribution documentation.

Motivation #

Hub admins have a strong impetus to monitor usage and costs by user groups because it allows them to advocate for better funding and cost recovery models based on data-driven insights. Group-level resource monitoring can help them to answer questions like:

How many people participated in our workshop group?
How much GPU compute is our power user group using?
Is our resource usage cost-effective for X group persona or Y group persona?

Current methods and workarounds include:

ring-fencing resources for specific user groups personas, e.g. creating a separate hub for a workshop group, or creating a separate Dask cluster for a power user group, which increases the admin burden of managing multiple hub instances
writing custom scripts to aggregate per user metrics, that are already available, into groups – which can be time-consuming and error-prone

JupyterHub and user groups #

Recent key developments upstream in JupyterHub for groups management, such as Authenticator managed group membership, makes this piece of work a prime and timely opportunity to be tackled. For more technical details of these upstream contributions, see GitHub PRs jupyterhub/oauthenticator#735 and jupyterhub/oauthenticator#498.

Users can access JupyterHub using a variety of authentication methods. Authentication providers like GitHub have built-in user management features that allow admins to create and manage user groups. These groups can then be configured in JupyterHub to authorize access to the hub, as well as control access to certain hardware profiles.

Following the key upstream contributions above, we can leverage Authenticator-managed group membership to automatically pass user group memberships from the authentication layer to JupyterHub itself. This allows us to capitalize on JupyterHub’s REST API to retrieve user group memberships from other services, such as exporting them as Prometheus metrics.

Exporting user group memberships to Prometheus #

The jupyterhub-groups-exporter project provides a service that integrates with JupyterHub to export user group memberships as Prometheus metrics. This component is readily deployable as part of any JupyterHub instance, such as a standalone deployment or a Zero to JupyterHub deployment on Kubernetes.

The exporter provides a Gauge metric called jupyterhub_user_group_info, which contain the following labels:

namespace – the Kubernetes namespace where the JupyterHub is deployed
usergroup – the name of the user group
username – the unescaped username of the user
username_escape – the escaped username

Escaped usernames are useful because Kubernetes pods have characterset limits for valid pod label names (this limit does not apply to pod annotations). Storing both types of usernames allows us to join escaped versions with their more human-readable unescaped usernames.

Exposing this metric as an endpoint for Prometheus to scrape allows us to query and join groups data with a range of usage metrics to gain powerful group-level insights. Here is an example PromQL query that retrieves the memory usage by user group:

sum(
 container_memory_working_set_bytes{name!="", pod=~"jupyter-.*", namespace=~"$hub_name"}
 * on (namespace, pod) group_left(annotation_hub_jupyter_org_username, usergroup)
 group(
 kube_pod_annotations{namespace=~"$hub_name", annotation_hub_jupyter_org_username=~".*", pod=~"jupyter-.*"}
 ) by (pod, namespace, annotation_hub_jupyter_org_username)
 * on (namespace, annotation_hub_jupyter_org_username) group_left(usergroup)
 group(
 label_replace(jupyterhub_user_group_info{namespace=~"$hub_name", username=~".*", usergroup=~"$user_group"},
 "annotation_hub_jupyter_org_username", "$1", "username", "(.+)")
 ) by (annotation_hub_jupyter_org_username, usergroup, namespace)
) by (usergroup, namespace)

Visualizing user group resource usage with Grafana #

The PromQL query above is rather long and complex to construct! However, you can benefit from an upstream contribution to the jupyterhub/grafana-dashboards project where we have encapsulated the PromQL queries as Jsonnet code and represented them as Grafana Dashboard visualizations (also known as Grafonnet). If you have a Kubernetes cluster running JupyterHub, try deploying these Grafana Dashboards and let us know what you think!

Our particular PromQL query above is visualized in the Grafana Dashboard User Groups Diagnostics under the Memory Usage panel (see also the corresponding screenshot at the top of this post). This is equivalent to its counterpart User Diagnostics dashboard, but with resource usage visualized on a per-group level rather than a per-user level 🎉

Future work #

We have laid the foundation for joining user group data to usage metrics with Prometheus by extracting memberships from JupyterHub’s database. This unlocks potent ways in which observability systems can be extended to group-level reporting and monitoring.

Future directions for this work include:

visualising cloud cost by user group in Grafana
developing more group-level reporting and monitoring dashboards
introducing group-level resource quotas.

What do you think? How would you like to see JupyterHub’s group management features evolve? Have you tried deploying this yourself? We welcome your feedback and feel free to open GitHub issues or make contributions to the repositories mentioned in this post.

Acknowledgements #

Thanks to the JupyterHub project for their collaboration and review of this work.

Solving classes of problems, rather than just an instance of a problem (with an example)

Mon, 09 Jun 2025 00:00:00 +0000

The Problem #

Two of our the communities we serve ( NMFS Openscapes and CryoCloud) reported issues with starting GPU nodes on their hubs. Upon investigation, I discovered that the cluster autoscaler seems to not recognize that GPUs were available in the cluster at all suddenly, and hence wasn’t provisioning the nodes. A restart of the cluster-autoscaler pod fixed the issue for both these communities.

An incomplete solution #

But is that the end of the story? Not if we want to provide reliable long term infrastructure to communities with minimal toil on the part of 2i2c engineers!

One of the engineering principles I’m trying to have us more intentionally and structurally embody is the idea that we don’t fix individual instances of problems, but whole classes of problems, rather than just an individual instance of the problem. Fixing the immediate issue is not enough - we need to understand what class of issues was manifesting itself in this particular fashion, and fix that.

What was the class of issues we could fix here? #

Digging in, I realized that our version of cluster-autoscaler was a little behind and not the latest. I presumed this was a bug in cluster-autoscaler (given a restart fixed it, implying it is a bug about state). To me, the class of problem here is that we were not rolling out releases to our “supporting infrastructure” fast enough. Perhaps if we were on the most recent cluster-autoscaler release, this issue would have never happened.

Additionally, this failure to scale up was reported to us by the community rather than by an automated alert. We should change that too!

Structured solutions #

We follow a two week sprint cycle, and I love the (hard won) structure it provides us. I don’t want to arbitrarily start doing work that upsets prior committed work from that structure. However, we also treat support requests seriously and try to work them into the sprint. So I timeboxed myself for one hour, and saw what I could accomplish. Turns out, a lot!

I upgraded all our support components, tested them, and rolled them out to all our communities! This included upgrading Grafana, Prometheus, nginx-ingress as well as the cluster-autoscaler. This also restarts the cluster-autoscaler across our clusters, fixing this issue for other communities (if any had it).
I re-enabled the automatic once a month PR for upgrading these support tasks. We had switched to doing them on a manual sprint cadence, but clearly that was not fast enough nor automated enough. We will instead work these into the sprint once the bot opens the PR. Credit to Erik Sundell for initially setting this up
Create an issue to track the alert creation, and put it in our sprint backlog.
(In an additional fifteen minute timebox) Write this blog post, to communicate out both to the affected communities and others what we have done.

By timeboxing myself, I didn’t upset our sprint cadence and was able to continue doing other work I had committed to in the sprint, while also fixing this class of issues to the best of my ability.

Moving forward #

While we have been implicitly trying to solve whole classes of issues rather than individual instances of an issue as a team for a while, I want us to explicitly do it from now on. Communicating this out to our communities is an important part of that, as is internal team training. This blog post is the former, and we are continually working on the latter :)

Acknowledgements #

Thanks to the OpenScapes and CryoCloud communities for working with us closely on infrastructure to identify improvements like this.

Launching Jupyter Book for 2i2c Communities

Thu, 08 May 2025 00:00:00 +0000

We’re excited to announce out-of-the-box support for Jupyter Book 2 for our community members. This allows communities to create and share knowledge bases together for their community workflows. This post describes the motivation behind this new functionality, and how you can learn more about the project.

⭐ Members of 2i2c’s community network can use this feature in their hubs by following our documentation and sharing guide.

A core component of our mission to make research and education more impactful, accessible, and delightful is leveraging our unique global network of communities to make meaningful improvements to the open-source tools that power their work. Learning from one community can then provide value to our entire network, e.g., our work with PACE on speeding up their CNN model training.

Central to our communities’ work is the importance of sharing new findings, best practices, and community resources. Across our network, we have seen communities creating their own “books” that provide a home for this kind of content. Many of these books feature the concept of a “landing page” that welcomes new members, establishes an identity, and provides jumping-off points (or “calls to action”) to more detailed resources.

Until now, each community has been required to undertake this work independently. 2i2c believes that by building upon existing open-source tools like Jupyter Book 2, we can help communities focus on the content of their home, rather than spending time worrying about its appearance. To that end, we have been working on an initiative to allow communities to rapidly build interactive starter documentation and provide users with a rich, interactive, and informative onboarding experience. Through this initiative, we have:

Improved the user experience of launching into interactive compute environments from a Jupyter Book.
Built components into the Jupyter Book “book theme” for low-density landing page content like call-to-action blocks.
Extended our service to co-locate community documentation alongside community hubs (i.e., docs.hub.2i2c.cloud).

(A screenshot of the 2i2c Showcase Hub landing page, featuring a simple banner image and call-to-action.)

To take advantage of this feature, communities can use the 2i2c-org/community-docs-template to deploy a Jupyter Book site to GitHub Pages. This template demonstrates simple usage of Jupyter Book 2 for computational content and landing page creation, and establishes the necessary CD workflows for web publication. Meanwhile, 2i2c can update our domain name management to point the docs.hub.2i2c.cloud nested subdomain to the newly deployed documentation.

For more information, see our community documentation for deploying Jupyter Books.

Developing these new capabilities taught us a lot about what makes building “good” community documentation so difficult. A wide range of bespoke website-building tools and integration quirks previously made it challenging for communities to both keep documentation current with internal changes and keep up with necessary software updates. We also learned that by trading bespoke complexity for simplicity and readability, we could build a solution that scales to multiple communities, with a consequently reduced maintenance burden.

With these improvements, we have initiated a conversation about what a more unified “look and feel” for our network might entail, and how it might benefit our communities. Much more can be done to build on this first step, and we are eager to gather feedback on how to improve these features for users.

To learn more about this work, consider exploring a minimal example on our Showcase Hub, and check out our service guide. Let us know what you think!

Offering Jetstream2-powered hub support at 2i2c

Mon, 28 Apr 2025 00:00:00 +0000

When we first committed to offer Jetstream2 support at 2i2c, Jetstream2, Magnum, OpenStack, ClusterAPI were all new concepts that we hadn’t used at 2i2c before. And although the initial exercise of reading about each of them independently was confusing, learning how they actually glued together was the key. This post is about Jetstream2, 2i2c persistent hub offerings, and the learning that took place in the process.

⭐ Members of 2i2c’s community network can determine their eligibility and learn about JetStream2 in our supported cloud providers documentation. If needed, reach out to 2i2c for support.

Context #

At 2i2c, we want to be able to deploy k8s clusters on different cloud providers. In a very simplistic way, for this we use:

Infrastructure as code to describe, deploy and manage the actual physical infrastructure from the cloud providers
Cloud specific CLI to authenticate to this infrastructure
Helm to deploy and manage k8s resources onto this infrastructure
And finally kubectl to interact with all of these k8s resources

(Main tools used at 2i2c to deploy and manage k8s clusters on different cloud providers)

On cloud providers like GCP, AWS, Azure, the Kubernetes support feels like an atomic feature of the cloud provider and works out of the box. But on Jetstream2, k8s support is not such a solid feature anymore.

Jetstream2 Kubernetes support stack #

Jetstream2 is a collection of supercomputers that are part of the ACCESS cyberinfrastructure. This ACCESS infrastructure groups together super computers like Jetstream2 (but not limited to it), into a mesh that creates the impression of a single, virtual system that scientists can openly access and interactively use.

It offers Infrastructure as a Service (IaaS), that allows users to deploy VMs and manage environments dynamically. And the piece that enables this Infrastructure as a Service feature is OpenStack.

OpenStack and Magnum #

OpenStack is an open source platform made of multiple projects that help build and manage both private and public cloud infrastructure.

For our use-case, one of the most relevant OpenStack sub-project is Magnum. Magnum offers container orchestration engines for deploying and managing containers, like Kubernetes, but not limited to it.

Initially, Kubernetes support was provided through a project called HEAT. However that has proven harder to manage and maintain, and it was extremely hard to upgrade a cluster. So, they’ve migrated towards a new driver called Cluster API magnum driver, which offers a more native k8s integration.

Cluster API and CAPI helm driver #

CAPI itself is k8s project that allows declaring k8s clusters in an easy way.

The helm driver on the other hand is what acts like a bridge between OpenStack’s Magnum and Kubernetes’ Cluster API (CAPI). Its main goal is to to manage the lifecycle (create, scale, upgrade, destroy) of Kubernetes-conformant clusters using a declarative API.

In order to do this, Cluster API provides an API for being able to manage the various components of a Kubernetes cluster. This conceptually looks like a Kubernetes cluster managing other Kubernetes clusters; the former, named the ‘CAPI management cluster’, is the one providing the API for managing the latter workload clusters.

Decomposing the previous atomic feature #

(Comparison between Jetstream2 and other cloud providers when it comes to k8s support)

Magnum is part of the OpenStack tent and it’s the first layer on top of Jetstream2 towards achieving k8s support.

The CAPI helm driver is what’s offering CAPI support. This is the last piece that’s needed to link a k8s cluster down to the hardware where it’s deployed, on Jetstream2.

Challenges #

The Jetstream2-OpenStack stack is not a simple one. It’s a complex stack of technologies and each of the connection points can be challenging to debug and fix when something doesn’t work. Especially when you are one of the first ones that pilots this new magnum driver setup.

So, it was expected that we faced some issues along the way. However, we were able to go around them and add Jetstream2 to our service menu. Below is a list of some of the issues that we faced:

We have to create terraform resource in sequence which takes longer because of a race condition that makes concurrent nodegroups creation requests to fail

bugs.launchpad.net/magnum/+bug/2097946

The role and labels of the nodegroups don’t get propagated to the actual nodes, so we cannot put our own labels on nodes at once

azimuth-cloud/capi-helm-charts#84

The node count and min node count cannot be set to 0 and each nodegroup has to have at least 1 node

bugs.launchpad.net/magnum/+bug/2098002

A default-worker is created apart from the default-control plane nodegroup and we cannot delete it due to the same issue as in 2.
Latest CAPI helm chart version causes autoscaling to stop working in a persistent hub setup, so we had to downgrade it to a previous version

2i2c-org/infrastructure#5601

Conclusion #

The biggest plus, is the people. We got support from Julian Pistorius, which has helped us a lot to both fix and validate some of the behaviours we were experiencing. Also, going through the Jetstream2 support process was also a pleasant experience because they were super prompt in answering and they were very nice.

Jetstream2 has a big plus over the other cloud providers with its openness thought the ACCESS program. This is something very handy to researchers and less costly than other cloud providers. 2i2c being able to offer hubs though this ACCESS program makes things more accessible to more researchers and more cost efficient.

Higher complexity comes also with more control over the infrastructure which has its advantages.

Leaving the challenges apart, the experience was a nice one and the outcome was positive -> 2i2c is now able to deploy both mybinder.org-like hubs as well as persistent storage hubs on Jetstream2 hardware, from the same cloud-agnostic infrastructure.

Acknowledgements #

Thanks to Project Pythia for funding and collaborating with us on this work.

Enforcing per-user storage quotas now available on GCP

Tue, 25 Feb 2025 14:18:04 +0000

Building upon our previous work developing per-user storage quotas for our AWS infrastructure, we are pleased to announce that this feature is now available for GCP-hosted hubs!

To provide this feature on this vendor, we have updated our infrastructure provisioning system to create persistent disks, and enable automatic backups of the disk for disaster recovery purposes. However, the systems we had already developed for AWS, such as jupyterhub-home-nfs and our alerting system through Prometheus Alertmanager, are vendor agnostic and work right out of the box with the new architecture!

If you would like to try this feature on your 2i2c-managed JupyterHub, please get in touch.

Acknowledgements #

This project was developed and deployed in collaboration with Tarashish Mishra from Development Seed, funded through the NASA VEDA project.

Open infrastructure for collaborative geoscience with Project Pythia: Learning how to deploy a BinderHub on Jetstream2

Wed, 12 Feb 2025 00:00:00 +0000

Project Pythia and the “Jupyter notebook obsolescence” problem #

Project Pythia provides educational resources for essential software tools that enable open, reproducible and scalable geoscience, such as the Pangeo stack of packages (Xarray, Dask, Jupyter). Their Cookbooks are crowdsourced, community-curated, and open-source collections of Jupyter notebooks that demonstrate how to use these tools for cloud-native, geoscientific workflows (see our Project Pythia Cookoff blog post). However, “Jupyter notebook obsolescence” is a common problem: tutorials that were created a few years ago may no longer work due to changes in the software ecosystem and hampers the reproducibility of scientific results. A reproducible execution environment and the infrastructure to support it are essential for the long-term sustainability of these educational resources.

Leveraging NSF-funded cyberinfrastructure for BinderHub #

A BinderHub allows users to dynamically create custom computing environments from Binder-ready repositories containing computational notebooks and configuration files that describe the software environment required to run them. A public Binder service exists at mybinder.org (see our blog post about joining the mybinder federation 🎉) and is a successful example of how open cloud infrastructure can accommodate reproducible execution environments.

The resources available on such a public service are limited therefore 2i2c, together with Project Pythia, have been exploring how to deploy a BinderHub backed by larger resources from the NSF-funded cloud computing platform Jetstream2. This allows for larger simultaneous user loads, such as at workshops, as well as access to more powerful distributed and parallelized workflows required to process large geoscientific datasets, under a persistent resource allocation.

Learning how to deploy on OpenStack #

Jetstream2 uses OpenStack in order to manage pools of compute, storage and networking resources, and for our purposes we specifically make use of OpenStack Magnum Cluster API driver to manage Kubernetes for our deployment.

Cluster API needs a CAPI management cluster in order to manage other Kubernetes clusters, called workload clusters. On Jetstream2, this management cluster is gracefully created and operated by the Jetstream2 team, which means that the only task to worry about is creating and configuring the workload cluster.

For the workload cluster we used the Openstack Terraform provider to define the cluster template, the cluster itself and the node groups in a reproducible way.

After the cluster infrastructure was successfully created on Jetstream2, thanks to the 2i2c hub infrastructure being cloud agnostic as well, deploying BinderHub to Jetstream2, was a seamless experience and it was no different than on other cloud providers that we already supported.

We also learnt about some limitations of the Openstack Magnum driver project, which were expected given it being a relatively recent project, slowly being adopted, but they were all reported upstream and hopefully will soon be fixed.

Acknowledgements #

Jetstream2: Explore ACCESS allocation and Julian Pistorius for technical support
Thanks to Project Pythia for funding and collaborating with us on this work.
Andrea Zonca for preliminary work on Kubernetes deployments on Jetstream 2

Towards frictionless, portable, and sustainable reproducibility with Binder

Mon, 10 Feb 2025 00:00:00 +0000

Last December I had an opportunity to discuss the current and future state of the open publishing ecosystem at a workshop hosted by HHMI¹. While 2i2c doesn’t primarily focus on “publishing” workflows, we do support communities on a journey that often leads to publishing, and we make choices about technology in our community hub platform that can support different kinds of publishing outcomes.

After listening to folks across the open science and publishing ecosystem, I noticed a common challenge:

Publishers care about reproducibility of computational narratives and the interactivity that computation can provide.
But they lack the capacity to manage computational infrastructure in a way that is flexible enough for all of their authors.

This post is a reflection on how ecosystems like Jupyter and managed community hubs could solve some of these challenges.

A community experiment to provide reproducible environments for published pre-prints #

Many of 2i2c’s communities already care about reproducibility and sharing their computational narratives. That’s one reason that we’ve been improving reproducible environment sharing with Binder, integrating Jupyter Book into our community cloud platform, and supporting the mybinder.org federation.

However, communities often want to publish rather than just share. Publishing is more structured, invites particular kinds of feedback, and requires more Quality Assurance. There’s a huge ecosystem of publishers and services that support formal publishing, and they ensure things like discoverability, long-term archivability, versioning, peer review, DOI referencing, etc.

We recently piloted running a BinderHub for Biorxiv publications with the Loren Frank lab and HHMI, and found this to be a nice proof-of-concept. While the “published article” lives on Biorxiv, the computational infrastructure and environment is provided by a BinderHub (in this case managed by 2i2c, but anybody could manage a hub in this way).

The Biorxiv Binder pilot workflow. An author used a 2i2c-managed BinderHub to generate a reproducible environment for their paper. They included a link to this environment in the abstract. Readers could click this link, and be taken to a fully interactive environment to explore the ideas in the paper and reproduce its computation.

This was a nice proof-of-concept, though I think that broader adoption of this type of workflow would require a deeper connection between publisher workflows and open source communities.

Could we enable communities to “bring their computational environments with them” when publishing? #

Currently, a community’s computational environment and data are not accessible to publishers. Could we relax this by allowing JupyterHub and Binder to be re-used by external services like publishers? This could allow a community’s hub to act like an external service for reproducibility that could be used by one of the many publishing platforms out there. It would require making improvements around a few different areas of Jupyter:

JupyterHub and BinderHub would need improved workflows around external authentication so that services could easily request kernels from a hub.
Jupyter Book and MyST would need the ability to power computation on a page from a variety of computational back-ends, potentially defined by a user (e.g., via Thebe and some UI design in Jupyter Book).
Jupyter Lab and other user interfaces may need improved ways for defining and sharing their environments and their content to use tools like Jupyter Book and Binder.

We’d also need integration work for the various publishers to leverage this technology for their infrastructure. This is a significant lift - a lot of publishers use very old and bespoke technology in their systems. However, there’s also hope that a subset of the publishing ecosystem is ready to try things like this.

There are many publishing organizations innovating with open source #

I learned that there’s a lot of interest in innovating around publishing workflows, as well as building on top of open source communities and standards. We don’t need the whole industry to move at once (it won’t), but we do need a critical mass of organizations who are interested in innovating. This might be possible with more publishing-focused products that integrate heavily with open source.

For example, Curvenote is a publishing and communication platform that builds heavily on top of the Jupyter and MyST ecosystems. They co-lead many of the open source projects they use in their platform. Curvenote builds largely around the MyST Markdown document engine, which means they could more easily integrate improvements around Portable Computation in the Jupyter ecosystem.

I hope that the broader publishing ecosystem moves in this direction. Because Jupyter is largely based around open standards and protocols, it should be possible for publishers to leverage the BinderHub API and the Reproducible Execution Environment Specification to integrate computation that powers their reproducible articles. This would allow a community’s members to connect their hub’s reproducible environment with each published article. Something like the figure below.

Publishers could re-use the computational environments from a community’s hub, resulting in a de-duplication of infrastructure and effort, and bridging the gap between where a community does its work, and where it submits new ideas for publication. (note these are hypothetical for now, but we think publishing platforms like these are a good starting point!)

Integrating with publishing in this way would allow communities to leverage their pre-existing infrastructure as part of their scholarly workflows. If a community had their own capacity to manage Binder infrastructure they could do so, or they could use a service provider like 2i2c to manage it for them. This would distribute the responsibility of infrastructure management to those who are in the best position to do so - the communities that do the work.

How could we sustain the cost of running computation for published articles? #

This raises an important question: how would you sustain services like these? Communities are already nervous about the cost of computation for their workflows. Public services like mybinder.org are free and accessible, but not scalable, nor suitable for complex or mission-critical workflows². Would community stakeholders pay for privileged access to BinderHubs that could reproduce and share their computational narratives? Would publishers be willing to pay a percentage of the cloud and management costs associated with reproduction? Could we use this to sustain a larger public service like mybinder.org?

We don’t have any answers yet but we’re keen to try. Our colleague Jim Colliander recently explored some of these ideas in a talk recorded for AGU 2024.

A talk from 2i2c team member Jim Colliander discussing the right to reproduce computational ideas, the importance of enabling frictionless reproducibility, and how we might sustain such a service.

One thing seems clear - there is an imperative to make reproducibility and interaction frictionless. This is both to ensure the scientific integrity of the work being done, and also to make computational ideas more accessible to the world. Technologies and service partnerships like these can help ensure the broader community’s right to reproduce the work of others.

Exploring Frictionless Portable Computing at 2i2c #

2i2c often plays a role in bridging user communities and open source communities through cycles of development and collaboration, perhaps we could do the same for the publishing community. We’d like to explore some tooling improvements that lay a foundation for these workflows, and will report back on our experiments in the coming months.

If you are interested in collaborating, please reach out. We’d love to hear from organizations from the scholarly publishing community to understand where these ideas have holes or need significant new development. I’d also love feedback on sustainability models to ensure these services can be relied on as part of the publishing ecosystem. In the meantime, hopefully these ideas serve as an inspiration for what is possible, and where we might be heading with 2i2c’s service and the broader publishing ecosystem.

Acknowledgements #

HHMI for organizing and hosting this workshop.
the Jupyter Book community for their collaboration and feedback on these ideas.

There’s a follow-up meeting for those who are interested. ↩︎
The costs associated with running mybinder.org have historically been shouldered by donations from organizations such as OVH, Google, GESIS, Curvenote, and now 2i2c. These donations are not guaranteed, and do not scale directly with the number of users. ↩︎

Announcing backups for GCP-hosted hubs!

Fri, 07 Feb 2025 13:08:22 +0000

2i2c are pleased to announce the development and deployment of automated backups of home directories on GCP-hosted hubs!

We have developed the gcp-filestore-backups project that regularly creates backups of JupyterHub home directories for disaster recovery purposes. The project is a Python wrapper around the gcloud tool to regularly request backups be made of the Filestore hosting JupyterHub’s user home directories, by default on a daily basis. The script also manages retention of these backups by checking how recently the last backup was made, and the age of existing backups, by default deleting any backup older than 5 days.

Having these backups enabled means that, in the unlikely and unfortunate case of data loss or corruption, we can reinstate the home directories of the hub to a relatively recent state that is at a maximum of 1 day prior to the incident.

We have deployed gcp-filestore-backups to all our GCP hubs presently running, with a retention period of 2 days. If you would like to discuss this further with us, please get in touch!

As ever, this project has been developed openly in line with our Right to Replicate so you can deploy it against your own infrastructure!

Enforcing per-user storage quotas with `jupyterhub-home-nfs`

Tue, 28 Jan 2025 09:57:28 +0000

When sharing a storage disk between users, as is usually the case in a JupyterHub deployment, it is important to put in guardrails so that one user cannot eat up the whole storage capacity from the rest of the users. To this end, 2i2c in close collaboration with Development Seed have developed the jupyterhub-home-nfs project which is a Helm chart that permits enforcing per-user quotas on the storage space.

Note that this feature is currently available to AWS hosted hubs only and will be rolled out to other cloud providers in the future.

Under the hood, the Helm chart runs NFS Ganesha as an in-cluster NFS server, backed by XFS as the underlying filesystem. Storage quota is enforced through XFS’s native quota management utility xfs_quota.

Since this feature moves our infrastructure away from managed filesystems (such as AWS’s Elastic File System) that cannot support per-user storage quotas, we have also developed monitoring and alerting mechanisms that will let us know when the disks are getting full, and automated back-ups for disaster recovery.

If you would like to try this on your 2i2c-managed hub, please get in touch.

This project can also be used with any Kubernetes-based JupyterHub, as per our Right to Replicate policy, so please try it out on your own deployment and let us know what you think!

Acknowledgements #

This project was developed and deployed in collaboration with Tarashish Mishra from Development Seed, funded through the NASA VEDA project.

2i2c hubs now run JupyterHub 5.0

Fri, 17 Jan 2025 00:00:00 +0000

We are excited to announce that all 2i2c hubs now run JupyterHub 5.0!

This is an upgrade that brings some exciting new features and improvements. Some of the highlights include:

The possibility to enable user-initiated server sharing
Authenticator-managed roles

Also, JupyterHub 5 will enable us to offer per-group shared directories in the future! Tracking Issue.

Checkout the JupyterHub 5.0 migration docs or the changelog for more details.

`frx-challenges`: A new tool to host data challenges for Frictionless Research Exchanges

Fri, 06 Dec 2024 00:00:00 +0000

2i2c is pleased to announce the frx-challenges project, a new open source tool to help communities host data challenges on shared infrastructure:

2i2c-org/frx-challenges

This project aims to make it easier for administrators to provide a service that enables users to submit code and data that are evaluated on secure infrastructure with access to private data and resources. It also provides a leaderboard that helps users compare their performance against others.

An example leaderboard for a data challenge, taken from the Cellmap Challenge. Users make submissions that are run against secure and private infrastructure and data, and provides feedback about the submission’s performance. Learn more about the FRX challenges project here: 2i2c.org/frx-challenges/

It is designed to be lightweight and flexible, and can be run on a variety of shared infrastructure. For those who wish to run this project on cloud infrastructure, we’ve also published a Helm Chart to help you deploy frx-challenges with Kubernetes.

While it can be run on its own, we believe that it naturally complements other tools and services for interactive computing and data, such as JupyterHub, Jupyter Book, and Binder. More on that below.

Below is a brief description of the motivation behind this project.

What are Frictionless Research Exchanges #

The project is heavily inspired by David Donoho’s vision of Frictionless Research Exchanges (FRX) as described in Data Science at the Singularity.

In this article, Donoho describes three key pillars for Frictionless Research Exchanges:

The three initiatives are related but separate; and all three have to come together, and in a particularly strong way, to provide the conditions for the new era. Here they are:

[FR-1: Data] datafication of everything, with a culture of research data sharing. One can now find datasets publicly available online on a bewildering variety of topics, from chest x-rays to cosmic microwave background measurements to uber routes to geospatial crop identifications.

[FR-2: Re-execution] research code sharing including the ability to exactly re-execute the same complete workflow by different researchers.

[FR-3: Challenges] adopting challenge problems as a new paradigm powering scientific research. The paradigm includes: a shared public dataset, a prescribed and quantified task performance metric, a set of enrolled competitors seeking to outperform each other on the task, and a public leaderboard. Thousands of such challenges with millions of entries have now taken place, across many fields.

We considered the landscape of tools and services, and felt that [FR-1] and [FR-2] were already well-served by a variety of tools and services for community workspace infrastructure (e.g., JupyterHub: jupyterhub.readthedocs.io), sharable computational environments (e.g., BinderHub: binderhub.readthedocs.io), authoring and reading computational narratives (e.g., Jupyter Book: jupyterbook.org and MyST: mystmd.org), and data I/O tools and standards (e.g., Zarr: zarr.readthedocs.io and Intake: intake.readthedocs.io).

However there was a natural missing piece for [FR-3 Challenges], and we could not identify any community-managed infrastructure that facilitated data challenges. This is the goal of frx-challenges.

Why facilitate data challenges? #

Data challenges are harder than you think! While it is simple enough to run somebody else’s code locally, data challenges require a systematic, secure, and automated approach to accepting and evaluating submissions in a fair and repeatable way. Here are some of the big challenges to tackle:

Submissions must retain user and team identity, which means that we must keep track of users and their submissions over time, since data challenges are designed to encourage iterative improvement and optimization.
Evaluations must use potentially complex resources and data since many data challenges operate by publicly sharing a small dataset, and then running it against a much more complex dataset.
Evaluations must be totally secure, so that submissions can’t do nefarious things like mine cryptocurrency or extract the challenge’s private data in unintended ways.
Evaluations must be automated, so that running the challenge does not require extensive human intervention and can scale to many users.
Evaluation must be flexible, so that the infrastructure can accept a variety of types of submissions (e.g. code, data, model weights, etc), run them with arbitrary environments designed by the organizers, and run them with the right hardware to get the job done.

These are just a few of the major challenges that we’ve tried to address with frx-challenges, and we’re excited to see how it goes with our first assisted community challenge: the Cellmap Challenge.

If you’re interested in learning more or participating in this project, follow along at its GitHub repository:

2i2c-org/frx-challenges

This is still the very early stages of the project, and we imagine it will evolve significantly. We welcome feedback for how it can more effectively serve a variety of communities.

Acknowledgements #

Thanks to the Howard Hughes Medical Institute (HHMI) for collaborating with us on the Cellmap Challenge, which led to the creation of this project.

Thanks to Kristen Ratan and Strategies for Open Science (Stratos) for enabling this collaboration, and providing strategic guidance and support.

Improving the logged in home page experience in JupyterHub with `jupyterhub-fancy-profiles`

Mon, 18 Nov 2024 12:55:20 -0800

On most research oriented JupyterHub installations, users would like to customize their server (the environment, resources available, etc) after logging in. In Kubernetes based JupyterHub environments, a profile list provides this functionality.

(Profile List for the NASA VEDA JupyterHub with the default implementation from KubeSpawner)

The profile list is the de-facto “logged in homepage” for these users, as that is what they see after they have logged in.

In collaboration with Development Seed, funded by our earlier grant from GESIS as well as the NASA VEDA project, we have been building the jupyterhub-fancy-profiles project to improve this experience.

(Profile List for the NASA VEDA JupyterHub with jupyterhub-fancy-profiles)

Last week, we rolled this new experience out to all 2i2c managed JupyterHubs! Here’s a quick rundown of what this enables:

Descriptions for choices in the dropdowns, making it much easier for users to know what they are getting with each environment (or resource selection).
Fully backwards compatible with the existing KubeSpawner profile list implementation. In our PR to roll this out to all hubs, you notice that we didn’t have to change the structure of any profile lists! So you can safely roll this out to your hubs too without needing to fundamentally change how your profiles are set up.
It is a modern web app (built with react), just like the JupyterHub admin panel. This allows us to evolve and satisfy user needs much faster, as well as expanding the pool of people who can contribute to the project!
Support for dynamically building images using mybinder.org style repositories! It talks to the binderhub API so users can build reproducible environments as they wish without admin involvement nor needing to fully understand how docker and containers work. Our earlier blog post has more information.

This is just the start, and thanks to ongoing funding from the NASA VEDA project, we are going to continue making improvements to this experience.

Use this in your JupyterHub #

As with everything we build at 2i2c (per our right to replicate policy), this project can be used with any JupyterHub installation that uses Kubernetes. There are instructions in the README. Please try it out on yours and let us know what you think!

Credit #

The project was initiated with funding generously provided by GESIS (see our earlier blog post).
Sanjay Bhangar and Oliver Roick from Development Seed for advocating for this project and contributing heavily to it.
The NASA VEDA project (in particular, Brian Freitag and Alex Mandel), for continued funding (in the form of engineering time) plus being early adopters!

Track and manage cloud costs using Grafana

Fri, 15 Nov 2024 00:00:00 +0000

Grafana dashboard showing cloud costs broken down by compute, storage and other components for the Openscapes hub.

We are pleased to unveil a new feature to track cloud costs within our Grafana dashboards! Community Champions now have the ability to monitor the cost and usage of their 2i2c-managed hubs that displays up to date aggregated costs as well as detailed breakdowns for more granular reports.

Note that this feature is currently available to AWS hosted hubs only and will be rolled out to other cloud providers in the future.

Accessing the cloud cost dashboard #

Community Champions can view the Cloud Cost dashboard from their Grafana instance (please see the Service Guide for how to gain access).

From the main menu of Grafana, navigate to Dashboards > Cloud cost dashboards > Cloud cost attribution to view the dashboard.

Understanding the cloud cost dashboard #

A typical 2i2c-managed deployment comprises of a staging hub and a production hub, although some other communities may have extra hubs such as a workshop hub. By default, costs are not broken down on a per hub basis unless the community has opted in to this feature.

The dashboard is made of several panels:

Daily costs
Daily costs per hub (opt-in only)
Total daily costs per component
Daily costs per component per hub (opt-in only).

For more detailed information on the data that each panel displays, please consult our Service Guide for reference.

The dashboard can be shared with other community members and stakeholders so they can understand usage and cost patterns. Community Champions can export data to a CSV file, or they can generate a snapshot of the Grafana dashboard and share a public link.

For instructions on how to export data from the dashboard, please see our Service Guide for reference.

Next steps #

We would love to know whether this feature is useful and how it can be improved. We will be contacting individual communities to share their feedback with us – please share your thoughts with us!

We will work on rolling out this service to GCP hosted clusters in future. Stay tuned to know when this feature is available to your community.

Acknowledgements #

Thank you to Erik for spearheading the rollout effort and to the rest of the 2i2c team for their support.
Thanks to Openscapes and Cryocloud communities for providing valuable insights during the prototyping and testing phase, and for funding part of this work.

Low storage alerting for the UToronto cluster

Fri, 14 Jun 2024 00:00:00 +0000

The UToronto hub landing page

2i2c has operated The University of Toronto hub since 2021 and this hub supports over 6000 educators and learners in a day! With a community of this size, file storage can quickly grow out of control and cause issues.

The 2i2c engineering team have implemented a low storage alerting system for Microsoft Azure, so that they can pre-emptively take remedial action before the filesystem is about to run out of diskspace.

Great job team 🚀

UToronto hub usage

Security report for jupyter-server-proxy: CVE-2024-28179

Tue, 19 Mar 2024 00:00:00 +0000

What happened? #

A few weeks ago, the JupyterHub team discovered a security vulnerability in the jupyter-server-proxy package that would allow potential unauthenticated access to a JupyterHub via WebSockets, allowing unauthenticated users to run arbitrary code on the JupyterHub. jupyter-server-proxy is used by many communities to provide alternative user interfaces like RStudio and remote desktops.

This vulnerability was detected by the JupyterHub team, with leadership from 2i2c’s engineers. It was resolved through upstream contributions to the JupyterHub project, and we have deployed a fix that mitigates this vulnerability for all the hubs 2i2c manages.

Does this impact my 2i2c community hub? #

We do not believe that any of 2i2c’s communities were impacted by this vulnerability, and a patch has now been pushed to all community hubs to resolve this issue.

If your community was vulnerable to this problem, you might experience slightly slower startup latency while we work out a long-term solution.

Since this is a vulnerability in the docker image used by our communities, we will be reaching out over the next few weeks to put a more permanent fix in place.

Where can I learn more? #

See the JupyterHub security advisory for CVE-2024-28179 for more information about the security vulnerability, including details on the mitigation we have put in place to protect our communities.

Conclusion #

We’re grateful that the JupyterHub community was quick to acknowledge, respond, and resolve this security vulnerability after it was brought to their attention. We’re also proud that 2i2c’s engineers helped the JupyterHub team throughout the process.

This allowed our team to resolve the problem before it impacted any of 2i2c’s communities. Because 2i2c community infrastructure is managed in a central location, we were able to resolve this for over 80 communities with a single team rather than expecting each community to learn about and fix this problem on their own.

We also believe this reflects the healthy upstream relationships that we hope to encourage with our team’s Open Source strategy and practices. By working with the JupyterHub community and pushing changes upstream, we’ve resolved this issue for any user of jupyter-server-proxy, not just 2i2c’s own ecosystem. In particular, because of 2i2c’s position running hubs for many communities via Kubernetes, we were able to identify a solution that did not require every user image to be updated (as described in section For JupyterHub admins of Z2JH installations).

We believe that all of these lead to a healthier, safer ecosystem of open source tools ❤️.

Integrating BinderHub with JupyterHub: Empowering users to manage their own environments

Wed, 03 Jan 2024 16:56:14 -0800

Thanks to Arnim Bleier, Jenny Wong, Georgiana Elena, Damián Avila, Jim Colliander and James Munroe for contributing to this blog post

mybinder.org is a very popular service that allows end users to specify and share the environment (languages, packages, etc) required for their notebooks to run correctly by placing configuration files they are already familiar with (like requirements.txt or environment.yml) along with their notebooks. While not without its own set of challenges, this is extremely powerful because it puts control of the environment in the hands of the people who write the code. They can customize the environment to fit the needs of their code, instead of having to fit their code into the environment that admins have made available.

But, mybinder.org (and the BinderHub software that powers it) is built for sharing your work after you are done with it, not for actively doing work. BinderHubs often do not have persistent storage nor persistent user identity, and UX is centered around ephemeral interactivity that can be shared with others (via a link), rather than persistent interactivity that a single user repeatedly comes back to. JupyterHub is more commonly used for this kinda workflow, but doesn’t currently have the ability for users to easily build their own environments. Admins who are running the JupyterHub can make multiple environments available for users to choose from, but this still puts admins in the critical path for environment customization.

Our collaboration with GESIS, NFDI4DS, and CESSDA, aims to bring this flexibility to JupyterHub directly. We aim to empower users to decide for themselves which applications and dependencies are installed on a per-project basis. Our work enables communities with heterogeneous requirements to share a single Hub. Our approach frees administrators from being overwhelmed by installation requests and transforms the JupyterHub platform into a platform for collaborative computational reproducibility. In this update, we report on our progress and upcoming steps in this project.

What does a BinderHub do, exactly? #

It is helpful to understand that BinderHub primarily has 3 responsibilities:

Present a UI to the end user for them to provide details on what to build (this is what you see when you go to mybinder.org)
Call out to repo2docker in a scalable way to actually build and push an image containing the environment for the given repository, and show the user logs as this build process happens. This also allows users to debug issues with their build more easily.
Talk to a JupyterHub instance to launch a user server with the built docker image, and redirect the user to this.

(2) is really the core feature of BinderHub, and we settled on figuring out how to make that available to JupyterHub users. It was really important to us that this was also done in a way that can be sustainably used by everyone, not just 2i2c. This blog post discusses the various improvements to the broad ecosystem of projects in the Jupyter ecosystem to get this done.

Demo #

But first, a very quick demo of how this looks like right now now!

This is very much a work in progress, but the basic flow can be seen clearly. Users see a Server Options menu after they log into JupyterHub. They can specify the two primary things that determine the server configuration:

The resources allocated (RAM, CPU and maybe GPU)
The environment (container image) used, which can be specified in one of 3 ways:

a. A pre-selected list of environments (container images), provided by the administrators who set up this JupyterHub b. A blank text box where you can enter any publicly available docker image they want c. A mybinder.org style way to specify a GitHub repository, which will be then dynamically built into a docker image for the user!

So what did we need to do to accomplish this, in a way that’s very upstream friendly and usable by everyone (and not just 2i2c)?

A Standalone `binderhub-service` helm chart #

The default upstream BinderHub helm chart includes a JupyterHub as a dependency, and configures itself to be used primarily in a manner similar to mybinder.org. As the person who helped make that choice early on, I can tell you why it was made - for convenience! And it was very convenient, as it allowed us to get mybinder.org going fast. However, it makes it difficult to install a BinderHub service alongside an existing JupyterHub. To this end, we have created a standalone BinderHub helm chart, designed to be installed alongside an existing JupyterHub, so we can use it purely to build images. This allows the BinderHub instance to be used as a JupyterHub Service, which is what we want.

While this helm chart is currently under the 2i2c GitHub org, the hope is that it can eventually migrate to a jupyterhub-contrib organization (once it is created), or it can become the upstream helm chart for BinderHub if enough work can be done in BinderHub to allow it to serve use cases like mybinder.org.

As part of this work, we also added a way for BinderHub to run in API only mode, so we can fully turn off the UI and launching ability of BinderHub. This change decoupled the three responsibilities of BinderHub we discussed previously, allowing us to bring our own UI and JupyterHub. BinderHub could now be used purely for its scalable image building features, which is exactly what we want!

Sustainably extending KubeSpawner’s `profileList` #

We identified KubeSpawner’s profileList feature as the ideal location for UI to dynamically build environments (container images), making it just another ’environment choice’ people can choose, along with picking the resources their server needs. From an end-user perspective, it was also the logical place for them to specify a repository to build into an environment, as they could already choose some pre-built environments from here. They can also select other arbitrary resources they want (such as memory, GPU, etc) from here as well. From a maintainer perspective, it helps with long-term maintenance of the JupyterHub projects.

The implementation of profileList however, was not easy to extend at this point. So this PR improved how easy it was to extend it in more complex ways, without making the implementation in KubeSpawner itself complicated. Even though this had no visible end-user effects, it was an extremely important step in allowing us to experiment with UI in a sustainable way without having to rely on upstream. These kinds of changes can sometimes be hard to sell to stakeholders but are extremely important in ensuring a continuous and sustainable relationship with upstream.

Implementing `unlisted_choice` feature in KubeSpawner #

The profileList feature was built to allow JupyterHub admins to specify an explicit list of container images the end-user can choose from. It did not have a way for any choice that was not pre-approved by the admin to be used. We needed this feature since the BinderHub API will build a new docker image for each environment the user wants, and so this can not be chosen from a pre-approved list. We had to safely add this feature to KubeSpawner in such a way that it was generally useful to everyone. Many other communities had been asking for such a feature anyway - the ability to simply ’type in’ an image and have that be used.

NASA VEDA was one such community, so we partnered with Sanjay Bhangar from Development Seed (an organization that helps run NASA VEDA) to implement this feature. Engineers from 2i2c contributed heavily to this feature as well, and after several PRs ( 1, 2, 3, 4 and 5), this feature is now available for everyone to use!

A key component of doing sustainable upstream work is that every addition needs to be useful by itself for a broad group of people. This change was very helpful for many communities that wanted to allow their users the freedom to pick whatever image they want to use, regardless of wether they wanted to use dynamic image building or not. The broad interest allowed us to build a coalition with other interested parties, and get the change accepted upstream more easily!

`jupyterhub-fancy-profiles` #

Once we had all these pieces in place, it was time to actually work on the frontend UI that would allow users to build images dynamically and launch them. Since this will replace the ‘profileList’ feature, it should also allow them to select different resources (RAM, CPU, etc) as needed, as well as type in an existing image if they desire. So it was a full re-implementation of the profileList frontend.

This is ongoing now at the jupyterhub-fancy-profiles project. It is a pure frontend web application, using modern frontend tooling ( React, webpack, Babel, etc) and written in JavaScript. It’s gone through a few revisions, but the demo provided earlier in the blog post is in its current state. Because the default profileList implementation is pure HTML / CSS with very minimal JS, it is limited in what kind of UX it could have. jupyterhub-fancy-profiles aims to be very helpful even when dynamic image-building features are not enabled on a JupyterHub. We hope to roll this out to a few JupyterHubs and improve it over time based on feedback.

`jupyterhub/@binderhub-client` npm package #

While building jupyterhub-fancy-profiles, we wanted to use the same javascript code used by BinderHub frontend to interact with the BinderHub API, instead of re-implementing it. However, the existing BinderHub JavaScript code was not easily consumable by external projects. We refactored the code, added tests, migrated to use modern JS practices and published the jupyterhub/@binderhub-client NPM package that can be used not just by jupyerhub-fancy-profiles but any external project for talking to the BinderHub API.

This had to be done in such a way that current BinderHub installations (such as mybinder.org) do not break. That took quite a few pull requests: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15. This refactoring work was very helpful to us, and also appreciated by the broader community.

Defending against cryptojacking with `cryptnono` #

For Open Science to flourish, we need to allow access to resources without login / paywalls wherever possible. A new menace against this has been cryptojacking - where attackers use up any and all available free compute to mine cryptocurrencies. This has affected many folks on the internet, including GitHub Actions and mybinder.org, the primary public BinderHub installation. mybinder.org has some extra protections against cryptojacking that aren’t easily usable elsewhere, and this has unfortunately meant that the demo JupyterHubs we have with these features enabled have been behind a login wall. I personally believe login walls are long term antithetical to open science, and so this was an important problem to solve.

cryptnono is an open source project designed to help fight cryptojacking, and as part of this grant we ported some of this functionality out of mybinder.org specific code into cryptnono, so other deployments may also benefit from it! We also migrated to using the super efficient ebpf Linux Kernel subsystem, allowing for more complex heuristics to catch a much broader range of cryptomining activity. We have been slowly tweaking the config on mybinder.org, and it has proven to be very effective! This will be very helpful for anyone who wants to provide a JupyterHub (or any other computational service) without a login wall. If you are interested in using cryptnono in this fashion, please reach out to us so we can work together!

Explored pathways that were then discarded #

List of things that were tried and then decided as not good pathways:

repo2docker-service, a separate JupyterHub service that could only build images. As we worked on it, we realized that it was replicating a lot of features that BinderHub already has, so we pivoted to working on BinderHub directly instead.
Building off of tljh-repo2docker. While this already had a nice UI, it would be hard to port it to run on a distributed Kubernetes environment without it becoming a ‘hard fork’.

While these did slow down the implementation of the project, it has allowed us to be very confident that the methods we have chosen are long-term sustainable.

Want to try this out? #

We have a demo of this running at imagebuilding-demo.2i2c.cloud, but unfortunately as we are still fine-tuning cryptnono config, at this moment it is not open to the public. Please contact me with your GitHub account if you want access, and promise to not be a cryptominer and you shall be granted access.

Want to set this up on your own JupyterHub? There is some work in progress documentation and more is being worked on. Drop a line in the linked pull request and we’ll be happy to help. The eventual goal is for anyone to be able to simply follow documentation and set this up for themselves.

We also have user facing documentation on using this service on docs.2i2c.org.

Future work #

This is not complete of course, and there is a lot of future work to be done.

mybinder.org also helps you distribute your content, not just the environment for your code to run in. Since JupyterHub usually comes with a persistent home directory for the user, nbgitpuller is commonly used for this purpose instead. We should explore ways to integrate nbgitpuller (and other ways to distribute content) in the future.
More thorough documentation for how you can recreate what is in the demo for yourself in your own JupyterHub installation.
Better UX for specifying images, including figuring out how to ‘save’ them for future reuse.
Better compatibility with mybinder.org, particularly in allowing other sources of environments (not just GitHub, but Zenodo, raw git repositories, etc) and URL compatibility.
Better authentication workflow between the frontend and the BinderHub API.

Credit #

All this work would not be possible without a large group of collaborators!

From 2i2c: Erik Sundell, Georgiana Elena, Yuvi, James Munroe, and Damián Avila.
The persistent BinderHub project was the direct inspiration for all this work, with particular thanks to Kenan Erdogan.
The tljh-repo2docker project, which explores similar ideas in the context of running only on a single node.
The broad JupyterHub and MyBinder.org community, particularly Simon Li and MinRK.
Funding generously provided by GESIS in cooperation with NFDI4DS (project number: 460234259) and CESSDA.
Arnim Bleier from GESIS was instrumental in making this project happen.

A QGIS desktop in the cloud with JupyterHub

Sat, 05 Aug 2023 00:00:00 +0000

The QGreenland Researcher Workshop

JupyterHub is a versatile platform that can serve a desktop with Geospatial Information Systems (GIS) software in the cloud. This was demonstrated by the QGreenland Researcher Workshop that was hosted by the NASA CryoCloud hub. The hands-on workshop trained 25-30 researchers, from Germany, India, France, Canada, Poland and the United States, on how to work with geospatial data in an open science framework.

QGreenland Overview #

QGreenland is an open-source geospatial data package designed for QGIS, a community-owned GIS platform. It focuses on Greenland, offering researchers and educators a comprehensive toolset for FAIR (findable, accessible, interoperable and reproducible) data analysis. The package integrates a variety of datasets into a single, easy-to-use data-viewing and analysis platform, supporting both offline and online use. This makes it particularly valuable for remote fieldwork and areas with limited internet access.

Workshop Success #

The QGreenland workshop demonstrated several key benefits of using JupyterHub for cloud-based GIS:

Accessibility: Participants from across the world could access the same powerful GIS tools through a web browser, eliminating the need for complex local installations while enhancing reproducibility
Cloud block storage: Using a JupyterHub in the cloud allowed for faster data access than a traditional NFS file store by provisioning each user with an elastic block store disk, reducing load times from 5 minutes to under 3 seconds.
Cost Efficiency: Utilizing the CryoCloud JupyterHub instance managed by 2i2c drastically cut down setup costs and time, with only minimal cloud operating expenses of roughly $1/person/day.

Conclusion #

The success of the QGreenland workshop underscores the potential of integrating interactive software applications in JupyterHub. This approach not only democratizes access to advanced geospatial tools but also fosters a collaborative research environment. We look forward to supporting more workshops for QGreenland in the future!

Want to know more? Check out the companion post by QGreenland on the Jupyter Blog

Acknowledgements #

Trey Stafford (CIRES)
Matthew Fisher (CIRES)
*Fisher, M., *T. Stafford, T. Moon, and A. Thurber (2023). QGreenland (v3) [software], National Snow and Ice Data Center.
Snow, Tasha, Millstein, Joanna, Scheick, Jessica, Sauthoff, Wilson, Leong, Wei Ji, Colliander, James, Pérez, Fernando, James Munroe, Felikson, Denis, Sutterley, Tyler, & Siegfried, Matthew. (2023). CryoCloud JupyterBook (2023.01.26). Zenodo. 10.5281/zenodo.7576602

* Denotes co-equal lead authorship

CILogon usage at 2i2c

Fri, 24 Feb 2023 00:00:00 +0000

About CILogon #

CILogon is an open source service provider that allows users to log in against over 4000 various identity providers, including campus identity providers. The available identity providers are members of InCommon, a federation of universities and other organizations that provide single sign-on access to various resources.

CILogon and 2i2c #

For the past year, 2i2c has been successfully using CILogon for more than fifteen of the hubs it manages.

Currently, most of the hubs that use it are hubs for communities in education that want to manage their hub access through their own institutional providers.

With using a tool like CILogon, we allow hub access to be managed both through the communities’ institutional providers, but also through social providers like GitHub and Google. Because both authentication mechanisms can coexist, there’s no need to provide specific credentials for 2i2c staff in order to have access to the hub. This reduces both the burden on institution’s IT departments, but also the complexity of a hub deployment.

Moreover, as we migrate away from our current Auth0 setup, the number of hubs using CILogon will further increase in the following year.

The setup #

The setup that 2i2c uses, is based on two important tools, the CILogon administrative client and the JupyterHub CILogonOAuthenticator.

The CILogon administrative client #

The 2i2c administrative client provided by CILogon allowed us to automatically manage the CILogon OAuth applications needed for authenticating into the hub.

For each hub that uses CILogon, we dynamically create an OAuth client application in CILogon and store the credentials safely, using the script at cilogon_app.py. The script can also used for updating the callback URLs of an existing OAuth application, deleting a CILogon OAuth application when a hub is removed or changes authentication methods, getting details about an existing OAuth application, getting all existing 2i2c CILogon OAuth applications.

The JupyterHub CILogonOAuthenticator #

For CILogon’s integration with JupyterHub’s authentication workflow, we’re using the CILogonOAuthenticator, which is part of the JupyterHub OAuthenticator project. This is what allows JupyterHub to use common OAuth providers for authentication, and it’s also a base for writing other Authenticators with any OAuth 2.0 provider.

As part of this 2i2c integration with the JupyterHub CILogonOAuthenticator some important upstream fixes and enhancements to the oauthenticator were identified and performed. For example, the GHSA-r7v4-jwx9-wx43 vulnerability was reported and fixed, and a migration guide containing a description of the breaking changes that were made, together with a step by step guide for the users on how to update their usage of JupyterHub CILogonOAuthenticator was provided.

Read more about how CILogon is setup for use at 2i2c from the docs.

Celebration #

Thanks to the 2i2c - CILogon partnership, during this past year we were able to integrate CILogon into 2i2c’s infrastructure and to observe its importance, usefulness and great support for 2i2c and the communities we server.

We are now happy to announce that the 2i2c - CILogon partnership has been expanded to another year!

Acknowledgements: The upstream jupyterhub-oauthenticator project mentioned in this post as being used at 2i2c is a JupyterHub package, kindly developed and maintained by the JupyterHub community and the 2i2c integration described was developed by the 2i2c engineering team. Also, this post was edited by Jim Basney.

Tech update: Multiple JupyterHubs, multiple clusters, one repository.

Tue, 19 Apr 2022 00:00:00 +0000

2i2c manages the configuration and deployment of multiple Kubernetes clusters and JupyterHubs from a single open infrastructure repository. This is a challenging problem, as it requires us to centralize information about a number of independent cloud services, and deploy them in an efficient and reliable manner. Our initial attempt at this had a number of inefficiencies, and we recently completed an overhaul of its configuration and deployment infrastructure.

This post is a short description of what we did and the benefit that it had. It covers the technical details and provides links to more information about our deployment setup. We hope that it helps other organizations make similar improvements to their own infrastructure.

Our problem #

2i2c’s problem is similar to that of many large organizations that have independent sub-communities within them. We must centralize the operation and configuration of JupyterHubs in order to boost our efficiency in developing and operating them, but must also treat these hubs independently because their user communities are not necessarily related, and because we want communities to be able to replicate their infrastructure on their own.

A year ago, we built the first version of our deployment infrastructure at github.com/2i2c-org/infrastructure. Over the last year of operation, we identified a number of major shortcomings:

Within a Kubernetes cluster, we deployed hubs sequentially, not in parallel. This grew out of a common practice of Canary deployments that allowed us to test changes on a staging hub before rolling them out to a production hub.
We used a single configuration file for all hubs within a cluster, which led to confusion and difficulty in identifying a hub-specific configuration.
Moreover, any change to a hub within a cluster caused a re-deploy of all hubs on that cluster. This is because we did not know whether a given change touched cluster-wide configuration or hub-specific configuration.

Our goal #

So, we spent several weeks discussing a plan to resolve these major problems - here were our goals:

We should be able to upgrade a specific hub alone, by inspecting which configuration files have been added or modified.
Production hubs should be upgraded in parallel when they are effectively run independently.
We should use staging hubs as “canary” deployments and not continue upgrading production hubs if the staging hub fails.

An overview of our changes #

To accomplish this, we needed to identify which hub required an upgrade based on file additions/modifications. This took a lot of discussion and iteration on design, and so we share it below in the hopes that it is helpful to others!

Improvements to our code and structure #

We made a few major changes to the infrastructure repository to facilitate the deployment logic described above. Here are the major changes we implemented:

We separated each hub’s configuration into its own file, or set of files. For example, here is 2i2c’s staging hub configuration.
We created a separate cluster.yaml file that holds the canonical list of hubs deployed to that cluster and the configuration file(s) associated with each one. For example, here is 2i2c’s GKE cluster configuration, which contains a reference to the previously mentioned staging hub.
We updated our deployer module to do the following things:
- Inspect the list of files modified in a Pull Request.
- From this list, calculate the name of a hub that required an upgrade, and the name of its respective cluster.
- Trigger a GitHub Actions workflow that deploys changes in parallel for each cluster/hub pair.

In addition to these structural and code changes, we also developed new GitHub Actions workflows that control the entire process.

A GitHub Actions workflow for upgrading our JupyterHubs #

We defined a new GitHub Actions workflow that carries out the logic described above. These are all defined in this deploy-hubs.yaml configuration file. Here are the major jobs in this workflow, and what each does:

generate-jobs: Generate a list of clusters/hubs that must be upgraded, given the files that are changed in a Pull Request.
- Evaluate an input list of added/modified files in a PR
- Decide if the added/modified files warrant an upgrade of a hub
- Generate a list of hubs and clusters that require upgrades, and some extra details:
  - Does the support chart that is deployed to the cluster also need an upgrade?
  - Does a staging hub on this cluster require an upgrade?
This produced two outputs to be used in subsequent steps:
- A human-readable table including information on why a given deployment requires an upgrade (using the excellent Rich library).
- JSON outputs that can be interpreted by GitHub Actions as sets of matrix jobs to run.
Our staging and support hub job matrix tells GitHub Actions to deploy staging and support upgrades that act as canaries and stop production deploys if they fail.
upgrade-support-and-staging: Update the support and staging Helm charts on each cluster. These are “shared infrastructure” Helm charts that control services that are shared across all hubs.
- Accepts the JSON list described above to determine what to do next
- Parallelises over clusters
- Upgrades the support chart of each if required
- Upgrades a staging hub for the cluster if required (for canary deployments, this is always required if at least one production hub is to be upgraded on the cluster)
filter-generate-jobs: Allows us to treat the support / staging hubs as canary deployments for all the production hubs on a cluster.
- If a staging/support hub deploy fails, removes any jobs for the corresponding cluster.
- Allows production deploys to continue on other clusters.
Our production hub job matrix tells GitHub Actions which hubs to update with new changes. These are triggered if a cluster’s staging/support job does not fail.
upgrade-prod-hubs: Deploy updates to each production hub.
- Accepts the JSON list described above to determine what to do next
- Parallelises over each production hub that requires an upgrade
- Deploy the relevant changes to that hub

Concluding Remarks #

We think that this is a nice balance of infrastructure complexity and flexibility. It allows us to separate the configuration of each hub and cluster, which makes each more maintainable by us, and is more aligned with a community’s Right to Replicate their infrastructure. It allows us to remove the interdependence of deploy jobs that do not need to be dependent, which makes our deploys more efficient. Finally, it allows us to make targeted deploys more effectively, which reduces the amount of toil and unnecessary waiting associated with each change. (It also reduces our carbon footprint by reducing unnecessary GitHub Action time).

We hope that this is a useful resource for others to follow if they also maintain JupyterHubs for multiple communities. If you have any ideas of how we could further improve this infrastructure, please reach out on GitHub! If you know of a community that would like 2i2c to manage a hub for your community, please send us an email.

Acknowledgements: The infrastructure described in this post was developed by the 2i2c engineering team, and this post was edited by Chris Holdgraf.

Incident report: Brief description of the incident

Wed, 01 Jan 1000 00:00:00 +0000

On MMMM DD, YYYY our cloud infrastructure team experienced an incident with the XXXXX community hub. [See this issue for the full report](LINK TO ISSUE IN 2i2c-org/incident-reports).

What happened #

___ thing happened.
This resulted in ___.
It happened because ___.
It was resolved by ___.

What we learned #

We need to ___.
We learned that ___.
This will happen again if ___.

Acknowledgements #

Thanks to ___ for helping us identify the problem.
Thanks to ___ for helping us implement a solution.

EXTRA EXAMPLES #

contentblog/2025/incident-ucmerced-user-throttling/index.mdindex.md

Service-Enhancements | 2i2c

Protecting our hubs against the CopyFail kernel exploit

Are 2i2c’s hubs at risk? #

Why do we think we’re not at risk? #

What else did we look into #

Acknowledgements #

Upgrading community infrastructure to Kubernetes 1.34 and JupyterHub 4.3.3

A new approach to infrastructure upgrades: upgrading in rounds #

Learn more #

Acknowledgements #

How regularly upgrading core infrastructure leads to upstream improvements and better infrastructure

Acknowledgements #

Enabling CloudBank to safely manage their own cluster infrastructure

Learn more #

Acknowledgements #

Improving our community hub reliability and stability in Q4 2025

What we accomplished #

We improved infrastructure reliability for our communities #

We became more efficient, responsive, and focused #

The improvements we made #

Infrastructure improvements #

Process improvements #

Looking forward #

Learn More #

Faster reporting of user home directory sizes

Using jupyterhub-home-nfs for near-instant disk usage metrics #

Try it out #

Coming next #

Acknowledgements #

Adding User Group Insights to Cloud Cost Dashboards with Grafana

Learn more #

Acknowledgements #

Enabling transparent cloud cost monitoring with user-level dashboards

Learn more #

Acknowledgements #

Demonstrating our infrastructure's reliability with a hub status page for our communities

Learn more #

Reducing base infrastructure costs on AWS with smarter instance types

Learn more #

Incident report: UC Merced user throttling during class startup

What happened #

What we learned #

Resolution #

Acknowledgements #

Our Product and Service goals for Q3 2025

Demonstrable reliability of our infrastructure #

User-level costs monitoring and group-level usage monitoring #

Improving our Incident Response capability #

Piloting a feature co-funding model #

Another update coming in Q4 #

Acknowledgements #

Announcing `jupyterhub-groups-exporter`: monitor usage based on JupyterHub group membership with Prometheus and Grafana

Motivation #

JupyterHub and user groups #

Exporting user group memberships to Prometheus #

Visualizing user group resource usage with Grafana #

Future work #

Acknowledgements #

Solving classes of problems, rather than just an instance of a problem (with an example)

The Problem #

An incomplete solution #

What was the class of issues we could fix here? #

Structured solutions #

Moving forward #

Acknowledgements #

Launching Jupyter Book for 2i2c Communities

Offering Jetstream2-powered hub support at 2i2c

Context #

Jetstream2 Kubernetes support stack #

OpenStack and Magnum #

Cluster API and CAPI helm driver #

Decomposing the previous atomic feature #

Challenges #

Conclusion #

Acknowledgements #

Enforcing per-user storage quotas now available on GCP

Acknowledgements #

Open infrastructure for collaborative geoscience with Project Pythia: Learning how to deploy a BinderHub on Jetstream2

Project Pythia and the “Jupyter notebook obsolescence” problem #

Leveraging NSF-funded cyberinfrastructure for BinderHub #

Using `jupyterhub-home-nfs` for near-instant disk usage metrics #

A Standalone `binderhub-service` helm chart #

Sustainably extending KubeSpawner’s `profileList` #

Implementing `unlisted_choice` feature in KubeSpawner #

`jupyterhub-fancy-profiles` #

`jupyterhub/@binderhub-client` npm package #

Defending against cryptojacking with `cryptnono` #