Serokell Blog

Rust, C++, and the Tradeoffs Behind Safe Low-Level Code: interview with Nikita Lisitsa

hi+briankean@serokell.co (Brian Kean) — Mon, 08 Jun 2026 00:00:00 GMT

In this post, we interview Nikita Lisitsa about C++, Rust, systems programming, and game development.

We discuss whether C++ is still the default path into systems programming, where Rust fits in, and how both languages may coexist over the next decade. Nikita also shares his perspective on memory safety, language complexity, game engine design, real-time physics simulations, renderer abstractions, and the lessons he learned from shipping Costa Verde.

Rust advertises its “if it compiles – it works” stance and its user-friendly/educational compiler messages. Should we start learning Rust as a first step to C++?

I’d say both languages should be learnt in parallel. Both Rust and C++ have a ton of quirks and idiosyncrasies specific to these languages (borrow checker in Rust, duck-typed templates in C++, etc), and I don’t think they are good prerequisites to each other. However, they share a lot, too (being low-level, RAII/Drop trait, caring about object ownership & lifetimes), which is why I think it’s a good idea to learn both at the same time.

Rust’s compiler and safety guarantees allow “learning by trial and error” in multi-threaded synchronization. Should we use Rust as training wheels while learning parallel programming?

I don’t have a strong opinion on this. Rust forces you to wrap shared data in Arc> or something like that, which does prevent data races, but also hides a lot of underlying complexity, and doesn’t prevent a lot of other problems like deadlocks. C++ has a lot of other stuff you have to care about when writing multi-threaded code. To be honest, something like Java still feels like the cleanest language to learn the basics of parallel programming, and then maybe C++ to learn the more complicated, low-level things. But for writing multi-threaded code, Rust is probably the best.

Generalising those two questions – is C++ still the best language to get into system programming?

I don’t think there is a “best” language for system programming, it really depends on what your goals are. Writing Linux drivers and writing high-load multi-threaded data-processing servers are very different tasks. C, C++, Rust, Go, and a ton of other languages can be great for system programming in a suitable context. C++ is generally a pretty good match, but sometimes the manual memory management and unexpected UB can make it not worth it.

C++ keeps accumulating features — modules, coroutines, reflection, contracts. Do you think the language is becoming too complex to use safely and teachably, or is that complexity justified?

The way I see it, there is a certain programming language spectrum between simplicity and complexity. It’s basically a balance of how much you have to keep in your head vs how much your language does for you, but you still have to keep in your head that the language does it for you. C, for example, is closer to the “simplicity” end of this spectrum, while C++ has always been on the opposite end. I personally love C++ for that: you have to learn a lot about what the language can do, but then it pays off as you use what you’ve learnt to your advantage to write simpler, more expressive code.

The current development of C++ follows that same pattern: more complexity that empowers the programmer. There have always been people who refuse to use most of C++ features, and only use some subset of the language, and it makes sense that from their perspective, the language is becoming worse. I personally think it is justified in the context of using C++ in its fullness.

Do you think C++ is a good choice for agentic development?

Never tried agentic development, so I can’t tell :)

With all this AI stuff going on, shouldn’t memory safety/correctness be the first priority of language design?

Memory correctness is just one type of bugs, and, contrary to the widespread misconception, it isn’t the most common bug we have in C++ or any other language. C++, for instance, has been evolving to make memory safety almost effortless — things like smart pointers and proper usage of RAII cover 95% of all such cases. The tooling has been evolving, too — ASAN, for instance, can find most memory-related bugs automatically. Typically, our bugs are simply errors in the program’s logic, and there’s little you can do in the language itself to prevent those. I’d say the focus should be on something like contract-based programming (or, better, proof-based languages that use dependent types / HoTT / etc), and not some specific type of common bugs like invalid memory access.

In game development/system programming, it is considered a good practice to avoid allocations at all costs. Did Rust people get it all wrong with all the fuzz around ownership?

The way I see it, allocations and ownership semantics are largely orthogonal. You can use stack variables, a dedicated frame allocator, or make an entirely custom global allocator both in C++ and in Rust.

That being said, I personally believe that the “avoid allocations” mantra is a bit of a cargo cult, and probably stems from the early 2000s, when OOP was the cool way to write code, and everyone allocated everything on the heap for no reason at all, leading to degraded performance. The right thing to do is always to analyze what’s the real performance drain. In my code, I often allocate tons of temporary arrays or strings (e.g. for UI), and it turns out that this is pretty much never the performance bottleneck unless you heap-allocate individual tiny objects or something like that.

C++ is still the language of choice for game engine development. What does Rust miss, and will it ever be able to compete with C++?

I don’t think Rust misses anything in particular. It can be hard to write complicated data processing code with nontrivial inter-object dependencies and a ton of mutable state (which is what game logic often looks like) in Rust, but it should be manageable once you do it a few times and settle on some fitting patterns and data structures. Also, game development often requires a ton of experimenting and ad-hoc tweaking, and Rust’s strictness introduces extra friction.

I believe the reason most game engines still use C++ is mostly inertia. I don’t just mean that people simply don’t want a new language when the old one does the job (though this is a reason as well), but also that C++ has several decades of surrounding tooling, libraries, and platform support (afaik various game console SDKs only support C++).

Your engine seems to provide low-level graphics/platform/audio/utility infrastructure, while you often reimplement the project-specific renderer per game. Where do you draw the boundary between “engine feature” and “game-specific system,” and can you give an example of a system you first put into the engine but later decided should live in the game?

I usually draw the boundary using two criteria: how good the code quality is, and how useful and generic this system is. Ultimately, it is based on how I feel about the code. There were instances of the code migrating into the engine and back. For example, there is a fairly generic deferred renderer in the engine, but I’ve never used it in any project, as I love playing with various styles and techniques and making the rendering engine from scratch is just easier. There is a 2d physics engine as well, which I never used outside of a few toy demonstrations — again, because it’s easier to make a dedicated simple physics engine tailored to a specific project.

On the other hand, the engine contains a very primitive 2D painter class that can draw lines, circles, sprites, and text. While the code quality and the API of this class are rather questionable, I’ve used it numerous times because it’s basically all you need for a simple 2D game on a game jam, which is why it’s part of the base engine as well.

In your soft-body spaceship work and water-over-terrain simulation, you repeatedly choose simplified models that preserve the gameplay-relevant behavior while avoiding full physical realism. How do you decide which invariants must be preserved, such as stability, mass conservation, energy behavior, or locality, and which physical inaccuracies are acceptable for a game?

It mostly boils down to how the physics simulation interacts with gameplay, and how costly the simulation is. In my soft-body game for Ludum Dare 53, you play as a soft squishy spaceship. Hare, stability is essential, as we don’t want the spaceship to explode, while conservation of momentum and angular momentum aren’t that important. In fact, I introduce artificial friction (which, of course, doesn’t exist in actual space) to dampen momentum over time, to help stabilize the ship if it’s rotating wildly, to prevent the player from accidentally flying too far away, and to force the player to think strategically about fuel consumption when planning distant trips. The simulation itself is extremely cheap, by the way, so performance isn’t an issue here.

Water simulation is way more complicated. Typical methods involve iteratively solving large sparse linear systems, with many iterations per frame. Together with my goal of simulating a reasonably large portion of terrain with a somewhat high resolution (say, 200x200 meters with 1 meter resolution), it’s not really feasible to do that in real-time. (I could do this on the GPU, but it’s already busy doing the rendering, and it only makes things faster by some constant factor.) So, I have to drastically simplify the model in order to achieve real-time performance. There are some properties of the model that I can’t sacrifice — for example, local mass conservation is crucial to prevent small puddles or lakes from disappearing (some models I’ve tried before had this problem). Energy conservation isn’t that important, though, — as with the squishy spaceship, I’ll introduce some friction anyway to make the water surface calm down over time. The method I’ve chosen (called virtual pipes) has a ton of drawbacks — for example, it lacks inertia entirely, — but it fits the other criteria, and is as fast as it could be.

You have used OpenGL 3.3, ported toward WebGPU, built WebGPU demos/raytracing, and used WebGPU compute for browser simulations. If you were designing your renderer backend abstraction today, what would you expose as stable engine-level concepts, and what would you avoid abstracting because WebGPU, OpenGL, and future APIs differ too much?

I actually don’t like the idea of a “renderer backend” for my personal projects. I usually rewrite the rendering engine from scratch for each project, using some rendering API like OpenGL or WebGPU directly. There are some things that seem common across many rendering engines, like textures or 3D models, but I find that it’s hard to make them truly generic because each project has its special needs. Maybe you need instancing, maybe you need some extra vertex or instance attributes to control per-vertex wind strength, maybe you need to pass an extra per-instance color transformation matrix, etc. If you try to accommodate all possible use cases, you end up with something that basically directly mimics the underlying API with no real benefit other than satisfying one’s need for abstraction. So, in some sense, I avoid abstracting the API altogether. This means that switching APIs for a particular project can be hard, but that doesn’t happen too often. In fact, it never happened before, but I might rewrite my current project in Vulkan. The rendering code is about 10k lines, and most of them aren’t related to the underlying API at all, so I don’t expect it to be that complicated.

Costa Verde was your first commercial release, and your projects show many jam games and technical prototypes. Looking back, which technical bets paid off when shipping Costa Verde, and which engine or design choices slowed you down the most? How would those lessons change your current village-building game’s architecture, production scope, and tool priorities?

This might sound surprising, but, for the most part, releasing Costa Verde didn’t really teach me anything about the technical side of game development that I didn’t already know. There weren’t really any technical bets either: I used well-supported technologies (OpenGL 3.3, SDL2, etc) which are known to work, and, well, they did work perfectly. The game isn’t really that rendering-intensive, so I didn’t need a sophisticated graphics API.

I did learn one architectural thing, though. The game features a lot of clearly-defined object classes, like road segments, cars, buildings, traffic lights (those were removed in the final game, though), etc, and they all were mostly just stored in big arrays. I had to support arbitrary removal from these arrays, so I had to use some generation-based handles to reference the objects instead of bare indices or pointers. If you look closely, this is basically half of the implementation of a typical ECS engine (the other half being the components — adding/removing/querying/etc). Realizing that I’ve made several independent half-baked ECS implementations throughout the game’s code convinced me to implement a proper full ECS engine for the next project.

The 3-year release cycle — is it the right cadence? Too fast, too slow, or does the cadence even matter given how long adoption takes?

I think it’s alright. The adoption of new standards is rather slow, but I’d guess if the standards were released faster, the adoption would be faster as well, and the same if it were slower. The programming language landscape is extremely diverse these days, and languages are forced to evolve as fast as they can. However, releasing too often leads to a versioning hell. So, to me the 3-year cycle feels about right.

How do you see C++ and Rust coexisting over the next decade? Competition, gradual replacement in some domains, or peaceful coexistence?

To me, it feels that it will be a peaceful competition, with varying rates of success of both languages in different fields. I don’t think Rust will be able to replace C++ entirely, but it definitely will succeed in some areas, especially where safety is a high concern. I’d say that Rust has a higher chance of replacing C, though there are other contenders for that as well (e.g. Zig or Odin).

The US government and NSA have recommended moving away from C/C++ toward memory-safe languages. Does that concern you, and how do you think the C++ community should respond?

I’m all for using Rust in safety-critical software, but I don’t think this recommendation makes sense for software in general. As I’ve said before, I believe that, in general everyday-use software, memory bugs are a minor problem, and by that logic we should all move to programming in proof assistants instead.

That being said, currently there are many proposals to make C++ itself safer, and while this hasn’t lead to any particular large language feature like a borrow checker, it’s still heading in an interesting direction.

Blog: https://lisyarus.github.io/blog/about.html
X: https://x.com/lisyarus
YouTube: https://www.youtube.com/@lisyarus
StackOverflow: https://stackoverflow.com/users/2315602/lisyarus

Serokell’s Work on GHC: Dependent Types, Part 5

hi+vladislavzavialov@serokell.co (Vladislav Zavialov) — Mon, 01 Jun 2026 00:00:00 GMT

This article continues the fine tradition of Serokell’s GHC team sharing their progress on bringing dependent types to Haskell. A lot has happened since the last report, and there is plenty to cover.

In this edition, Vladislav Zavialov presents three major contributions and a host of smaller improvements that push Dependent Haskell closer to becoming a practical reality.

Summary

The highlights of this report are:

Visible forall in GADTs
Namespace-specified imports
Type instances in kind checking

After that, we are going to go through a number of other improvements:

Progress on unifying HsType and HsExpr
The star kind syntax in required type arguments
Pun detection in required type arguments
New type families: Tuple, Constraints, Tuple#, Sum#
Rework of name resolution for built-in and punned names

Visible `forall` in GADTs

The design of dependent types for Haskell, as described by GHC Proposal #378 “Design for Dependent Types”, includes support for at least 6 quantifiers:

Quantifier        Dependence     Visibility     Erasure
------------------------------------------------------------
forall a. ty      Dependent      Invisible      Erased
forall a -> ty    Dependent      Visible        Erased
foreach a. ty     Dependent      Invisible      Retained
foreach a -> ty   Dependent      Visible        Retained
Eq a => ty        Non-dependent  Invisible      Retained
t1 -> t2          Non-dependent  Visible        Retained

The one we all care about is foreach a -> ty, also known as the dependent product, dependent function, or Π-type (all three are synonymous). But before we can tackle something so ambitious, it helps to deal with the other quantifiers, such as forall a -> ty, often referred to as the visible forall or VDQ (visible dependent quantification).

The majority of design questions for VDQ were resolved when GHC Proposal #281 “Visible forall in types of terms” was accepted back in 2021, and we have been relentlessly chipping away at its implementation ever since, one engineering challenge at a time.

The latest advancement in this direction is the implementation of VDQ in GADTs. Starting with GHC 9.14, the RequiredTypeArguments extension allows declarations such as the following:

data T a where
  Typed :: forall a -> a -> T a

In this example, the Typed data constructor takes two visible arguments: a type, and then a value of that type, e.g. Typed Int 42 or Typed String "hello". This surely has a dependently-typed look to it! Here are some examples of what it might look like in various contexts:

-- Expressions
t1 = Typed Int 42
t2 = Typed String "hello"
t3 = Typed (Int -> Bool) even

-- Patterns
f1 (Typed a x) = x :: a
f2 (Typed Int n) = n*2
f3 (Typed ((->) w Bool) g) = not . g

-- Types (with DataKinds)
type T1 = Typed Nat 42
type T2 = Typed Symbol "hello"
type T3 = Typed (Type -> Constraint) Num

One thing to keep in mind, and why this doesn’t really count as dependent types, is that the type argument is guaranteed to be erased, i.e. has no effect on how data is represented on the heap, and consequently can’t be pattern matched on:

f4 (Typed a x) =
  case a of   -- Nuh-uh! This is a compile-time error
    Int  -> negate x
    Bool -> not x
    _    -> x

Nonetheless, allowing this form of quantification comes with its own technical challenges, which are now behind us. This means that when it comes to adding proper dependent types, we won’t have to worry about syntactic trivia.

Here is what it took to make this work.

The first challenge was that GHC’s AST for constructor patterns MkE @tp1 @tp2 p1 p2 kept the type arguments and term arguments in entirely separate lists. We refactored it to use a mixed-list representation:

ConPat "MkE" [tp1, tp2] [p1, p2]               -- old
ConPat "MkE" [InvisP tp1, InvisP tp2, p1, p2]  -- new

The immediate effect of the new representation was an improvement to error messages. Consider the pattern Con x @t y. Previously it resulted in a parse error because @t could not occur after x, and now it is reported as [GHC-14964].

More interestingly, the form Con x @t y has become, in principle, representable, so it has become possible to handle it properly down the compilation pipeline.

The second challenge was to allow visible forall in constructor signatures. It took a few tries to get this right, as the forall-or-nothing rule that governs implicit quantification meant that some quantifiers had to be kept in a separate field. The type checker was also updated to no longer assume that all subpatterns without the @ herald were value arguments. Indeed, some or all of them may very well turn out to be required type arguments.

The third challenge was to update the Core representation of data constructors to allow foralls of varying visibility to occur in the list of quantifiers:

- dcUserTyVarBinders :: [InvisTVBinder]
+ dcUserTyVarBinders :: [TyVarBinder]

This change not only necessitated mechanical changes scattered throughout the GHC codebase, but also resulted in cryptic Core Lint errors. The patch had been shelved for a few months until @mniip helped to debug this issue at ZuriHac 2025. As it turned out, the predicate that determines whether to introduce a data constructor wrapper was returning a false negative.

With all three challenges resolved, GHC can now handle the following example (constructed to stress-test the feature rather than demonstrate a realistic use case):

newtype Checker a b c where
  C :: forall a. forall b -> forall c. (a -> (c, b)) -> Checker a b c

cDouble = C Double (length &&& read)

-- ghci> map (test cDouble) ["1.5", "0.5", "1.05", "0.05"]
-- [(3,True),(3,True),(4,True),(4,False)]
test :: Show b => Checker String b c -> String -> (c, Bool)
test (C t f) s = fmap (\r -> show @t r == s) (f s)

And there we have it: we are one step closer to freely mixing terms and types in our Haskell programs. Future work in that direction includes nested quantification in GADTs (#18389) and VDQ in pattern synonyms (#23704).

Namespace-specified imports

Haskell has two namespaces, as GHC Proposal #581 explains:

the type namespace, including the names of type constructors, type synonyms, type families and type classes;
the data namespace, including the names of term-level values, functions, data constructors and pattern synonyms.

This separation allows us to define data constructors and type constructors whose names coincide:

data T = T

At use sites, GHC infers which T is referred to from the context. For example, in t :: T the occurrence of T resolves to the type constructor, whereas in t = T to the data constructor.

The reliance on context to select the namespace creates ambiguities when it comes to import/export lists, fixity declarations, pragmas, TH name quotation, and most notably when mixing terms and types with DataKinds and RequiredTypeArguments.

The easiest way to sidestep these issues is to avoid introducing type and data constructor names that coincide, and this is what dependent-types-flavored Haskell tends to do. However, we can’t expect all libraries to adopt such a convention, so we still need ways to disambiguate T against T in a context-independent manner.

After a number of false starts, with no fewer than three proposals from various authors, the consensus was to introduce namespace-specified imports looking like this:

import Data.Monoid as M.Type (type ..)
import Data.Monoid as M (data ..)

Notice the new .. syntax; it is a wildcard that stands for all names in the specified namespace.

Recall that the Data.Monoid module exports the following newtypes:

newtype All       = All     {getAll :: Bool}
newtype Alt f a   = Alt     {getAlt :: f a}
newtype Any       = Any     {getAny :: Bool}
newtype Ap f a    = Ap      {getAp :: f a}
newtype Dual a    = Dual    {getDual :: a}
newtype Endo a    = Endo    {appEndo :: a -> a}
newtype First a   = First   {getFirst :: Maybe a}
newtype Last a    = Last    {getLast :: Maybe a}
newtype Product a = Product {getProduct :: a}
newtype Sum a     = Sum     {getSum :: a}

With imports written as above, one can refer to the type constructors as M.Type.Dual, M.Type.Endo, M.Type.Product, and so on, and to the data constructors as M.Dual, M.Endo, and M.Product respectively.

At the outset, this seemed like a straightforward feature to implement: a new form of import/export item that simply selects names according to the namespace specifiers. However, as we started to tackle this problem, we found a surprising number of pre-existing bugs that had to be fixed first:

In subordinate import and export lists, the use of the type namespace specifier used to be silently ignored (#12488, #22581)
In subordinate import lists within a hiding clause, non-existent items led to a poor warning message with -Wdodgy-imports (#25983)
In subordinate import lists within a hiding clause, non-existent items resulted in the entire import declaration being discarded (#25984)
In subordinate import lists, it was not possible to refer to a class method if there was an associated type of the same name (#25991)

With so many corner cases discovered, it was no longer clear how many other bugs were lurking in the import/export logic, but we decided to proceed cautiously with preparations:

We introduced the option to use the data keyword with individual import/export items, e.g. import Data.Proxy as D (data Proxy).
In accordance with the proposal, we deprecated the pattern namespace specifier and introduced the -Wpattern-namespace-specifier warning to aid migration to the new data syntax.
We increased the test coverage of -Wduplicate-exports and -Wdodgy-exports, which revealed some typos and dead code.

At last, it seemed like the foundation was solid enough to tackle the actual feature. The next step was to add support for top-level namespace-specified wildcards type .. and data .. to import and export lists:

import M (type ..) -- imports all type and class constructors from M
import M (data ..) -- imports all data constructors and terms from M

module M (type .., f) where
  -- exports all type and class constructors defined in M,
  -- plus the function 'f'

We then carried out a refactoring, after which the implementation resumed and reached another milestone: support for subordinate namespace-specified wildcards X(type ..) and X(data ..) in import and export lists:

import M (Cls(type ..))  -- imports Cls and all its associated types
import M (Cls(data ..))  -- imports Cls and all its methods

module M (R(data ..), C(type ..)) where
  -- exports R and all its data constructors and record fields;
  -- exports C and all its associated types, but not its methods

At this point, GHC could handle all examples from the proposal, so we might as well declare victory. Studying the spec revealed a few corner cases that are still being worked on (#27268), but those are unlikely to arise in practice.

Et voilà, Haskellers have another tool to deal with the Dreaded Namespace Problem.

Type instances in kind checking

If you ever wrote $(return []) in your module to help GHC kind-check your code, you know what this section is going to be about. If not, read on: this one is a heavy hitter.

We are excited to announce that, after 10 years of futile attempts, we finally taught GHC to find open type family instances during kind checking, regardless of the order in which they are written in the source file. Consider the following program:

type family Open a
type family F a :: Open a

type instance F Int = True
type instance Open Int = Bool

When kind-checking the F Int = True instance, we need to know that Open Int = Bool, so we’d better check the other type instance first, otherwise we’ll see the following error:

error: [GHC-83865]
    • Expected kind ‘Open Int’, but ‘True’ has kind ‘Bool’
    • In the type ‘True’
      In the type instance declaration for ‘F’

But look again at this line:

type instance F Int = True

It mentions F, Int, and True. Nothing points to Open! So how can we know to kind-check the other instance first?

We tried various heuristics, but every time, we found corner cases that couldn’t be handled. In the meantime, some variation of this issue was reported at the rate of about once a year:

#12088 “Type/data family instances in kind checking” (May 20, 2016)
#12239 “Dependent type family does not reduce” (June 29, 2016)
#13790 “GHC doesn’t reduce type family in kind signature unless its arm is twisted” (June 5, 2017)
#14668 “Ordering of declarations can cause typechecking to fail” (January 14, 2018)
#15561 “Type error conditioned on ordering of GADT and type family definitions” (August 24, 2018)
#16410 “Order of declarations matters” (March 8, 2019)
#16448 “Unresolved type family kind bug” (March 16, 2019)
#16693 “Order of declarations affects which programs are accepted” (May 25, 2019)
#19611 “Typechecking of dependently-kinded type family instances doesn’t have enough equalities in scope” (March 28, 2021)
#20875 “Declaration order of aliases and type families” (December 26, 2021)
#21172 “No reduction of kind families” (March 5, 2022)
#22257 “Dependent type families cannot be separated from their instances” (October 5, 2022)
#25238 “Kind reduction with open type families depends on data type declaration order” (September 5, 2024)
#25834 “Splices seem to break type inference for type families” (March 8, 2025)

Haskellers are phenomenal at writing programs that break the compiler, and thanks to the bug reports, we have accumulated an excellent set of test cases.

Let’s now dive into the technical details.

After renaming type, class, and instance declarations in a module, GHC does dependency analysis on the renamed declarations to figure out in which order to kind-check them. The dependency analysis returns a topologically sorted list of SCCs (strongly-connected components), where an SCC represents a set of mutually recursive declarations.

If the dependency analysis were complete, i.e. if it were able to discover all dependencies between all declarations, then GHC could simply kind-check SCCs in order. Unfortunately, this is not the case, because dependencies come in two varieties:

Lexical dependencies arise when X mentions Y by name:

data X (a :: Y) = MkX   -- depends on Y
data Y = MkY

Non-lexical dependencies arise when an instance must be in the typing environment:

type family F x
data X (a :: F Int) = MkX a   -- depends on (F Int ~ Type)
type instance F x = Type

Non-lexical dependencies can’t be discovered by looking at the free variables of a declaration, and attempts to find a good heuristic did not bear fruit. As a consequence, the order of SCCs coming out of the renamer is determined solely by lexical dependencies.

In other words, the SCCs are arranged in lexical dependency order, meaning:

definitely in dependency order if all dependencies are lexical
possibly not in dependency order if there are non-lexical dependencies

Here is another example of how type checking declarations might go wrong due to non-lexical dependencies:

type family F a
type instance F Int = Bool

data R = MkR (F Int)

type S = MkR True

For S to kind-check, we need to know that (F Int) ~ Bool. But we won’t know that unless we’ve looked at the type instance declaration for F before kind-checking S.

The solution we ended up adopting is to discover the correct kind-checking order by trial and error. The algorithm works as follows:

Perform the dependency analysis on declarations without instances and considering lexical dependencies only. The result is a topologically sorted list of SCCs.
Create one singleton “SCC” per instance and put them at the end.
Check all SCCs in order, skipping any that are blocked (free variables not in the environment), flawed (unusable unpack pragmas), or failing (type errors)
- (3a) if none were skipped, we are done
- (3b) if all were skipped and none are flawed, we are stuck; report errors and exit
- (3c) if all were skipped and some are flawed, redo the pass allowing flawed SCCs
- (3d) if some were skipped and some weren’t, we’ve made progress; iterate

In the common case of lexical dependencies only, the algorithm is linear in the number of groups: it completes in one pass if the program is kind-correct, or two passes if there are kind errors.

In the less common case of non-lexical dependencies, the algorithm is worst-case quadratic in the number of groups: if each pass manages to check only one group, we end up doing a pass per group.

Now, regarding the “flawed” groups. These are the ones where the programmer used the {-# UNPACK #-} pragma on a field, yet we could not unpack. One possible reason for this is that we lack a data instance in the environment that would allow for the field to be unpacked, so it is beneficial to treat this like a kind error: skip the flawed group and retry it in a later pass, when we might have more data instances in the environment. However, if the only reason a pass gets stuck is due to flawed groups, then we can make progress by treating unpacking failure as a warning. This way, we maximize unpacking with explicit {-# UNPACK #-} pragmas. Later we might check SCCs for other “flaws”, but for now the property is just about unusable unpack pragmas.

Finally, a short comment on why it is necessary to check whether a group is ready (all free variables are in the environment) or blocked (some free variables are not in the environment). One might expect this check to be redundant, as the SCCs come in lexical dependency order. However, as soon as we skip a group, the rest of the pass can no longer rely on this property, hence the check. It is rare to encounter this problem in a kind-correct program, but we managed to construct a test case.

That is all for the technical details. It took many iterations to arrive at this algorithm. Each previous attempt failed on some corner case or another, which is, of course, what made this a 10-year problem to begin with. If you are one of the many Haskellers who ran into this bug and found yourself reordering declarations or reaching for the $(return []) workaround, you can now drop it. GHC 9.14 handles it automatically.

Beyond the immediate quality-of-life improvement, this fix unblocks exciting developments such as singletonisation of GADTs.

Progress on unifying `HsType` and `HsExpr`

Internally, GHC’s frontend uses two distinct types to represent types and terms: HsType and HsExpr respectively. A dependently typed language would normally give uniform treatment to types and terms, so one of our goals is a refactoring to use one representation for both, arriving at the following declaration in GHC:

type HsType = HsExpr

The rationale for this is to increase code reuse between the term- and type-level pipelines in the compiler front-end (AST, parser, renamer, type checker).

Given that types and terms already share a number of similarities, one would expect this to be a straightforward refactoring. For example, the following constructs are found in both:

Variables a
Constructors Con
Prefix applications f a
Infix applications a # b
Type applications f @t
Literals 42, "hello"
Signatures a :: t
Lists [a, b, c]
Tuples (a, b, c)
Parentheses (a)
Holes _

However, as it turned out, the representations for those constructs in HsExpr and HsType differ in various ways, leading to what one might call “death by a thousand cuts”.

We expect that it will take a good number of patches to handle each discrepancy individually and work out its user-facing consequences. For example, when we dealt with a difference in how infix operator applications were represented, we ended up allowing infix holes a`_`b in types, following the precedent set by term-level expressions.

The issue is further exacerbated by the fact that each construct actually has three variants depending on the compilation phase: the AST is either parsed, renamed, or typechecked, and each construct can have phase-specific extensions.

To make the effort more systematic, we created an automated test case that tracks the current status of the refactoring. If you open testsuite/tests/ghc-api/T25121_status.stdout in GHC’s sources, you will find a detailed report on which representations match or mismatch.

The goal is to make that test case report “match” for every constructor; then we will be able to merge HsType with HsExpr. To that end, here are some constructs that we have already updated for improved uniformity:

Literals (dropped HsTyLit in favor of HsLit)
Operators (added infix holes in types)
Wildcards (refactored HsWildCardTy to use HoleKind)
Tuples and sums (updated exact-print annotations)

We are looking forward to making more changes in that direction.

The star kind syntax in required type arguments

The StarIsType extension allows the programmer to write * instead of Type:

Maybe :: * -> *

Although it is eventually going to be deprecated per GHC Proposal #143 “Remove the * kind syntax”, there are currently no plans to fully remove it. Indeed, the proposal uses the following phrasing:

the -XStarIsType extension may be removed from GHC to simplify the internals

And “may be removed” only hints at the possibility of removal; given the magnitude of the potential breakage, this is not going to happen without friction.

Looking at this from the perspective of unifying term- and type-level parsers, this means we need to parse * in term syntax somehow. The problem is that it conflicts with the multiplication operator! This is a known problem and the raison d’être of the aforementioned proposal.

Back in 2020, we prototyped a solution to this. With a carefully crafted set of edits to the Haskell grammar, it turned out to be possible to add limited support for the * syntax where it does not conflict with multiplication.

Indeed, consider the common kinds:

Maybe :: * -> *
Either :: * -> * -> *
Monad :: (* -> *) -> Constraint

None of them actually involve the a * b ambiguity.

The prototype that implemented this was actually shelved, because it was difficult to motivate the change to the grammar. But now, with the introduction of RequiredTypeArguments, we can allow the following code:

{-# LANGUAGE RequiredTypeArguments, StarIsType #-}
x1 = f (* -> * -> *)
x2 = f (forall k. k -> *)
x3 = f ((* -> *) -> Constraint)

So we dusted off the old patch, rebased and refined it, and now we finally have a sound backwards compatibility story for *. Merging term and type syntax no longer requires its complete removal.

Pun detection in required type arguments

The initial implementation of RequiredTypeArguments left out pun checking due to engineering difficulties. Consider:

x = 42

f, g :: forall a -> ...
f (type x) = g x

In accordance with the specification, the g x function call is renamed as a term, so x refers to the top-level binding x = 42, not to the type variable binding type x as one might expect.

This is somewhat counterintuitive because g expects a type argument. Forbidding puns in required type arguments allows us to produce a helpful error message:

error: [GHC-09591]
  Illegal punned variable occurrence in a required type argument.
  The name ‘x’ could refer to:
    ‘x’ defined at Test.hs:3:1
    ‘x’ bound at Test.hs:5:9

Unfortunately, the initial attempt to introduce this check was stalled, as the necessary information was not available in the type-checking phase.

Luckily, in an effort to improve error messages, @sheaf corrected the architectural flaw in GHC that was blocking the punning check, and we implemented the check.

New type families: `Tuple`, `Constraints`, `Tuple#`, `Sum#`

GHC Proposal #145 “Non-punning list and tuple syntax” describes the NoListTuplePuns extension, which removes the overloading of (a, b): with the extension enabled, (a, b) always refers to the data constructor, never the type constructor. The corresponding tuple type can still be written as Tuple2 a b, but the Tuple type family offers better notation by reusing the familiar (a, b) syntax:

Tuple (Int, Bool)          = Tuple2 Int Bool
Constraints (Show a, Eq a) = CTuple2 (Show a) (Eq a)
Tuple# (Int#, Float#)      = Tuple2# Int# Float#
Sum# (Int#, Float#)        = Sum2# Int# Float#

When the extension was first implemented by @tek, these type families had to be left out because GHC’s type inference was not yet powerful enough to handle them. The missing piece was more aggressive injectivity analysis for closed type families, which @rae proposed and, three years later, @simonpj implemented. With that in place, we were able to add the type families.

As part of this work, we also bumped the maximum sum arity from 63 to 64. Adding the Sum64# constructor had previously been blocked by a separate issue, which @luite resolved.

We also submitted a proposal amendment, as the type family definitions had to be simplified to a form that GHC’s injectivity checker could handle.

Rework of name resolution for built-in and punned names

While working on the name resolution component of GHC, we stumbled upon some outdated documentation and technical debt. Further inspection revealed four lurking bugs:

#25174 Template Haskell mistakenly assumes "FUN" is built-in syntax
#25179 mkName (template-haskell) ignores NoListTuplePuns
#25180 Valid hole fits don’t suggest tuple constructors
#25182 MkSolo and MkSolo# are mistakenly classified as BuiltInSyntax

Fixing them in a principled way called for a complete overhaul of the relevant logic. We cleaned up which names truly qualify as built-in syntax, unified the treatment of tuples of all arities, made the implementation aware of NoListTuplePuns, and updated the internal documentation.

Previous updates on dependent types

Conclusion

This was the fifth blog post of our series “Work on GHC: Dependent Types”. In this installment, we covered visible forall in GADTs, namespace-specified imports, and the long-awaited fix for type instance ordering in kind checking. We also shared incremental progress on unifying HsType and HsExpr, and a handful of smaller but meaningful improvements throughout the compiler.

We are committed to our vision of dependently-typed programming in Haskell, so stay tuned for future updates!

The Hidden Perils of MonadBaseControl

hi+diogocastro@serokell.co (Diogo Castro) — Mon, 23 Mar 2026 00:00:00 GMT

MonadBaseControl is notoriously tricky to use correctly. It’s really easy to misuse and end up introducing subtle unexpected behaviour or downright bugs, even in the hands of the more experienced developers.

The goal of this article is to establish a clear mental model of how to work with MonadBaseControl, recognize its dangers, and how to avoid them.

Lastly, I’ll leave you with recommendations for best practices, when to reach out for MonadBaseControl, and when not to.

This post assumes basic familiarity with MonadBaseControl. If you’ve never used it, I wholeheartedly recommend reading Alexis King’s Demystifying MonadBaseControl first and then coming back to this article. I will be reiterating some of the pitfalls mentioned in her article, expanding upon those, and diving into others.

Note: this page is available as a Literate Haskell file. Refer to that module to find the GHC extensions and imports used throughout the article.

Before we begin, let’s define a couple of helper functions to make our examples easier to read:

-- | Append a value to the state.
appendToState :: MonadState [a] m => a -> m ()
appendToState a =
  -- For the purpose of this article, excuse the inefficiency.
  -- A `DList` or `Seq` would be more appropriate.
  modify (<> [a])

-- | Print the current state with a label for context.
printState :: Show s => String -> StateT s IO ()
printState context = do
  st <- get
  liftIO $ putStrLn $ "State observed from '" <> context <> "': " <> show st

A quick refresher

Say you had this function; it takes an IO action as input and returns another IO action.

foo :: forall a. IO a -> IO a

If we wanted to call foo with an action of type StateT s IO a, we could “lift” it like this:

fooState :: forall a s. StateT s IO a -> StateT s IO a
fooState stateAction = do
  inputState <- get
  let ioAction = runStateT stateAction inputState :: IO (a, s)
  (a, outputState) <- liftBase $ foo @(a, s) ioAction
  put outputState
  pure a

That is, we need to:

Use get to capture the input state.
Run the StateT s IO a action with the input state, yielding an IO (a, s) action.
Call foo with the IO (a, s) action.
Restore the output state with put.

Observe how foo’s type parameter is instantiated to @(a, s), materializing as foo :: IO (a, s) -> IO (a, s). In essence, we’re threading the state through foo, and then restoring it afterwards.

MonadBaseControl abstracts over this pattern and provides a general-purpose way of lifting functions like foo into some transformer stack.

foo' :: forall m a. (MonadBaseControl IO m) => m a -> m a
foo' action = do
  st <- liftBaseWith \(runInBase :: m a -> IO (StM m a)) -> do
    let ioAction = runInBase action :: IO (StM m a)
    foo @(StM m a) ioAction
  restoreM st

Notice the parallels between this and the StateT version:

Use liftBaseWith to capture the input state; this gives us runInBase, a closure over that state.
runInBase will run an m a action with the input state, yielding an IO (StM m a) action.
Call foo with the IO (StM m a) action¹.
Restore the output state with restoreM.

Again, we’re instantiating foo as foo :: IO (StM m a) -> IO (StM m a), allowing the state to be threaded through.

The monad-control package gives us the machinery to do this, and the packages lifted-base and lifted-async build on top of it to give us the lifted version of commonly used functions from the base and async packages, respectively.

Discarded state

To start off simple, let’s take a look at this function:

whenJust_ :: Maybe b -> (b -> IO a) -> IO ()
whenJust_ Nothing _ = pure ()
whenJust_ (Just x) f = void $ f x

A naive first attempt at lifting it might look like this:

whenJust_' :: forall m a b. (MonadBaseControl IO m) => Maybe b -> (b -> m a) -> m ()
whenJust_' mb f = do
  liftBaseWith \runInBase ->
    whenJust_ mb (runInBase . f)

This typechecks, but it won’t work as expected. Remember: lifting a function with MonadBaseControl implies threading the state through it, getting the output state back, and then restoring it.

However, even though whenJust_ does take a polymorphic action IO a, it unfortunately returns IO (). This means we simply can’t get our output state back, which means we can’t possibly call restoreM!

As a result, all state modifications will be discarded:

-- | >>> execStateT testWhenJust1 [0]
-- [0]
testWhenJust1 :: StateT [Int] IO ()
testWhenJust1 = do
  whenJust_' (Just 1) \x -> do
    appendToState x

The only way to preserve state modifications is if whenJust_ returns the a produced by the input action IO a.

In most situations, you can modify the function being lifted, or reimplement it, to allow the state to flow through. Luckily, in this instance, that’s fairly easy to fix:

whenJust :: Maybe b -> (b -> IO a) -> IO (Maybe a)
whenJust Nothing _ = pure Nothing
whenJust (Just x) f = Just <$> f x

whenJust_'' :: forall m a b. (MonadBaseControl IO m) => Maybe b -> (b -> m a) -> m ()
whenJust_'' mb f = do
  stMaybe :: Maybe (StM m a) <- liftBaseWith \runInBase ->
    whenJust mb (runInBase . f)
  case stMaybe of
    Just st -> do
      _ :: a <- restoreM st
      pure ()
    Nothing ->
      pure ()

Unlike whenJust_, whenJust does allow a to flow through. The input action produces an a, which is then wrapped in a Maybe and returned. We can now capture the output state when Just is returned, and restore it.

-- | >>> execStateT testWhenJust2 [0]
-- [0,1]
testWhenJust2 :: StateT [Int] IO ()
testWhenJust2 = do
  whenJust_'' (Just 1) \x -> do
    appendToState x

Threading state

While lifting a function with 1 input action is usually straightforward, lifting a function with 2 or more can get really thorny. Let’s have a look at a more nuanced (and rather contrived) example and try lifting logDuration:

-- | >>> logDuration (threadDelay 1_e6) (\d -> putStrLn $ "Took " <> show d)
-- Took ...s
logDuration :: IO a -> (NominalDiffTime -> IO b) -> IO a
logDuration action logFn = do
  (a, duration) <- timed action
  _ <- logFn duration
  pure a

timed :: IO a -> IO (a, NominalDiffTime)

To avoid the trap described in the last section, we’re going to be using the higher-order combinator control, which ensures we do call restoreM.

logDuration' :: (MonadBaseControl IO m) => m a -> (NominalDiffTime -> m b) -> m a
logDuration' action logFn = do
  control \runInBase ->
    logDuration (runInBase action) (runInBase . logFn)

However, there are 2 problems with this.

First, the input state is being forked and passed into both action and logFn. Recall that runInBase is a closure that captures the input state. And because we’re applying it twice, once to action and once to logFn, both actions will see the same input state.

-- | >>> evalStateT testLogDuration1 [0]
-- State observed from 'action': [0]
-- State observed from 'logFn': [0]
testLogDuration1 :: StateT [Int] IO ()
testLogDuration1 = do
  logDuration'
    (printState "action" >> appendToState 1)
    (\_ -> printState "logFn")

This is not the behaviour most users would expect. A more sensible implementation would thread the output state of action into logFn.

The second problem is that, even though we are using restoreM to restore the output state, we’re only restoring the output state of action. The output state of logFn is being discarded.

-- | >>> execStateT testLogDuration2 [0]
-- [0]
testLogDuration2 :: StateT [Int] IO ()
testLogDuration2 = do
  logDuration'
    (pure ())
    (\_ -> appendToState 1)

Why? Let’s have a closer look at the type of logDuration. Note how it takes 2 input actions, IO a and NominalDiffTime -> IO b, but only returns the output a of the first action. b is never returned, so its state cannot be restored.

Just as before, we need to break the function apart and reimplement it in terms of its primitives.

logDuration is defined in terms of timed, which takes a single input action. Its type is IO a -> IO (a, NominalDiffTime), so it allows the state to flow through.

logDuration'' :: (MonadBaseControl IO m) => m a -> (NominalDiffTime -> m b) -> m a
logDuration'' action logFn = do
  (st, duration) <- liftBaseWith \runInBase -> do
    timed (runInBase action)
  a <- restoreM st
  _ <- logFn duration
  pure a

Now the input state is observed by action, action’s output state is observed by logFn, and logFn’s output state will be observed by the caller.

-- | >>> execStateT testLogDuration3 [0]
-- State observed from 'action': [0]
-- State observed from 'logFn': [0,1]
-- [0,1,2]
testLogDuration3 :: StateT [Int] IO ()
testLogDuration3 = do
  logDuration''
    (printState "action" >> appendToState 1)
    (\_ -> printState "logFn" >> appendToState 2)

Brick walls

So far, we’ve learned to be mindful of how the state flows through the function, how it’s captured and restored, and how we can “massage” the function’s definition to make things work.

Still, there are times when we’ll hit a wall and find functions that are just impossible to lift with MonadBaseControl in a satisfactory way.

Two very common pitfalls are functions related to concurrency and exception handling.

concurrently

Concurrency is a rather obvious issue, and concurrently illustrates it well.

concurrently :: IO a -> IO b -> IO (a, b)

At a fundamental level, the input state must be forked and given to both branches and, once they’re done, we must only keep the state of one branch.

In the implementation below, I arbitrarily chose to always keep the state of the second branch, regardless of which action finishes first. This is exactly how concurrently is implemented in the lifted-async package.

-- | >>> execStateT (concurrently' (appendToState 1) (appendToState 2)) []
-- [2]
concurrently' :: (MonadBaseControl IO m) => m a -> m b -> m (a, b)
concurrently' ma mb = do
  (stateA, stateB) <- liftBaseWith \runInBase -> do
    Async.withAsync (runInBase ma) \asyncA ->
      Async.withAsync (runInBase mb) \asyncB -> do
        Async.waitBoth asyncA asyncB

  a <- restoreM stateA -- here we restore the output state of the 1st branch, but then...
  b <- restoreM stateB -- ... we immediately overwrite it with the output state of the 2nd branch.
  pure (a, b)

bracket

The issues with mixing exception handling and MonadBaseControl are a bit more subtle and deceiving. Let’s have a look at bracket to understand why.

bracket :: IO a -> (a -> IO b) -> (a -> IO c) -> IO c

If we start out with control and “follow the types”, we’ll get this:

bracket' :: (MonadBaseControl IO m) => m a -> (a -> m b) -> (a -> m c) -> m c
bracket' acquire release use =
  control $ \runInBase ->
    bracket
      (runInBase acquire)
      (\st -> runInBase $ restoreM st >>= release)
      (\st -> runInBase $ restoreM st >>= use)

In fact, this is the exact example given in the docs for control.

To understand it, it helps to see how exactly bracket’s type parameters are being instantiated here:

bracket
  :: IO (StM m a)              -- acquire
  -> (StM m a -> IO (StM m b)) -- release
  -> (StM m a -> IO (StM m c)) -- use
  -> IO (StM m c)

Let’s break it down:

The input state is captured and passed to our acquire action.
The output state of acquire (StM m a) is passed to both use and release; restoreM st >>= ... ensures our use and release actions will see it.
The output state of release (StM m b) is discarded.
The output state of use (StM m c) is returned by bracket and will be restored by control.

There are 2 issues with this implementation:

release runs after acquire and use, but only sees the output state of acquire, not use.
The output state of release is discarded.

-- | >>> execStateT testBracket1 []
-- State observed from 'release': ["acquire"]
-- ["acquire","use"]
testBracket1 :: StateT [String] IO ()
testBracket1 =
  bracket'
    (appendToState "acquire")
    (\_ -> printState "release" >> appendToState "release")
    (\_ -> appendToState "use")

This is not how we want the state to be threaded. Again, we’ll break the function apart and reimplement it in terms of its primitives. bracket is defined using mask and onException, so we’ll redefine it using lifted versions of those same functions from the lifted-base package (which do behave sensibly).

bracket'' :: (MonadBaseControl IO m) => m a -> (a -> m b) -> (a -> m c) -> m c
bracket'' acquire release use =
  Lifted.mask \restore -> do
    a <- acquire
    c <- restore (use a) `Lifted.onException` release a
    _ <- release a
    pure c

Now we can observe the state being threaded correctly through all 3 actions:

-- | >>> execStateT testBracket2 []
-- State observed from 'acquire': []
-- State observed from 'use': ["acquire"]
-- State observed from 'release': ["acquire","use"]
-- ["acquire","use","release"]
testBracket2 :: StateT [String] IO ()
testBracket2 =
  bracket''
    (printState "acquire" >> appendToState "acquire")
    (\_ -> printState "release" >> appendToState "release")
    (\_ -> printState "use" >> appendToState "use")

Looking good, right? Well… there’s a very insidious bug hiding in there.

It works great when run on StateT, but if we run it on ExceptT, we run into trouble. If the use function exits with throwError, then that effect will cause bracket'' to short-circuit and skip the release handler!

-- | >>> runExceptT testBracketExcept
-- acquire
-- use
-- Left "use error"
testBracketExcept :: ExceptT String IO ()
testBracketExcept = do
  bracket''
    (liftIO (putStrLn "acquire"))
    (\_ -> liftIO (putStrLn "release"))
    (\_ -> liftIO (putStrLn "use") >> throwError "use error")

The use function does not throw an exception (so `Lifted.onException` release a is never run) and exits early (before _ <- release a has a chance to run).

In an attempt to fix the threading of the state, we ended up making things much worse and broke bracket’s semantics!

The issue here is that we want the output state of use to be passed to release, except when dealing with transformers with multiple exit points like ExceptT and MaybeT.

In the latter case, in order to preserve bracket’s semantics, we want to:

Run use, but don’t restore its output state yet.
Run release.
If both use and release exited with an error, rethrow release’s.
Otherwise, restore use’s output state.

We simply cannot do this with MonadBaseControl. A function lifted with MonadBaseControl has to capture/restore state uniformly for all possible transformers. But what we want here is to have multiple implementations of bracket, one for each concrete monad transformer, that decides how best to capture/restore state.

And that is precisely how the exceptions package solves this problem. It defines a MonadMask typeclass that, among other things, describes the semantics of an abstract generalBracket. The instances for StateT, ExceptT, MaybeT, etc., then implement the correct state threading behaviour for each transformer, while upholding the prescribed semantics.

Conclusion

The good ol’ “if it compiles, it works” just doesn’t apply when dealing with MonadBaseControl. It’s not just a matter of making the types line up; it’s a matter of semantics.

If you can get away with using only stateless transformers, do it! None of the issues described here apply to stateless transformers (e.g., ReaderT, LogT). Forking state is innocuous, and there’s no output state to restore afterwards.

You often can avoid stateful transformers by replacing StateT s with mutable variables such as ReaderT (IORef s), and ExceptT with runtime exceptions.

You can constrain your functions with StM m a ~ a to rule out stateful transformers. This is what the “safe” module Control.Concurrent.Async.Lifted.Safe from the lifted-async package does, and you should prefer it over Control.Concurrent.Async.Lifted.

Another alternative is MonadUnliftIO. It’s roughly equivalent to MonadBaseControl with StM m a ~ a, but with a simpler API. The downside is that the base monad is constrained to IO.

Nevertheless, if you must support stateful transformers (e.g., StateT, ExceptT, MaybeT), lifting functions with only 1 input action, like withMVar, is usually easy enough. Just remember to always restore the output state using restoreM. Prefer using higher-order combinators like control and liftBaseOp, if possible.

Lifting functions with 2 or more input actions, on the other hand, is when things get complicated. Having to use the runInBase closure more than once is a dead giveaway that something might be off. Reimplementing the function in terms of simpler functions that only take 1 input action each can sometimes work, but it’s not guaranteed.

For exception handling, I’d recommend avoiding MonadBaseControl and lifted-base. Instead, go with the exceptions package or, better yet, safe-exceptions for safer handling of async exceptions. You get the best of both worlds: power (it supports stateful transformers) and safety (it behaves sensibly with regard to state).

MonadBaseControl is a big hammer; wield it wisely.

¹: StM m a is an associated type family of MonadBaseControl. It represents a value a enriched with the state of a monad m. For example, StM (StateT s IO) a ~ (a, s), and StM (ExceptT e IO) a ~ Either e a. For stateless monads, StM m a evaluates to a: StM (ReaderT r IO) a ~ a.

Rust in Production: JetBrains

hi+ivangromakovsky@serokell.co (Ivan Gromakovsky) — Mon, 23 Feb 2026 00:00:00 GMT

In our Rust in Production interview series, we talk with developers and technical leaders who are shaping how Rust is built and used in practice.

This interview explores JetBrains’ strategy for supporting the Rust Foundation and collaborating around shared tooling like rust-analyzer, the rationale behind launching RustRover, and how user adoption data shapes priorities such as debugging, async Rust workflows, and test tooling (including cargo nextest).

Today’s guest is the Head of the Rust Ecosystem at JetBrains, Vitaly Bragilevsky.

In talks and interviews you often emphasize JetBrains’ long-term commitment to the Rust ecosystem, including your involvement in the Rust Foundation and collaboration around rust-analyzer. How do you balance building proprietary JetBrains tooling with contributing to shared, community-owned Rust infrastructure in a way that actually accelerates Rust adoption rather than fragmenting the tooling landscape?

We think about this balance very deliberately, because the last thing the Rust ecosystem needs is artificial fragmentation driven by vendors pulling in different directions.

First, it’s important to be precise about our role. We don’t directly contribute code to core Rust open-source projects like rust-analyzer, but we very much share the same underlying problems. Because of that, we stay in close communication with the rust-analyzer team, exchange feedback, and align in direction where it makes sense. In parallel, we participate in Rust Foundation programs focused on supporting the people and processes behind Rust development, not on controlling technology.

From a product perspective, our default stance is: use what the Rust ecosystem already provides whenever possible. We rely on the standard Rust toolchain and regularly evaluate which existing components can be reused or integrated into RustRover rather than reinventing them. This helps us stay compatible with how Rust developers already work and lowers the cognitive cost of adopting the language.

Regarding fragmentation, I see it a bit differently. Diversity of tools and approaches isn’t a failure mode by default – it’s often a strength. Different developers, teams, and domains need different workflows. A strictly limited, highly opinionated tooling setup may be elegant, but it can also exclude people. Healthy competition and multiple well-integrated tools give users real choice, and that choice is what ultimately accelerates adoption.

Our goal is not to replace or overshadow community-owned infrastructure, but to build on top of it and around it: providing a polished, integrated experience for users who value that, while staying aligned with the broader ecosystem. When developers can pick the tools that fit them best – whether that’s lightweight editors, full IDEs, or something in between – Rust becomes more accessible, not more fragmented.

RustRover is JetBrains’ first dedicated Rust IDE after years of offering Rust support as plugins for IntelliJ IDEA and CLion. From your perspective as Head of the Rust Ecosystem, what concrete signs inside JetBrains and in the wider community convinced you that the time had come to invest in a standalone Rust product rather than “just” improving the plugins?

The decision was driven by a combination of external signals from the Rust ecosystem and very practical internal considerations at JetBrains.

Externally, we saw sustained, long-term growth of Rust – not just in developer adoption, but in serious production use by companies across very different industries. More teams were betting on Rust for core systems, making the ecosystem commercially relevant in a way that clearly went beyond enthusiasts and early adopters. At that point, simply treating Rust as an add-on to other IDEs no longer reflected how important it had become for many users.

Internally, having a standalone product matters a lot. It allows us to put Rust development on a clear roadmap, dedicate a focused team, and invest in the experience end-to-end rather than competing for attention and resources within a broader product. From an organizational and product-management perspective, this is a much healthier way to build something long-term.

There’s also a signaling effect that I personally consider very important. When JetBrains launches a dedicated, commercial IDE for a language, it sends a strong message to companies: this technology is mature, well-supported, and a safe bet. In that sense, RustRover is not only a response to Rust’s growth – it’s also a way of reinforcing it, giving teams additional confidence that Rust is ready for broader adoption in professional environments.

For us, surveys are not just about measuring popularity — they’re a way to understand how Rust is actually used in practice and where developers are still paying a lot of friction tax.

The most valuable signals for us are things like the industries where Rust is being adopted, the kinds of applications people are building, and the tooling problems they repeatedly run into. These answers have a very direct impact on our roadmap, because they tell us where better tooling can realistically move the needle for adoption and productivity.

A concrete example is debugging. We consistently see that many Rust developers avoid debuggers altogether, not because they don’t need them, but because the existing experience is often unreliable or hard to use. That’s a clear signal for us to invest more heavily in debugger quality and integration, rather than assuming that “Rust developers just don’t debug.”

We also see that a large share of Rust development today is backend work, with heavy use of asynchronous Rust. That has consequences for everything from code insight and diagnostics to debugging and profiling, and it means we need to focus on improving the developer experience specifically for async-heavy scenarios, not just for small libraries or toy examples.

Finally, the surveys show a significant cluster of Rust usage in areas like blockchain — for example, ecosystems such as Solana. For us, this isn’t a niche curiosity; it’s a signal that real teams are building production systems there, and that investing in better support for these workflows can have a tangible impact. In that sense, our roadmap is shaped less by abstract ideas of what Rust could be used for, and more by careful observation of what Rust developers are already doing today — and where better tools can help them do it with less friction.

For someone who uses Zed/Neovim with a rust-analyzer, what difference would you feel with RustRover. Is the proprietary JB engine better than the tools rust provides and what’s the story for a proprietary engine instead of a community-driven analyzer?

If you’re coming from Zed or Neovim with rust-analyzer, the first difference you’ll notice is that RustRover is not “just” code analysis – it’s a full IDE experience that’s available out of the box and designed to work as a coherent system.

RustRover combines Rust code analysis with a lot of other IDE capabilities: a debugger, a profiler, advanced dependency management, collaboration tools, support for web technologies and databases, and AI features ranging from simple model-based interactions to more advanced agent-style workflows. On the debugging side in particular, we use our own forks of LLDB and GDB, tuned specifically for Rust, because upstream debuggers still struggle with many Rust-specific constructs.

I usually try not to frame this as a direct “rust-analyzer vs. JetBrains engine” comparison. Both analyzers cover a broadly similar feature set, and both have rough edges – Rust is a very complex language, and large real-world codebases stress tools in different ways. Depending on project size, architecture, macros, build setup, and many other factors, developers can get noticeably different results from different analyzers.

Our engine has a long history. It started about ten years ago as part of the IntelliJ Rust plugin, well before rust-analyzer existed, and it was built using the traditional JetBrains approach to deep IDE integration. Interestingly, it was originally started by Aleksey Kladov (matklad) – the same person who later initiated rust-analyzer, which is based on very different architectural principles.

Today, we don’t see a strong reason to abandon our own analysis stack. One major advantage is that we’re much closer to the IDE itself: we’re not constrained by the LSP protocol, and we can build UX features that simply aren’t possible when the analyzer is a separate, generic service. That tight integration enables things like richer refactorings, more context-aware navigation, and smoother interactions across debugging, profiling, and code insight.

Finally, I actually think it’s healthy that Rust has two serious analyzers. It means no one can afford to be complacent. The competition pushes both approaches forward – and in the end, Rust developers are the ones who benefit from that constant pressure to improve.

JetBrains actively supports the Rust Foundation. What motivated this decision, and what kind of value does JetBrains expect to get back?

Joining the Rust Foundation was a very natural step for us at the time. As Rust was entering a more mature phase of adoption, the Foundation emerged as a focal point for coordinating long-term, ecosystem-level efforts: supporting core infrastructure, improving the sustainability of key projects, investing in developer education, and providing a neutral space where companies and the community can work together.

For JetBrains, participation in the Rust Foundation gives us a structured and transparent way to engage at that level. We get the opportunity to talk directly with companies that are deeply invested in Rust, to contribute to discussions about priorities and initiatives, and to propose or support programs that improve the overall developer experience. While the Foundation doesn’t dictate technical direction, it plays an important role in aligning efforts around shared problems that no single company can solve alone.

From a practical standpoint, working with an organization like the Rust Foundation is also simply more convenient and scalable for us as a company. We still communicate with individual open-source projects and with members of the Rust community directly, but the Foundation gives us a central forum where those conversations can happen more systematically and with broader impact.

Ultimately, the value we expect to get back is not a specific technical advantage, but a healthier, more sustainable Rust ecosystem. That directly benefits our users – and, by extension, our products – because better infrastructure, better-supported maintainers, and clearer long-term signals make Rust a safer and more attractive choice for teams and companies.

There is a fairly common perception in the community that testing in Rust can feel less flexible and more verbose – especially when it comes to mocks, fixtures, and test infrastructure, which often rely on third-party crates and careful architectural design. Do you agree that this is a real pain point for Rust developers today? How do you see this area evolving, and is improving test ergonomics something that the language team and tooling vendors like JetBrains are actively focusing on?

I think cargo nextest is a great example of how the Rust ecosystem is evolving to address real testing pain points without overloading the language itself. Rust deliberately keeps its built-in testing model minimal and reliable, but that means that questions of scale – performance, isolation, flaky tests, CI ergonomics – are pushed into external tooling. As projects grow, cargo test often becomes a bottleneck, and that’s exactly the space where nextest provides a much more robust and production-ready test execution model.

What’s important is that nextest doesn’t change how tests are written in Rust at all. All the existing approaches – standard #[test], async tests, fixtures and mocks from third-party crates – continue to work as they are. Instead, it focuses on execution, observability, and control, which are some of the biggest sources of friction for teams working with large Rust codebases. In that sense, it complements the existing ecosystem rather than competing with it.

From a tooling perspective, this is exactly where IDEs can add a lot of value. We see strong demand for better test workflows, and that’s why we’re actively working on deeper integration of cargo nextest into RustRover – including running, debugging, and visualizing test results. Our goal is to hide as much of the infrastructural complexity as possible behind a coherent UX, so developers can benefit from powerful tools like nextest without having to constantly think about how all the pieces are wired together.

Beyond the Hype: Crossing the GenAI Divide in Real-World Business

hi+ivan-smetannikov@serokell.co (Ivan Smetannikov) — Wed, 24 Dec 2025 00:00:00 GMT

If you judged the state of AI in business by your LinkedIn feed, you’d think the revolution already happened.

Boards ask about GenAI at every quarterly review. Teams are spinning up pilots. Vendors are promising “copilots” for every function, from procurement to payroll. And yet, when you look at the P&L, a quieter story emerges: for most organizations, nothing fundamental has changed.

Hard numbers come from the recent MIT-backed study on the State of AI in Business 2025. Despite an estimated $30-40 billion being poured into GenAI, 95% of organizations are seeing no measurable return. Only 5% of integrated AI pilots deliver real, meaningful value.

The report terms this chasm the GenAI Divide, and for anyone building or buying AI, it is the defining challenge of the next few years.

As a company that designs and implements AI solutions for enterprises, we see this divide every day: between demos and deployment, between experimentation and transformation, between tools that impress in slide decks and systems that quietly move business metrics.

This essay is about what that divide really is and how to cross it.

High Adoption, Low Transformation

On paper, at least, the adoption of GenAI is booming: more than 80% of organizations have explored or piloted tools like ChatGPT, Copilot, and similar assistants, and about 40% report some level of deployment.

But these deployments mostly live at the edge of the business: helping individual employees draft emails, summarize documents or clean up code. Useful? Absolutely. Transformative? Not yet.

When the study turned to structural change across industries (things like new market leaders, AI-native business models, or changes in customer behavior) the picture was stark:only two sectors clearly show signs of AI-driven disruption. Technology and Media & Telecom.

Seven out of the nine industries show lots of pilots, almost no fundamental change.…

Executives feel the disconnect. As one manufacturing COO put it candidly: “The hype says everything has changed. In our operations, we’re just processing contracts a bit faster.”

The divide isn’t about interest or investment; it’s about impact.

Why Pilots Stall: The Learning Gap

Stripping the buzzwords away, most of the failures of GenAI in the enterprise come down to one problem: the tools don’t learn.

The research in the report across 52 organizations found that the biggest barriers to scaling AI weren’t regulation, infrastructure, or even talent. They were all symptoms of a deeper learning gap:

Systems do not store feedback.
They don’t adapt to messy reality.
They can’t evolve with changing workflows.

That’s why generic tools and “LLM wrappers” can tend to do well for ad-hoc work but fall apart in mission-critical workflows. Chat-based tools still expect users to paste the full context every time. Internal tools often ship as rigid, static products that ignore how people actually work.

When asked what holds back the GenAI pilots, concerns about “model quality” showed up high on the list-but not because the underlying models are weak. These same users happily rely on ChatGPT in their personal workflows. But once AI is embedded into enterprise systems, people expect something more: context, continuity, and memory.

In other words, intelligence without learning just isn’t enough.

Nowadays, most of the companies are focused on building their own data with some basic RAG pipelines or out-the-box ready solutions. But these attempts are crushed with the harsh reality: in-context search is not the same as real knowledge, throwing everything into a mixer, blending it together and then passing to the LLM doesn’t work without some extra dedication on building the proper knowledge (in a mathematical sense) base. And these in conjunction with an intelligent retrieval on top are costly and very complicated to develop properly.

The Shadow AI Economy: What Actually Works

Here’s the paradox: while official AI programs struggle, AI is already changing work—just not in ways IT can see.

The study revealed a burgeoning “shadow AI economy” within organizations:

Only about ~40% of companies report having purchased an official LLM subscription.
But employees at more than 90% of the responding companies report using personal tools for work on a regular basis, including ChatGPT and Claude.

Many of these are being used multiple times a day by staff, while corporate AI pilots are stuck in protracted “evaluation phases.”

This represents an uncomfortable truth to leadership, yet is also a powerful signal: employees will adopt AI when it is flexible, responsive, and clearly useful.

They won’t use brittle, overengineered, or badly integrated AI in their workflows, no matter how strategic the initiative appears on the roadmap.

The forward-looking organizations are starting to pay attention to this shadow economy, studying how power users actually get value from AI and using that as a blueprint for official solutions.

Builders vs. Buyers: Two Ways Across the Divide

The report highlights another uncomfortable finding: internal builds fail twice as often as external partnerships. Externally developed, learning-capable tools reached deployment roughly 67% of the time in the sample, versus around 33% for in-house builds.

That doesn’t mean “never build.” It means:

Manage AI more like a BPO or consulting engagement than a traditional SaaS rollout.
Anticipate a co-evolution process with your vendor, rather than an off-the-shelf and set-and-forget product.

In our experience, organizations that successfully cross the divide from the buyer’s side tend to do three things differently:

They buy like operators, not tourists passing by flashy storefront.
They benchmark vendors on business outcomes, not model benchmarks or fancy feature demos. Their questions: What P&L metric will this move? How fast? With what baseline?
They empower line managers, not just central AI labs.

The most successful deployments often start with power users who already rely on AI in their daily work. These managers come with concrete use cases and own the rollout, supported by central teams providing governance and guardrails. They demand tools that learn.

The executives consistently identified a short list of must-haves in the interviews: deep understanding of their workflow, minimal disruption to existing tools, clear data boundaries,
and crucially, the ability to improve over time.

From the builder’s side of things, the winning playbook is surprisingly consistent:

Start with narrow, high-value well-specified workflows: for example, contract review, call summarization, repetitive coding tasks.
Integrate deeply into the system, rather than trying to replace it.
Build ground truth datasets and validation metrics to be able to accurately evaluate results of your non-deterministic AI agents.
Build in persistent memory and feedback loops from the first day.
Scale from visible, low-risk edge workflows into wider applications.

Those startups doing this well are achieving seven-figure run rates within 6–12 months, not by promising a generic “AI platform,” but by quietly becoming indispensable in one or two workflows first.

The ROI Nobody Brags About: Back-Office Wins

Another pattern in the data is almost comically human: executives tend to put the AI budget where it’s easiest to tell a story.

Asked to hypothetically budget for GenAI, respondents steered about 70% of it toward sales and marketing use cases. Email automation, lead scoring, content generation, campaign orchestration – easy to pitch, easy to measure, easy to talk about in board decks.

But some of the biggest and clearest ROI the study found lives elsewhere:

Automating back-office processes cuts up to $2–10M annually in BPO spend on customer service and document processing.
30% reductions in external agency costs related to content and creative work.
Unlock significant savings in risk and compliance workflows in financial services.

Interestingly, these gains do not often come from mass layoffs. Organizations that have crossed the GenAI Divide tend instead to:

Reduce outsourced, third-party spend
Avoid incremental hiring in certain functions
Reassign internal teams to higher-value work rather than cutting them altogether.

Front-office AI gets the limelight, but back-office AI often gets the returns.

What Comes Next: From Agents to the Agentic Web

The tools of GenAI today still live mostly inside single products: an assistant inside your CRM, a copilot in your office suite, a chatbot on your support site.

The next leap isn’t just “more capable models”, but an Agentic Web: a mesh of interoperating agents able to learn, coordinate, and act across tools, vendors, and even companies. Protocols such as MCP (Model Context Protocol), A2A (Agent-to-Agent), and NANDA form early infrastructure for that world.

In such an environment, AI systems would be able to:

Discover and evaluate vendors independently
Spin up dynamic API connections, instead of waiting for hand-coded integrations.
Orchestrate multi-step workflows across multiple platforms and organizations
Continuously optimize processes based on actual outcomes.

The key implication for enterprises is this: the vendors you train today will shape your flexibility tomorrow.

Once a system has deeply trained on your processes, data, and edge cases, the switching costs become enormous. Procurement leaders in the study estimate that many of these relationships will effectively “lock in” over the next 18-24 months.

That’s why the GenAI Divide is more than a research term; it’s a strategic clock.

A Practical Way Forward

For those organizations still on the wrong side of the divide, the road ahead is less glamorous than the hype would suggest-but far more achievable:

Stop chasing demos.
Focus on workflows where time, cost and error rates are already measured and where AI can plug into existing systems rather than replace them.
Consider AI initiatives to be learning systems, not static products.
Demand persistent memory, feedback loops, and measurable improvement over time.
Start where your people already are. Talk to the “shadow AI” users. Understand what tasks they quietly automate. And turn those hacks into supported, secure, enterprise-grade solutions.
Partner with builders that live in your workflow. Whether you build, buy, or blend both, work with teams that understand your approvals, data flows, and edge cases – not just your industry buzzwords. The GenAI Divide is real, but it’s not permanent.

The organizations that cross it first won’t necessarily be the ones with the biggest budgets or flashiest AI labs. They’ll be the ones that treat AI not as a miracle button but as a new kind of infrastructure: embedded, adaptive, and above all, capable of learning.

Design Patterns for Long-Term Memory in LLM-Powered Architectures

hi+ivan-smetannikov@serokell.co (Ivan Smetannikov) — Tue, 09 Dec 2025 00:00:00 GMT

Design Patterns for Long-Term Memory in LLM-Powered Architectures

The explosive growth of large language models (LLMs) has reshaped the AI landscape. Yet their core design is still fundamentally stateless: a drawback often referred to as “conversational amnesia.” An LLM can only operate within a limited context window and, paradoxically, loses more signal as that window grows, making it unable to reliably carry information forward across extended interactions.

This limitation remains the key blocker to building truly persistent, collaborative, and personalized AI agents that can handle complex, multi-step, long-running workflows. To overcome it, the industry is moving past traditional stateless Retrieval-Augmented Generation (RAG) and toward more advanced architectural patterns purpose-built for long-term memory.

This report outlines and contrasts four leading design philosophies that have emerged:

The Operating System Paradigm (MemGPT). Treats memory as a managed computational resource, virtualizing LLM context to simulate infinite capacity.
OpenAI memory management. A product-driven approach where memory enables seamless, persistent personalization across all interactions.
Claude memory management. Prioritizes user control and strict data isolation, using memory as a project-scoped workspace tool.
AI Toolkits memory management. Open source provides building blocks for developers to design custom, domain-specific memory systems.

Our high-level findings reveal a clear trade-off between automated convenience and explicit control. Emerging systems such as MemGPT and modular agent frameworks point toward a future of autonomous, self-managing memory.

System I: MemGPT — The Operating System Paradigm

MemGPT represents a fundamental architectural shift: it reframes memory not as a content-management problem but as a resource-management challenge. Inspired directly by computer operating system design, it treats the LLM’s finite context window not as a hard limit but as a form of fast, volatile memory (analogous to RAM), to be intelligently managed alongside a larger, persistent storage layer (analogous to disk).

According to the well-known researcher Andrej Karpathy, a lot of computing concepts carry over in this paradigm. Concepts from computer security carry over, with attacks, defenses and emerging vulnerabilities. E.g. today it orchestrates:

Input & Output across modalities (text, audio, vision)
Code interpreter, ability to write and run programs
Browser / internet access
Embeddings database for files and internal memory storage and retrieval

According to him, looking at LLMs as chatbots is the same as looking at early computers as calculators. We’re seeing an emergence of a whole new computing paradigm. and it is very early.

Core Architecture: Virtual Context Management

At its core, MemGPT features a hierarchical memory architecture closely mirroring that of a traditional OS:

Primary Context (RAM) — The fixed-size prompt that the LLM can directly “see” during inference.
It consists of three partitions:
1. Static system prompt, containing base instructions and function schemas.
2. Dynamic working context, serving as a scratchpad for reasoning steps and intermediate results.
3. FIFO message buffer, holding the most recent conversational turns.
External Context (Disk Storage) — An effectively infinite, out-of-context layer inaccessible to the model without explicit retrieval.
It includes:
1. Recall Storage, a searchable document or log database containing the full historical record of interactions for literal recall.
2. Archival Storage, a long-term, vector-based memory for large documents and abstracted knowledge retrievable via semantic search.

Information flow between these tiers is governed by a strict paging mechanism. The LLM “processor” works only within the primary context. To access external data, it must autonomously issue explicit function calls (e.g., conversation_search, archival_memory_search).

Results from these calls are then paged in, replacing less relevant segments in the FIFO queue, and creating the illusion of an unbounded context window while remaining within the physical token limits of the underlying model.

Memory Formation: The Self-Managed Write-Back Cycle

Persisting information into long-term memory occurs through an event-driven write-back cycle, analogous to an OS interrupt.

The process is triggered by memory pressure: when token usage in the primary context approaches a defined threshold (e.g., 70% capacity), the system inserts an internal alert.
Upon receiving this signal, the LLM halts its current reasoning, reviews its working memory, determines which content is least critical, summarizes it, and writes it to the appropriate external tier.

Crucially, the LLM itself manages this cycle—autonomously deciding what to keep, what to discard, and where to store it.
This self-reflective capability allows for dynamic memory correction (e.g., overwriting outdated facts) and embodies a primitive form of cognitive self-regulation.

Tooling and Technology Stack

Typical implementations leverage the following stack:

Core Framework: The original MemGPT open-source Python framework, now evolved into Letta.
LLM Compatibility: Model-agnostic but optimized for function-calling models such as OpenAI’s GPT-4 or GPT-3.5.
Vector Databases: Used for archival semantic search; common options include Chroma, LanceDB, or pgvector.
Persistent Storage: Recall layers often use lightweight databases or file systems for event logs and message histories.

Strengths:
Elegant abstraction of the finite-context problem; creates the illusion of infinite memory via virtualization.

Limitations:
All reasoning and memory management are handled by a single agent, consuming valuable cognitive bandwidth.
Because stored data is unstructured, performing complex relational queries (e.g., “Which decisions were influenced by facts from source X?”) is nearly impossible without heavy post-processing—precisely what MaaS aims to solve.

In short, MemGPT’s brilliance lies in its autonomy—but that autonomy comes at a cost. Every cycle spent on memory logistics is a cycle not spent on task reasoning.

System II: OpenAI Memory Management

OpenAI’s memory for ChatGPT represents a product-first architecture designed to deliver a seamless, deeply personalized experience.
Unlike other models, it implements a global, user-centric memory that persists across all conversations, making the assistant continuously aware and contextually intelligent with minimal user effort.

Core Architecture: Hybrid Fact + Semantic Storage

The system combines two complementary memory layers:

Saved Memories. While we do not exactly know how it is organized, it can be viewed as a page or two of different facts collected together from all of your chats. LLM decides automatically which of the data you’ve provided via your current conversation is added there. These can be explicitly provided (“Remember that I live in San Francisco”) or automatically classified by the model as potentially useful. Each session begins with this document prepended to the LLM prompt, ensuring continuity. We assume that they also might use some simple key-value database to manage it more accurately.
Chat History Reference. A large-scale RAG-style retrieval layer that semantically searches across all previous user interactions to find relevant context fragments for the current query. According to the system’s behaviour we assume that this search is working on the whole text, not on the summaries of these chats.

During response generation, both layers are queried in parallel. Memories are added to the beginning of the conversation. Semantically matched chat snippets are combined into the model’s context according to the concrete request, yielding personalized, contextually aware outputs.

Memory Formation

The write-back cycle operates in two modes:

Explicit Commands. The user directly instructs the model to remember or forget something.
Automatic Extraction. Background classifiers continuously scan conversations to identify recurring or salient information (e.g., profession, tone preferences). Extracted facts are either suggested to or silently added into the saved memory layer.

User oversight remains central. Through the ChatGPT settings interface, users can view, edit, or delete any stored memory, ensuring transparency, privacy, and control.

Technology Stack (Presumably)

While proprietary, the architecture likely includes:

A high-performance vector store for semantic retrieval over historical chats.
A scalable key-value or document database for structured “saved memories.”
A background extraction pipeline leveraging a classification model.

Strengths:
Delivers effortless, “magical” personalization at global scale, which is perfect for user application.

Limitations:
Global scope makes it unsuitable for enterprise or professional contexts. The risk of context leakage (where information from one client or topic influences another) is inherent to its design. Also it is not supporting multi-user usage in the same scope, at least for now.

In essence, OpenAI’s architecture mirrors its business strategy: prioritize simplicity and personalization for a mass audience over data compartmentalization or symbolic structure. Its strength lies in accessibility and its limitation lies in lack of user control.

System III: Claude Memory Management

Claude’s approach stands as a philosophical counterpoint to OpenAI’s global personalization. Anthropic emphasizes user control, explicit activation, and strict data compartmentalization, producing a memory system that is less automated but more predictable: well-suited to professional workflows where data separation is non-negotiable.

Core Architecture: Project Summaries and File-Scoped Context

Claude’s memory combines official features with strong, community-driven patterns:

Project Memory (Team/Enterprise). Users create distinct projects, each with its own editable memory summary of key facts, instructions, and context. That summary is automatically injected into prompts for all chats within the same project, enforcing hard boundaries – nothing from Project A can bleed into Project B.
Community Patterns in CLAUDE.md. In developer workflows, teams often include a CLAUDE.md file at the repo root. When interacting with Claude in that working directory, the file is read and included in context. This acts like “context injection,” not dynamic RAG: the whole file is loaded, enabling versioned, reviewable, and Git-managed context (architecture principles, coding standards, API specs) that sits alongside the codebase.
On-Demand Retrieval Tools. When explicitly asked to “recall” something from the past, Claude appears to use internal tools (e.g., conversation_search) confined to the current project. It doesn’t rely on a global index, reinforcing compartmentalization.

Memory Formation: Mostly Curated by the User

In contrast to OpenAI’s automation, Claude’s write-back cycle is largely explicit:

Project memory is updated because a user asks for it to be updated.
The CLAUDE.md pattern is edited by humans via normal version control.
Memory is not “always on”; users often prompt Claude to look back, which in turn activates the project-limited search tools.
The search implementation has two key elements:
- Conversation search. Makes keyword and topic-based searches across the entire conversation history.
- Recent chats. Provides time-based access to the conversation history with customizable sort chronological order and optional pagination using ‘before’ and ‘after’ datetime filters.

Technology Stack

Anthropic Platform: Project memory is a first-class feature in Claude’s web app/API for Team/Enterprise plans.
Files + Version Control: The CLAUDE.md pattern relies only on standard files, editors, and Git.
3rd-Party Connectors: An emerging ecosystem (e.g., CLI tools) exposes long-term or local stores via Claude tool integrations.

Strengths:
Advanced control, predictability, and transparency. Ideal for client work and regulated environments where data isolation is paramount.

Limitations:
High cognitive load and limited scalability. Memory grows only as fast as users curate it.
Large, monolithic summaries risk the loss of information.

System IV: AI Toolkits Memory Management

LangChain and Microsoft Autogen aren’t memory products; they’re frameworks that give you the primitives to assemble sophisticated memory architectures. The payoff is fine-grained control, but the cost is engineering effort.

Core Architecture: Composable and Protocol-Oriented

LangChain Memory Modules:
- Buffer Memory and Window Memory (short-term chat windows),
- Summary Memory (periodic LLM summaries to save tokens),
- Entity Memory (structured capture of people/orgs/concepts),
- Knowledge-Graph Memory that builds nodes/edges on the fly for relational querying.
LangGraph for Stateful Agents. A graph-based orchestration layer for building cyclic, multi-agent workflows where memory is an explicit part of the agent’s state persisted across steps and sessions.
Autogen Memory Protocol. A generalized Memory interface with a RAG-centric pattern: agents query external memory stores to enrich their context. Community integrations cover vector DBs and third-party memory services.

Memory Formation: Your Design

These frameworks don’t ship with an “always-on” write-back loop. You design it yourself. For example:

The agent completes a step.
A “memory extraction” tool (often LLM-powered) distills facts, entities, or relationships.
A “memory store” tool writes structured results to the chosen backend (vector store, graph DB, KV/Doc store).

Technology Stack

Languages: Python and JavaScript/TypeScript.
Orchestration: LangChain, LangGraph, Autogen
Pluggable Datastores:
- Vector stores: Pinecone, Chroma, Weaviate, LanceDB.
- Graph DBs: Neo4j, Memgraph, Kùzu.
- KV/Document stores: Redis, MongoDB.

Basically anything you want as they are both open-sourced.

Strengths:
Unmatched flexibility. You can encode the exact memory structure, DB tech, and agent logic your domain needs, often surpassing off-the-shelf systems with the precise tinkering.

Limitations:
Complex and expensive to build and maintain. Reliability, scale, consistency, and data governance are your responsibility.

Industry Trajectory:
Early memory systems leaned on raw text + vector search. That hits a ceiling for relational questions (“Who’s working with Alice on currently blocked projects?”). Later we introduced entities and relationships, knowledge graphs. Patterns like LangChain’s KG memory and growing Graph DB integrations signal a shift from mere similarity search toward relational reasoning, the most plausible path to collaborative, “systems-level” agents.

Final thoughts on the current state

Basically every approach has its strengths and weaknesses, to quickly summarize:

OpenAI. Frictionless, “magical” continuity for individuals: great for consumers but risky for enterprise separation.
Claude. Strong isolation and user control: great for client work and regulated contexts but manual effort required.
MemGPT. Autonomous context virtualization: near-infinite memory feel but single-agent overhead and unstructured storage limit relational queries.
Toolkits. Maximum customizability: best path to tailor-made, domain-specific systems but highest build complexity.

While there is no universally accepted approach to handle AI agents memory management right now, we can foresee some strong emerging trends and patterns that will result in more stable, reliable, safe and interpretable AI systems in the near future. For example, modern systems went from direct context sharing into some RAG capabilities and now are moving into autonomous memory orchestration with LLMs using tools provided by developers. In order to handle the logic and provide additional guardrails industry is shifting from unstructured snippets and restricting prompts with cornercases into direct knowledge graph usage. And nowadays, in order to prevent context overflow more systems are moving from single agent calls into multi-agent pipelines. These pipelines might accumulate a lot of errors due to the snowball effect of inaccurate agents being called on each step, which also will be overcome with knowledge graphs, shared memory pipelines and its strict typing with audition capabilities.

The Real Limits of AI Agents in 2025

hi+ivan-smetannikov@serokell.co (Ivan Smetannikov) — Sun, 02 Nov 2025 00:00:00 GMT

TL;DR: Everyone says 2025 is the year of autonomous AI agents. We’ve built a lot of them in production, and that’s exactly why we think most of the current hype just doesn’t add up. In this post, we’ll break down the most common misconceptions, talk about what actually works in the real world, and explain why the math and economics behind the hype don’t hold up yet.

Rumors and Speculations Breakdown

“Autonomous AI agents will replace traditional workflows in 2025!”
Not really. The idea of fully autonomous multi-step agents sounds great, but in practice it falls apart under simple math. The issue isn’t intelligence or prompt quality, it’s compounded error rates. Even small per-step mistakes grow exponentially over time, which makes true end-to-end autonomy impossible at scale.

“Conversational agents are the next big thing!”
Maybe, but not in the way most people think. Long-context agents suffer from quadratic token costs. Every new message has to reprocess the entire conversation, and that makes long sessions ridiculously expensive.

“We just need better APIs and the agents will figure it out!”
Nope. The real bottleneck isn’t model capability, it’s bad tool design. Most “AI agents” today fail because the tools they use don’t give them structured feedback. The AI doesn’t need human-style interfaces, it needs clean, machine-readable signals that help it reason about what just happened.

Engineering Reality Breakdown

Let’s move from the hype to the practical side: what actually happens when you build and ship AI agents in production.

1. The Mathematics Behind Failure

Error compounding quietly kills multi-step autonomy.
Say your model performs each step with 95% accuracy (which is already optimistic). Here’s what happens:

5 steps → 77% success
10 steps → 59% success
20 steps → 36% success

Real production systems often need 99.9% reliability. Even if you somehow reach 99% per step, you still only get about 82% success across 20 steps. That’s not a prompt issue, that’s just math.

On the contrary, our DevOps agent works precisely because it’s not truly autonomous. It runs 3–5 well-defined operations with rollback points and optional human confirmations. Each step is verifiable, and errors don’t pile up. The “autonomy” part is an illusion built on careful architecture.

2. Token Economics Nobody Mentions

There’s another uncomfortable reality: conversational agents are usually too expensive to scale.
Every new exchange requires processing the full conversation history, so token usage grows quadratically.

When we built a conversational database agent, the first few queries were cheap. By the 50th turn, each response can be costing several dollars, more than the value of the query itself. That doesn’t work in production.

That’s why stateless, single-turn agents are often more practical. Our function generator, for example, does one thing: it takes a description, produces a function, and stops there. No memory management, no exploding costs, just fast, cheap, and reliable execution. Ideally you want to validate and store intermediate results in conventional databases and at least try to guardrail them into being deterministic.

3. The Tool Design Wall

Even if you solve the math and the cost, there’s another wall waiting: tool engineering. LLMs are now quite good at calling tools, but the real challenge is designing tools that talk back in a way the AI can understand.

You need to think carefully about:

How to report partial successes
How to summarize large outputs without burning context
How to recover when a tool fails
How to handle dependencies between tools

Our database agent works well only because each tool returns structured, meaningful feedback, not just raw API dumps. That took weeks to get right. The truth is, the AI handles maybe 30% of the logic. The other 70% is the surrounding engineering: feedback design, context management, AI guardrails, error handling, and recovery mechanisms. All these mechanisms are trying to fit non-deterministic and unpredictable AI behavior into the strict frame reducing error rate drastically.

Integration Breakdown

And even if you fix everything else, you still need to connect your agent to real systems, and real systems are messy.
Enterprise software isn’t a collection of clean APIs. It’s full of quirks, legacy components, unpredictable rate limits, and compliance rules that change overnight.

Our production database agent doesn’t just “run queries on its own.” It manages transaction safety, connection pools, audit logs, and rollback logic — all the boring, reliable stuff you need to make things actually work. Integration is where most AI agents fail quietly.

What Actually Works

After building several different agent systems, a clear pattern has emerged. The ones that work all look surprisingly similar:

UI generation agents succeed because humans review everything before deployment.
Database agents work because potentially destructive actions require confirmation.
Function generators work because they’re stateless and self-contained.
DevOps agents work because they output infrastructure-as-code that humans can review and roll back.
CI/CD agents work because the pipeline enforces strict success and rollback criteria.

And all these agents work only if you have clear and straightforward guidelines on a granular task you want them to perform. Just as you would explain something to a real person who never did it: with all the caveats and potential problems.

The pattern is simple: AI handles complexity, humans keep control, and traditional software ensures reliability.

Predictions for the end of 2025

Here’s how we think 2025 will play out:

Startups chasing “fully autonomous agents” will hit a hard wall with cost and reliability. Few-step demos don’t survive real 20-step workflows. Real data and tools accessed via magic of MCP but without clear guidelines will not result in high accuracy even on simple few-steps pipelines.
Big enterprise tools that just slap “AI agent” onto their existing products will stall because their integrations can’t handle the real world.
The real winners will build focused, domain-specific assistants that use AI where it helps most, but still rely on humans or deterministic systems for critical control points and general AI agents guidance.

Eventually, people will realize the difference between AI that demos well and AI that actually ships. It’s going to be an expensive lesson.

Building the Right Way

If you’re building AI agents this year, start with these principles:

Define a clear problem you want to solve or automate. AI is not a magic box, its usage stays on the same principles as a classical software development. The only difference is that it can handle unstructured data with much less development effort and it has non-deterministic output.
Split the problem into verifiable pieces where possible. Instead of building a single complex agent make several small ones. If needed, make extra “manager” or “intermediate” agents that will aggregate results of several other agents.
Provide clean instructions. Each set of instructions should be clear and straightforward. Try to cover all cornercases, but not overthink it: if instructions become too complicated or too long then return to step 2.
Define clear boundaries. Know exactly what your agent can do and when it should stop. Be ten times more careful with agents that provide you with data and not just summarize and build reports. Do not give direct write access to the agent if it works with sensitive information.
Design for failure. Assume 20–40% of operations will go wrong. Have rollback plans. Always have a “ground truth” source of information which was not touched by AI at all. Have the full log of what was done in the system with an emergency script that can use your “ground truth” data to rebuild everything if needed.
Mind the economics. Measure token costs and scale realistically. Stateless often beats stateful. Cache agents responses if needed, especially if their job was to generate some intermediate data.
Prioritize reliability over autonomy. People trust consistent tools more than “magical” ones. Never deploy a new agent to a wide mass of people if you haven’t proved it to be effective. Use some beta-testers with expertise relevant to the application area to tune it up.
Use AI where it shines. Let it handle reasoning, intent, and generation. Give it unstructured data as input to work with. Leave data processing, execution and state to proven software patterns.

Outro

The agent revolution is real, but it’s not going to look like the hype suggests. The winning systems won’t be fully autonomous. They’ll be thoughtful combinations of AI reasoning, human judgment, and traditional engineering discipline.

We are not betting against AI. We are betting against the current obsession with its overpromising use. The real breakthroughs will come from teams who understand the limits, respect the math, and build around reality instead of wishful thinking.

Far-sight outlook:
Still, it’s worth thinking about where this all leads. Just like deep learning eventually replaced handcrafted pipelines with end-to-end systems, agents will likely follow the same path.

Over time, meta-learning and new reinforcement-learning methods — ones that don’t even exist yet — will let models learn not just tasks, but how to learn.

They’ll be able to adapt to feedback, handle rare edge cases, and self-correct in ways we currently have to hardcode. When that happens, the rigid guardrails we depend on today will turn into adaptive self-tuning mechanisms, and we’ll finally reach the true end-to-end agent era where you just add extra input as the system goes and it autocorrects itself accordingly without any additional inputs from your side.

Reviving an Old iMac with NixOS

hi+alexeydanilevsky@serokell.co (Alexey Danilevsky) — Sun, 14 Sep 2025 00:00:00 GMT

People have been using computers for decades. Information technology advances by leaps and bounds. As a result, yesterday’s new, powerful machines quickly become today’s obsolete hardware, gathering dust on shelves and in closets.

However, these old computers can still be useful for low-resource tasks, such as working with documents, surfing the web, and watching videos. In general, the obsolete hardware is supported by very old software — operating systems and applications — which is vulnerable and unsafe to use. New software cannot be installed or runs too slowly due to a slow CPU and a lack of memory and disk space.

I personally have two examples of such obsolete computers: an Acer Extensa 5220 laptop and an 18-year-old iMac. The iMac has a large display, and it is a shame that I cannot use it safely since its hardware is supported by the very old MacOS Lion. I tried to install several lightweight Linux distributions on it but failed because of:

It was not easy to get the Wi-Fi card to work properly.
The software was not responsive.

That is why I decided to experiment with NixOS on my old computers, since I know that NixOS allows fine-tuning of the operating system and other software.

Goals

Get the old computer working.
Use modern and secure software with available security and feature updates.
Ensure a pleasant and responsive user experience.

Breathing new life into the old iMac

Generally, the following steps are suitable for any low-spec computer. However, we will focus more on the iMac, as it is a bit more challenging and requires more effort.

We have an iMac made in 2007:

CPU: Core2 Duo 2.4Ghz
RAM: 2GB
SSD: 120GB (upgraded some time ago from HDD)
Wi-Fi, Bluetooth

Problems we have to solve:

First of all, we need to remove the thick layer of dust that has collected on it.
The computer’s hardware is very limited, so we cannot use the graphical installer.

Furthermore, Apple computers have a Broadcom Wi-Fi card that requires a proprietary driver, which is not included in the default NixOS installation image. Of course, it’s possible to connect to the network using an Ethernet cable or tethering, but I don’t have a suitable cable and don’t want to use my mobile data.

That is why we are going to create a custom minimal ISO image with Broadcom Wi-Fi support to be able to install NixOS conveniently.

We are going to create a NixOS configuration that is sufficient to meet the requirements of Goal 3.

Creating custom minimal installation image

Note: This step is required only for Apple computers and should be skiped for all other hardware. Simply download the minimal installation image from the NixOS website.

Prerequisites: Nix should be installed on the computer which you are going to use for creating the installation image.

Create a working directory, say custom-nixos-iso.

Creating image with nix flakes

Create flake.nix file with the following content:

{
  description = "Custom NixOS ISO with Broadcom";

  inputs.nixpkgs.url = "github:NixOS/nixpkgs/nixos-25.05";

  outputs = { self, nixpkgs }:
    let
      system = "x86_64-linux";
    in {
      nixosConfigurations.install-iso = nixpkgs.lib.nixosSystem {
        inherit system;
        modules = [
          ({ config, pkgs, ... }: {
            imports = [
              "${nixpkgs}/nixos/modules/installer/cd-dvd/installation-cd-minimal.nix"
            ];
            hardware.enableRedistributableFirmware = true;
            # boot.extraModulePackages = [ config.boot.kernelPackages.broadcom_sta ];
            boot.kernelModules = [ "b43" ];
            boot.blacklistedKernelModules = [ "wl" ];
            networking.enableB43Firmware = true;
          })
        ];
      };
    };
}

After that run in the working directory the following command:

nix build .#nixosConfigurations.install-iso.config.system.build.isoImage

The image will be placed in the result directory.

Creating image without flakes

If you don’t want to use flake you can create installation image with ordinary nix. Create iso.nix file with the following content:

{ config, pkgs, ... }:

{
  imports = [
    <nixpkgs/nixos/modules/installer/cd-dvd/installation-cd-minimal.nix>
    <nixpkgs/nixos/modules/installer/cd-dvd/channel.nix>
  ];

  hardware.enableRedistributableFirmware = true;
  boot.kernelModules = [ "b43" ];
  networking.enableB43Firmware = true;
  # boot.extraModulePackages = [ config.boot.kernelPackages.broadcom_sta ];
  boot.blacklistedKernelModules = [ "wl" ];
}

Run in the working directory:

nix-build '' -A config.system.build.isoImage -I nixpkgs=channel:nixos-25.05 -I nixos-config=iso.nix

Notes on selecting Wi-Fi drivers

Actually, there are only two options:

The original proprietary driver, broadcom_sta, which provides the kernel module wl.
The free driver b43, which requires redistributable proprietary firmware.

The best driver is the one that works for you. My experience shows that both can get the Wi-Fi card working. However, broadcom_sta is vulnerable and, in my case, it does not work correctly and creates an incorrect ARP configuration. This means that although the Wi-Fi card obtains an IP address, it does not have access to the gateway. That is why I have enabled b43 in my Nix configurations.

If you want to try broadcom_sta, you need to:

Uncomment the line that enables this driver
Activate the kernel module wl.
Disable the B43Firmware option.
Blacklist the b43 kernel module.

Write the image to a USB stick

You can use any suitable software for this. However, note that Ventoy should not be used for installing NixOS on Apple computers.

Booting from the USB stick

When booting your old computer, you must select the USB stick as the boot device. On Apple computers, this is done by pressing the Option key immediately after powering it on.

Setting up the Internet connection

Before you start installation you should make sure that you have working Internet connection.

First, run command

ip link

and check if you can see your Wi-Fi card in the output. In my case, it’s the wlan0 interface. If you don’t see a Wi-Fi card, it means the driver does not support your hardware. Make sure the driver is loaded (lsmod) and try setting up an alternative driver when creating the installation image.

If Wi-Fi driver works and recognizes your hardware, configure and run your Wi-Fi connection.

wpa_passphrase YOUR_SSID > wpa.conf
sudo wpa_supplicant -B -i wlan0 -c wpa.conf

Please adjust values for your Wi-Fi network SSID and network interface name to match your setup.

Normally, the dhcpcd daemon is running after boot, so you should get a working Internet connection shortly. If not, make sure you have entered correct values for the SSID and password. To check your connection run the following commands:

ip addr # check that you got an IP address
ping 8.8.8.8 # check if the Internet is accessible

Installing NixOS

Create and mount disk partitions

Notes:

This guide assumes your disk has no existing partitions. If it does, ensure you have backed up any important data and remove all partitions before proceeding.
In all following commands, replace sdX with your actual device name (e.g., sda, nvme0n1). You can use the lsblk command to list all available block devices.
You are free to partition the disk according to your needs. However, the layout below is a rather optimal setup for an older computer with limited disk space.
The size of the swap partition should correspond to the amount of RAM in your system. In this guide, we will create a swap partition equal to system’s memory size.

To remove existing partitions use the following commands:

sudo parted /dev/sdX -- print # prints existing partitions
sudo parted /dev/sdX -- rm n # remove partition with number n

Create new partitions:

# Create partition table GPT
sudo parted /dev/sdX -- mklabel gpt 

# Create boot partition
sudo parted /dev/sdX -- mkpart ESP fat32 1Mib 512Mib
sudo parted /dev/sdX -- set 1 esp on

# Create swap partition (size 2048 MiB)
sudo parted /dev/sdX -- mkpart primary linux-swap 512MiB 2560MiB

# Create root partition
sudo parted /dev/sdX -- mkpart primary ext4 2560MiB 100%

Format new partitions:

sudo mkfs.fat -F 32 -n boot /dev/sdX1
sudo mkswap -L swap /dev/sdX2
sudo mkfs.ext4 -L nixos /dev/sdX3

Mount partitions:

sudo mount /dev/disk/by-label/nixos /mnt

sudo mkdir -p /mnt/boot
sudo mount /dev/disk/by-label/boot /mnt/boot

sudo swapon /dev/disk/by-label/swap

Create NixOS configuration

sudo nixos-generate-config --root /mnt

After that open /mnt/etc/nixos/hardware-configuration.nix

sudo -e /mnt/etc/nixos/hardware-configuration.nix

If you use b43 Wi-Fi driver then make sure that broadcom_sta is not set up.

Add noatime option to root file system configuration.

Replace the entire contents of /mnt/etc/nixos/configuration.nix with the following configuration. Adjust it according to your needs.

# This configuration.nix file is optimized for an old (2007) iMac.
# It can be adapted for any old computer with weak hardware.
# To do this, you just need to configure it for your specific hardware:
# Wi-Fi card, video driver, printer, etc. Along with that, you may
# choose your preferences for locale, basic system software, themes, etc.
{ config, lib, pkgs, ... }:

{
  imports = [ ./hardware-configuration.nix ];

  hardware = {
    # Wi-Fi card and Bluetooth on iMac require redistributable firmware
    enableRedistributableFirmware = true;
    # Enable harware graphics support
    graphics = {
      enable = true;
      enable32Bit = true;
    };
    # Enable Bluetooth support
    bluetooth = {
      enable = true;
      powerOnBoot = true;
      package = pkgs.bluez;
      settings = {
        General = {
          Experimental = true;
	        ControllerMode = "dual";
	        FastConnectable = true;
	      };
      };
    };
  };

  boot = {
    # Use EFI bootloader
    loader = {
      systemd-boot = {
        enable = true;
        configurationLimit = 5;
      };
      efi.canTouchEfiVariables = true;
    };
  };
  # Enable compressed RAM swap. Useful for low RAM systems
  zramSwap = {
    enable = true;
    # Allow utilizing 60% of RAM for compressed swap
    memoryPercent = 60;
  };
  # Configure networking
  networking = {
    # Use network manager
    networkmanager.enable = true;
    # Set desired host name
    hostName = "nixos";
    # Enable firmware to make Wi-Fi card working (in my case BCM4321)
    enableB43Firmware = true;
  };
  # Setup time zone and locale
  time.timeZone = "Europe/Moscow";
  i18n.defaultLocale = "ru_RU.UTF-8";
  # Configure services
  services = {
    # GUI
    xserver = {
      # Enable X server
      enable = true;
      # Setup video driver. Old iMac has Radeon graphics card
      videoDrivers = [ "radeon" ];
      # Configure lightweight display manager
      displayManager = {
        lightdm = {
	        enable = true;
          greeters.gtk.enable = true;
	      };
        # The following commands are to configure a good-looking default GUI
        # theme. The script does not change the user settings made manually.
        sessionCommands = ''
          set_default() {
            local ch="$1" key="$2" val="$3" type="$4"
            if ! xfconf-query -c "$ch" -p "$key" >/dev/null 2>&1; then
              xfconf-query -c "$ch" -p "$key" -s "$val" --create -t "$type"
            fi
          }

          set_default xsettings /Net/ThemeName "Arc-Dark" string
          set_default xsettings /Net/IconThemeName "Papirus-Dark" string
          set_default xsettings /Gtk/CursorThemeName "Bibata-Modern-Classic" string
          set_default xsettings /Gtk/CursorThemeSize "24" int
          set_default xfwm4 /general/theme "Arc-Dark" string
          set_default xsettings /Gtk/FontName "Inter 10" string
        '';
      };
      # Setting XFCE as a desktop manager. It is rather lightweight but
      # feature-reach
      desktopManager.xfce.enable = true;
      # Configure multi language support. This section could be omitted because
      # XFCE supports configuring keyboard layout switching.
      xkb = {
        layout = "us,ru";
        # Swithch keyboard layouts with CMD+Space like by default on Mac
        options = "grp:win_space_toggle";
      };
    };
    # Configure autologin for your user. Weak computers usually do not intended
    # to be used by several users. So this configuration can be useful. If you
    # want more than one user just omit this section.
    displayManager = {
      autoLogin = {
        enable = true;
        user = "your-user-name";
      };
    };
    # Configure multimedia support
    pipewire = {
      enable = true;
      # Enable alsa comatibility
      alsa.enable = true;
      # Enable pulse audio comatibility
      pulse.enable = true;

      wireplumber = {
        enable = true;
        # Configure audio support via Bluetooth
	      extraConfig = {
          "50-bluez" = {
            monitor.bluez.properties = {
              bluez5 = {
	              enable-msbc = true;
                enable-sbc-xq = true;
		            enable-hw-volume = true;
		            roles = [ "a2dp_sink" "a2dp_source" "hsp_hs" "hfp_hf" ];
	            };
	          };
	        };
	      };
      };
    };
    # Enable power management
    tlp.enable = true;
    # Enable daemon for temperature monitoring
    thermald.enable = true;
    # Enable SSH.
    openssh.enable = true;
    # Configure printing
    printing = {
      enable = true;
      # Setup correct drivers. The following driver is for most Epson printers.
      drivers = [ pkgs.epson-escpr ];
    };

    avahi = {
      enable = true;
      nssmdns4 = true;
      openFirewall = true;
    };
    # Enable Bluetooth manager
    blueman.enable = true;
  };
  # Configure fonts
  fonts = {
    # Install most useful and popular fonts
    packages = with pkgs; [
      dejavu_fonts
      liberation_ttf
      noto-fonts noto-fonts-cjk-sans noto-fonts-emoji
      inter
      roboto roboto-mono
      jetbrains-mono
      fira-code
      font-awesome
      corefonts
    ];
    # Font settings
    fontconfig = {
      antialias = true;
      hinting.enable = true;
      hinting.style = "slight";
      subpixel.rgba = "rgb";
      subpixel.lcdfilter = "default";

      defaultFonts = {
        serif = [ "Noto Serif" "DejaVu Serif" "Liberation Serif" ];
        sansSerif = [ "Inter" "Noto Sans" "DejaVu Sans" "Liberation Sans" ];
        monospace = [ "JetBains Mono" "Fira Code" "DejaVu Sans Mono" "Roboto Mono" ];
        emoji = [ "Noto Color Emoji" ];
      };
    };
  };
  # By default XFCE installs thunar (file manager), but does not install
  # thunar-archive-plugin. Here we explicitly install thunar with plugins.
  # Having thunar-archive-plugin allows us cpmpress/decompress files from the
  # context menu.
  programs.thunar = {
    enable = true;
    plugins = with pkgs.xfce; [
      thunar-archive-plugin
      thunar-volman
    ];
  };
  # At the moment of creating this configuration file there is an issue with
  # thunar-archive-plugin: it does not work with xarchiver. The following hack
  # resolves the issue.
  nixpkgs.overlays = [
    (final: prev: {
      xfce= prev.xfce.overrideScope (_self: super: {
        thunar-archive-plugin = super.thunar-archive-plugin.overrideAttrs (old: {
	        postInstall = lib.concatStringsSep "\n" [
	          (old.postInstall or "")
            ''
              mkdir -p $out/libexec/thunar-archive-plugin
              cp -r ${prev.xarchiver}/libexec/thunar-archive-plugin/* \
                    $out/libexec/thunar-archive-plugin/ 2>/dev/null || true
            ''
          ];
	      });
      });
    })
  ];
  # Install system-wide packages
  environment = {
    # Do not install any package by default. We want full control of packages
    # configuration
    defaultPackages = [];
    # We install only the following minimum of packages. Other packages user can
    # install in their profile.
    systemPackages = with pkgs; [
      htop ncdu iotop lm_sensors pciutils usbutils # System utilities
      xfce.xfce4-terminal # Graphical terminal
      xfce.xfce4-xkb-plugin # plugin to display keyboard layout variant in status bar
      rofi # An utility for applications quick start and other useful features
      system-config-printer # GUI utility for printer settings
      arc-theme papirus-icon-theme bibata-cursors # Themes for XFCE
      zathura # Document viewer
      mpv # Media player
      feh # Image viewer
      micro # Tiny simple text editor used as default editor
      neovim # Text editor for advanced users
      xarchiver zip unzip p7zip xz zstd bzip2 gzip unrar # Archiver utilities
      firefox # Web browser
    ];
    # Set some environment variables
    variables = {
      # Default editor
      EDITOR = "micro";
      # Hope QT applications will look better in XFCE
      QT_QPA_PLATFORMTHEME = "gtk3";
    };
  };
  # Nix configuration
  nix = {
    # Configure garbage collection
    gc = {
      automatic = true;
      dates = "weekly";
      options = "--delete-older-than 7d";
    };
    # Additional useful settings
    settings = {
      auto-optimise-store = true;
      experimental-features = [ "nix-command" "flakes" ];
    };
  };
  # Allow proprietary software
  nixpkgs.config.allowUnfree = true;
  # Users configuration
  users.users.your-user-name = {
    isNormalUser = true;
    extraGroups = [
      "wheel" # Enables sudo for user
      "networkmanager" # Enable Wi-Fi connections for user
    ];
  };
  # Enable sudo
  security.sudo.enable = true;

  system.stateVersion = "25.05";
}

Install the system

sudo nixos-install

After the installation process finishes, reboot the system and enjoy. Of course, you can fine-tune the system now as you like.

Haskell in Production: Scrive

hi+denisoleynikov@serokell.co (Denis Oleynikov) — Tue, 29 Jul 2025 00:00:00 GMT

In our Haskell in Production series, we interview developers and technical leaders from companies that use Haskell for real-world tasks. We cover benefits, downsides, common pitfalls, and tips for building useful Haskell products. Today’s guest is Flavio Corpa, the Senior Software Engineer of Scrive – a company providing electronic signature services and eID solutions.

Technical Stack & Architecture

What was the architecture of Scrive like when Haskell was first introduced? What kind of problems was it intended to solve at that point?

One of the founders, Gracjan Polak, was a functional programming enthusiast. He started programming Scrive using Haskell for the backend from the very beginning.

Which parts of your system are written in Haskell today, and where did you deliberately choose not to use it?

Most of the electronic signature and eID backend is written in Haskell, with a notable exception in the final PDF handling logic, where the iText library is used via Kotlin.

Were there any unexpected advantages or friction points when using Haskell in a highly regulated domain, with legal traceability and audit requirements?

In audits, we have to respond to questions like: “Which static code analysis are you using?”. It is a bit hard to explain why Haskell does not have any.

Do you use AI for Haskell development? Could you share your experience with it? Is it helpful? How do you think the rapid growth of programming AI will affect the future of Haskell?

Yes the company encourages us to use AI in our Haskell development and provides us with Github Copilot licenses, I have been using it recently but also besides my job I’ve been using tools like Cursor and Windsurf for developing with TypeScript and Svelte and I have to say I feel like the support of LLMs for Haskell is quite limited in my view. This was to be expected because of the reduced sample size of the dataset we have compared to other mainstream languages, but I really hope they will become smarter in the future and aid us even in such complex languages as Haskell!

What build and deployment tools do you use in dev environments and for CI/CD?

We have a mix of Docker and Nix company-wise, but mostly our CI/CD is Docker with Kubernetes, and it is working fine for us, but I would say I feel it is a bit slower than it should be. Definitely, there’s room for improvement there.

Scale, Interfacing & Observability

Scrive integrates with many external partners — from banks to government entities. How has Haskell held up when dealing with complex, real-world API interactions? Context: Digging into serialization, FFI, latency, and error resilience.

There was a point where we had to hand-roll a SOAP client in Haskell for an external integration :) But overall, we mostly find what we need in the ecosystem - sometimes resorting to maintaining some of the packages (for example, hpqtypes).

How do you handle observability in your Haskell services — logging, tracing, metrics? Was this an area where you had to build extra tooling or abstractions?

We use Prometheus through the prometheus-haskell suite of packages. Logging is done with our own open-source toolkit: https://github.com/scrive/log. Metrics are produced with a fork of the tracing library and sent to Grafana Tempo.

Haskell in Production: Lessons and Tactics

What are some hard-earned lessons from maintaining Haskell in production at scale? Any anti-patterns you’d warn others to avoid?

Functions that acquire locks on the database can be at the root of production issues if they are not well-understood by developers, and this starts with the name. A function called withLockedDocument will be used more carefully than a function simply called withDocument.* Trying to fit several entities that have admittedly similar shapes, but wildly different life cycles, into one type will come to haunt you sooner than you think. You can resort to a number of tricks to express two different types of users in the same data type, but this will put a burden on your code. Write a new data type; it’s free. In some parts of the codebase, we are using effectful, and we are very happy with it, btw ;)

How do you onboard new engineers who aren’t experienced in Haskell? Has this influenced how you write or structure code internally?

Mob and pair programming are a huge thing within Scrive, even engineers who were once QA’s are invited to try out Haskell (and Elm), and they sometimes even stick to it! Since the company uses mostly “Boring Haskell” (except for a few complex abstractions here and there), the onboarding process is not so difficult once the developer starts to understand Haskell’s basic principles and syntax.

Have you built any internal libraries or DSLs to help model domain logic (e.g., document flows, signing rules, legal states)?

Yes! I joined a team within Scrive that had built its own DSL following the recommendation of Sandy Maguire in his book Algebra Driven Design (which I had to read, of course, haha). The domain was very specific to documents, signatures, and the way we handle them in the company, but the resulting DSL and property tests written as a consequence were really nice and incredibly powerful!

How do you approach GHC upgrades? Do you usually try to keep up with the latest stable GHC version, or do you stay on the version that works for you for as long as you can? How painful are GHC upgrades for you? Do they require a lot of work to build your whole codebase?

We upgrade to new minor versions of GHC based on how well-supported they are by the ecosystem and known regressions, the latter being the main adoption blockers for us.

Do you follow GHC development with regard to new features? Are you open to introducing them into your code, or do you prefer being more conservative? Are there any features added over the last few years that you find particularly useful?

We keep an eye out for the latest improvements in features that allow us to debug our systems in production in an efficient manner, like thread labels, backtraces, and callstacks. In this regard, GHC 9.12 brings many good things. We were very happy to adopt OverloadedRecordDot, which decreased our usage of the Optics library a lot.

Functional Mindset vs Business Constraints

How do you balance functional purity with product velocity, especially in a legal-tech setting where correctness is critical but deadlines are real?

Our tooling and performance team exists to lay reliable foundations upon which we can build rapidly. Percentage-wise, the majority of “new” features are built upon existing features. We rarely have to rebuild something from the ground up. Moreover, our QA team ensures that what we deliver is up to spec. They maintain their own list of invariants to check for, and they are a big reason why we deliver quality software today.

Do you feel that using Haskell has shaped the way your team thinks about problems, not just technically, but organizationally or culturally?

I think we are using Haskell in a practical way and not an academic way. Outside the scope of a single service, I see no difference from other tech companies that could possibly be influenced by tech choice. Having said that, working in a company full of functional developers (frontend and backend) is quite a delight to be honest :)

The Road Ahead

Are there areas of Scrive’s platform where you’re planning to increase (or reduce) Haskell usage in the future?

While backend services for the main product suite keep being written in Haskell, we are also letting other teams (integrations, analytics) use the tools they are comfortable with.

If you had to rebuild a critical part of your system today, would you still choose Haskell, or would another tool be more appropriate now?

Considering our investments in development experience and the in-house expertise: Yes.

Thanks Flávio for this interview! Flávio website Flávio Twitter Scrive website Scrive LinkedIn

A Bit Late but Ultimate Analysis: DeepSeek

hi+ivan-smetannikov@serokell.co (Ivan Smetannikov) — Mon, 03 Mar 2025 00:00:00 GMT

DeepSeek R1

tldr: if you want to find out everything about deepseek you can read this blogpost that has 3 separate sections with increasing technical details and difficulty from one to another, so you can stop your reading at any point once you’ve got enough information or material becomes too difficult to comprehend.

Lately DeepSeek released their latest model R1 which has performance comparable with all the latest available OpenAI models while having much less computational costs. Unfortunately due to a lot of optimistic claims by their team and a lot of difficult to comprehend innovations introduced in their work, we’ve got a lot of rumours and misunderstanding circling around this mode.

In our blogpost we will briefly break down most common rumours and speculations about R1 model, give detailed but easily comprehensible explanations of all DeepSeek innovations in this model and explain why it was so cheap to train and so easy to operate, and in the end provide some deeper explanation on the most difficult parts of their research, so you could understand how it works up until the last bit.

Rumours and speculations breakdown

“DeepSeek R1 is on the same level as OpenAI models, but much cheaper!” While DeepSeek’s inference is definitely much cheaper, it’s performance excellence is not so clear. Yes, it shows comparable or better performance than some OpenAI’s models on several open benchmarks, but this holds true only for math and coding, it shows much worse results for other common tasks. Also there are some independent researches that it is worse for more general math and coding tasks outside of popular benchmarks, which was partially confirmed on latest AIME competition (see Data Labelling Pipeline NB for details). I definitely recommend to think about this model more as Google Gemini Flash Thinking competitor, than full-fledged OpenAI model’s.
“DeepSeek is dirt-cheap to use!” Well, yes and no. Yes, you can use DeepSeek model from their official API for the fraction of the cost of other popular models like LLama. But unfortunately their team was not ready for such a hype, so their API is down very often and very unstable to use. And if you will try to use it internally or buy some other APIs that run it, you will quickly find out that it is several times more expensive to do. The main problem is that while weights of the model and white paper about it were openly published, their hardware-specific source code was not. And it contains tons of optimizations that make this model cheaper to run.
“DeepSeek stole OpenAI’s data!” From what we are seeing from our internal and other independent tests this statement seems quite unlikely to be true and probably were made to cool down OpenAI’s investors. Later in the second section you will see some details on their innovative technique to gather data, provided in the DeepSeekMath paper. And in third section we will discuss how this technique was further improved and changed to make a DeepSeek-Zero and then DeepSeek-R1 model. These innovations are also contradict that initial OpenAI’s statement.
“DeepSeek spent 5.58 million to train — over 89 times cheaper than OpenAI’s rumored 500 million budget for its o1 model!” Well, that’s complete nonsense. While 5.58 mil is probably a true number and it is much cheaper than competitors, we are talking about 4-8 times difference at most. The main issue is that 5.58 mil was spent only for a single final training run of the model, which for example for other comparable sized models with known costs were in between 7 to 20 mil. This price tag does not incorporate all intermediate runs, which are usually much cheaper, but there are up to several hundreds of them. It also does not include data gathering, research, development and human resources spendings. So in the end completely developed DeepSeek model probably costed at least 200 millions. Nevertheless, they provided a lot of innovations to reduce both the training and inference costs, which we discuss later in this blogpost.

Innovations breakdown

Now let’s take a look at all optimisations and innovations made by DeepSeek. I will mostly focus on either general scientific achievements or technical cost-reduction innovations. This section is still general-public oriented, so I hope it will be easy to digest.

1. Low-Level Optimization for Faster Computation

Most AI models are trained using PyTorch, a popular deep-learning framework that provides ease of use but adds extra computational overhead. For faster training, many advanced AI teams use NVIDIA’s NCCL instead (a high-performance library for communication between GPUs). However, DeepSeek went even deeper — they customized NCCL itself, optimizing GPU Streaming Multiprocessors (SMs) using super low level PTX (Parallel Thread Execution) assembly language. This super low-level tuning allowed them to better match their specific hardware architecture, reducing latency and improving data transfer between GPUs. This approach was introduced in their DeepSeek V2 paper.

2. 8-bit hybrid Training Instead of 32-bit for Cost Efficiency

Most AI models train in 32-bit floating point (FP32) or 16-bit floating point (FP16) precision. This is a standard approach that ensures stability but requires significant computational power. DeepSeek was able to stabilize 8-bit training (FP8), drastically cutting memory usage and increasing speed. But they didn’t just naively apply 8-bit across the board which is well known to be unstable. They used a hybrid approach where most layers operated in FP8, but some carefully picked ones were aggregated in 32-bit precision when needed for stability. This “Floating Point Adaptive” (FPA) training balances efficiency and accuracy while reducing training costs and memory requirements.

3. Mixture of Experts (MoE) for Massive Parameter Efficiency

DeepSeek R1 uses a Mixture of Experts (MoE) architecture, meaning that instead of activating all 671 billion parameters during inference, it selectively activates only 37 billion. This drastically reduces computational load while still leveraging a large model’s capability. While MoE approach itself is well-known and already were used by OpenAI and Mistral models, they gave an extra spin on it. MoE introduces a new challenge — balancing the GPU workload. Since only a subset of experts is active at any given time, not all GPUs are used equally, and some of them are basically idling and waiting for data. Instead of relying on NVIDIA’s default load management, DeepSeek developed a custom load balancer to optimally distribute work across concrete GPUs infrastructure they had according to their specific architecture.

4. Optimized Hardware Choices for US Export-Limited GPUs

Training and running large models depend on three key factors:

Compute power (FLOPs) – Main speed multiplier for training base LLMs.
Memory bandwidth – How fast GPUs can access and process data.
Interconnect speed – How efficiently GPUs communicate with each other.

Due to US export restrictions, DeepSeek was unable to access the highest-end NVIDIA GPUs, which limited them in FLOPs. However, they made up for this by NVIDIA providing specialized cards with high memory bandwidth and fast interconnect speeds, much higher than their top performing server GPUs. This turned out to be more important for reasoning models (models optimized for tasks like problem-solving and step-by-step reasoning rather than raw number crunching), which DeepSeek-R1 is. So unintentionally NVIDIA helped them to overcome US Export limitations, at least for their reasoning model. I assume that this might result into additional restrictions later.

5. Efficient Attention Mechanism: MLA Attention

Traditional Transformer models, like those introduced in the famous “Attention is All You Need” paper, use quadratic complexity for attention mechanisms, meaning computational cost grows rapidly with longer input sequences. DeepSeek R1 uses Multi-Layer Aggregation (MLA) Attention, which allows it to reduce complexity by leveraging fewer latent representations while maintaining accuracy. This helps improve speed and scalability when processing large inputs. Moreover they once again did it with a low-level hardware-specific implementation, this approach showed up to 50% performance boost in attention calculations when was applied by other AI labs, so it is probably comparable here.

6. From TRPO to PPO and GRPO: Evolution of Reinforcement Learning

DeepSeek R1 improves training stability by leveraging policy optimization techniques in reinforcement learning. Originally, Trust Region Policy Optimization (TRPO) was used in many RL-based training approaches, but it had limitations — it imposed strict constraints that could slow down learning. The transition to Proximal Policy Optimization (PPO) relaxed these constraints while maintaining stability, making it more efficient for fine-tuning AI models. The main issue with PPO was in it’s should store additional model that is needed to approximate special value function that is used to optimise LLMs parameters. DeepSeek introduced novel approach called Group Relative Policy Optimization (GRPO) based on PPO which completely excludes this costly requirement. For more details on this approach you can look at the last section of this blogpost.

7. Self-Learning with Automated Rule-Based Rewards

While it is not really related to the cost of the final training run, or inference costs, one of DeepSeek’s most cost-effective strategies was minimizing human intervention in fine-tuning. Instead of relying heavily on Reinforcement Learning from Human Feedback (RLHF) (which requires expensive human labelers), they introduced a rule-based self-learning system with two types of rewards:

Accuracy Rewards – For tasks with clear right/wrong answers (e.g., math problems, programming challenges), the system automatically evaluates correctness using predefined test cases or expected formats.
Format Rewards – The model was trained to structure its reasoning process clearly by placing intermediate thoughts between and tags, making its responses more interpretable.

This automation reduced costs while surprisingly maintaining high-quality learning outcomes. While the idea of this approach is not novel, model was able to effectively train itself to reason from the ground up, which was not properly achieved before. I will focus more on the whole pipeline in the next section.

8. Complicated but efficient dataset generation and R1 training pipelines

In their work they used original DeepSeekMath paper as a starting point. In that paper they utilised open Common Crawl repository and expanded it with multiple iterations through the semi-automated approach using old-fashioned FastText model for webpages filtering and annotating them. As a result they obtained good reasoning dataset which had math and programming problems. These kind of problems not only has some internal reasoning, but this reasoning is possible to validate automatically.

DeepSeekMath showed outstanding performance in math and programming tasks within its weight class. From there they trained DeepSeek-R1-Zero model using prompt and applying automated rewards you’ve seen in previous point. Unfortunately DeepSeek-R1-Zero was mixing languages in its thinking process, so they have to perform extra steps in order to obtain DeepSeek-R1. You can get more technical details in the next section.

This approach excluded both Supervised Fine Tuning (SFT) — a process of using big specially labelled dataset (in this case with handcrafted reasoning chains) to train the initial model. Also it excluded Reinforcement Learning from Human Feedback (RLHF) from the process — it is a long process of running model again and again and using humans to evaluate its outputs. As you can imagine both of these processes are quite costly.

9. Potentially Lower Safety Standards?

Some experts speculate that DeepSeek R1 was able to ship faster and more affordably by cutting back on certain safety features. One indicator is that the model sometimes incorrectly identifies itself as “ChatGPT” instead of “DeepSeek,” suggesting that less effort was spent on refining safety guardrails and brand-specific fine-tuning. This makes sense for an open-source model, where users are expected to modify and adapt the AI themselves. Also this model definitely has almost no safeguards and produces harmful and discriminatory outputs with ease, so much less resources were spent there. But maybe it is even better for some applications, try to automatically translate dubs for any TV show where main characters are swearing a lot with OpenAI, you will get rejected pretty fast. Just to be clear: DeepSeek’s official API still has some extra guardrails incorporated, but most of them are not in the model weights themselves.

Technical breakdown

In this section we will focus on some deeper technical details that will give you better perspective on some innovations and math behind the scenes and also provide some extra evidence on their corpus and research both being novel, contradicting some of OpenAI’s claims.

Data Labelling Pipeline and DeepSeek-R1-Zero

As a foundation for their data labelling DeepSeek-R1 used DeepSekMath corpus which was constructed from the Common Crawl open dataset. In their paper they provide this picture of iterative pipeline.

It starts with an initial seed corpus OpeWebMath dataset. It is a small high-quality math dataset.
Then, they trained simple and lightweight fastText model (from 2016!) using 500k data points from it as a positive examples database, and using the same number of web pages form Common Crawl as negative ones.
In the next step they applied this model to find deduplicated URLs (i.e. pages with the same URL prefix were merged into one point) that lead to math-related pages preserving only top-ranking ones.
As initial dataset lacked diversity, their next step was to find “disjoint domains”, i.e. internet resources where some percentage of web-pages were math-related.
After finding these domains they were labeled manually, adding more positive examples to the positive corpus and the cycle starts over again with the new math seed.

NB. Some of these websites contains tasks from known benchmarks. DeepSeek’s team applied extra filtering to avoid benchmark contamination in their training data, but as latest American Invitational Mathematics Examination (AIME) competition showed, although all models saw a notable decline in performance, R1 suffered a far greater drop. This might be a signal that they still had a benchmark contamination of some degree.

Obtaining DeepSeek-R1

First model they have created was DeepSeek-R1-Zero. Basically they took DeepSeek-V3, took their math and code dataset and trained it with this prompt using simple Rule-Based RL training:

They used the same reward model I’ve showed in point 7 at previous section.

From that point they have to transition to R1. Why do we need to have a such complicated pipeline instead of just simply using DeepSeek-R1-Zero once we’ve got it? Unfortunately this model suffers both from poor readability and English and Chinese languages mixing. While test showed that single-language restriction reduced benchmarks metrics, it still was a preferable way to go, as the main point of this model is to show proper and understandable reasoning process behind the answer.

Before moving forward just a small reminder: Reinforcement Learning (RL) is a machine learning approach where an agent learns to make decisions by performing actions and receiving feedback in the form of rewards or penalties, aiming to maximize cumulative rewards over time.

It starts with a pre-trained DeepSeek-V3 which is an LLM trained in a standard way as all other LLMs, but using optimizations we’ve discussed in previous section.
Perform Supervised Fine Tuning on this V3 model on a carefully selected small set (several thousands samples) of R1-Zero outputs manually validated as high-quality and readable.
Apply the same reasoning self-learning procedure as it was for the R1-Zero using math and coding dataset where auto-validation is possible for the Reinforcement Learning rewards calculation.
Apply rejection sampling. With all generated samples we’ve obtained on the 3-rd step, DeepSeek-V3 used as an external expert that decides which samples should be left. This helps to generate more reasoning chains across more general-purpose domains.
Once again reinforcement learning based training. At this stage some rule-based rewards are applied for areas where it is possible (like math), for others LLM validation is used.

Deep dive into the TRPO → PPO → GPPO transition

While TRPO and PPO were known in the RL domain, GPPO is completely new and proposed in the DeepSeek-R1 paper. Let’s move from the beginning to understand how it works.

In Reinforcement Learning you usually have some Actor A and some Environment E, E gives you an observation (in this case question q) and A give output (in this case direct answer or a chain of though answer depending on the model). Last element of the schema is the reward that E gives to A depending on the answer quality.

In RL this actor internally will have a neural network (LLM) in our case, in mathematical terms we can call it policy $π_{Θ} (o b s)$ , where $Θ$ represents tunable parameters of the LLM. Then output can be denoted as $o = L L M (q, Θ)$ . The task is fine-tune LLMs parameters and get the most of the reward.

The main issue that in order to tune the LLM you need to have some Loss function $L (o, \overset{ˉ}{o})$ where $\overset{ˉ}{o}$ is a correct answer. Then using Loss function you can calculate gradients and update model parameters. In the problem statement we have we do not have correct answers as most of the data is unlabelled. So instead we perform next trick.

We perform and action an assume that this action was correct. In this case loss $L = - l o g (π (o b s)) \cdot r e w a r d$ . By default we calculate a gradient and perform gradient descent, reward in this case shows how big a step should be based of known correct answer. As we do not have a way to calculate it directly in our case, we introduce new function Advantage: $A = (r - b)$ , where $r$ is a post-action reward and $b$ is a baseline.

Reward $r (o b s, a c t)$ is calculated via (1) some external reward estimation like complier with tests in the case of code, (2) some direct internal validation via unsupervised metrics or rule-based ones, (3) LLM as a judge like setting, where you use external LLM or even train one in parallel with this one. DeepSeek went with direct approach which is described in the point 7 in the previous section.

Baseline $b$ is calculated via value function, a regression that is pre-trained on the labeled data you have to answer the question “what will be an average reward for an action from a given state”.

Then loss can be written as

\begin{matrix} L = - l o g (π (o b s)) \cdot A = - l o g (π (o b s)) \cdot (r - b), \end{matrix}

where advantage $A > 0$ when the action we perfromed is better than average expected and less than zero when vice versa.

TRPO is a Trust Region Policy Optimization works the following way. You have a gradient, but you assume that it is dangerous to trust your gradient too much as it was produced by some random stochastic process (via working with concrete data samples). To incorporate that you modify your original loss by adding KL-divergence which basically says how different are 2 distributions:

\begin{matrix} L_{n e w} = L_{o l d} + D_{K L} (π, \hat{π}) . \end{matrix}

Basically you are measuring how different your new policy in comparison to previous one you had and applying extra penalty on that, forcing gradient descent not to move too far away from the policy you had, which adds extra stability into the optimization process. Unfortunately TRPO is computationally intensive as in order to perform this estimation you need to calculate extra derivatives, make 2-nd order approximations, evaluate landscape and perform extra line search, so instead of it PPO approximation was developed.

PPO is a Proximal Policy Optimization and has this complicated formula to work with, lets break it down:

(3)

In its core it has once again Policy multiplied by Advantage $π_{Θ} (o_{t} ∣ q, o_{< t}) \cdot A_{t}$ . The $π_{o l d}$ in formula (3) is the instantiated model which produces outputs $q$ and $o$ , so it is just a number that can be computed directly from the current instance. The $π_{Θ}$ is the model you are optimizing gradients for, which will be used as old one on the next step. This results into the whole equation being the similar as before in (1) but with different pre-calculated constants and a bit different form. The main idea is that this helps to avoid the sampling bias of the current policy making weight of this part larger for policies that are rare under the old policy, as they are undersampled under it, they must be weighed stronger.

But, as we find more radical new policies, this drastically increases the first part and moves new policy too far away, resulting into the second term under the $m i n$ coming into play. This $c l i p$ function basically imitates simple cutting rule that works the same way as in TRPO, but without complicated calcluations.

The next thing they did they applied the same mechanism that was showed in (2), but instead of using some heavy calculations to obtain KL-divergence they made in a sense similar term:

(4)

The $π_{r e f}$ here is the reference model, which is usually the initial SFT model they had at the initialization of the whole optimization process. The main idea is that while we want to perform RL optimization, we still assume that initial model already had somewhat good representation of the world, and we do not want to move to far away from that. Although it is not exactly the same as KL-divergence used in TRPO, it still gives similar results.

Next, they decided to move from direct advantage to more computationally effective approach. They made groups of outputs by direct sampling and from the old policy $π_{Θ_{o l d}}$ and optimized policy model $π_{Θ}$ by maximizing this objective:

(5)

It is the same as before in (3), but with (4) added in the end, group sampling instead object sampling and new advantage function ${\hat{A}}_{i, t}$ calculated based on relative rewards of the outputs inside each group only. This group-relative advantage calculation helps to exclude value model, which usually required to run secondary instance of LLM model and thus cut a lot of computational resources.

If you want some extra comments on how and why it works you can try to read the original R1 paper. But the main idea of the benefits we got was shown in this image with transition from PPO to GRPO.

To sum up, as a result of this transition:

Value model which was computational heavy is excluded.
Instead of estimating average possible reward with value model we just sample several outputs, evaluate them with reward model and use average reward as our “value” in older terms.
For a reward model we use simple rule-based system which was described in point 7 of previous section. This also reduces computational costs.
Changed heavy KL-divergence calculation with some light-weighted approximation.

As a result we only need to make some extra sampling, apply light-weighted reward model to get average reward and apply the same procedure as it was in PPO with some extra tweaks while nearly halving all other much heavier calculations.

Outro

Hopefully this blogpost gave you a better understanding of the foundations of the DeepSeek innovations. I tried to collect everything you need to know about it and tackle every rumour we had so far. You can share it with any person who gives provoking and unsupported claims about various aspects of the R1 model, hopefully this helps.

Sources

More detailed analyses in visual form:

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models by Yannic Kilcher https://youtu.be/bAWV_yrqx4w
DeepSeek-R1 Paper Explaned — A New RL LLMs Era in AI? by AI Papers Academy https://www.youtube.com/watch?v=DCqqCLlsIBU
DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate, and AI Megaclusters | Lex Fridman Podcast 459 https://www.youtube.com/watch?v=_1f-o0nqpEI

For the advanced readers:

Proximal Policy Optimization Algorithms https://arxiv.org/abs/1707.06347
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models https://arxiv.org/abs/2402.03300
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning https://arxiv.org/abs/2501.12948

Serokell Blog

Rust, C++, and the Tradeoffs Behind Safe Low-Level Code: interview with Nikita Lisitsa

Rust advertises its “if it compiles – it works” stance and its user-friendly/educational compiler messages. Should we start learning Rust as a first step to C++?

Rust’s compiler and safety guarantees allow “learning by trial and error” in multi-threaded synchronization. Should we use Rust as training wheels while learning parallel programming?

Generalising those two questions – is C++ still the best language to get into system programming?

C++ keeps accumulating features — modules, coroutines, reflection, contracts. Do you think the language is becoming too complex to use safely and teachably, or is that complexity justified?

Do you think C++ is a good choice for agentic development?

With all this AI stuff going on, shouldn’t memory safety/correctness be the first priority of language design?

In game development/system programming, it is considered a good practice to avoid allocations at all costs. Did Rust people get it all wrong with all the fuzz around ownership?

C++ is still the language of choice for game engine development. What does Rust miss, and will it ever be able to compete with C++?

The 3-year release cycle — is it the right cadence? Too fast, too slow, or does the cadence even matter given how long adoption takes?

How do you see C++ and Rust coexisting over the next decade? Competition, gradual replacement in some domains, or peaceful coexistence?

The US government and NSA have recommended moving away from C/C++ toward memory-safe languages. Does that concern you, and how do you think the C++ community should respond?

Serokell’s Work on GHC: Dependent Types, Part 5

Summary

Visible forall in GADTs

Namespace-specified imports

Type instances in kind checking

Progress on unifying HsType and HsExpr

The star kind syntax in required type arguments

Pun detection in required type arguments

New type families: Tuple, Constraints, Tuple#, Sum#

Rework of name resolution for built-in and punned names

Previous updates on dependent types

Conclusion

The Hidden Perils of MonadBaseControl

A quick refresher

Discarded state

Threading state

Brick walls

concurrently

bracket

Conclusion

Rust in Production: JetBrains

For someone who uses Zed/Neovim with a rust-analyzer, what difference would you feel with RustRover. Is the proprietary JB engine better than the tools rust provides and what’s the story for a proprietary engine instead of a community-driven analyzer?

JetBrains actively supports the Rust Foundation. What motivated this decision, and what kind of value does JetBrains expect to get back?

Beyond the Hype: Crossing the GenAI Divide in Real-World Business

High Adoption, Low Transformation

Why Pilots Stall: The Learning Gap

The Shadow AI Economy: What Actually Works

Builders vs. Buyers: Two Ways Across the Divide

The ROI Nobody Brags About: Back-Office Wins

What Comes Next: From Agents to the Agentic Web

A Practical Way Forward

Design Patterns for Long-Term Memory in LLM-Powered Architectures

Design Patterns for Long-Term Memory in LLM-Powered Architectures

System I: MemGPT — The Operating System Paradigm

Core Architecture: Virtual Context Management

Memory Formation: The Self-Managed Write-Back Cycle

Tooling and Technology Stack

System II: OpenAI Memory Management

Core Architecture: Hybrid Fact + Semantic Storage

Memory Formation

Technology Stack (Presumably)

System III: Claude Memory Management

Core Architecture: Project Summaries and File-Scoped Context

Memory Formation: Mostly Curated by the User

Technology Stack

System IV: AI Toolkits Memory Management

Core Architecture: Composable and Protocol-Oriented

Memory Formation: Your Design

Technology Stack

Final thoughts on the current state

The Real Limits of AI Agents in 2025

Rumors and Speculations Breakdown

Engineering Reality Breakdown

1. The Mathematics Behind Failure

2. Token Economics Nobody Mentions

3. The Tool Design Wall

Integration Breakdown

What Actually Works

Predictions for the end of 2025

Building the Right Way

Outro

Reviving an Old iMac with NixOS

Goals

Breathing new life into the old iMac

Problems we have to solve:

Creating custom minimal installation image

Creating image with nix flakes

Visible `forall` in GADTs

Progress on unifying `HsType` and `HsExpr`

New type families: `Tuple`, `Constraints`, `Tuple#`, `Sum#`