Skip to main content

Reading list round up II

· 7 min read
Bruno Felix
Digital plumber, organizational archaeologist and occasional pixel pusher

How to write complex software

Grant Slatton's article provides an overview of his process for implementing complex software systems1. Two principles particularly stand out:

First, understanding your system's performance envelope is key to assess different implementation options. Writing disposable toy programs to identify the hardware/software limits that apply on the most optimistic (i.e. with little to no overhead) case can help sense check key design decisions, ensuring they are fit-for-purpose before committing significant resources.

Second, working top-to-bottom. Traditional software development often starts by implementing lower-level dependencies first - like building a bridge from both riverbanks simultaneously. However, this approach frequently leads to misaligned interfaces, kludges and accidental complexity in part due to the lack of a good feedback loop as different stakeholders are working in parallel without the means to view how the system works as a whole. Delivery pressures and the sunk cost fallacy make it very tempting for teams to accept this state of affairs, apply band-aids and move on. Instead, designing from the top down allows teams to craft interfaces that support readable code and sensible abstractions. This requires more imagination and a sense of what good looks like (as you will be writing code for dependencies that don't exist yet), but it provides faster feedback on component fit and enables teams to demonstrate working software earlier, even with stubbed components.

Nobody Gets Fired for Picking JSON, but Maybe They Should?

JSON is an incredibly successful data interchange format and is often adopted without further consideration. Unsurprisingly, despite its simplicity and human-friendliness it is far from perfect. Nobody Gets Fired for Picking JSON, but Maybe They Should? is a great breakdown of the various problems that plague JSON. In particular:

  • Numbers: Decimal number encoding is not defined in RFC8259 leaving it up to each implementation. Consider that because of how decimal numbers work in computers, representations may not be exact and therefore issues like rounding become more important. With JSON you get whatever the implementation chooses to go with, or worse some other program changes the CPU rounding more using fesetround! Also the behavior around representing infinity or NaN (Not a Number) is just weird and wonderful.

  • Data loss on large integers: You can encode larger numbers with 64 bit integers than with 64 bit floating point numbers. Since every number in JSON is a decimal number, for large digits you are at risk of data loss. Even if your application doesn't use very large numbers that are subject to this problem, if you are not careful modeling your data and store for example barcodes as a numeric type you are at risk of data loss.

  • Strings: Generally okay and the whole JSON document should be encoded in UTF-8 (this is a great start), however it still permits unpaired surrogate code points which can lead to some strange artifacts when encoding/decoding.

  • Binary data: If you need to transmit binary data, it needs to be encoded as a base64 encoded string, which adds more bloat and overhead.

  • Streaming is not supported: Nothing else to say about this, you either get the full JSON document or you don't.

  • Canonicalization woes: JSON does not care about whitespace and field ordering, however if you digital signatures operate on byte blobs and are thus sensitive to these things and therefore RFC 8785 defines a JSON Canonicalization Scheme, which re-uses the ECMA-262 (Javascript 6+) serialization rules, which introduces subtle issues when dealing with strings and numbers (e.g. unpaired surrogate code points are not supported).

Generative AI – The Power and the Glory

AI continues to capture headlines and funding. Michael Liebreich's analysis of how energy has become a crucial limiting factor for the tech industry's plans for AI is a very interesting read.

As of today data centers only account for less than 2% of US power consumption, however Generative AI has quite an energy appetite: a Google search takes about 0.3Wh while a ChatGPT query takes 2.9Wh, a full order of magnitude more! With massive investments flowing into AI, managing this growing energy demand alongside existing infrastructure needs requires planning and collaboration between all parties2.

Given the boom-bust nature of the tech sector, forecasting power needs is a challenging exercise. Historical forecasts have often proved wildly inaccurate, reflecting inflated market sentiment (as seen during the crypto boom) and failing to account for actual demand, technological advances, or economies of scale. While hundreds of billions of dollars are being deployed to build new data centers, a crucial question remains: Is there sufficient demand to justify these capital expenditures?

According to the analysis, it would require $600 billion of annual revenue (and counting) to turn a profit on all the capital expenses in the pipeline. Actual adoption is nowhere near those numbers despite the hype. And while everyone is happy to play with these tools as long as they are free, convincing individuals and businesses to commit to recurring expenses is an entirely different matter, especially in a context where money is no longer cheap - Liebreich does a very good job of highlighting this.

Underlying a lot of forecasts is the assumption that transformer based systems like ChatGPT will continue scaling up following the trend of the past two years, and therefore chips and energy will be the bottlenecks. There are warning signs that this may not be the case:

  • The available stock of public text data for training large language models may be approaching exhaustion3. Other modalities like video are also available, but is it economically and technically viable?
  • There have been performance improvements for AI workloads, so between existing hardware architectures becoming better and/or entirely new classes of custom designed chips/co-processors4 emerging, it will be possible to run increasingly advanced AI workloads at lower energy costs.
  • Technology is increasingly political (especially given the erratic behavior of certain figures in tech). This means that there may be a market and incentives for privacy friendly AI models that run locally on devices without sharing information with the outside world.
  • Regulatory oversight could limit the tech industry's more ambitious expansion plans.

Overall it is important to remember that when you are in the thick of it, it's hard to tell the difference between a sigmoid and an exponential curve. However if a system requires the output of a nuclear plant to work, and in many ways still underperforms the human brain (which operates on just 20W), then the situation warrants a healthy dose of scepticism.


Footnotes

  1. There is no hard and fast rule that defines complex software, but I would define it along three different dimensions: it is fairly large (i.e. it does not fit easily in one's head), it is connected to other systems/processes in an organization, and it deals with non-trivial amount of load.

  2. The article cites examples of how actors in this space are adapting to this reality: Microsoft data centers in Wyoming sharing their backup generators with the rest of the grid in exchange for better energy prices. Hyperscalers moving compute intensive tasks such as training new models to data centers that have access to plentiful, cheap (and hopefully green) energy, while using data centers closer to densely populated areas (which also have more demand) to run lower latency, less compute intensive tasks such as inference.

  3. Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data

  4. TPU - Tensor Processing Units

Reading list round up I

· 5 min read
Bruno Felix
Digital plumber, organizational archaeologist and occasional pixel pusher

In this new series I am going to start capturing my notes and reflections on articles that I find interesting/valuable for future reference.

LLMs won't save us

"Safety and predictability often go hand-in-hand, and I fear that in the rush to destaff unfashionable things, we will sacrifice predictability in the expectation of safety, and receive neither."

LLMs won't save us by Niall Murphy presents an interesting look at the role of LLMs in the context of SRE/DevOps, particularly in incident management.

Despite the immense power of LLMs, there are significant challenges in their adoption in the operations space, which stem from the nature of the technology and the incentives at play (e.g. reduce headcount). First, LLMs are probabilistic in nature - thus their behavior is unpredictable. This is compounded by the fact that there is a significant risk of drift between how models behave during testing/evaluation and production1 (techniques to keep this in check like evals and other LLMs as judges are still incipient).

Second, humans provide essential adaptive capacity in operational systems. While LLMs may handle the most common and trivial incidents just fine, this comes with important second order consequences: first it may deprive human operators of the chance to build up critical skills, degrading human expertise crucial to handle severe incidents. The rich body of work on the negative effects of automation2 and the effects of the "bumpy transfer of control" at the worst possible time substantiate this risk.

The limits of data

"Data is supposed to be consistent and stable across contexts. The methodology of data requires leaving out some of our more sensitive and dynamic ways of understanding the world in order to achieve that stability."

The limits of data by C. Thi Nguyen examines the inherent limitations of quantitative data.While data's power lies in its aggregation and portability – allowing information to be understood across different contexts – this very strength comes at the expense of context. The article highlights several critical considerations when working with large datasets:

First, the availability of easy-to-collect metrics shape how goals are formulated, but "the map is not the territory" and it may lead to negative outcomes3.

Second, the broader the audience for a dataset, more context is lost. This may lead to what essentially amounts to pernicious, even if well-intentioned, metrics (e.g. using ticket sales - which everyone understands - as a metric for determining arts funding).

Third, how data is collected and classified may be biased. Therefore the idea that quantitative data is an immaculate objective view of the world is flawed. Working with data at scale requires decisions about relevance and exclusion, choices that become invisible once embedded in taxonomies and methodologies.

Fourth, metrics can become detached from their original purposes, and can be internalized by individuals leading to "value capture" – where the metric itself becomes the goal rather than what it was meant to measure (e.g. citation rates vs actual understanding in academia).

Therefore when working with large quantitative datasets, there are a few things to be mindful of:

  • Who/how was the data collected?
  • Who created the system of categories into which the data is sorted? What information does that system emphasize, and what does it leave out?
  • Whose interests are served by that filtration system?
  • Not everything is tractable as quantitative data. The world is messy and context matters. Quantitative data, especially large datasets are not just limited in this regard, so caveat emptor.

“Founder Mode” and the Art of Mythmaking

“Founder Mode” and the Art of Mythmaking by Charity Majors dissects the Founder Mode talk, and does an excellent job of capturing some of the valuable insights that would otherwise be buried under a ton of founder mythologization (there is some frankly cringe-worthy stuff in the original talk by Chesky).

Airbnb, like many other companies, is now reckoning with the consequences of zero interest rate policy era practices. The incentives to massively scale up operations in an environment of quasi-unrestricted resource allocation (AKA throw-money-and-bodies-at-a-problem) led to organizational dysfunctions. In the new economic reality operational efficiency and profitability are becoming the name of the game.

The main take-aways are that running an efficient organization, with the right number of people is incredibly valuable. A leaner organization requires less need for alignment/meetings, leads to flatter organizational structures and less politics and empire building. In addition to this, having managers that are subject matter experts and manage through the work is something that just makes sense, and I am happy to see this becoming a mainstream opinion.

As for some of the other takes in the original material, let's just say that I am very skeptical about them...


Footnotes

  1. There is some indication that LLMs exibihit different behavior during training compared to production: Alignment Faking in Large Language Models

  2. The Ironies of Automation

  3. The example of Nike's post-Pandemic digital transformation strategy comes to mind.

Reliability, robustness and resilience

· 3 min read
Bruno Felix
Digital plumber, organizational archaeologist and occasional pixel pusher

"No system can escape the constraints of finite resources and changing conditions" - David D. Woods

The topic of failure, its nature, consequences and especially how individuals and organizations think about it has fascinated me for a long time. As we approach Black Friday, Cyber Monday and Christmas - a period when many organizations face increased load and stress on their systems - and with my upcoming discussion of the classic 'How Complex Systems Fail'1 treatise (link to the event here), I thought it was a good time to revisit the concepts of reliability, robustness and resilience.

These terms are often used interchangeably, however as you probably guessed, they do in fact mean very different things, and having a clear understanding of them will actually help reason about failure modes and stressors that affect the systems we build and operate2.

Reliability is the probability of a system performing adequately for a given period of time when it is used under the specified operating conditions. A classic example of this is the failure rate of hard disks over a period in well defined operating conditions.

Robustness is the ability of a system to continue to operate despite failures of some of its subcomponents or parts. Staying in the world of storage, creating redundant copies of the information in different disks provides a degree of robustness in case of one or more disk failures.

Resilience is a system's ability to adapt from disruptions stemming from fundamental surprises (black swan events, key assumptions that no longer hold true, etc). The experience during the COVID-19 pandemic is a good example of resilience, with many sectors of society adapting to novel circumstances by leveraging technology to enable remote work and digital service delivery.

A keen reader has probably observed that reliability, robustness and resilience are increasingly difficult properties to achieve: from ensuring that individual components or processes behave according to specification under well known conditions, to tolerating known failure modes of sub-components/processes all the way to adapting to unforeseen events. Resilience in particular is not a factor of technical components alone, system operators play a key role in providing adaptive capacity in situations of crisis.

"Everything Fails All the Time" - Werner Vogels

The deployment of increasingly ambitious systems in the context of a hyper connected world facing multiple stressors (economic pressure, geo-politics, climate) makes it valuable to be able to reason about failure using more precise terminology. A better understanding of failure is also humbling: no matter how good technical solutions are, they operate in a messy environment and failure is almost guaranteed. Allowing a system to "extend its capacity to adapt when surprise events challenge its boundaries"3, i.e. Graceful Extensibility, requires investments to build up adaptive capacity (e.g. build anticipatory capacity through game days4 and fault injection5, up-skill personnel), but is increasingly important in our contemporary world.


Footnotes

  1. How complex systems fail

  2. I'm obviously retreading old ground here, and if this topic interests you, Lorin Hochstein's excellent Github page offers a wealth of great resources to get you started. I cannot recommend it enough.

  3. Resilience as Graceful Extensibility to Overcome Brittleness

  4. "A game day simulates a failure or event to test systems, processes and team responses" - AWS Well Architected Framework

  5. This goes into the topic of Chaos Engineering. You can find a good description of what it is here

Revisiting the C compilation pipeline

· 17 min read
Bruno Felix
Digital plumber, organizational archaeologist and occasional pixel pusher

Over the past few years my interests have taken a turn towards system programming and database internals.

In that context languages like C, Rust or Zig make a lot of sense. As such I am going to start a more focused effort to refresh my memory of C and its toolchain. In my day job I currently use Python, Java and occasionally typescript, so I have been a bit out of the system languages game for a while. Time to fix that gap!

Macro beats micro

· 5 min read
Bruno Felix
Digital plumber, organizational archaeologist and occasional pixel pusher

Science, engineering and to some extent management are permeated by the ideas of Cartesian reductionism. Treating phenomena as machines that can be decomposed into their constituent parts layed the foundation to our modern world. Once analyzed and understood, assembling the constituents back together would yield a complete understanding of the phenomena. This is both enticing and arguably successful, however this approach struggles when faced with emergent properties and behaviors of systems.

Innovation under the radar

· 4 min read
Bruno Felix
Digital plumber, organizational archaeologist and occasional pixel pusher

These days LLMs are capturing the lion's share of media, investor and corporate attention. Judging by the headlines and the gargantuan amounts of funding being mobilized, one would almost think that the tech sector completely pivoted to this technology.

Tailscale and Docker networking

· 5 min read
Bruno Felix
Digital plumber, organizational archaeologist and occasional pixel pusher

After a well deserved rest for the past few weeks I'm back to my normal routine. One of the ideas I've been toying around with (more on that at some point in the future) benefits from remotely accessing resources running on your local network. If you grew up in the 90s or early 00s you probably remember setting up a NAT in your router to forward certain ports so you could play your favorite game with your friends. Tailscale[^1] has been on my radar for a while so I decided to take it for a spin.

Unicode audio analyzer

· 4 min read
Bruno Felix
Digital plumber, organizational archaeologist and occasional pixel pusher

What does unicode, audio processing and admittedly bad early 2000s Internet memes have to do with one another?

In the previous post in the deep dive into unicode series we explored how combining characters like diacritics work. One interesting property of unicode is that it is possible to combine multiple combining characters together.

Notes on "Programming as theory building"

· 7 min read
Bruno Felix
Digital plumber, organizational archaeologist and occasional pixel pusher

Programming as Theory Building[^1] is an almost 40 year-old paper that remains relevant to this day. In it, the author (Peter Naur, Turing Award winner and the "N" in BNF[^2]) dives into the fundamental question of what is programming, and builds up from that to answer the question about what expectations can one have, if any, on the modification and adaptation of software systems.

A deep dive into unicode and string matching - II

· 8 min read
Bruno Felix
Digital plumber, organizational archaeologist and occasional pixel pusher

In the previous entry of this series I went through a lightning tour of what is Unicode and provided some details into the various encodings that are part of the standard (UTF-8/16/32). This serves as the baseline knowledge for further exploration of how Unicode strings work, and some of the interesting problems that arise in this space.