The Prospect of an AI Winter

2023-03-27 • 20 min read • comment via LW, EAF

img

Summary #

The Prospect of a New AI Winter #

What does a speculative bubble look like from the inside? Trick question – you don’t see it.

Or, I suppose some people do see it. One or two may even be right, and some of the others are still worth listening to. William Eden tweeting out a long thread explaining why he’s not worried about risks from advanced AI is one example, I don’t know of which. He argues in support of his thesis that another AI winter is looming, making the following points:

  1. AI systems aren’t that good. In particular (argues Eden), they are too unreliable and too inscrutable. It’s far harder to achieve three or four nines reliability than merely one or two nines; as an example, autonomous vehicles have been arriving for over a decade. The kinds of things you can do with low reliability don’t capture most of the value.
  2. AI systems won’t get that much better. Some people think we can scale up current architectures to AGI. But, Eden says, we may not have enough compute to get there. Moore’s law is “looking weaker and weaker”, and price-performance is no longer falling exponentially. We’ll most likely not get “more than another 2 orders of magnitude” of compute available globally, and 2 orders of magnitude probably won’t get us to TAI.[3] “Without some major changes (new architecture/paradigm?) this looks played out.” Besides, the semiconductor supply chain is centralised and fragile and could get disrupted, for example by a US-China war over Taiwan.
  3. AI products won’t be that profitable. AI systems (says Eden) seem good for “automating low cost/risk/importance work”, but that’s not enough to meet expectations. (See point (1) on reliability and inscrutability.) Some applications, like web search, have such low margins that the inference costs of large ML models are prohibitive.

I’ve left out some detail and recommend reading the entire thread before proceeding. Also before proceeding, a disclosure: my day job is doing research on the governance of AI, and so if we’re about to see another AI winter, I’d pretty much be out of a job, as there wouldn’t be much to govern anymore. That said, I think an AI winter, while not the best that can happen, is vastly better than some of the alternatives, axiologically speaking.[4] I also think I’d be of the same opinion even if I had still worked as a programmer today (assuming I had known as much or little about AI as I actually do).

Past Winters #

There is something of a precedent.

The first AI winter – traditionally, from 1974 to 1980 – was precipitated by the unsympathetic Lighthill report. More fundamentally it was caused by AI researchers’ failure to achieve their grandiose objectives. In 1965, Herbert Simon famously predicted that AI systems would be capable of any work a human can do in 20 years, and Marvin Minsky wrote in 1967 that “within a generation […] the problem of creating ‘artificial intelligence’ will be substantially solved”. Of Frank Rosenblatt’s Perceptron Project the New York Times reported (claims of Rosenblatt which aroused ire among other AI researchers due to their extravagance), “[It] revealed an embryo of an electronic computer that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence. Later perceptrons will be able to recognize people and call out their names and instantly translate speech in one language to speech and writing in another language, it was predicted” (Olazaran 1996). Far from human intelligence, not even adequate machine translation materialised (it took until the mid-2010s when DeepL and Google Translate’s deep learning upgrade were released for that to happen).

The second AI winter – traditionally, from 1987 to 1993 – again followed unrealised expectations. This was the era of expert systems and connectionism (in AI, the application of artificial neural networks). But expert systems failed to scale, and neural networks learned slowly, had low accuracy and didn’t generalise. It was not the era of 1e9 FLOP/s per dollar; I reckon the LISP machines of the day were ~6-7 orders of magnitude less price-performant than that.[5]

Wikipedia lists a number of factors behind these winters, but to me it is the failure to actually produce formidable results that seems most important. Even in an economic downturn, and even with academic funding dried up, you still would’ve seen substantial investments in AI had it shown good results. Expert systems did have some success, but nowhere near what we see AI systems do today, and with none of the momentum but all of the brittleness. This seems like an important crux to me: will AI systems fulfil the expectations investors have for them?

Moore’s Law and the Future of Compute #

Improving these days means scaling up. One reason why scaling might fail is if the hardware that is used to train AI models stops improving.

Moore’s Law is the dictum that the number of transistors on a chip will double every ~2 years, and as a consequence hardware performance is able to double every ~2 years (Hobbhahn and Besiroglu 2022). (Coincidentally, Gordon Moore died last week at the age of 94, survived by his Law.) It’s often claimed that Moore’s Law will slow as the size of transistors (and this fact never ceases to amaze me) approaches the silicon atom limit. In Eden’s words, Moore’s Law looks played out.

I’m no expert at semiconductors or GPUs, but as I understand things it’s (1) not a given that Moore’s Law will fail in the next decade and (2) quite possible that, even if it does, hardware performance will keep running on improvements other than increased transistor density. It wouldn’t be the first time something like this happened: single-thread performance went off-trend as Dennard scaling failed around 2005, but transistor counts kept rising thanks to increasing numbers of cores:

img

Some of the technologies that could keep GPU performance going as the atom limit approaches include vertical scaling, advanced packaging, new transistor designs and 2D materials as well as improved architectures and connectivity. (To be clear, I don’t have a detailed picture of what these things are, I’m mostly just deferring to the linked source.) TSMC, Samsung and Intel all have plans for <2 nm process nodes (the current SOTA is 3 nm). Some companies are exploring more out-there solutions, like analog computing for speeding up low-precision matrix multiplication. Technologies on exponential trajectories are always out of far-frontier ideas, until they aren’t (at least so long as there is immense pressure to innovate, as for semiconductors there is). Peter Lee said in 2016, “The number of people predicting the death of Moore’s law doubles every two years.” By the end of 2019, the Metaculus community gave “Moore’s Law will end by 2025” 58%, whereas now one oughtn’t give it more than a few measly per cent.[6]

Is Transformative AI on the Horizon? #

But the main thing we care about here is not FLOP/s, and not even FLOP/s per dollar, but how much compute AI labs can afford to pour into a model. That’s affected by a number of things beyond theoretical peak performance, including hardware costs, energy efficiency, line/die yields, utilisation and the amount of money that a lab is willing to spend. So will we get enough compute to train a TAI in the next few decades?

There are many sophisticated attempts to answer that question – here’s one that isn’t, but that is hopefully easier to understand.

Daniel Kokotajlo imagines what you could do with 1e35 FLOP of compute on current GPU architectures. That’s a lot of compute – about 11 orders of magnitude more than what today’s largest models were trained with (Sevilla et al. 2022). The post gives a dizzying picture of just how much you can do with such an abundance of computing power. Now it’s true that we don’t know for sure whether scaling will keep working, and it’s also true that there can be other important bottlenecks besides compute, like data. But anyway something like 1e34 to 1e36 of 2022-compute seems like it could be enough to create TAI.

Entertain that notion and make the following assumptions:

What all that gives you is 50% probability of TAI by 2040, and 80% by 2045:

img

That is a simple model of course. There’s a far more sophisticated and rigorous version, namely Cotra (2020) which gives a median of ~2050 (though she’s since changed her best guess to a median of ~2040). There are many reasons why my simple model might be wrong:

Still, I really do think a 1e35 2022-FLOP training run could be enough (>50% likely, say) for TAI, and I really do think, on roughly this model, we could get such a training run by 2040 (also >50% likely). One of the main reasons why I think so is that as AI systems get increasingly more powerful and useful (and dangerous), incentives will keep pointing in the direction of AI capabilities increases, and funding will keep flowing into efforts to keep scaling laws going. And if TAI is on the horizon, that suggests capabilities (and as a consequence, business opportunities) will keep improving.

You Won’t Find Reliability on the Frontier #

One way that AI systems can disappoint is if it turn out they are, and for the forseeable future remain, chronically unreliable. Eden writes, “[Which] areas of the economy can deal with 99% correct solutions? My answer is: ones that don’t create/capture most of the value.” And people often point out that modern AI systems, and large language models (henceforth, LLMs) in particular, are unreliable. (I take reliable to mean something like “consistently does what you expect, i.e. doesn’t fail”.) This view is both true and false:

John McCarthy lamented: “As soon as it works, no one calls it AI anymore.” Larry Tesler declared: “AI is whatever hasn’t been done yet.”

Take for example the sorting of randomly generated single-digit integer lists. Two years ago janus tested this on GPT-3 and found that, even with a 32-shot (!) prompt, GPT-3 managed to sort lists of 5 integers only 10/50 times, and lists of 10 integers 0/50 times. (A 0-shot, Python-esque prompt did better at 38/50 and 2/50 respectively). I tested the same thing with ChatGPT using GPT-3 and it got it right 5/5 times for 10-integer lists.[9] I then asked it to sort five 10-integer lists in one go, and it got 4/5 right! (NB: I’m pretty confident that this improvement didn’t come with ChatGPT exactly, but rather with the newer versions of GPT-3 that ChatGPT is built on top of.)

(Eden also brings up the problem of accountability. I agree that this is an issue. Modern AI systems are basically inscrutable. That is one reason why it is so hard to make them safe. But I don’t expect this flaw to stop AI systems from being put to use in any except the most safety-critical domains, so long as companies expect those systems to win them market dominance and/or make a profit.)

Autonomous Driving #

But then why are autonomous vehicles (henceforth, AVs) still not reliable enough to be widely used? I suspect because driving a car is not a single task, but a task complex, a bundle of many different subtasks with varying inputs. The overall reliability of driving is highly dependent on the performance of those subtasks, and failure in any one of them could lead to overall failure. Cars are relatively safety-critical: to be widely adopted, autonomous cars need to be able to reliably perform ~all subtasks you need to master to drive a car. As the distribution of the difficulties of these subtasks likely follows a power law (or something like it), the last 10% will always be harder to get right than the first 90%, and progress will look like it’s “almost there” for years before the overall system is truly ready, as has also transparently been the case for AVs. I think this is what Eden is getting at when he writes that it’s “hard to overstate the difference between solving toy problems like keeping a car between some cones on an open desert, and having a car deal with unspecified situations involving many other agents and uncertain info navigating a busy city street”.

This seems like a serious obstacle for more complex AI applications like driving. And what we want AI for is complicated tasks – simple tasks are easy to automate with traditional software. I think this is some reason to think an AI winter is more likely, but only a minor one.

One, I don’t think what has happened to AVs amounts to an AV winter. Despite expectations clearly having been unmet, and public interest clearly having declined, my impression (though I couldn’t find great data on this) is that investment in AVs hasn’t declined much, and maybe not at all (apparently 2021 saw >$12B of funding for AV companies, above the yearly average of the past decade[10]), and also that AV patents are steadily rising (both in absolute numbers and as a share of driving technology patents). Autonomous driving exists on a spectrum anyway; we do have “conditionally autonomous” L3 features like cruise control and auto lane change in cars on the road today, with adoption apparently increasing every year. The way I see it, AVs have undergone the typical hype cycle, and are now by steady, incremental change climbing the so-called slope of enlightenment. Meaning: plausibly, even if expectations for LLMs and other AI systems are mostly unmet, there still won’t be an AI winter comparable to previous winters as investment plateaus rather than declines.

Two, modern AI systems, and LLMs specifically, are quite unlike AVs. Again, cars are safety-critical machines. There’s regulation, of course. But people also just don’t want to get in a car that isn’t highly reliable (where highly reliable means something like “far more reliable than an off-brand charger”). For LLMs, there’s no regulation, and people are incredibly motivated to use them even in the absence of safeguards (in fact, especially in the absence of safeguards). I think there are lots of complex tasks that (1) aren’t safety-critical (i.e., where accidents aren’t that costly) but (2) can be automated and/or supported by AI systems.

Costs and Profitability #

Part of why I’m discussing TAI is that it’s probably correlated with other AI advancements, and part is that, despite years of AI researchers’ trying to avoid such expectations, people are now starting to suspect that AI labs will create TAI in this century. Investors mostly aren’t betting on TAI – as I understand it, they generally want a return on their investment in <10 years, and had they expected AGI in the next 10-20 years they would have been pouring far more than some measly hundreds of millions (per investment) into AI companies today. Instead, they expect – I’m guessing – tools that will broadly speed up labour, automate common tasks and make possible new types of services and products.

Ignoring TAI, will systems similar to ChatGPT, Bing/Sydney and/or modern image generators become profitable within the next 5 or so years? I think they will within 1-2 years if they aren’t already. Surely the demand is there. I have been using ChatGPT, Bing/Sydney and DALL-E 2 extensively since they were released, would be willing to pay non-trivial sums for all these services and think it’s perfectly reasonable and natural to do so (and I’m not alone in this, ChatGPT reportedly having reached 100M monthly active users two months after launch, though this was before the introduction of a paid tier; by way of comparison, Twitter reportedly has ~450M).[11]

Eden writes: “The All-In podcast folks estimated a ChatGPT query as being about 10x more expensive than a Google search. I’ve talked to analysts who carefully estimated more like 3-5x. In a business like search, something like a 10% improvement is a killer app. 3-5x is not in the running!”

An estimate by SemiAnalysis suggests that ChatGPT (prior to the release of GPT-4) costs $700K/day in hardware operating costs, meaning (if we assume 13M active users) ~$0.054/user/day or ~$1.6/user/month (the subscription fee for ChatGPT Plus is $20/user/month). That’s $700K × 365 = $255M/year in hardware operating costs alone, quite a sum, though to be fair these costs likely exceed operational costs, employee salaries, marketing and so on by an order of magnitude or so. OpenAI apparently expects $200M revenue in 2023 and a staggering $1B by 2024.

At the same time, as mentioned in a previous section, the hardware costs of inference are decreasing rapidly: the price-performance of AI accelerators doubles every ~2.1 years (Hobbhahn and Besiroglu 2022).[12] So even if Eden is right that GPT-like models are 3-5x too expensive to beat old-school search engines right now, based on hardware price-performance trends alone that difference will be ~gone in 3-6 years (though I’m assuming there’s no algorithmic progress for inference, and that traditional search queries won’t get much cheaper). True, there will be better models available in future that are more expensive to run, but it seems that this year’s models are already capable of capturing substantial market share from traditional search engines, and old-school search engines seem to be declining in quality rather than improving.

It does seem fairly likely (>30%?) to me that AI companies building products on top of foundation models like GPT-3 or GPT-4 are overhyped. For example, Character.AI recently raised >$200M at a $1B valuation for a service that doesn’t really seem to add much value on top of the standard ChatGPT API, especially now that OpenAI has added the system prompt feature. But as I think these companies may disappoint precisely because they are obsoleted by other, more general AI systems, I don’t think their failure would lead to an AI winter.

Reasons Why There Could Be a Winter After All #

Everything I’ve written so far is premised on something like “any AI winter would be caused by AI systems’ ceasing to get more practically useful and therefore profitable”. AIs being unreliable, hardware price-performance progress slowing, compute for inference being too expensive – these all matter only insofar as they affect the practical usefulness/profitability of AI. I think this is by far the most likely way that an AI winter happens, but it’s not the only plausible way; others possibilities include restrictive legislation/regulation, spectacular failures and/or accidents, great power conflicts and extreme economic downturns.

But if we do see a AI winter within a decade, I think the most likely reason will turn out to be one of:

I still think an AI winter looks really unlikely. At this point I would put only 5% on an AI winter happening by 2030, where AI winter is operationalised as a drawdown in annual global AI investment of ≥50%. This is unfortunate if you think, as I do, that we as a species are completely unprepared for TAI.

References #

Cotra, Ajeya. 2020. “Forecasting Tai with Biological Anchors.”
Erdil, Ege, and Tamay Besiroglu. 2022. “Revisiting Algorithmic Progress.” https://epochai.org/blog/revisiting-algorithmic-progress.
Hobbhahn, Marius, and Tamay Besiroglu. 2022. “Trends in Gpu Price-Performance.” https://epochai.org/blog/trends-in-gpu-price-performance.
Odlyzko, Andrew. 2010. “Collective Hallucinations and Inefficient Markets: The British Railway Mania of the 1840s.”
Olazaran, Mikel. 1996. “A Sociological Study of the Official History of the Perceptrons Controversy.” Social Studies of Science 26 (3): 611--59.
Sevilla, Jaime, Lennart Heim, Anson Ho, Tamay Besiroglu, Marius Hobbhahn, and Pablo Villalobos. 2022. “Compute Trends across Three Eras of Machine Learning.” https://epochai.org/blog/compute-trends.
Villalobos, Pablo, Jaime Sevilla, Lennart Heim, Tamay Besiroglu, Marius Hobbhahn, and Anson Ho. 2022. “Will We Run out of Ml Data? Evidence from Projecting Dataset Size Trends.” https://epochai.org/blog/will-we-run-out-of-ml-data-evidence-from-projecting-dataset.

Footnotes #

  1. By comparison, there seems to have been a drawdown in corporate investment in AI from 2014 to 2015 of 49%, in solar energy from 2011 to 2013 of 24% and in venture/private investment in crypto companies from 2018 to 2019 of 48%. The share prices of railways in Britain declined by about 60% from 1845 to 1850 as the railway mania bubble burst (Odlyzko 2010), though the railway system of course left Britain forever changed nonetheless. ↩︎

  2. Well, this depends a bit on how you view Moore’s Law. Gordon Moore wrote: “The complexity for minimum component costs has increased at a rate of roughly a factor of two per year.” Dennard scaling – which says that as transistors shrink, their performance improves while power consumption per unit area remains constant – failed around 2005. I think some traditionalists would say that Moore’s Law ended then, but clearly the number of transistors on a chip keeps doubling (only by other means). ↩︎

  3. William Eden actually only talks about artificial general intelligence (AGI), but I think the TAI frame is better when talking about winters, investment and profitability. ↩︎

  4. It’s interesting to note that the term AI winter was inspired by the notion of a nuclear winter. AI researchers in the 1980s used it to describe a calamity that would befall themselves, namely a lack of funding, and, true, both concepts involve stagnation and decline. But a nuclear winter happens after nuclear weapons are used. ↩︎

  5. Apparently the collapse of the LISP machine market was also a contributing factor. LISP machines were expensive workstations tailored to the use of LISP, at the time the preferred programming language of AI researchers. As AI programs were ~always written in LISP, and required a lot of compute and memory for the time, the loss of LISP machines was a serious blow to AI research. It’s a bit unclear to me how exactly the decline of LISP machines slowed AI progress beyond that, but perhaps it forced a shift to less compute- and/or memory-hungry approaches. ↩︎

  6. The question is actually operationalised as: “Will the transistors used in the CPU of Apple’s most modern available iPhone model on January 1st, 2030 be of the same generation as those used in the CPU of the most modern available iPhone on January 1st, 2025?” ↩︎

  7. That said, MosaicBERT (2023) achieves similar performance to BERT-Base (2018) with lower costs but seemingly more compute. I estimate that BERT-Base needed ~1.2e18 FLOP in pre-training, and MosaicBERT needed ~1.6e18. I’m not sure if this is an outlier, but it could suggest that the algorithmic doubling time is even longer for text models. When I asked about this, one of the people who worked on MosaicBERT told me: “[W]e ablated each of the other changes and all of them helped. We also had the fastest training on iso hardware a few months ago (as measured by MLPerf), and MosaicBERT has gotten faster since then.” ↩︎

  8. $10B may seem like a lot now, but I’m thinking world-times where this is a possibility are world-times where companies have already spent $1B on GPT-6 or whatever and seen that it does amazing things, and is plausibly not that far from being transformative. And spending $10B to get TAI seems like an obviously profitable decision. Companies spend 10x-100x that amount on some mergers and acquisitions, yet they’re trivial next to TAI or even almost-TAI. If governments get involved, $10B is half of a Manhattan-project-equivalent, a no-brainer. ↩︎

  9. Example prompt: “Can you sort this list in ascending order? [0, 8, 6, 5, 1, 1, 1, 8, 3, 7]”. ↩︎

  10. FT (2022): “It has been an outrageously expensive endeavour, of course. McKinsey put the total invested at over $100bn since 2010. Last year alone, funding into autonomous vehicle companies exceeded $12bn, according to CB Insights.” – If those numbers are right, that at least suggests the amount of funding in 2021 was substantially higher than the average over the last decade, a picture which seems inconsistent with an AV winter. ↩︎

  11. Well, there is the ethical concern. ↩︎

  12. I’m not exactly sure whether this analysis is done on training performance alone, but I expect trends in training performance to be highly correlated with trends in inference performance. Theoretical peak performance isn’t the only thing that matters – e.g. interconnect speed matters too – but it seems like the most important component.

    I’m also guessing that demand for inference compute is rising rapidly relative to training compute, and that we may be seeing R&D on GPUs specialised on inference in future. I think so far that hasn’t been the focus as training compute has been the main bottleneck. ↩︎

  13. By true out-of-distribution generalisation, I mean to point at something like “AI systems are able to find ideas obviously drawn from outside familiar distributions”. To make that more concrete, I mean the difference between (a) AIs generating entirely new Romantic-style compositions and (b) AIs ushering in novel kinds of music the way von Weber, Beethoven, Schubert and Berlioz developed Romanticism. ↩︎

  14. I’m not confident that this would scale, though. A quick back-of-the-envelope calculation suggests OpenAI would get the equivalent of about 0.016% of the data used to train Chinchilla if it spent the equivalent of 10 well-paid engineers’ salaries (in total ~$200K per month) for one year. That’s not really a lot.

    That also assumes:

    1. A well-paid engineer is paid $200K to $300K annually.
    2. A writer is paid $10 to $15 per hour (this article suggests OpenAI paid that amount for Kenyan labourers – themselves earning only $1.32 to $2 an hour – to provide feedback on data for ChatGPT’s reinforcement learning step).
    3. A writer generates 500 to 1,500 words per hour (that seems reasonable if they stick to writing about themselves or other things they already know well).
    4. A writer works 9 hours per day (the same Kenyan labourers apparently worked 9-hour shifts), about 21 days per month (assumes a 5-day work week).
    5. Chinchilla was trained on ~1.4T tokens which is the equivalent of ~1.05T words (compare with ~374B words for GPT-3 davinci and ~585B words for PaLM) (Sevilla et al. 2022). I use Chinchilla as a point of comparison since that paper, which came after GPT-3 and PaLM were trained, implied LLMs were being trained on too little data.

    Those assumptions imply OpenAI would afford ~88 labourers (90% CI: 66 to 118) who’d generate ~173M words per year (90% CI: 94M to 321M), as mentioned the equivalent of 0.016% of the Chinchilla training data set (90% CI: 0.009% to 0.031%). And that implies you’d need 6,000 years (90% CI: 3,300 to 11,100) to double the size of the Chinchilla data set. ↩︎