I mainly want to write today about AI, and its future. The case I want to make is that AI is not immune to the gravity of logic, and that the principles of causal inference dictate the surly bounds of artificial intelligence.
Before that, I want to update you on what we know about the stunning violence decline of 2024. As always, Jeff Asher has the news: through the end of October 2024, violence is down a bit from 2023, and murder is down a lot. Although it gets far less attention, perhaps because it has been dropping steadily for more than three decades, property crime is down even more. See more on Jeff’s excellent Substack channel.
This is great news. What is clear from Jeff’s graphic is that, in 2024 at least, the violence decline was mainly about fewer homicides. So let’s dig into that a little bit.

The best source of data on firearms homicides is the CDC provisional mortality statistics, which has updated cause of data (provisionally) through April, 2024. Comparing April 2024 to April 2023, firearms homicides were down 15.3% and compared to the peak of the COVID era firearms spike in April 2021, firearms homicides are down almost 23 percent. The cumulative monthly trends show the decline quite clearly.
Given how well this matches the Asher data, this suggests that there will be around 15,000 firearms homicide victims in 2024, down from a peak of 20,958 in 2021. That is almost 6,000 fewer firearms homicide victims, which is a notable accomplishment.
Now, the hard work of understanding what is working and how it can be replicated should be at the top of the agenda.
The Limits of AI
I am not a computer scientist, I don’t play one on TV, and I didn’t sleep at a Holiday Inn last night. But I have spent a career thinking about causal inference, and I have some thoughts about the future of AI. My main thought about AI is that there is an inherent problem with scaling AI that gets too little attention. I get why the AI tech folks don’t want to talk about it because it potentially puts a ceiling on what AI can be. But I think it’s worth thinking about.
In the social sciences, a lot of effort is employed to solve inference problems—to try and tease out cause and effect. Charles Manski famously wrote that:
If you have an inference problem that is solved by getting more of the same data, you have a statistical problem.
If more of the same data does not solve your problem of inference, you have an identification problem.
The crux of my argument is that AI has an identification problem but as far as I can tell, AI is treating inference problems in Large Language Models (LLMs) as a statistical problem. If the inference problem that AI folks are trying to solve is in fact an identification problem, this puts a ceiling on AI that is somewhat lower than its boldest proponents suggest.
Now, identification is one of the most complex ideas in social science, and I certainly can’t do it justice in a Substack essay. But the idea is this. Suppose you are interested in reducing violence. If you looked at the data, you would quickly observe that places with a lot of violence also often have deep poverty. And you would ask yourself, did poverty cause the violence, or did the violence cause the poverty? And the answer is, yes. Both are true.
Cause and effect are a toxic stew of endogeneity in this example: the effects cause the cause. And the effect and the cause are simultaneously determined (as poverty goes up, violence goes up—and as violence goes up, poverty goes up). And, there is a lot more going on that explains both violence and poverty and so the model is woefully incomplete. And, and, and. More data about crime and poverty is not going to solve this problem of what causes what.
What are other examples of identification/endogeneity problems? It turns that these problems lurk everywhere!
I love you and you love me. Do you love me because I love you or do I love you because you love me or are they independent (I love you regardless of whether you love me, and you love me regardless of whether I love you)? I’ll bet you a dollar you’ve been stuck on this one at some point.
I want to sell you an essay I wrote on Substack. If I give it a high price, I will get fewer subscribers, but also, I will get more dollars per subscriber. But if I charge lower prices, I will enroll a lot more subscribers but collect fewer dollars from each of them. So, what price do I set?
I am a football coach, and I direct some players onto the field to run a certain play. In response, the defense substitutes specific players to counter my deployment. So I substitute some different offensive players in response to their response. And so on. So, ChatGPT, what is the right answer, what’s the right offense for me to run?
I want to write a book on endogeneity and identification that normal people can read. But no one has ever written a book on identification and endogeneity for normal people. Does that mean there is no market for a book on identification and endogeneity for normal people to buy and read?
The key issue is that in every case, more of the same data does not provide information that solves the problem.
Anyway, the point is that these kinds of identification problems are everywhere and intrinsic to the human condition. By contrast, surmounting the barriers to scaling AI has focused on collecting ever larger sample sizes. Taking that approach necessitates an insidious (and incorrect) assumption: that everything is a statistical problem.
Now, that’s a real problem if you want to use AI to solve problems that we puny humans have been unable to solve. It is obvious to point out that if the AI models treat all problems as statistical problems, then the AI models can’t solve identification problems.
Now, you may ask, what are the identification problems that AI wants to solve that can’t be solved if AI thinks of the world purely in terms of statistical inference?
Reasoning. Any problem that requires reasoning to resolve. Especially when reasoning in the presence of uncertainty. Which is always the case.
Consider this small example about the uncertainty around the meaning of words. AI builds on large language models (LLMs) that essentially are massive correlation matrices that predict statistically how words should be ordered in sentences. AI does a brilliant job in normal circumstances—in the middle of the distribution of word relationships—when there is little uncertainty. But AI wants to be great, it wants to reason, so it also needs to be great out in the tails of the distribution where the relationship between words is fraught with uncertainty.
There are lots of examples of where this would trip up AI. For instance, there are reverse causality problems, words that mean one thing in one order and something totally different in reverse order. Running water and water running. Car racing and racing cars. That’s just two-word combinations. Here’s a three-word combination for ya: High School students vs. School Students High.
Consider the Oxford comma. Consider the confusion when you do and don’t include that extra comma. For funsies, I asked Chat-GPT to give me some examples of Oxford comma confusion and boy did it deliver!
"We invited the strippers, JFK and Stalin."
"We invited the strippers, JFK, and Stalin."
Let’s see you work with that data, large language model!
I suspect this kind of confusion is one source of AI hallucinations. AI researchers will tell you this is because AI can’t (yet!) reason. But reason is a funny word for this. If it means we just need to expose AI to lots of these kinds of situations so it can figure it out and apply what it learned in the future, doesn’t that just mean that AI folks think they have a statistical problem which more data solves, rather than an identification problem which more data does not solve? And if by ‘reason’ they mean that AI is going to be able to figure out how to solve identification problems, I say awesome! Let’s do it! Because humans have not been able to do this…
But I don’t think it will be able to solve this problem. All of these issues are inherent in the way that AI collects data: it just hoovers up everything. Anyone who has taken a class in social science knows that there are limitations in just grabbing up all the data that is available to you. What you get is called a convenience sample, it’s just whatever data is lying around. Social scientists know that trying to draw an inference from whatever data is just lying around is extremely difficult.
To be able to figure out if this causes that, somehow you have to construct a group that gets whatever treatment thing you are interested in and one group that does not (a comparison). For the most part, you have to construct those groups after they did or did not get the treatment. This creates big problems, because it is often the case that the reason people did or did not get the treatment is highly related to their outcome.
Suppose you look at a bunch of school data showing how well people did on a test, and some of those people volunteered for a tutoring program, and some did not and you want to know how well the tutoring program worked. The people who volunteered for the tutoring program surely would have done better than the people who did not volunteer. Because even with no tutoring they would have done better because that same motivation to volunteer for tutoring would likely have caused them to be motivated to do other things that would help them do better on the test too, like study more. And you can dig, dig, dig, into this same data. But at the end of the day, you still won’t know for sure how much the volunteering bit contributed to the better outcomes.
This is the AI problem I don’t think we talk enough about. It’s just digging ever deeper into the same pile. I see in the newspapers that the solution that is being proposed is for the AI companies to create more of the same data to plug the holes in the data. For one, I don’t see how that solves the identification problem. And for two, all I can think about when I read about creating data, e.g., synthetic data, are all the problems created by the last great experiment in synthetic data: the synthetic collateralized debt obligations (CDO-squared) and all the harm that did to society.
The thing is that there are strategies you can employ to solve some identification/ endogeneity problems. The most straightforward one is to run experiments, randomized, controlled clinical trials, instead of relying on whatever data is just lying around. When I see articles saying that AI companies are hiring mathematicians and other data scientists to “solve math problems” that are presumably not in the existing literature so that new LLMs can learn from them, I wonder if that is what they are doing. But I wonder how far it will get them, because the number of hypotheses they would need to test this way is infinite…
But again, this does not solve the bigger issue that more of the same data does not solve inference problems with identification problems. Because the intrinsic problem is that there just aren’t solutions to some problems.
If I want to know if you love me because you just do, or if you love me because you are just reflecting back on me my love for you then, well, it’s a pickle. You can seek out counseling from a trained professional or the counsel of friends and loved ones. However, they only know what you tell them—you might even call data from them synthetic—but regardless, their ability to help is limited. You could run an experiment and break out of the endogeneity mess you have gotten yourself into. You could test whether your partner just loves you innately by withholding your love from them. And you will get an answer. But the fact that you ran the experiment at all will indelibly change the course of your love story just by virtue of having been run.
You get the idea.
Fascinating stuff. I think the crime statistics are fascinating, and I believe there was a spike after COVID-19. Please do not worry about disabusing me of that notion if I am wrong.