If at first you don’t succeed, try try again

7 Feb 2024 | Data is not evidence

One of the papers that I cited most prominently in Grow the Pie is “Corporate Sustainability: First Evidence on Materiality“. It shows that ESG doesn’t always pays off: firms with high ESG scores don’t beat the market; only those that focus their ESG efforts on issues material to their industry. For example, climate change is a serious global threat, but it isn’t the most important concern for a tech company that conducts its business in the cloud rather than along the coastline. Thus, a tech company that’s best-in-class in its carbon footprint doesn’t beat the market; instead, it outperforms by leading the pack on relevant societal issues, such as data privacy, cyberaddiction and misinformation.

I included this paper in virtually every talk I gave on my book, and every presentation I gave on ESG investing. I also featured it in the first draft of Chapter 2 of my forthcoming book May Contain Lies, the chapter on Black-and-White Thinking – that ESG is not black-or-white, or always good or always bad, but there are shades of grey. It was one of my favorite papers ever written on ESG. This because I viewed the results, as stated by the authors and published in a top peer-reviewed journal, are extremely important. On the one hand, ESG can pay off, in contrast to the concerns of ESG skeptics. But on the other hand, in contrast to the views of some ESG supporters, you can’t do ESG in a scattergun way – they have to focus on material issues, and show restraint on immaterial ones.

But a few months after I penned Chapter 2, a new study came across my LinkedIn feed by Luca Berchicci and Andy King. It suggested that this paper, which I knew and loved, might be the product of data mining. And it’s not just me who was fooled; the paper was used in testimony before the US Senate and appeals to the US Securities and Exchange Commission, as well as having a major impact on both academia and practice.

What is data mining? This is when authors run lots of different tests, and only publish the results that give the answers they want. “If at first you don’t succeed, try, try again” isn’t just an abstract proverb – it’s true in real life when it comes to some studies. What are the multiple tries that the authors might have had here?

What’s Material?

The Sustainability Accounting Standards Board lists what the most material ESG topics are for each industry; the data provider MSCI rates companies according to their performance on various ESG issues. Since MSCI provides data company-by-company, and SASB defines materiality industry-by-industry, the researchers had to decide which industry each firm is in. For most companies, it’s obvious – General Motors is in motors, Bank of America is a bank – but for others, it’s not so clear. Is Comcast in telecoms, or media and entertainment because it owns NBCUniversal? The list of issues that MSCI scores doesn’t precisely correspond to the topics whose materiality SASB rates, so judgment is also needed here. For example, it’s unclear whether to count sulphur dioxide as a greenhouse gas, a toxic emission, or both.

Measuring Material ESG Scores

Those first steps produced a “material ESG score” – a company’s performance, as measured by MSCI, on the ESG issues that SASB judges as being material for its industry. But there are even more forks in the road. Do you use a static measure that captures your ESG score at a point in time, or a dynamic measure of changes over time? The authors decided on the latter, but then faced another decision. Do you look at a company’s improvement in isolation, or compared to its peers to address concerns of grade inflation? Again they chose the latter, which then required them to select a peer group – they plumped for companies with similar size, profitability, leverage, R&D (research and development), advertising, and a few other factors. This gave them a measure of how much a company’s material ESG score had improved relative to its peers.

The Control Group

The final step was the most crucial. In Chapter 4 of May Contain Lies, I stress the importance of having a control sample, which is as close as possible to the treated sample except for the input variable. That works if the input variable is binary. You either receive a drug or a placebo, so you compare patients with one to those with the other. In this case, the analysis is simple – you find the average recovery for the treated sample, do the same for the control sample, calculate the difference and check if it’s significant.

But many variables aren’t zero-one; they’re continuous. Compared to its peers, one company might have improved its material ESG score by 0.87, another by 0.21, and three more by -0.39, 0.85, and -0.72. As a result, the data isn’t neatly divided into treated and control companies; they’re all treated, but different degrees. We deal with continuous variables by running a regression. This takes companies’ actual changes in ESG scores, correlates them with their actual changes in financial performance, and draws a best-fit line to find the overall relationship. Here’s an example (using hypothetical data):

Company Change in Material ESG Score Financial Performance
A 0.87 1.9
B 0.21 -0.1
C -0.39 0.2
D 0.85 0.1
E -0.72 0.9

The power of a regression is that it uses the full range of your data. Company A with an ESG change of 0.87 improved over four times faster than B with 0.21, and the regression takes this into account – in the above graph, it’s four times further away from zero on the right of the horizontal axis.

Throwing Away Data

But this isn’t what the researchers did. They took only the firms in the top 20% of ESG improvements, ignoring all the rest, and showed that they beat the market. Their claim: companies with big increases in their ESG scores outperform.

Sounds convincing, right? Actually, there’s a serious problem. The authors divided the sample into black-and-white, ignoring all the shades of grey. Company A with an ESG improvement of 0.87 did barely any better than D with 0.85, so they should be treated almost identically. But focusing only on big ESG improvers, and defining them as companies in the top fifth, means that A counts as a big improver and D does not. This suits the authors since A had performance of 1.9 but D earned only 0.1, consistent with the idea that big ESG improvers had superior performance.

The original, continuous variable “ESG improvement” was 0.87, 0.21, -0.39, 0.85 and -0.72 for A to E and took into account the full colour spectrum. Dividing the sample into the top 20% and bottom 80% converts it into a black-and-white variable “big ESG improver or not”, which was 1 (yes) for A and 0 (no) for B through E. But the last four shouldn’t be bucketed together, because there’s a huge variation in their actual ESG improvements – converting 0.21, -0.39, 0.85, and -0.72 to 0 throws out all the information in the different shades of grey. Far from studying whether your ESG improvement affects performance, the researchers’ approach actually ignores your ESG improvement if you’re not in the top 20%.

It’s easy to see why this approach is attractive, if you want a positive result. In the original graph, if you ignore company A and consider only the bottom 80%, the correlation between ESG changes and financial performance is negative:

That’s why the overall relationship in the first graph, with all five datapoints, is weak. Company A supports the authors’ hypothesis because it was a strong ESG improver and performed strongly, but B to E contradict it due to their negative correlation, and so the best-fit line is close to flat. But if you don’t like some of the data, you can make up an excuse to get rid of it. By focusing on the top 20% (Company A), which supports your story, you can hide the fact the bottom 80% don’t fit. When throwing out the shades of grey and classifying firms as black-and-white (big improvers or not), the graph now shows a much stronger relationship:

Top 20% or not? (1 = yes, 0 = no) Financial Performance
1 1.9
0 -0.1
0 0.2
0 0.1
0 0.9

In addition to throwing away information, binarization gives you huge flexibility to decide your treated and control groups. If classifying the top fifth as a big ESG improver doesn’t work, why not try the top quarter or the highest third? And you could classify non-improvers as the bottom 80%, weakest 20%, or lowest half.

You can also do the same trick with the outcome variable. Rather than considering actual financial performance (1.9, -0.1, 0.2, 0.1, and 0.9), you reduce it to black-and-white by studying whether you beat the average, in which case 0.9 is just as good as 1.9 as they’re both above-par. A conclusion that “companies with big improvements in their ESG scores are twice as likely to have above-average financial performance” sounds striking – but we care about actual performance, not just moving from below- to above-average. On a scale of 1 to 10, edging up from 4.9 to 5.1 counts for your results if all that matters is whether you’re higher or lower than the middle, but jumping from 5.1 all the way to 10 has no effect.

Tying It All Together

Given how serious these problems are, how can researchers get away with ignoring the shades of grey? Because of black-and-white thinking. We like to divide the world into good and bad. “The top 20% in ESG changes beat the bottom 80% by 5% per year” has a clear interpretation – good firms beat bad firms by 5%. With a regression, there’s no such thing as good or bad; instead, you summarise it with a regression coefficient. That takes into account all the data and represents the slope of the best-fit line. The slope in our first graph was 0.24 – a higher ESG change of 1 is associated with higher financial performance of 0.24, on average. This still has a real-world meaning, but is harder to interpret and remember than the goodies beating the baddies by 5%.

Now binarization can sometimes be justified. Perhaps ESG only boosts your performance if you’re in the upper echelon to begin with. Improving your football skills from average to good won’t see you make a living as a professional athlete, but rising from excellent to world class can bring home the millions – and it might be the same for ESG. For the other judgment calls made by the authors, there might also be many reasonable paths to take.

So Berchicci and King took an agnostic approach. They conducted a “model uncertainty” analysis, where they looked over all the possible specifications the authors could have chosen. Doing so produced a range of possible values for how much ESG improvers beat the market. They found the outperformance reported in Khan et al.’s paper was larger than 98% of the other numbers they’d have ended up with by making different choices.

Statistics can never prove or disprove a hypothesis, and so this 98% statistic can’t prove data mining – the configurations the authors reported might genuinely be the only ones they tried and they just got very lucky. Indeed, Berchicci and King never made this accusation. Their goal wasn’t to embarrass anyone, but scientific inquiry – to understand whether there really is a link between material ESG performance and stock returns. Regardless of whether the published results were data-mined or reached by accident, the 98% figure suggests they’re not reliable. Nearly any other plausible specification leads to a weaker result. As a result, I had to take this paper out of Chapter 2 of May Contain Lies, and all my future talks on ESG investing.

The Aftermath

Berchicci and King’s paper was published in the Journal of Financial Reporting. They initially sent it to The Accounting Review, the same journal that published the initial paper. Often journals have been willing to publish papers that overturn prior results in the same journal, as this is the way science evolves. Andrew Gelman’s blog covered what happened at The Accounting Review, which I find disappointing.

Normally, if researchers’ results are overturned, they take the L, stop promoting it, and move on. Sometimes they may have indeed made an honest mistake. Indeed, even though the goal of this blog is to uncover misinformation, I didn’t think to feature this paper because Berchicci and King had already highlighted its flaws, and I assumed the authors would stop promoting their overturned results. In addition, this blog focuses on practitioner papers, since for academic papers there is a process by which results get overturned – other authors write critiques, which get peer-reviewed and published if they’re valid.

But today I was surprised to see an article in the Harvard Business Review by one of the authors referring to his results as if they were gospel:

“A main criticism of corporate sustainability has long been that it results in firms not putting shareholders first, thus contradicting managers’ fiduciary duty. In 2016, however, I published a paper, “Corporate Sustainability: First Evidence on Materiality” … that began to overturn that narrative. We documented that considering financially material ESG factors (i.e., those sustainability activities that are related to the core sector practices of the firm) improve portfolio returns, which is consistent with financially material sustainability activities creating shareholder value.”

The author then claims “In part as a result of our paper, ESG investing took off”, emphasizes how the Financial Times “called the paper a turning point on how investors viewed and integrated ESG information” and refers to the study as “my foundational paper”. Even if your results have not been overturned, it is not for you to call your own paper foundational; that’s for others to judge. This is even more surprising when your findings have been called into question. The academic process of publishing critiques seems to have had no effect here.

But perhaps Berchicci and King’s analysis was wrong? Maybe it was the Journal of Financial Reporting that erred by publishing their critique, rather than The Accounting Review that erred by publishing the original paper. Then, the authors are perfectly entitled to continue to share their work, because the concerns are specious. Perhaps they should market it even more prominently, to drown out the flawed criticisms.

Unfortunately (for supporters of the original paper like myself), this does not seem to be the case. The authors had a right to reply to the critique, and took up this right. They actually do not refute any of the main points in Berchicci and King’s analysis, or convincingly defend their own study, but instead suggest that the general idea that materiality matters remains correct because they’re consistent with the findings of other ESG papers (including one of my own). But as Berchicci and King point out in their rejoinder to the reply, only 2 of the 28 articles cited study materiality, and both are self-citations. And even if other papers found similar results, this does not make the original paper correct – the HBR article should have referred to those other papers. Thus, the authors seem to recognize the critique is valid, but still promote their overturned results. I thus hope that this post helps inform readers about this important issue, one in which I was misled myself.

Inside the Ivory Tower

Inside the Ivory Tower

In May Contain Lies, I highlight the value of academic research. While it's far from perfect, it can be more reliable than practitioner studies for a number of reasons: Its goal is scientific inquiry, rather than advocacy of a pre-existing position or releasing findings to improve a company's image. It's conducted by those with expertise in conducting scientific research. Papers published in top scientific journals are peer-reviewed, which helpsimprove their accuracy. However, authors, journalists, and practitioners will sometimes cite research as if it bears the hallmark ...
Does only 2% of VC funding go to female founders?

Does only 2% of VC funding go to female founders?

A widely quoted statistic is that only 2% of VC funding goes to female founders. For example, this Forbes article highlights that "only 2% of all VC funding goes to women-led startups" and asks "Why is only 2% of VC funding going to female founders"? If true, this statistic is substantial underrepresentation and needs to be urgently addressed. However, it's problematic for several reasons. 1. The Statistic Ignores Diverse Teams The 2% statistic actually refers to companies founded solely by women. It ignores diverse companies founded by both men and women. This is strange, because ...
An unhealthy obsession with organisational health

An unhealthy obsession with organisational health

Two leading asset management firms drew my attention to the McKinsey Organizational Health Index as a potential tool to evaluate a company. A book, "Beyond Performance 2.0: A Proven Approach to Leading Large-Scale Change", written by two McKinsey partners, claimed that companies with high scores on this Index trounced their unhealthy peers along a range of performance measures. For example, their shareholder returns were three times as high. But as I wrote in an earlier post, rather than being more impressed by big numbers, we should be more sceptical. If it were really possible to ...