Why we need a new standard
For years, FiveThirtyEight (538) has been considered the gold standard for pollster rankings. According to their system, four pollsters currently sit at the top with a perfect 3.0 score: The New York Times, ABC News, Marquette, and YouGov. On the surface, this seems like a reliable marker of pollster accuracy. But does this ranking actually reflect reality? The evidence suggests otherwise.
Take RealClearPolitics (RCP) as a point of comparison. Their aggregated scorecard for accuracy paints a very different picture. While The New York Times performed well in 2022, calling the midterms correctly, the broader data reveals glaring inconsistencies. Among 18 pollsters evaluated over multiple years, The New York Times ranks only sixth for accuracy, and YouGov drops all the way down to sixteenth! These discrepancies raise serious questions about the methodology behind 538's rankings.
Atlas Intel CEO Andrei Roman has been leading the charge in exposing the flaws in 538’s pollster ratings, sharing a series of Xeets that highlight just how inconsistent—and often nonsensical—their scoring system is. In upcoming posts, I’ll dive deeper into 538's methodology and the weaknesses that undermine its credibility.
For now, one thing is clear: we need a new, more reliable standard for evaluating pollster accuracy.
Defining a good pollster
If we’re going to create a new system for rating pollsters, the first step is defining what makes a pollster "good." FiveThirtyEight’s (538) methodology attempts to capture this by considering factors like margin of error, type of race, partisan bias, transparency, and more. But according to Roman, this approach may actually incentivize systemic bias in pollster behavior. When you look at how disconnected 538’s rankings are from actual pollster accuracy, it’s hard to disagree.
What Should We Value in a Pollster?
Most people don’t care about things like sample sizes, transparency, or methodology jargon. They care about results. If I were to boil down what makes a pollster valuable to the average person, it might come down to three key qualities:
Accuracy: How close were their predictions to the final results, in terms of margins?
Signal: Are they able to deliver useful signals in high-stakes, close, and controversial races?
Breadth: Do they perform well across different races, including off-year elections and diverse contexts?
I disagree with 538 that factors like partisanship or transparency should influence the score itself. For instance, a partisan pollster who consistently gets the margins right is more valuable than a transparent, neutral one that doesn’t. Methodological details like sample sizes are logistical challenges for pollsters but don’t tell us much about their end results.
Accuracy Over Everything
If accuracy is the goal, we need to ensure our scoring system reflects it. Under the current 538 methodology, pollsters appear to be rewarded for meeting specific criteria—many of which don’t directly correlate with accuracy. This ties into Goodhart’s Law, which warns that “when a measure becomes a target, it ceases to be a good measure.” Over time, any ranking system risks becoming meaningless if pollsters optimize for the ranking rather than for actual performance.
To rebuild trust in pollster rankings, we need a simpler, more intuitive approach—one focused squarely on accuracy, signal, and breadth. Leave the logistics to the pollsters and the partisanship debates to the political commentators.
Measuring Accuracy: Why Standard Metrics Fall Short
Accuracy in polling isn’t as simple as calculating the average error. To see why, let’s compare two hypothetical pollsters, Alice and Bob, in the context of the 2024 Presidential election.
The Case of Alice and Bob
Alice’s Results:
Polled four battleground states: Pennsylvania (PA), Wisconsin (WI), Michigan (MI), and Nevada (NV).
Predicted results consistently skewed by +3 points in favor of Harris.
Called 3 out of 4 races incorrectly.
Bob’s Results:
Polled the same four battleground states as Alice, plus South Dakota (SD).
In the battleground states:
Predictions were within 1 point of the actual results.
Called all 4 battleground states correctly.
In South Dakota:
Predicted Trump at 55.5% vs. Harris at 41.5% (an error of 15 points, as the actual result was Trump 63% - Harris 34%).
Intuitive vs. Metric-Based Accuracy
Most people would intuitively judge Bob to be the more accurate pollster:
He excelled in the high-stakes, close battleground states.
His large error in South Dakota, a non-competitive race, feels less consequential.
However, standard error metrics tell a different story. Because they heavily penalize outliers, Bob’s performance in South Dakota drags down his scores, making Alice appear more accurate overall. Here’s how they compare under traditional metrics:
RMSE in particular harshly penalizes Bob for his one large error, even though his battleground performance was far better than Alice's.
A Weighted Approach to Accuracy
To better measure a pollster's accuracy, we need to emphasize races that matter. One way to achieve this is by weighting races based on their competitiveness, using the inverse of the race's spread as a weight. This ensures that tighter races carry more significance in the accuracy calculation.
Here’s how Alice and Bob fare when using this weighted approach:
Bob's Weighted Scores clearly reflect his superior performance in the critical battleground states.
Alice’s Scores remain unchanged because her errors were uniform across all races, competitive or not.
The key distinction between Weighted Root Mean Square Error (WRMSE) and Weighted Mean Absolute Error (WMAE) lies in how they handle outliers. WRMSE squares the errors, which disproportionately penalizes large outliers which might in turn penalize non-competitive races with large margins. WMAE, on the other hand, avoids this issue by treating all errors linearly. By adopting WMAE as the primary metric for accuracy, we ensure that pollster rankings better reflect their true ability to provide reliable predictions in high-stakes races.
As such, we adopt WMAE as our Accuracy Metric.
Measuring Signal: What Value a Pollster Adds
Focusing solely on accuracy can inadvertently favor low-volume, niche pollsters. For instance, if we used Weighted Mean Absolute Error (WMAE) as our only metric, a pollster like CWS Research would top the rankings based on their single poll from the 2022 Texas Governor race. While their poll result was impressive, we need to consider a broader range of races and assess the influence each pollster has on collective insights.
Introducing the Signal Metric
One way to achieve a more comprehensive evaluation is by measuring the "signal" each pollster contributes to the aggregation of polls. Here's how it works:
Iterative Exclusion: For each race, we remove a pollster's data from the pool and calculate the average of the remaining polls.
Right-sizing the Pollster’s Impact: To avoid under-measuring signals in highly-polled races and over-measuring signals in lightly-polled races, we assume each pollster contributes 20% to the aggregate rather than computing a simple mean.
Calculating the Delta: We compute the difference—or delta—between this new average and the actual election result.
Assessing the Impact:
Positive Signal: If including the pollster's data moves the average prediction closer to the actual result, they receive a positive signal.
Negative Signal: If their inclusion shifts the average away from the actual result, they receive a negative signal.
Accumulating Signals: We repeat this process for each pollster across all races, summing their signals to gauge overall performance.
Applying Weightings: To prevent overemphasis on uncompetitive races, we apply weightings that give more significance to closely contested races.
This metric rewards pollsters who improve collective predictions. By measuring the unique contribution of each pollster, we discourage "herding"—the tendency to align with the consensus without adding new information. However, relying solely on the signal metric can lead to unintended consequences.
An Illustrative Example:
Imagine a scenario with ten pollsters assessing a particular race:
Majority Pollsters: Nine pollsters correctly predict that the Democratic candidate will win. However, they overestimate the margin by five points due to a consistent bias in favor of the Democrat.
Outlier Pollster: One Republican-leaning pollster incorrectly predicts a win for the Republican candidate. Despite being wrong about the outcome, this pollster's prediction brings the average margin closer to the actual result because it counteracts the collective bias of the others.
In this case, the Republican pollster, despite predicting the wrong winner, would receive a positive signal score because their input reduces the overall error in the aggregated prediction. This outcome highlights a limitation of the Signal Metric: it can reward incorrect predictions if they happen to adjust the average in the right direction.
Accuracy remains paramount. We must continue to emphasize the Weighted Mean Absolute Error (WMAE) as the primary metric because it directly assesses how close a pollster's predictions are to the actual results.
Measuring Breadth: Balancing Accuracy and Participation in Polling
When evaluating pollster performance, it's crucial to consider not just how often they're correct but also the scope of their participation. Enter our final metric in redefining pollster rankings: the Breadth Metric.
What Is the Breadth Metric?
The Breadth Metric is straightforward—it counts the total number of races a pollster has correctly called. By focusing on the sheer number of accurate predictions, we balance both the volume of polls conducted and their correctness.
Why Count the Number of Correctly Called Races?
Using the total number of correct calls as a metric addresses several challenges that arise when relying on percentages or simply counting the number of races polled:
Avoids Overvaluing Minimal Participation
The Pitfall of Percentages: If we used success rates based on percentages, a pollster who conducts just one poll and predicts it correctly would boast a 100% success rate. This could misleadingly place them above a pollster who accurately predicts 90 out of 100 races (a 90% success rate).
The Breadth Advantage: By focusing on the total number of races correctly called, we acknowledge consistent performance across a larger sample size.
Acknowledges the Challenge of Volume
Recognizing Increased Difficulty: The complexities of different regions, voter behaviors, and unforeseen events make high-volume polling a formidable task.
Rewarding Consistent Accuracy: By counting the total number of correct calls, we recognize pollsters who maintain high accuracy despite the increased difficulty.
Encourages Taking a Stand in Competitive Races
Moving Beyond Predicting Ties: In 2024, there was a perception of pollsters predicting ties in high-stakes races to hedge against being wrong. The Breadth Metric motivates pollsters to make clear calls in competitive races.
While the WMAE serves as a pollster’s accuracy metric, and the Signal Metric measures a pollster's unique contribution, the Breadth Metric adds another vital layer by:
Encouraging Volume with Accuracy: It rewards pollsters who not only participate widely but also maintain correctness across that breadth.
Promoting Comprehensive Data Collection: It benefits analysts and the public by increasing data availability across more races.
By integrating the Breadth Metric into our evaluation system, we're not just assessing how often pollsters get it right—we're acknowledging their commitment across the electoral spectrum.
Tying the Metrics Together
To provide a comprehensive assessment of pollster performance, we combine the three key metrics—Accuracy, Signal, and Breadth—into a final report card. Each metric is normalized across all pollsters to ensure fair comparisons. We then weight the metrics to calculate the Final Rating:
Accuracy: 60%
Signal: 20%
Breadth: 20%
This weighting emphasizes not only a pollster's precision in predicting outcomes but also their ability to improve collective insights and their commitment to covering a wide range of races.
Lookback Period
We concentrate on the most recent midterm and presidential elections. For instance, when assessing pollsters for the 2024 presidential election, we consider their results from the 2022 midterms onward, intentionally excluding their performance in the 2020 election.
While this approach might spark debate, we believe it's justified for several reasons:
Rapid Technological Advancements: Polling technologies and methodologies evolve quickly. Recent performance is more indicative of a pollster's current capabilities.
Relevance of Current Data: Focusing on the latest elections ensures that our assessment reflects the most up-to-date information, capturing recent shifts in voter behavior and polling environments.
Standard Practice: In many fields, individuals and organizations are judged based on their recent accomplishments rather than historical performance.
By limiting the lookback period to the most recent elections, we prioritize current proficiency over historical success.
Assigning Letter Grades
For the final grading, we categorize pollster scores into quintiles and assign letter grades accordingly:
A, B, C, D, F
To further distinguish performance within each quintile, we add modifiers:
The top third of each quintile receives a "+" (e.g., B+)
The bottom third receives a "-" (e.g., B-)
By integrating these metrics and weighting them thoughtfully, we establish a balanced and meaningful standard for evaluating pollsters—one that recognizes accuracy, rewards valuable contributions, and encourages broad participation.