Personal scientists prefer quantitative data whenever possible, which is why you’ll see us acquiring the latest gizmos and diagnostics for our own testing. But how much testing is enough?
This week we present one answer to that question, plus links to other news and information you may have missed.
How much data should you keep?
The blessing and curse of modern computer storage makes it easy to store years, decades, even a lifetime of your personal data. Many personal scientists keep this data in a huge spreadsheet that they laboriously update regularly, hoping to clean that last micron of significance out of their testing. But how much is “enough”?
Answer: Not much.
In a 2016 study, Stanford researchers looked at five years of data from 75,000 patients admitted to the hospital and whose subsequent medical and recovery history were carefully documented. They ran a machine learning model against the data in order to make predictions about which treatments worked and which didn’t, and then retrospectively varied amount of input to each prediction to how much the extra data mattered.
Bottom line: four months is about all you need. You’ll get slightly better predictions with a little more data, but the performance quickly levels off and eventually drops. In fact, more (older) data can actually hurt your model by twisting it with information that is no longer relevant.
This makes sense when you think about it. If you hope to improve your sleep, for example, data from a few years ago will be contaminated by all the differences between the you that you are today and the you that you were then.
The Stanford study includes more tidbits:
“Most item types may be ignored”: Although the researchers had access to the entire electronic health record for these patients, the vast majority of the data was irrelevant to treatment outcomes. Vital signs data, like body temperature or blood pressure, won’t add anything to the model predictions for conditions whose treatment doesn’t on them. In other words, be careful to not needlessly complicate your models with extra data that won’t matter.
Beware of subtle methodological changes: The longer the span of data you collect, the more likely your data will be affected by differences in measurement. If you upgraded your Apple Watch, for example, the new data might not be comparable to the old.
None of these results is terribly surprising. Ultimately this is just a variation on the old “garbage-in-garbage-out” truth that we already know. Don’t waste time or effort on data that doesn’t matter — including data older than a few months.
Links worth your time
In Taiwan and much of Asia, 90% of high-schoolers need glasses, up from 10% in the 1950s. The reason? New evidence suggests it’s related to how much time kids spend outdoors. (Wired: “The World Is Going Blind. Taiwan Offers a Warning, and a Cure”)
Vinay Prasad’s latest appearance on the Econtalk podcast to discuss cancer screening includes an important nuanced perspective on preventative tests. We need to distinguish between tests that just provide information, he says, and those that will actually change the outcome. Surprisingly, many common tests like mammograms, colonoscopies, and prostate exams come up short — you may not be better off.
See this tweet from Dan Go for my favorite longevity test. A study of 2000 adults aged 51-80 shows that people who can do this exercise will live at least 6 years longer. If you can’t do this, better work on it.
About Personal Science
Although this newsletter is published weekly, most of our posts are intended to be timeless. Please check our archives for additional ideas for how to use science for personal, rather than professional reasons.
And please let us know if you have other topics you’d like us to cover.
4 months seems a bit short for data that may have seasonal trends? Also, data that is more sparse (e.g. blood tests, for which I may have just one or two data points per year) may not be "predictive", but still provides a valuable baseline.