A Harvard biostatistician is rethinking plans to use Apple Watches as part of a research study after finding inconsistencies in the heart rate variability data collected by the devices. Because Apple tweaks the watch’s algorithms as needed, the data from the same time period can change without warning.
“These algorithms are what we would call black boxes — they’re not transparent. So it’s impossible to know what’s in them,” JP Onnela, associate professor of biostatistics at the Harvard T.H. Chan School of Public Health and developer of the open-source data platform Beiwe, told The Verge.
Onnela doesn’t usually include commercial wearable devices like the Apple Watch in research studies. For the most part, his teams use research-grade devices that are designed to collect data for scientific studies. As part of a collaboration with the department of neurosurgery at Brigham and Women’s Hospital, though, he was interested in the commercially available products. He knew that there were sometimes data issues with those products, and his team wanted to check how severe they were before getting started.
So, they checked in on heart rate data his collaborator Hassan Dawood, a research fellow at Brigham and Women’s Hospital, exported from his Apple Watch. Dawood exported his daily heart rate variability data twice: once on September 5th, 2020 and a second time on April 15th, 2021. For the experiment, they looked at data collected over the same stretch of time — from early December 2018 to September 2020.
Because the two exported datasets included data from the same time period, the data from both sets should theoretically be identical. Onnela says he was expecting some differences. The “black box” of wearable algorithms is a consistent challenge for researchers. Rather than showing the raw data collected by a device, the products usually only let researchers export information after it has been analyzed and filtered through an algorithm of some kind.
Companies change their algorithms regularly and without warning, so the September 2020 export may have included data analyzed using a different algorithm than the April 2021 export. “What was surprising was how different they were,” he says. “This is probably the cleanest example that I have seen of this phenomenon.” He published the data in a blog post last week.
Apple did not respond to a request for comment.
It was striking to see the differences laid out so clearly, says Olivia Walch, a sleep researcher who works with wearable and app data at the University of Michigan. Walch has long advocated for researchers to use raw data — data pulled directly from a device’s sensors, instead of filtered through its software. “It’s validating, because I get on my little soapbox about the raw data, and it’s nice to have a concrete example where it would really matter,” she says.
Constantly changing algorithms makes it almost prohibitively difficult to use commercial wearables for sleep research, Walch says. Sleep studies are already expensive. “Are you going to be able to strap four FitBits on someone, each running a different version of the software, and then compare them? Probably not.”
Companies have incentives to change their algorithms to make their products better. “They’re not super incentivized to tell us how they’re changing things,” she says.
That’s a problem for research. Onnela compared it to tracking body weight. “If I wanted to jump on a scale every week, I should be using the same scale every time,” he says. If that scale was tweaked without him knowing about it, the day-to-day changes in weight wouldn’t be reliable. For someone who has just a casual interest in tracking their health, that may be fine — the differences aren’t going to be major. But in research, consistency matters. “That’s the concern,” he says.
Someone could, for example, run a study using a wearable and come to a conclusion about how people’s sleep patterns changed based on adjustments in their environment. But that conclusion might only be true with that particular version of the wearable’s software. “Maybe you would have a completely different result if you just been using a different model,” Walch says.
Dawood’s Apple Watch data isn’t from a study and is just one informal example. But it shows the importance of being cautious with commercial devices that don’t allow access to raw data, Onnela says. It was enough to make his team back away from plans to use the devices in studies. He thinks commercial wearables should only be used if raw data is available, or — at minimum — if researchers are able to get a heads-up when an algorithm is going to change.
There might be some situations where wearable data could still be useful. The heart rate variability information showed similar trends at both time points — the data went up and down at the same times. “If you’re caring about stuff on that macro scale, then you can make the call that you’d keep using the device,” Walch says. But if the specific heart rate variability calculated on each day matters for a study, the Apple Watch may be riskier to rely on, she says. “It should give people pause about using certain wearables, if the rug runs the risk of being ripped out underneath their feet.”