Data Integrity

In remote patient monitoring systems, data integrity is crucial. Delivering wrong data could result in lack of necessary treatment, or in administering incorrect treatment.

Straightforward storage, transmission and display of data in itself rarely creates problems, as long as basic precautions are taken. In fact, under the current EU Medical Devices Directive, such functionality does not even qualify a system as a medical device, although the situation is set to change with the new Medical Device Regulation coming into force in May 2020. Storage systems and transmission protocols typically have various types of built-in data integrity protection anyway, and more so if you use encryption – which you should do, since there is rarely much additional cost to it if you design for it from the start. Backups is one potential point of failure, everyone makes backups but sometimes people neglect to continuously verify that they are working. Still, not having working backups might cause an extinction-level event for your business, but it would be less of a safety issue, since in a situation where lack of backups becomes a problem, you would already be very aware that things are going wrong.

The worst kind of data integrity problem is the one you have no way of detecting before the damage is done. Maybe data is associated with the wrong patient ID; maybe a format conversion fails in certain cases; maybe in some failure modes data is silently discarded. You need to have monitoring and warning systems in place to ensure you detect the problem before it has a chance to affect patient health.

Causes

It is helpful to consider potential causes for data integrity problems, when deciding on a mitigation strategy. Some more common ones are listed below. Of course, not all of these apply to every system.

Security Threats

Security threats affecting integrity include mainly ransomware attacks, which target the data directly, and collateral damage from other attacks. An attacker using your server as part of their illegal operations might cause data to be lost simply because their use of the server disturbs its normal operation. For example, when reconfiguring the server to install their own software, they might accidentally replace some system components with versions that are incompatible with the server software. Or your cloud provider might detect that your server is being used to distribute malware and shut it down.

In the future, there might also be attacks where patients manipulate their own data. For example, insurance companies might set their premiums based on the monitoring results, or organize competitions based on adherence data. In both cases there would be a financial benefit from being able to alter the data. However, it is very difficult to prevent data from being manipulated by the patients themselves – after all, they can simply put their monitoring device on someone else, and in any case they have physical possession of the monitoring device which makes it hard to secure completely. For the time being this issue is largely ignored because it is difficult to mitigate, occurs rarely, and damage is limited. Also, remote patient monitoring is still far more trustworthy than traditional home monitoring where the patient records results on pen and paper.

Human Errors

Software engineers sometimes think that human errors are not their problem. However, medical device regulation requires us to consider foreseeable human errors as risks and mitigate those risks. For example:

The user may set the current date wrong in their mobile phone, causing wrong timestamps in data
A nurse may give the patient a loaner phone that has already been allocated to another patient, causing the data to be assigned to the wrong patient
If the name of the currently shown patient is not prominently displayed in the healthcare professional’s user interface, the user may lose track of which patient they are viewing and e.g. change medication for the wrong patient, causing wrong medication dosage

It is actually not enough to consider only human error; we must also consider human intransigence. If the system has safety features that can be bypassed in an emergency, we must consider the possibility that people bypass them merely for convenience. If the system is not suitable for a certain type of use, we must consider the possibility that people will nevertheless attempt to apply it for that purpose. Fortunately, we do not quite have to add the medical equivalent of a warning against drying your cat in the microwave; users of medical systems are professionals, so we can assume a certain amount of competence. Of course, if the system is also used by patients, we must be more careful.

Hardware Errors

Networks and storage media can and will fail. With adequate use of checksums, backups, and redundancy, these risks can typically be fully mitigated through use of best practices in data storage and transfer.

For example, in case of time-critical data transfers, such as health-related alarms, redundant networks might be needed to ensure that failure of the Wide Area Network (WAN) does not cause the system to malfunction, e.g. ADSL + cellular, or a dual SIM solution (with SIM cards from two different operators) if there is no fixed network available. Similarly, in case of storage, you should use a database clustering configuration that stores redundant copies of data in multiple physical locations.

Software Errors

All kinds of software errors might result in data integrity loss. Adequate testing will catch most, but some specific types of errors are worth extra attention.

Patient Identity Errors

This is probably one of the most insiduous errors because the erroneous values are completely valid, they are just associated with the wrong patient. There is in practice no possibility to detect the error after the fact, unless you have mitigations in place specifically for this purpose.

You might think this could only happen because of some really egregious programming error. Unfortunately, that is not always the case. For example, concurrency problems might cause data to be mixed up when several threads (each handling a different patient) write to the same data sink. Also, the problem might not be in the data transfer phase at all, perhaps the identities of two patients can get mixed up during the patient provisioning phase.

Format Conversion Errors

Transfer of data from a measurement device (such as a blood pressure monitor) to the cloud usually entails some format conversions. For example, numbers are converted from internal binary representation to a transfer format (such as FHIR), and vice versa. Sometimes the numbers are also converted from one measurement system to another, e.g. from imperial units to metric units, or from local date and time to UTC. The best way to avoid mistakes is to use mature libraries for the conversions, and avoid conversions which cannot be done without a large amount of custom code. For example, using a hand-crafted on-the-wire format that replaces all the attribute names with a short identifier (1=weight, 2=systolic, 3=diastolic, etc.) might save you a few bytes but introduces the possibility of mixing up values during marshalling and unmarshalling.

Timestamp Errors

Remote patient monitoring produces time series data. The time part can be very important; for example, a sudden increase in weight might indicate accumulation of fluid, which is a symptom of congestive heart failure. If the timestamps are wrong, the weight increase might not register as sudden.

Timestamps can be wrong for a variety of causes. Often the time is recorded by the measurement device, and its internal clock might not be set correctly. Or the time might be taken from the user’s mobile phone, and the user might have set the phone’s clock incorrectly. Or, as in the previous example, the conversion of the local time used by the phone to UTC used by the server might be incorrect, e.g. because the user was moving between time zones and had not set the correct time zone yet.

In some cases it is impossible to get truly accurate timestamp information, but if the measurement data is transmitted from the measurement device to the mobile device soon after the measurement, you can verify time from network on the mobile device and use that for at least a sanity check.

Combination of Errors

Software testing tends to consider errors in isolation, i.e. each error situation is tested individually using some typical test data that triggers the error. However, combinations of errors are not always tested due to the number of test cases needed, and the rarity of occurrence. For example, when there is some problem with the data, the normal error handling might recover and salvage the data. However, if there is simultaneously a problem with the network, the code that fixes the network problem might change the control flow so that the code that fixes the data problem is skipped, and the data is lost. This kind of rarely surfacing programming error is difficult to prevent and can lurk undetected for a long time if there is no system in place to detect data loss.

Solutions

IEC 62304 mandates a mitigation for each identified risk, and as you can see, data integrity problems are always a risk. Below are some possible solutions.

Audit Trail Verification

Your system should in any case produce a detailed audit trail so that you are able to afterwards accurately reconstruct the sequence of events that led to an incident. However, if the audit trail is detailed enough, it can be used to detect data integrity problems. You just need to turn around the typical log monitoring logic which looks for abnormal events in the logs. Instead, you need to verify that the logged events follow the normal, expected processing logic. Once a new measurement is logged for the first time (on the mobile device), you need to follow the audit trail that tracks the processing of that data, until you see the log output that signifies the data was successfully stored in the backend database.

To enable this kind of verification, you need to introduce the concept of a processing flow identifier. Whenever a tracked processing flow starts, you need to generate an unique identifier that is then passed on throughout the processing flow, and is included in all log output. This allows you to easily track the processing flow from start to finish.

The nice thing about audit trail verification is that it builds on systems that are needed in any case for other purposes. Audit trails are needed for access auditing, security and incident management purposes. You need a separate log server in order to guarantee log immutability, but it also ensures that a failure in the main system cannot disable the audit trail verification system and vice versa. The only extra functionality is really the reverse monitoring logic, and the processing flow identifiers.

Digital Signatures

Digital signatures are a well-known mechanism for ensuring data integrity. The most secure way is to generate a patient- and device-specific PKI key-pair in the mobile phone (or gateway), and use the private key to sign the data. If the private key never leaves the mobile phone / gateway, it would be impossible for someone to falsify data later even if they had access to the database. This does require that the patient’s public key is stored in a secure key vault on server side. If someone were able to replace the public key associated with the patient, they would also be able to replace the patient’s data (assuming that the patient is no longer using that key pair for new data, e.g. because they have a new mobile phone / gateway). The key vault must therefore be completely isolated from the rest of the system and have separate and strict security controls.

There are a few weaknesses with digital signatures though:

They cannot detect loss of data. You can mitigate this issue by assigning a steadily increasing sequence number to each new piece of data; data loss can then be detected from missing sequence numbers.
You need to establish a canonical data format that is used for calculating the checksum or signature. However, if the format used in actual processing is different, there might be a format conversion error that is not apparent when using the canonical format.
Handling of derivative data gets complex. For example if the system detects atrial fibrillations (afibs) from ECG data, and stores the detected afibs, how should the checksum/signature and sequence numbering of that data work? You can solve this e.g. by requiring the number of detected afibs to be stored for each ECG data block, and by signing the afib data with a patient-specific private key stored at the server, but this increases complexity. For example, to keep the server-side patient private key safe, you would need to create some kind of signing service.
Sometimes you need to correct data afterwards and you need to establish a procedure for that (e.g. signing by an authorized administrator).

You can simplify things a little by not requiring signing keys to be secure. In that case there would be no point in public key cryptography either, a Hash-based Message Authentication Code (HMAC, e.g. HMAC-SHA-256) that is calculated using both the data and a patient-specific key is sufficient. If the signing keys are not protected, the solution only protects against accidental changes, but that may be sufficient for you.

Immutable Data Monitoring

Measurement data is usually immutable. There might be corrections to it on occasion, but that typically happens very rarely. So once the data is in the database, any changes to it should be carefully controlled and monitored for.

Digital signatures are one way of accomplishing this, but implementation can be much simpler when you are only looking to detect any change to the data, rather than ensure that only the patient can generate the data. Commercial data integrity protection tools can likely be used. The downside is a lower level of protection, e.g. the data is vulnerable while en route to the database.

Sanity Checking

Sanity checking only detects a subset of data integrity problems, but it is good progamming practice in any case. For example, a patient’s age should never be less than zero or greater than 150. Systolic or diastolic blood pressure should never be less than zero or greater than 400 (although if you have anything above 180 systolic or 110 diastolic you are probably in need emergency medical care, so much lower upper bounds would also be justified).

Versioning

Once you have detected an error, obviously you want to correct it, but it is not always that simple. Patient data regulation often requires (obviously, this depends on your jurisdiction) that changes to patient data do not delete the old data. In other words, you need data versioning. This is also very good from data integrity point of view, because changes caused by human or software errors can easily be reversed without having to manually import the old data from backups. Of course, versioning does not help if data is lost or corrupted before it is written to the database for the first time.

The requirement for data versioning means that you should consider choosing a database that supports versioning natively. Then you can more easily guarantee that data is immutable. If you implement versioning on top of a non-versioning database such as MongoDB, you must carefully engineer the versioning layer so that a software error cannot cause it to be accidentally circumvented.

Buffering

As already mentioned, data versioning does not help if the data never makes it to the database. Data buffered on the mobile phone or gateway must not be deleted before the server has confirmed that the data has been successfully written to the database. It is not necessarily enough that the data has been written to a persistent message queue such as Kafka or RabbitMQ, because subtle software errors might still cause the data to be lost before it can be written to the database.

In a simple architecture where the data upload REST request handler directly accesses the database, this is trivial, but highly scalable systems may have a processing pipeline that does not immediately process the data. In such systems data upload request handler may instead return sequence number information for data that has already been written to database, allowing the mobile phone (or gateway) to remove that data from its buffer.

Of course, buffering does not help if the data was corrupted rather than lost. Note also that writing data to the database makes the data safe only if the database clustering configuration stores redundant copies of data in multiple physical locations. If you are relying on backups to keep the data safe, there is a time window between writing data to the database and the time of the next backup during which the data is vulnerable. In theory you could mitigate the risk by not reporting data sequence numbers as completed to the mobile phone or gateway before the data has been backed up. However, you would then be implementing an inferior (and probably buggy) solution to a problem that database clustering already solves.

User Interface

The preferred way to handle human error is to design the user interface and processes in such a way that mistakes cannot happen. If that is not possible, then the probability of human error should be minimized, and the probability of discovering an error early should be maximized. Sometimes this can be at odds with usability, however. For example:

If the user has to enter the social security number for a patient in order to make changes, it is very unlikely that they will enter data for the wrong patient, but usability will suffer.
If the mobile phone application always shows the name of the assigned patient when it is started, it is very likely that the patient will notice if they have been given a loaner phone that was assigned to a different patient.

You will have to weight the pros and cons of each solution carefully, and verify your solutions through user testing.

Conclusion

Being able to guarantee data integrity is an essential part of making medical systems safe to use. Protecting against malicious actors and hardware failure is not enough. You also need to identify where human and software error poses risks to data integrity, and mitigate those risks. Multiple risk controls will be needed, and you have to review the available solutions and select the ones appropriate for your system and risk profile.

Comment on this article on Linkedin