Cardiovascular disease is the main cause of mortality worldwide, representing about a third of annual deaths, and patients’ medical care produces a large amount of data. DataTools4Heart will design methods to reuse such data to facilitate research and improve the conditions of cardiovascular patients. The DataTools4Heart toolbox will be designed, tested, and implemented in several countries while ensuring the compliance of all legal (e.g., privacy) constraints in the cardiology domain.
DataTools4Heart aims to improve the lifestyle of patients suffering from cardiovascular diseases by developing a comprehensive, federated, privacy-preserving toolbox for data reuse in cardiology. The tools include a platform that will securely access data from different hospitals and transform them into their digital counterparts (synthetic data) to be used by researchers and clinicians across the world. The access into large-scale multi-source cardiology data will be facilitated by the creation of virtual assistants.
Through the capacity of all its participants and the consortium, DataTools4Heart will focus on unlocking currently inaccessible cardiovascular health data. DataTools4Heart operates in response to the European Society of Cardiology (ESC) call for a major shift towards integrative data-driven approaches to develop personalised cardiovascular medicine. Multi-site federated health data use will be allowed, thanks to the contributions from clinical centres from seven EEA countries, namely Spain, the Netherlands, United Kingdom, Italy, Sweden, Romania, and Czech Republic. They will constitute a representative sample of the European healthcare landscape, to contribute to the creation of the European Health Data Space.
DataTools4Heart will create a comprehensive cardiology data toolbox for clinicians, researchers, and data scientists. Tools will allow data ingestion and harmonisation, Natural Language Processing in multiple languages, federated machine learning and data synthesis. Virtual assistants will aid users in navigating large multi-source cardiology data while adhering to European regulations and data standards.
Data ingestion and harmonisation
DataTools4Heart will develop a common data extraction tool to improve metadata and data interoperability while addressing data heterogeneity across European regions and cardiology units. This tool will be developed and validated through a modular and flexible Data Ingestion Suite deployed in 7 European sites. Interoperability of the Data Ingestion will be guaranteed with at least 4 standard-based data models (HL7 V2, HL7 CDA, OMOP CDM, and i2B2) and tested in 3 different use cases for AI modelling.
Natural Language Processing
DataTools4Heart will introduce a multilingual Natural Language Processing (NLP) suite to standardise the structuring of cardiology reports across European regions, including cardiology-specific entity recognition and machine translation. Such suite will include adaptation of 7 language models to the cardiology domain in English, Spanish, Italian, Romanian, Czech, Swedish, and Dutch using EHR data from clinical site partners. The project will include the release of clinical multilingual corpora (CardioSynth and Paraclite) in 7 languages, with over 50% being low-resource and containing more than 500,000 words of clinical text.
Federated machine learning and data synthesis
With the aim to develop innovative methods for synthesising data, DataTools4Heart will build a privacy-preserving cardiology data toolbox to improve data reusability, while adhering to ethical and legal standards. A secure and federated network for cardiology data will be established in 7 European locations across all regions, as result from the cooperation of different stakeholders. Differentially private synthetic data will allow to handle data representative of a target population, scalable, shareable for research purposes, and able to reduce bias in algorithmic development. The legacy will be the creation of an open-source privacy-conscious synthetic dataset, CardioSynth. The process and the quality of synthetic data generation will be thoroughly evaluated over the course of the project.