German Medical Text Corpus

The main aim of GeMTeX is to create a large annotated text collection of German medical medical texts from daily patient care. With the consent of patients’ consent, the plan is to collect documents from the electronic health records (ePA) from six university hospitals. Using natural language processing (NLP), the documents will be prepared and anonymised made available in anonymised form for joint use. This creates a valuable text repertoire for research and development. The potential of computer-assisted language processing (NLP) is growing due to rapid advances in machine learning (deep learning). Clinical language is far away from everyday and scientific language. The progress of clinical NLP will depend critically on specially trained language models, which will require realistic clinical documents. The MII Module 3 methodological platform GeMTeX will solve the two major bottlenecks of previous language models, data accessibility and data annotation.

The Medical Informatics Initiative (MII) provides a unique opportunity to make clinical documents accessible on a large documents on a large scale and to enrich them with systematic annotations. A German medical text collection will promote the development of NLP- resources that support the analysis of German clinical texts. GeMTeX will create a technical and organisational structure to collect anonymised texts and make them and to process them for enrichment in accordance with the guidelines. Thus GeMTeX covers a broad spectrum of annotation tasks. These will be tested, reviewed and applied on a large scale to create a unique database. AI models can be trained and then tested for their usefulness in everyday clinical practice. The enriched text documents and the models will be made available via the Central Library of Medicine (ZBMED) and via the DFG-funded project NFDI4Health, with which GeMTeX is closely with which GeMTeX works closely.

Institute for AI in Medicine (IKIM)