The global shortage of qualified pathologists has become a pressing issue, significantly impacting healthcare systems worldwide [1]. Efficient use of pathologists’ time and expertise is important. However, not all healthcare facilities have consistent access to experienced pathologists, and in some instances, even skilled pathologists may face limitations in their knowledge or require second opinions. Consequently, providing reliable and timely support to pathologists through advanced computational models can substantially enhance diagnostic accuracy and operational efficiency. Recent advancements in multi-modal large language models (MLLMs) offer promising solutions for supporting medical professionals. These models integrate visual and textual data to deliver detailed, precise, and contextually relevant responses, making them ideal candidates for providing rapid second opinions in pathology.
Foundational Models: As a result of advances in image analysis using transformer architectures [2, 3], numerous foundational models (FMs) suitable for various histopathological applications have emerged, for instance H-Optimus-0 [4], a vision transformer trained on over500:000whole slide images (WSIs), the Hibou family [5], different sized vision transformers trained for pathology on over1million WSIs, and Virchow [6] also a vision transformer trained on1:5millionWSIs.
Open-Source Contributions: Parallel developments in open-source large language models (LLMs) such as Qwen [7], LLaMA [8], Mistral [9], DeepSeek [10], and many more provide a robust basis for specialized applications, thanks to their flexibility and accessibility.
Medical-Specific Pretraining: Several studies have pursued specialized training approaches for medical domains, resulting in LLMs optimized specifically for medical knowledge, including pathology. Examples include BioMed-LLaMA-3 [11], a finetuned LLaMA 3 model using 54Kbiomedical-focused samples. Other recent efforts [12], trained a model on multilingual medical data, creating one of the leading open-source models in this field. Such tailored models hold promise for enhancing precision and reliability in medical diagnosis.
Multi-modal Approaches: The integration of visual and textual modalities has significantly evolved. LLaVA [13] trains in two steps, first a projector which projects extracted image tokens from a vision encoder into a text token space for a LLM and in a second step finetunes this projector together with the LLM. LLaVA-Med [14] successfully demonstrated the potential of rapidly fine-tuning generic multi-modal approaches on medical datasets, achieving notable performance despite relatively limited training time, resources, and data, using 600K samples for pretraining and 60K for finetuning. Building upon this, researchers have introduced even more specialized multi-modal models trained specifically on histopathological data, such as PA-LLaVA [15], which pretrained a vision encoder on 800K samples followed by pretraining the projector on 500K samples and finetuning on 36K samples. Furthermore, Quilt LLaVA collected a larger dataset from sources like YouTube and Twitter to pretrain their model on 700K samples and finetune on 100K samples [16]. PathGen LLaVA used an even larger dataset of 1.6M samples for pretraining and200K for finetuning [17].
Challenges in Multimodal Pathology Models: Despite their potential, multimodal pathology models face several inherent challenges. Original frameworks like LLaVA extract numerous to-kens from images, many of which contain redundant information, thus limiting computational efficiency. Approaches like PruMerge have tried to address this issue by significantly reducing the number of tokens from 576 down to 32, achieving improved efficiency without notable performance degradation.
Furthermore, handling large-scale images, particularly whole slide images (WSIs), poses unique difficulties. Simple downscaling is unsuitable as it risks losing critical diagnostic details. Recent models such as PathAlign [18] explicitly tackle WSIs, enabling better diagnostic interpretations from extensive image data. Additionally, techniques like PA-LLaVA have attempted to preserve detailed information by employing advanced attention mechanisms to selectively focus on diagnostically relevant features within larger images.
This paper builds upon these foundational advances, addressing existing limitations to pro-vide pathologists with an efficient, precise, and reliable multi-modal tool to enhance diagnostic practices and patient outcomes.
The proposed Multi-Modal Large Language Model (MLLM) consists of three main components: a visual encoder, a projector module, and a language model.
Visual Encoder: For feature extraction from images, we selected H-optimus-0 [4], a state-of-the-art foundational vision encoder for histopathology. H-optimus-0 expects inputs of size (224 ⨉ 224) pixels, and returns 256 patch tokens, 4 register tokens and one classification token with a dimensionality of 1536. It has 1:1B parameters.
Language Model: We employ MMed Llama [12], a specialized medical domain LLM, which the authors claim exhibits robust multilingual performance across various medical contexts. The Model has 8B parameters with an embedding dimension of 4096.
Projection Module: The projector serves as a bridge mechanism, translating visual encoder output (1536) into token embeddings compatible with the language model (4096). It is implemented as a lightweight neural network with two fully connected layers and GELU activation functions and has only 23M parameters.
Our approach follows the stages illustrated in Figure 1 and is described below:
(A) Encoding: Images are first split into multiple patches, each matching the encoder’s input size of (224⨉224) pixels. Simultaneously, a down-sampled thumbnail representing the entire input is created. Each individual patch, along with the thumbnail, is encoded separately by H-optimus-0, producing 256 patch tokens, 4 register tokens, and 1 CLS token per patch, each with an embedding dimension of 1536.
(B) Patch Selection: CLS tokens from all non-thumbnail patches are used to calculate attention scores, indicating the relative importance of each patch. Patches are ranked based on these scores, and the top-k (e.g., 10) most relevant patches are selected for further processing.
(C) Token Reduction: For each selected patch, the CLS token is used directly. The remaining tokens (patch and register tokens) are pooled into a smaller number of tokens (e.g., 5 per patch) using average pooling, significantly reducing redundancy while preserving ne-granular visual information.
(D) Projection: The reduced tokens from selected patches undergo dimensionality projection. The projector converts these visual embeddings from their original dimension (1536) into the embedding dimension of the language model (4096).
(E) Text Embedding: User-provided textual input is tokenized and embedded directly into the shared latent embedding space, enabling seamless fusion with visual data.
(F) LLM Integration: Embeddings from visual and textual inputs are concatenated into a unified sequence, which is fed into MMed Llama. The LLM generates predictions by autoregressively decoding the combined multimodal embeddings.
The training setup follows a two-step process inspired by the LLaVA framework [13]:
Our model is trained on an extensive, diverse dataset comprising sources such as our own curated version of PubMedOA containing images that were published on PubMed with the image captions as labels, annotated patches from The Cancer Genome Atlas (TCGA) and, the YouTube and Twitter portion from the Quilt-1M dataset [19].
Data curation and preprocessing were specifically tailored for each dataset, ensuring high-quality training inputs. For textual data preprocessing and curation, automated methods including ChatGPT-4omini and other natural language processing tools were employed.
This results in a pretraining dataset of over 2.3M images and over 1M of samples for finetuning. Having data from multiple sources ensures data diversity and enables us to rigorously remove samples that do not meet our quality expectations.
The PAiChat model presented here is only trained on a small portion of this dataset, using approximately 500K samples for pretraining and 60K samples for Finetuning.
For quantitative evaluation, we consider the following tasks and datasets for histopathology zero-shot-classification, meaning the model was not trained on these specific classification tasks:
BACH: 400 breast images of 4 classes (benign, in situ, invasive and nor mal) [20]. The images have a resolution of 2048 ⨉ 1536.
NCTCRC7k: 7.000 colorectal patches of 9 tissue classes (Adipose, Background, Debris, Lymphocytes, Mucus, Smooth muscle, Normal colon mucosa, Cancer-associated stroma, Colorectal adenocarcinoma epithelium) [21]. The images have a resolution of224 ⨉ 224.
Pcam: 1.000patches from the PatchCamelyon dataset which contains lymph node sections, with a binary label indicating presence of metastatic tissue [22]. The images have a resolution of 96 ⨉ 96.
OSCC: Oral cancer dataset containing 439 images with 100x magnification and 495 with 400x magnification. The images contain either normal epithelium or Squamous Cell Carcinoma [23]. The images have a resolution of2048 ⨉ 1536.
We are comparing our results with the general medical MLLM LLaVA-med [14], the on Quilt-1M trained model Quilt LLaVA [16] and PathGen LLaVA [17].
The metric used is the balanced accuracy implementation of scikit-learn [24]. In the case of balanced data, balanced accuracy is the same as the accuracy metric but in case of unbalanced data it has the advantage that it takes class imbalances into account.
In Table 1, we observe that LLaVA-Med achieves the lowest performance among the evaluated models, with an average balanced accuracy of 38:12%.
Quilt LLaVA shows strong performance on binary tasks (Pcam and OSCC). However, its accuracy declines considerably on the multiclass breast cancer (4-class) and colorectal (9-class) tasks, resulting in an overall average balanced accuracy of 50:36%.
PathGen-LLaVA demonstrates notable superiority on multiclass tasks compared to other models, achieving the highest performance among the baseline models with an average balanced accuracy of 55:69%.
Finally, our proposed model, PAiChat, surpasses all compared models, achieving the highest average balanced accuracy of 64:37%.
In Figure 2 we can see a minimalistic UI for PAiChat.
The relatively poor performance of LLaVA-Med is expected, as it was trained on diverse medical datasets with limited exposure to histopathological images specifically. Upon examining the quantity of histopathological images included during pretraining and ne-tuning, a clear correlation emerges: models pretrained with larger datasets perform better. For instance, PathGen-LLaVA, trained on the largest histopathology dataset among Quilt LLaVA, LLaVA-Med, and PathGen-LLaVA, achieves the highest performance among these baseline models.
Although the dataset used to train our proposed model (PAiChat) is not larger, it benefits notably from incorporating specialized medical models, see H-Optimus-0 and MMedLlama, both pretrained extensively on medical and histopathological data, resulting in superior performance. This highlights the critical importance of utilizing large, diverse datasets to achieve strong performance in MLLM training.
Figure 1: (A) Encoding: The input image is divided into multiple patches matching the input requirement of the vision encoder (here 224×224 pixels). Additionally, a downscaled thumbnail of the same size is created. All image patches, including the thumbnail, are encoded individually using the vision encoder, generating a sequence of t+1 tokens (256 patch tokens, 4 register tokens, and 1 CLS token per patch), each with the embedding dimension of the vision encoder (1536). (B) Patch reduction: The CLS token from each patch (excluding the thumbnail) are used to get a patch attention score. Which is used to reorder the patches and select the top-k patches (e.g., 10). (C) Token reduction: From each selected patch, the CLS token is preserved directly, while the remaining tokens (patch and register tokens) are reduced to m tokens through average pooling. (D) Projection: The reduced set of patch tokens is subsequently passed through the projector module, a shallow two-layer neural network employing GELU activations, which maps visual embedding dimensions (1536) into the latent space of the Large Language Model (LLM). (E) Text Embedding: The input text is tokenized and then embedded into the latent space of the LLM. (F) LLM: The Encoded Text is combined with the flattened image encodings and fed into the LLM to predict tokens.
Figure 2: The model in action with the minimalistic UI. The user can upload an image and select patches from it or load the whole image. Then the user can chat with PAiChat, receiving shor tand precise answers.