Council on Undergraduate Research (CUR) - Leveraging Multimodal AI for Medical Image Diagnosis through LLM's and Visual Question Answering

Leveraging Multimodal AI for Medical Image Diagnosis through LLM's and Visual Question Answering

Artificial Intelligence (AI), particularly large language models (LLMs), has significantly advanced medical diagnostics, especially in interpreting medical images. This research explores the application of the LLaVa (Large Language and Vision Assistant) model for medical visual question answering (VQA) in analyzing chest X-rays, with a focus on COVID-19 diagnoses. Accurate, automated diagnostic tools are crucial in healthcare, where rapid and precise analysis is often life-saving. Chest X-rays, essential for diagnosing conditions like COVID-19, require expertise and time to interpret. This study investigates how LLaVa can augment or partially automate these processes, generating accurate responses to clinical queries using LLMs.

LLaVa integrates natural language processing and computer vision to generate text responses from image inputs. Leveraging its ability to learn from image-text pairs, LLaVa was fine-tuned for medical VQA to analyze chest X-rays. We utilized the LLaVa 1.5 model, known for its dynamic high-resolution capabilities, enhancing its perception of intricate medical details. To enhance the performance of LLaVa, we extracted mask features of the images using a convolutional neural network (CNN) and integrated these features into the model, enabling it to better capture structural details and improve diagnostic accuracy.

A dataset of 7,146 chest X-ray images paired with detailed doctor notes was used for model training, refining LLaVa’s ability to associate visual cues with clinical findings. Training spanned 10 epochs on two NVIDIA RTX 3090 GPUs. Hyperparameters, such as learning rate, were optimized to improve accuracy, while a test set of 2,114 images evaluated the model's performance.

Preliminary results demonstrate LLaVa’s potential in identifying pulmonary infections and handling complex clinical questions. However, challenges remain with nuanced cases and ambiguous visual cues. Further refinement is necessary to improve robustness and accuracy. LLaVa’s integration into medical VQA systems shows promise in supporting healthcare professionals, enhancing diagnostic accuracy, and improving patient outcomes.

Presenter

Shazab Ali

Leveraging Multimodal AI for Medical Image Diagnosis through LLM's and Visual Question Answering

Description

Back to Sessions

Custom JS

Leveraging Multimodal AI for Medical Image Diagnosis through LLM's and Visual Question Answering

Category

Description

Custom CSS