Local Large Language Models for Complex Structured Tasks

V. K. Cody Bumgardner; Aaron Mullen; Samuel E. Armstrong; Caylin Hickey; Victor Marek; Jeff Talbert

AMIA Jt Summits Transl Sci Proc. 2024; 2024: 105–114.

Published online 2024 May 31.

PMCID: PMC11141822

Local Large Language Models for Complex Structured Tasks

V. K. Cody Bumgardner, PhD, Aaron Mullen, BS, Samuel E. Armstrong, MS, Caylin Hickey, BS, Victor Marek, PhD, and Jeff Talbert, PhD

Author information Copyright and License information PMC Disclaimer

Abstract

This paper introduces an approach that combines the language reasoning capabilities of large language models (LLMs) with the benefits of local training to tackle complex language tasks. The authors demonstrate their approach by extracting structured condition codes from pathology reports. The proposed approach utilizes local, fine-tuned LLMs to respond to specific generative instructions and provide structured outputs. Over 150k uncurated surgical pathology reports containing gross descriptions, final diagnoses, and condition codes were used. Different model architectures were trained and evaluated, including LLaMA, BERT, and LongFormer. The results show that the LLaMA-based models significantly outperform BERT-style models across all evaluated metrics. LLaMA models performed especially well with large datasets, demonstrating their ability to handle complex, multi-label tasks. Overall, this work presents an effective approach for utilizing LLMs to perform structured generative tasks on domain-specific language in the medical domain.

Introduction

In recent years, artificial intelligence (AI) and natural language processing (NLP) have been applied to medicine, from clinical prognosis to diagnostic and companion diagnostic services. One of the most potentially groundbreaking developments in this domain has been the emergence of generative large language models (LLM), such as OpenAI’s ChatGPT¹ and its successors. These user-facing, AI-driven systems have proven to be attractive resources, revolutionizing the way medical professionals interact with AI for both research and patient care.

LLMs possess a great capacity to analyze vast amounts of medical data, ranging from research papers and clinical trial results to electronic health records and patient narratives.^2,3 By integrating these diverse, potentially multimodal⁴ data sources, these models can identify patterns, correlations, and insights that might have otherwise remained hidden. With their ability to understand natural language, these AI-powered systems can process patient symptoms, medical histories, and test results to assist in diagnosing diseases more efficiently. LLMs have demonstrated encouraging capabilities to generate⁵ and summarize⁶ medical reports, including radiology^7,8,9 and pathology^10,11,12 diagnostic reports.

The large volume of language data used in training LLMs has enabled so-called zero-shot¹³ data operations across classes of data not necessarily observed during model training. While LLMs are useful for many transferable language tasks, performance is dependent on distinguishable associations between observed and non-observed classes. Medical terminologies, domain-specific jargon, and institutional reporting practices produce unstructured data that may not contain transferable associations used by general-purpose LLMs. If the medical context (rules, association mappings, reference materials, etc.) does not exceed the input limits of the model, which at the time of writing are approximately 1.6k and 13k words for ChatGPT and GPT4, respectively, with associations and context included in this input limit. The process of manipulating LLM results through input content and structure is commonly referred to as ”prompt engineering”.¹⁴ However, technical limitations aside, data policy, privacy, bias, and accuracy concerns associated with AI in medicine persist. With limited information on the underlying data or model training process, it is not clear that third-party use of services like Chat-GPT is consistent with FDA guidance¹⁵ on the application of AI in clinical care, which along with privacy concerns has contributed to the outright restriction of generative AI services in many healthcare organizations.

The theme of ”bigger is better” remains dominant in the realm of AI, particularly as it pertains to language model data and parameter sizes. Five years ago, Google’s BERT¹⁶ language transformers revolutionized deep learning for NLP tasks. While large compared to vision models of the same generation, BERT-style models were publicly available, provided with permissive licenses, and rapidly incorporated into NLP pipelines. BERT-style models are small enough to be fine-tuned for specific tasks, allowing the incorporation of medical and other data within the model itself. The latest LLMs, such as GPT4, reportedly consist of trillions of parameters, are trained with trillions of input tokens, and reportedly cost hundreds of millions of dollars. If publicly available, few institutions would have the expertise or capacity to infer GPT4-sized models, much less train them. Fortunately, three months after the release of ChatGPT, Meta released LLaMA,¹⁷ and later LLaMA 2,¹⁸ which are foundational LLMs that are small enough to be trained yet large enough to approach ChatGPT performance for many tasks. Following the release of LLaMA, additional foundational models such as Falcon¹⁹ and MPT²⁰ were released. Similar to previous community models such as BERT, these new foundational LLM models are provided in a range of sizes from 3 to 70 billion parameters. Table 1 provides the number of parameters, and Table 2 lists vRAM requirements for common language models. There are now tens of thousands²¹ of derivative LLMs trained for specific tasks, including the medical domain,²² which can benefit from both complex language reasoning and domain-specific training. We will refer to LLMs that can be trained and operated without needing services, such as OpenAI and Google BART²³ as local LLMs.

Table 1.

Comparison of model sizes

Model	# Parameters
GPT 4	1.7T (reportedly)
GPT 3.5	175B¹
LLaMA	7B,13B,33B,65B³⁰
Longformer	149M³¹
BERT-base	110M³¹

Open in a separate window

Table 2.

LLaMA vRAM requirements

Model	vRAM
LLaMA 7B	14GB
LLaMA 13B	27GB
LLaMA 33B	67GB
LLaMA 65B	133GB

Open in a separate window

Using LLMs to extract machine-readable values is an area of research that has recently attracted significant attention. This research aims to leverage the capabilities of LLMs to extract^24,25 specific numerical or discrete information from unstructured text in a format that can be used by downstream computational pipelines. Typical approaches to LLM structured data output include prompt engineering and post-processing, which can be applied to both services and local LLMs. Most recently, projects such as Microsoft Guidance,²⁶ LangChain,²⁷ and JsonFormer²⁸ have emerged to manage the input structure, model interaction, and output structure of both online and local LLMs. In addition, local LLMs can be fine-tuned to provide structured data in response to specific generative instructions, which can be combined with LLM data control software.

In this paper, we provide an approach to harness the language reasoning power of LLMs with the benefits of locally trained and operated models to perform complex, domain-specific tasks. We will demonstrate our approach by extracting structured condition codes from pathology reports. We have found that ChatGPT does not have sufficient medical context to report structured conditions from pathology reports, providing the response ”I don’t have the capability to perform specific queries to extract information like ICD codes from medical reports.” Likewise, while BERT-style models work well for limited-sized text and frequently used condition codes, they lack the language processing capabilities to perform well across complicated unstructured data with large numbers of multi-label codes. We test the efficacy of our local LLMs against BERT-style models that have been trained with pathology language data and LongFormer,²⁹ an extended context BERT-like model, both of which we fine-tuned for data extraction.

Methods

This section will describe our process for curating LLM datasets, model training and evaluation, quantization³² approaches, and operational hosting of local LLM models.

LLM Instruction Datasets

We derived our dataset from over 150k uncurated surgical pathology reports containing gross descriptions, final written diagnoses, and ICD condition codes³³ obtained from clinical workflows from the University of Kentucky. ICD codes were used over other condition codes as they were available in new and historical reports. Gross reports describe the characteristics of tissue specimens, and final reports describe the diagnosis based on microscopic reviews of tissues in conjunction with laboratory results and clinical notes. A single case may contain many tissue specimens, which results in individual gross and final reports. It is common practice in pathology reports to identify gross and final diagnosis results of specimens within semi-structured template text reports, with resulting specimen condition codes assigned to the entire case. The result of this practice is that there is no direct association between case-reported condition codes and specimens. It is common for there to be multiple condition codes per specimen, so conflicting codes can occur within a case. For example, if one specimen is malignant and another benign, the codes assigned to the case would conflict. As a result of reporting practice, extracting condition codes on a specimen level is a complex NLP challenge. Beyond demonstrating the use of LLMs, our motivation for this effort is to better identify specimens and their associated digital slides for multimodal and vision-based clinical AI efforts.

We limited our dataset to cases with cancer-related codes, reducing the potential ICD label range from 70k to 3k. We further eliminated cases that did not include condition codes or a final report, reducing the dataset case count to 117k. In order to test the performance of various model architectures and parameter sizes, we created three datasets: large (all data), small (10% of large), and tiny (1% of large). For each dataset, code combinations that didn’t appear at least ten times were eliminated. Training and test sets were generated with a 10% code-stratified split. The random sampling of cases in the reduced sets combined with the imposed code distribution requirements provides smaller datasets with more common codes.

Given that the condition codes are reported on the case level, we concatenated gross and final reports into a single text input and assigned associated ICD codes as the output label. Each model class and training system has its own format, which we will explain in the following sections.

BERT and LongFormer models can be trained with the same datasets. These datasets are most often CSV files, where the first column is the text input and the remaining columns are binary hot-encoded labels indicating label class, as shown in Table 3.

Table 3.

Example BERT and LongFormer training data format

Input Text	code 0	code 1	code N
biopsy basal cell carcinoma type tumor...	0	1	0
lateral lesion and consists of tan soft tissue...	1	0	0
omentum omentectomy metastatic high grade carcinoma...	0	0	1

Open in a separate window

LLMs are typically trained using an instruction-based format, where instructions, (optional) input, and model response are provided for one or more interactions in JSON format. For each pathology case, we concatenate all text input into a single input field with the associated codes as the model response. Each case is represented as a single conversation. An example of an abbreviated case instruction is shown in Listing 1.

Model Training

As part of this effort, we trained over 100 models across multiple datasets, model architectures, sizes, and training configurations. For each dataset (tiny, small, and large), we increased model size, where applicable, and training epochs until the performance of the testing dataset diminished, which we discuss in detail in Section , Results. All training was conducted on a single server with 4XA100 80G GPUs.³⁴ For LLaMA 7B and 13B parameter models, the average training time was 25 minutes per epoch and two hours per epoch, respectively. In the following sections, we describe the training process for each unique model architecture.

BERT and its successor transformer models are available in three forms: foundational model, extended language model, and fine-tuned model. Foundational models, as the name would suggest, are trained on a wide corpus of language, which provides a foundational model for fine-tuned tasks, such as code extraction.

In areas where common language and words do not adequately represent the applied domain, unsupervised language modeling can be used to train a new model on domain-specific language. For example, the popular BioBERT³⁵ model, which was trained using biomedical text, has been shown to outperform the foundational BERT model for specific biomedical tasks. Using example Hugging Face transformer language modeling code,³⁶ we trained our own BERT-based language model using pathology case notes as inputs. Except for the removal of condition code columns, the training data is identical to the format shown in Table 3.

All BERT models were fine-tuned using example Hugging Face transformer training code.³⁷

LongFormer is a BERT-like model that makes use of a sliding window and sparse global attention, which allows for an increased maximum input token size of 4096 compared to 512 for BERT. While the majority of gross or diagnostic reports would not exceed the capacity of BERT models, the concatenation of report types across all specimens in the case could easily exceed the 512-token limit. LongFormer models, which provide twice the input token size of our local LLM (2048), allow us to test the impacts of maximum token size on BERT-style model performance.

No language modeling was performed with LongFormer models, and all models were fine-tuned using example Hugging Face LongFormer transformer training code.³⁸

LLaMA-Based LLMs are by far the most popular local LLM variants. Models can vary based on training data, model size, model resolution, extended context size,³⁹ and numerous training techniques such as LoRA⁴⁰ and FlashAttention.⁴¹ Research associated with local LLMs is developing at a very rapid pace, with new models and techniques being introduced daily. The result of such rapid development is that not all features are supported by all training and inference systems. Fortunately, support has coalesced around several projects that provide a framework for various models and experimental training techniques. We make use of one such project named FastChat,⁴² an open platform for training, serving, and evaluating large language models. The FastChat team released the popular LLaMA-based LLM Vicuna. Following the Vicuna training code described by the FastChat team, we trained our LLMs using our pathology case data in instruction format, as shown in Listing 1. We trained both 7B and 13B parameter LLaMA models across our three datasets. In all cases, our LLaMA-based models were trained with half-precision (fp16).

Local LLM Hosting

As previously noted in Table 1, the sizes of foundational language models have grown significantly since the release of BERT. As model sizes increase, model-level parallelism must be used to spread model layers across multiple GPUs and servers. In addition, model checkpoints themselves can be hundreds of gigabytes in size, resulting in transfer and load latency on model inference. The development of inference services that implement the latest models, techniques, and optimize resource utilization is an active area of research. We make use of vLLM,⁴³ an open platform that supports numerous model types, extensions, and resource optimization. vLLM and other inference platforms provide API services, allowing users to decouple inference services from applications. In addition, vLLM includes an OpenAI-compatible API allowing users to seamlessly compare ChatGPT/GPT4 with local LLMs results.

Unless otherwise noted, all local LLM performance testing was conducted using vLLMs OpenAI-compatible API.

Generative Pre-trained Transformer Quantization (GPTQ)⁴⁴ is a technique that is used to reduce the GPU memory requirements by lowering the precision of model weights and activations. To match the resolution of the foundational LLaMA models and to reduce resource requirements, local LLMs are commonly trained at half-(fp16) or quarter-precision (int8). However, even at half-precision, the GPU memory requirements are significant and can exceed the capacity of the largest single GPUs, as shown in Table 2.

Quantization for CPUs has become extremely popular as LLM model sizes and associated resource requirements increase. Using CPU-focused libraries, such as GGML,⁴⁵ models can be further quantized to even lower precision (int4, int3, int2). High levels of quantization can drastically reduce resource requirements and increase inference speed, allowing LLMs to be run directly on CPUs. As with model size, the performance impacts of precision reduction are highly dependent on the workload. Quantization can occur post-training, allowing a single model to be trained and reduced to various quantization levels for evaluation. Similar to vLLM, LLaMA.cpp⁴⁶ is an open platform that focuses on the support of GGML quantized models on CPUs. LLaMA.cpp provides tools to quantize pre-trained models and supports bindings for common languages such as Python, Go, Node.js, .Net, and others. The LLaMA.cpp Python⁴⁷ project provides an OpenAI-compatible API, which we use to evaluate quantized local LLMs where indicated.

Results

Seven different model architectures were tested on the three dataset sizes (tiny, small, and large). This includes four separate BERT models: BERT-base-uncased, BioClinicalBERT, PathologyBERT,⁴⁸ and UKPathBERT. BERT- base-uncased is the original foundational BERT model, BioClinicalBERT is trained on biomedical and clinical text, PathologyBERT is trained on pathology reports that are external to our institution, and UKPathBERT is our own BERT-base-uncased language model trained on our own pathology report dataset.

Additionally, the BERT-like Longformer model with an increased input context size was trained. The performance of these BERT-style models serves as benchmarks and evidence for the complexity of our language tasks.

Finally, LLaMA 7B and 13B parameter models were trained using the same datasets in an instruction-based format, which we will refer to as Path-LLaMA. Unlike other generative LLMs, our intended output is a structured set of condition codes. The stability of the structured output allowed us to statistically evaluate model results as we would other non-generative models.

In both generative (LLM) and BERT-style transformer model cases, multilabel classification results will be evaluated the same. Accuracy (ACC) refers to the frequency of explicitly correct predicted labels. For example, if a particular case has two labels assigned to it and the model only correctly guesses one of them, the accuracy is 0% for that case. Because of this strict method, accuracy is somewhat low compared to the other performance metrics. The AUC (Area Under the ROC Curve) is calculated for each possible class, and the macro (unweighted) average is taken. This was performed using the sklearn metrics⁴⁹ package. In the context of multilabel classification, the AUC represents how likely each class is to be labeled correctly. Therefore, similarly to binary classification, an AUC below 0.5 represents that the model performs worse than random chance on average. Similarly, precision, recall, and F1 score are calculated for each class and the macro average is taken to produce a final result, using the classification report function from sklearn’s metrics package. With multilabel classification, precision measures the proportion of correct predictions, while recall measures the proportion of instances that received correct classifications, and the F1 score averages these together.

The best results of any architecture were achieved with the LLaMA-based LLM, as seen in Table 4, which shows the overall model performance results, averaged across all datasets and parameter settings.

Table 4.

Average performance of each model on all datasets

	Accuracy	AUC	Precision	Recall	F1
Path-LLaMA 13B	0.748	0.816	0.779	0.777	0.775
Path-LLaMA 7B	0.647	0.763	0.68	0.674	0.674
UKPathBert	0.058	0.506	0.059	0.059	0.059
PathologyBERT	0.057	0.502	0.059	0.059	0.059
BioClinicalBERT	0.053	0.507	0.055	0.054	0.055
BERT-base-uncased	0.036	0.498	0.04	0.042	0.04
Longformer 149m	0.001	0.5	0.063	0.42	0.103

Open in a separate window

LLaMA-based models outperform BERT-style models across all evaluation metrics. As expected, larger parameter models tend to outperform smaller models, and models trained within a specific domain outperform those that are not. In the remainder of this section, we go into more detailed evaluations of model size, numbers of epochs, dataset size, and other potential performance factors.

Model Size

The two most commonly used sizes of LLaMA models, 7B and 13B, were tested to determine the impact of parameter size on performance. In testing, we observed a very similar inference performance of 0.3-0.4 seconds per case between 7B and 13B models using fp16. We attribute this to our multi-gpu test system, which is utilized less with the 7B model and other overhead of the decoupled API interface. We also tested GGML int4 quantized versions of 7B models, which for results were nearly identical to their fp16 counterparts but with an inference time of 7.5 seconds per case. Despite the lower precision, CPU-based inference resulted in significantly longer inference times.

As seen before in Table 4, the larger model performed better on average. However, as shown in Table 5, when compared only to the large datasets, their performance was very similar. Both achieved an F1 score of 0.785, while the 7B model obtained a slightly higher AUC of 0.79 compared to the 13b model’s 0.786. This indicates that the increase in model size had little effect on performance when compared to the largest dataset.

Table 5.

Path-LLaMA dataset comparison

	Accuracy	AUC	Precision	Recall	F1
7B Tiny	0.507	0.701	0.507	0.507	0.507
13B Tiny	0.778	0.82	0.778	0.778	0.778
7B Small	0.699	0.812	0.74	0.735	0.734
13B Small	0.724	0.842	0.765	0.765	0.761
7B Large	0.737	0.79	0.793	0.787	0.785
13B Large	0.742	0.786	0.793	0.787	0.785

Open in a separate window

Number of Epochs

Each model architecture was trained on a range of epochs. The number of epochs tested for each was dependent on two things: model training time (dataset and parameter sizes) and how many epochs it took before the results on the training set no longer improved. The average F1 score for each model and the number of epochs are given in Table 6.

Table 6.

F1 of each model for each number of epochs tested

	1	3	6	12	24	48	96
Path-LLaMA 13B	0.749	0.767	0.80
Path-LLaMA 7B	0.486	0.586	0.761	0.825	0.759
UKPathBERT	0.002	0.008	0.032	0.041	0.148	0.2	0.2
PathologyBERT	0.004	0.006	0.055	0.037	0.117	0.267	0.133
BioClinicalBERT	0.004	0.007	0.009	0.059	0.118	0.4
BERT-base-uncased	0.015	0.007	0.007	0.016	0.088	0.2	0.133
Longformer	0.081	0.075	0.072	0.229	0.219

Open in a separate window

This table shows that the number of epochs during training can have a significant impact on the results of the model. In each case, at least six epochs were required to train the best model, in some cases significantly more. Optimal epoch count is very much experimental in practice, as it is highly dependent on the dataset, model parameter size, and other training parameters.

Dataset Size

The average number of words per pathology case was approximately 650, so assuming token counts are 1.25X larger than words, our largest dataset contained over 80M tokens from 100k cases. As previously mentioned, larger datasets include a wider range of condition codes, so in this context, a larger dataset does not necessarily guarantee better performance. The performance of each model on each dataset size is shown in Table 7. Here, the LLaMA models are shown to perform the best on the largest datasets, while the BERT-style models perform best on the smallest.

Table 7.

F1 of each model on each dataset

	Tiny	Small	Large
Path-LLaMA 13B	0.778	0.761	0.785
Path-LLaMA 7B	0.641	0.764	0.783
UKPathBERT	0.114	0.014	0.018
PathologyBERT	0.114	0.011	0.021
BioClinicalBERT	0.105	0.012	0.019
BERT-base-uncased	0.073	0.012	0.018
Longformer	0.206	0.025	0.009

Open in a separate window

Discussion

Leveraging LLMs for tasks requiring structured output is not a trivial task. We experimented with pre- and post- processing techniques to ensure structured output, including Microsoft Guidance, LangChain, and JsonFormer. The best results were obtained by ordering condition codes into alphabetical lists separated by line breaks. While structured data tools are useful for extracting entities from generated sentences, there is little to no development or published research related to how these tools are used with models fine-tuned to produce structured output. With the exception of single epoch training of the Path-LLaMA 7B model, deviation (hallucination) from the intended format was not experienced.

The largest LLaMA model, with 13 billion parameters, performed the best on average, as seen in Table 4. Both Path- LLaMA models performed significantly better than any other model. The BERT transformers performed poorly on average, but the versions that were trained specifically on pathology-related text outperformed the basic model. The Longformer had better recall than the BERT models because it tended to predict a large number of different codes, meaning it had a higher chance of guessing correctly. However, this brings down the precision and accuracy of this model because many of its guesses are wrong. The Longformer’s accuracy is evaluated as incredibly low due to the nature of how accuracy was calculated, requiring all guesses in each instance to be correct.

The results on model size and number of epochs are mostly unsurprising, with larger models that have been trained for more epochs generally outperforming simpler models that are trained for less time. When it came to the Path-LLaMA, the difference in performance between the 7B parameter and 13B parameter models largely dissipated as the datasets increased in size, as shown in Table 5. This trend may be due to the fact that the larger model was trained for less epochs overall when the dataset size increased. On the smallest datasets, both models could afford to be trained for the same amount of epochs without an unwieldy time cost, and it is on these datasets that we see the 13B model outperform the 7B model. But on the largest dataset, the 13B model could not be trained for the same number of epochs, due to the increased time per epoch. Therefore, the smaller model seems to ”catch up” to the larger one on the largest dataset. We expect that, if training time had not been a constraint, the larger model would have continued to outperform the smaller model at every stage.

The results in Table 7 show that the LLaMa models perform best on the largest datasets, while the BERT-style models perform best on the smallest. The smaller dataset is an easier classification problem, with fewer possible class labels and examples, but the larger dataset has more complex data to train from. This seems to further reinforce the superiority of LLaMA compared to the other models. When the dataset is large, the other models fail, while LLaMA only improves with more data, demonstrating its improved capability to learn and correctly classify condition codes compared to the other models.

In addition to the factors already discussed, other possible influences were considered to determine their impact on the results. The length of the input description for each sample was paired and analyzed with how often that sample was predicted correctly for each model. This was done to determine if, for example, longer descriptions allowed the model to understand the text better and classify the correct code more often. However, it was found that there was no significant correlation between the length of the description and how often that sample was correctly predicted. We speculate that the complexity of language far outweighed the size of the input context window, as indicated by LongFormer performance.

Certain classification codes were far more frequent in the dataset than others. This was especially true for the tiny and small datasets, which might have only ten examples of specific code combinations. The frequency of each code in the dataset was analyzed along with what percentage of the time that code was correctly predicted by the models. Unsurprisingly, it was found that the most common codes were predicted correctly a greater percentage of the time when compared to the less common classification codes. Likewise, smaller models performed better with a limited range of codes.

More work was done to analyze the difference in performance between more common and less common codes. The various performance metrics were evaluated for each model on only the top 20% most common classification codes. This was then compared with the same performance metrics when evaluated on all codes in the dataset. We found that performance generally increases when considering the most common codes. However, most models see only a slight improvement when considering the top 20% most common classification codes. Given that some codes have hundreds of appearances in each dataset and others have no more than a few, it is likely that the most common codes already impact the overall results significantly enough to see little difference between the two sets of performance results.

Conclusion

In this paper, we described the end-to-end process of training, evaluating, and deploying a local LLM to perform complex NLP tasks and provide structured output. We analyzed model performance across parameters and data size along with data complexity. We compared these results with BERT-style models trained on the same data. The results of this effort provide overwhelming evidence that local LLMs can outperform smaller NLP models that have been trained with domain knowledge. In addition, we demonstrate that while more latent, LLMs can be deployed without GPUs. While we make no claims that local LLMs provide comparable language processing capabilities to ChatGPT and its successors, technical and policy limitations make local LLMs actionable alternatives to commercial model services. We have also shown that accurate models (such as LLaMA 7b) can be made usable on reasonable CPU/GPU hardware with minimally increased overhead.

In future efforts, we aim to explore newer and larger models, such as LLaMA 2 and Falcon. We would like to further explore the impact of LLM context size and post-training context extension on model performance. Finally, we aim to explore the structure of instruction and input training data on model results.

With the exception of the identified example dataset, code and instructions to recreate this work can be found in the following repository: https://github.com/innovationcore/LocalLLMStructured

Acknowledgements

The project described was supported by the University of Kentucky Institute for Biomedical Informatics; Department of Pathology and Labratory Medicine; and the Center for Clinical and Translational Sciences through NIH National Center for Advancing Translational Sciences through grant number UL1TR001998. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

Figures & Table

An external file that holds a picture, illustration, etc.
Object name is 2159f1.jpg

Open in a separate window

Figure 1.

Local LLM High-level View

An external file that holds a picture, illustration, etc.
Object name is 2159L1.jpg

Open in a separate window

Listing 1.

LLM Instruction JSON Format

References

1. OpenAI. ChatGPT. 2023. Accessed: 2023-07-30. https://chat.openai.com.

2. Xue VW, Lei P, Cho WC. The potential impact of ChatGPT in clinical and translational medicine. Clinical and Translational Medicine. 2023;13(3) [PMC free article] [PubMed] [Google Scholar]

3. Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Frontiers in Artificial Intelligence. 2023;6:1169595. [PMC free article] [PubMed] [Google Scholar]

4. Li C, Wong C, Zhang S, Usuyama N, Liu H, Yang J, et al. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:230600890. 2023.

5. Zhou Z. Zhou of ChatGPT’s capabilities in medical report generation. Cureus. 2023;15(4) [PMC free article] [PubMed] [Google Scholar]

6. Temsah O, Khan SA, Chaiah Y, Senjab A, Alhasan K, Jamal A, et al. Overview of early ChatGPT’s presence in medical literature: insights from a hybrid literature review by ChatGPT and human experts. Cureus. 2023;15(4) [PMC free article] [PubMed] [Google Scholar]

7. Ma C, Wu Z, Wang J, Xu S, Wei Y, Liu Z, et al. ImpressionGPT: an iterative optimizing framework for radiology report summarization with chatGPT. arXiv preprint arXiv:230408448. 2023.

8. Biswas S. BiswasGPT and the future of medical writing. Radiological Society of North America. 2023.

9. Jeblick K, Schachtner B, Dexl J, Mittermeier A, Stüber AT, Topalis J, et al. Chatgpt makes medicine easy to swallow: An exploratory case study on simplified radiology reports. arXiv preprint arXiv:221214882. 2022. [PMC free article] [PubMed]

10. C S, SC S, E L, S P, OO F, GP P, et al. Application of ChatGPT in Routine Diagnostic Pathology: Promises, Pitfalls, and Potential Future Directions. Advanced Anatotomic Patholology. 2023. [PubMed]

11. Sinha RK, Roy AD, Kumar N, Mondal H, Sinha R. Applicability of ChatGPT in assisting to solve higher order problems in pathology. Cureus. 2023;15(2) [PMC free article] [PubMed] [Google Scholar]

12. Brennan G. Brennan ChatGPT to Write Pathology Results Letters. @ Gijournal. 2023;3 [Google Scholar]

13. Wang W, Zheng VW, Yu H, Miao C. A survey of zero-shot learning: Settings, methods, and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 2019;10(2):1–37. [Google Scholar]

14. White J, Fu Q, Hays S, Sandborn M, Olea C, Gilbert H, et al. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:230211382. 2023.

15. Washington, DC, USA: The US Food and Drug Administration; 2021. Good machine learning practice for medical device development: guiding principles. [Google Scholar]

16. Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018.

17. Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:230213971. 2023.

18. Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:230709288. 2023.

19. ZXhang YX, Haxo YM, Mat YX. Falcon LLM: A New Frontier in Natural Language Processing. AC Investment Research Journal. 2023;220(44) [Google Scholar]

20. Introducing mpt-7b: A new standard for open-source, commercially usable llms. MosaicML NLP Team and others. 2023. Accessed: 2023-07-30.

21. LLM Explorer. Extractum.io. 2023. Accessed: 2023-08-01. https://llm.extractum.io/

22. Wu C, Zhang X, Zhang Y, Wang Y, Xie W. Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:230414454. 2023.

23. Google. BART. 2023. Accessed: 2023-07-30. https://bard.google.com/

24. Hu Y, Ameer I, Zuo X, Peng X, Zhou Y, Li Z, et al. Zero-shot clinical entity recognition using chatgpt. arXiv preprint arXiv:230316416. 2023.

25. Wei X, Cui X, Cheng N, Wang X, Zhang X, Huang S, et al. Zero-shot information extraction via chatting with chatgpt. arXiv preprint arXiv:230210205. 2023.

26. Microsoft. Guidance. GitHub. 2023. https://github.com/microsoft/guidance .

27. Harrison C. HarrisonChain. GitHub. 2023. https://github.com/hwchase17/langchain .

28. Team J. TeamFormer. GitHub. 2023. https://github.com/1rgs/jsonformer .

29. Beltagy I, Peters ME, Cohan A. Longformer: The long-document transformer. arXiv preprint arXiv:200405150. 2020.

30. Team TV. vicuna=13b-4bit. HuggingFace. 2023. https://huggingface.co/elinas/vicuna-13b-4bit .

31. Quijano AJ, Nguyen S, Ordonez J. Grid Search Hyperparameter Benchmarking of BERT, ALBERT, and Long-former on DuoRC. arXiv preprint arXiv:210106326. 2021.

32. Dettmers T, Lewis M, Belkada Y, Zettlemoyer L. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:220807339. 2022.

33. World Health Organization; 2023. ICD-10: international statistical classification of diseases and related health problems : tenth revision. https://apps.who.int/iris/handle/10665/42980 . [PubMed] [Google Scholar]

34. Choquette J, Gandhi W, Giroux O, Stam N, Krashinsky R. NVIDIA A100 tensor core GPU: Performance and innovation. IEEE Micro. 2021;41(2):29–35. [Google Scholar]

35. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234–40. [PMC free article] [PubMed] [Google Scholar]

36. Transformers Language Modeling. Hugging Face. 2023. https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling .

37. Transformers Multi-label Classification. Hugging Face. 2023. https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multi_label_classification.ipynb .

38. LongFormer Training. Hugging Face. 2023. https://github.com/huggingface/transformers/tree/main/src/transformers/models/longformer .

39. Chen S, Wong S, Chen L, Tian Y. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:230615595. 2023.

40. Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, et al. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:210609685. 2021.

41. Dao T. DaoAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv preprint arXiv:230708691. 2023.

42. Zheng L, Chiang WL, Sheng Y, Zhuang S, Wu Z, Zhuang Y, et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. 2023.

43. vLLM. vLLM Team. 2023. https://github.com/vllm-project/vllm .

44. Frantar E, Ashkboos S, Hoefler T, Alistarh D. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:221017323. 2022.

45. GGML. GGML Team. 2023. https://github.com/ggerganov/ggml .

46. LLaMA.cpp. LLaMA.cpp Team. 2023. https://github.com/ggerganov/llama.cpp .

47. LLaMA.cpp Python. LLaMA.cpp Python Team. 2023. https://github.com/abetlen/llama-cpp-python .

48. Santos T, Tariq A, Das S, Vayalpati K, Smith GH, Trivedi H, et al. AMIA Annual Symposium Proceedings. vol. 2022. American Medical Informatics Association; 2022. PathologyBERT-Pre-trained Vs. A New Transformer Language Model for Pathology Domain; p. 962. [PMC free article] [PubMed] [Google Scholar]

49. sklearn.metrics. Scikit-learn Team. 2023. https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics .

Articles from AMIA Summits on Translational Science Proceedings are provided here courtesy of American Medical Informatics Association