Local Execution

Local LLMs

Ollama

Setup

Download Ollama
Download model
- Search for the desired model and go to the corresponding page
  - The dolphin models are uncensored (search for dolphin)
- Click on the desired version, copy the cli command and execute in the terminal
  - Make sure you have enough RAM for the LLM to work
  - Memory should be roughly equal to the size of the model
- Once the terminal command executes we are ready to start using the LLM locally
  - When we run the 'ollama run' command for a model for the first time, it is downloaded in the ~/.ollama/models/manifests/registry.ollama.ai/library folder in Mac

Useful Commands

Command	Description
/bye (or Ctrl D)	quit interacting with the LLM
ollama	Lists all available commands
ollama list	Lists the models available for use
ollama run	To run the model
ollama serve	To start a local server¹

Running Ollama in Colab

Install Ollama: Install the Ollama CLI in the Colab environment. This typically involves downloading and running an installation script.
Start Ollama Server: Start the Ollama server in the background. This will make the API endpoint available for use.
Download an Ollama Model: Download a specific language model (e.g., 'llama3.1:8b') using the Ollama CLI.

Also see Ollama Prompting in Python

LlamaFile

Setup

Download the `TinyLlama Llamfile from the Other example llamafiles section

Grant permissions

chmod +x TinyLlama-1.1B-Chat-v1.0.F16.llamafile

Start llamafile server
```
./TinyLlama-1.1B-Chat-v1.0.F16.llamafile
```
- This should launch the llamafile ui on http://127.0.0.1:8080

Sample Code

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:8080/v1",  # local llamafile url
    api_key="no-key"   # empty string does not work
)

prompt = f"What is LLM?"
messages = [{"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}]
response = client.chat.completions.create(
    model="TinyLLM",
    messages=messages,
)
print(response.choices[0].message.content)

Local Frameworks

LLamaIndex

Implementation Steps Using LlamaFile

Install required packages

# core package for LlamaIndex, which provides the framework for building and querying indexes
pip install llama-index
# package for integrating LLaMA models with LlamaIndex, allowing for the use of LLaMA models as the language model (LLM) component
pip install llama-index-llms-llamafile
# package for providing the embedding functionality using LLaMA models, which is essential for creating vector representations of documents
pip install llama-index-embeddings-llamafile
# package for openai integration (required only if using openai models)
pip install llama-index-llms-openai

Set up LlamaIndex with LLaMAFile
Define custom prompt using prompt templates (Optional)
Load Local Data
Build the Index
Create a Query Engine
Query the Index

Sample Code

from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.embeddings.llamafile import LlamafileEmbedding
from llama_index.llms.llamafile import Llamafile

# configure LlamaIndex to use the LLaMAFile
# create an embedding model that utilizes the LLaMAFile
Settings.embed_model = LlamafileEmbedding(base_url="http://localhost:8080")
# Set up the Llama model as the LLM component of LlamaIndex
Settings.llm = Llamafile(base_url="http://localhost:8080", temperature=0, seed=0)

# load local pdf docs
# the `load_data` method loads the documents from the directory and returns them as a list. 
local_reader = SimpleDirectoryReader(input_dir='DirPath/')
docs = local_reader.load_data(show_progress=True)

# create an index to store vector representations of the loaded documents. 
# the `from_documents` method builds the index from the provided list of documents.
index_pdf = VectorStoreIndex.from_documents(docs)

# convert the index into a query engine that can handle queries and query the index
query_engine_pdf = index_pdf.as_query_engine()

# use the query engine to retrieve relevant information from the index
query = "What is the main topic of the document?"
response = query_engine.query(query)
print(response)

Custom Prompts

The most commonly used prompts are text_qa_template (1) and refine_template (2).

Used to get an initial answer to a query using retrieved nodes
Used when the retrieved text does not fit into a single LLM call with response_mode="compact" (the default), or when more than one node is retrieved using response_mode="refine"
- The answer from the first query is inserted as an existing_answer, and the LLM must update or repeat the existing answer based on the new context.
This will show the ip and port of the server which can then be used as an endpoint to interact with the LLM.

Sample Code with Custom Prompts

Completion PromptsChat Prompts

from llama_index.core import PromptTemplate

text_qa_template_str = (
    "Context information is below.\n---------------------\n{context_str}"
    "\n---------------------\nUsing both the context information and also using"
    " your own knowledge, answer the question: {query_str}\nIf the context isn't"
    " helpful, you can also answer the question on your own.\n"
)
text_qa_template = PromptTemplate(text_qa_template_str)

refine_template_str = (
    "The original question is as follows: {query_str}\nWe have provided an existing answer: "
    "{existing_answer}\nWe have the opportunity to refine the existing answer (only if needed)"
    " with some more context below.\n------------\n{context_msg}\n------------\nUsing both the"
    " new context and your own knowledge, update or repeat the existing answer.\n"
)
refine_template = PromptTemplate(refine_template_str)

print(index_pdf.as_query_engine(
            text_qa_template=text_qa_template,
            refine_template=refine_template,
        ).query("What is the main topic of the document?"))

from llama_index.core.llms import ChatMessage, MessageRole
from llama_index.core import ChatPromptTemplate

qa_prompt_str = (
    "Context information is below.\n---------------------\n{context_str}\n---------------------\n"
    "Given the context information and not prior knowledge, answer the question: {query_str}\n"
)

refine_prompt_str = (
    "We have the opportunity to refine the original answer (only if needed) with some more context below.\n"
    "------------\n{context_msg}\n------------\nGiven the new context, refine the original"
    " answer to better answer the question: {query_str}. If the context isn't useful, output"
    " the original answer again.\nOriginal Answer: {existing_answer}"
)

# Text QA Prompt
chat_text_qa_msgs = [
    ("system", "Always answer the question, even if the context isn't helpful.",),
    ("user", qa_prompt_str),
]
text_qa_template = ChatPromptTemplate.from_messages(chat_text_qa_msgs)

# Refine Prompt
chat_refine_msgs = [
    ("system","Always answer the question, even if the context isn't helpful.",),
    ("user", refine_prompt_str),
]
refine_template = ChatPromptTemplate.from_messages(chat_refine_msgs)

print(index_pdf.as_query_engine(
            text_qa_template=text_qa_template,
            refine_template=refine_template,
        ).query("What is the main topic of the document?"))

LangChain

Implementation Steps Using LlamaFile

Install required packages

pip install -U langchain langchain-openai langchain-community

Set up Langchain with Llamafile
Create prompt
Invoke prompt using model

Sample Code

from langchain_community.llms.llamafile import Llamafile
from langchain.prompts import PromptTemplate

# Initialize Llamafile
llm = Llamafile(temperature=0.7)

# Define prompt
prompt_template = PromptTemplate.from_template("What are LLMs?")

# Invoke prompt
prompt = prompt_template.invoke({})
result = llm.invoke(prompt)

For using local files as context:

Load local data and split into manageable chunks
Splitters

CharacterTextSplitter:
- Tries to preserve paragraphs, sentences, and words as coherent units.
- Can specify chunk_size, chunk_overlap, and separator.
- Does not automatically handle very large chunks; instead, it relies on the user setting appropriate values for chunk_size and chunk_overlap.
RecursiveCharacterTextSplitter:
- Similar to CharacterTextSplitter, but adds recursive splitting capabilities.
- Automatically handles very large chunks by attempting to split them according to the specified chunk_size and separator list.
- If a chunk remains too large after the first round of splitting, it will try again with subsequent separators in the list.
Loading Files
```
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter

# Load local data
pdf_path = 'DirPath/Filename'
pdf_loader = PyPDFLoader(file_path = pdf_path)

# Split using Recursive splitter
# split is based on the specified chunk size
splitter = RecursiveCharacterTextSplitter(chunk_size = 1000,chunk_overlap = 0)

pdf_data_RS = pdf_loader.load_and_split(text_splitter=splitter)
```
Define custom prompt using prompt templates (Optional)
Instantiate chain
Generate response by invoking chain

Sample Code with Custom Prompts

Completion PromptsChat Prompts

from langchain.chains.summarize import load_summarize_chain
# create prompt
template = """
Write a summary that highlights the main ideas in 3 bullet points of the following:
"{text}"
SUMMARY:
"""
# Create prompt
prompt = PromptTemplate.from_template(template)
# Instantiate chain
chain = load_summarize_chain(
    llm=llm,
    chain_type='stuff',
    prompt=prompt,
    verbose=False   # Setting this to true will print the formatted prompt
)
# Invoke chain
results = chain.invoke(pdf_data_RS)
print(f"Result Keys: {results.keys()}")
print(f"\nOutput: {results['output_text']}")

from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.llm import LLMChain
from langchain_core.prompts import ChatPromptTemplate

# Define prompt
prompt = ChatPromptTemplate.from_messages(
    [("system", "Summarize the highlights of the following in 3 bullet points:\\n\\n{context}")]
)

# Instantiate chain
chain = create_stuff_documents_chain(llm, prompt)

# Invoke chain
result = chain.invoke({"context": pdf_data_RS[1:5]})

This will show the ip and port of the server which can then be used as an endpoint to interact with the LLM.
- This should launch the ollama ui on http://127.0.0.1:11434
  - Accessing the above url should display the message "Ollama is running"
- If it shows the error: Error: listen tcp 127.0.0.1:11434: bind: address already in use, it means that ollama is already running at port 11434
↩