Open Source GenAI
Huggingface
-
Two API levels:
- Pipelines: Higher level APIs to cary out standard tasks. These can primarily be of two types:
- Transformation: For out of the box inference tasks
- Sentiment Analysis, Classifications, Named Entity Recognition (NER), Q&A, Summarization, Translation
- Diffusion: For out of the box audio, video and image generation
- Transformation: For out of the box inference tasks
- Tokenizers and Models: Lower level APIs providing more control
- Pipelines: Higher level APIs to cary out standard tasks. These can primarily be of two types:
Pipelines
Transformation Pipelines
- Accepts the pipeline type, a model id and a device as parameters
- Pipeline types: One of the supported types
- Examples: sentiment-analysis, text-classification, ner, question-answering, summarization, translation, text-generation etc.
- Model id: The id of the model to use
- Optional
- Can be a model id from Huggingface
- Huggingface will select the model that's the default for the task if not specified
- Device: The device to use for inference
- Specify "cuda" for the device to use an NVIDIA GPU
- Specify "mps" on a Mac
- Pipeline types: One of the supported types
- Returns a pipeline object that can be used to perform inference
- Can be used to perform inference on a single text or a list of texts
Basic Pipeline Usage
from huggingface_hub import login
from transformers import pipeline
# Login to Huggingface
login(hf_token, add_to_git_credential=True)
# Load a pipeline for the desired task
nlp = pipeline("pipeline_type for desired task", model="model_id", device="device")
# Use the pipeline to analyze a text
result = nlp("text")
print(result)
Sample Pipeline Code
# Sentiment Analysis
nlp = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", device="cuda")
result = nlp("I love this product!")
print(result)
# Named Entity Recognition (NER)
nlp = pipeline("ner", device="cuda")
result = nlp("I love this product!")
for entity in result:
print(entity)
# Question Answering
question="What ia apple?"
context="Explain in context of computers"
nlp = pipeline("question-answering", device="cuda")
result = nlp(question=question, context=context)
print(result)
# Summarization
nlp = pipeline("summarization", device="cuda")
text = """
Some text to summarize
"""
summary = nlp(text, max_length=50, min_length=25, do_sample=False)
print(summary[0]['summary_text'])
# Translation
translator = pipeline("translation_en_to_hi", model="facebook/m2m100_418M", device="cuda")
result = translator("I am translating from English to Hindi.")
print(result[0]['translation_text'])
Diffusion Pipelines
Basic Pipeline Usage
from huggingface_hub import login
from IPython.display import display
from diffusers import DiffusionPipeline, AutoPipelineForText2Image
import torch
# Login to Huggingface
login(hf_token, add_to_git_credential=True)
# Load a pipeline for the desired task
# NVIDIA's GPU technology is cuda. The .to("cuda") is used to run the pipeline on the GPU
diffusion_pipe = DiffusionPipeline.from_pretrained("diffusion_model_id", torch_dtype=torch.float16, use_safetensors=True, variant="fp16").to("cuda")
# Using AutoPipelineForText2Image
diffusion_pipe = AutoPipelineForText2Image.from_pretrained("diffusion_model_id", torch_dtype=torch.float16, use_safetensors=True, variant="fp16").to("cuda")
# Use the diffusionpipeline to generate an image
prompt = "Prompt for image"
image = diffusion_pipe(prompt=prompt, num_inference_steps=5 guidance_scale=0.0).images[0]
display(image)
# Adding a refiner to the pipeline
refiner_pipe = DiffusionPipeline.from_pretrained("refiner_model_id", text_encoder_2=base.text_encoder_2, vae=base.vae, torch_dtype=torch.float16, use_safetensors=True, variant="fp16",).to("cuda")
diffusion_pipe = diffusion_pipe.to_refiner(refiner_pipe)
# Use both diffusion pipeline and refiner pipeline to generate an image
n_steps = 40
high_noise_frac = 0.8
prompt = "Prompt for image"
image = diffusion_pipe(prompt=prompt, num_inference_steps=n_steps, denoising_end=high_noise_frac, output_type="latent").images[0]
image = refiner_pipe(prompt=prompt, num_inference_steps=n_steps, denoising_start=high_noise_frac, image=image).images[0]
display(image)
Tokenizers
- Translates text to tokens and token ids (See Tokenization)
- Contains a vocab with
- tokens and token ids
- special tokens such as (beginning of the sequence) and (end of the sequence)
- helps understand where a sequence starts and ends
- Uses
encodeanddecodemethods to translate text to tokens and token idsencode: Translates text to tokens and token idsdecode: Translates tokens and token ids to text
Basic Tokenizer Usage
from huggingface_hub import login
from transformers import AutoTokenizer
import torch
# Login to Huggingface
login(hf_token, add_to_git_credential=True)
# Load a tokenizer for the desired task
tokenizer = AutoTokenizer.from_pretrained("tokenizer_model_id", trust_remote_code=True)
# Use the tokenizer to encode and decode text
text = "Text to encode and decode"
token_ids = tokenizer.encode(text)
decoded_text = tokenizer.decode(token_ids, skip_special_tokens=True) # skip_special_tokens=True removes special tokens
decoded_tokens = tokenizer.batch_decode(token_ids) # batch_decode decodes a list of tokens
How LLMs predict chat outputs
- We use a specific format as the input prompt for Chat models
- The LLM does not know how to handle the messages list format we are familiar with
- LLMs are Data Science models that take a sequence of numbers and predict the probability of the next number
- The tokenizer has a utility method
apply_chat_templatethat converts the messages list format into the right input prompt for the LLM- The right prompt format will contain a sequence of words with special tags to separate the System, User, Assistant prompt
- Then the words are broken down into tokens
- Then the tokens are replaced with token ids
- The input to an LLM is this sequence of Token IDs
- The output is the probability distribution of the next Token ID to follow this input
# Convert message to right input prompt for the LLM
# The add_generation_prompt=True parameter ensures that the LLM generates a response to the question, instead of just predicting how the user prompt continues.
# See https://huggingface.co/docs/transformers/main/en/chat_templating#addgenerationprompt
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# View tokenized right input prompt format
prompt = tokenizer.apply_chat_template(messages)
Quantization
- Huggingface uses a library called
bitsandbytesfor Quantization
Sample Model with Quantization
# accelerate is a companion library that allows LLMs to run effectively on GPUs
!pip install -q --upgrade bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TextStreamer
import torch
messages = [
{"role": "user", "content": "Tell me a joke"}
]
# define the quantization config to load the model into memory and use less memory
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4"
)
# Load a tokenizer for the desired task
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
# Set the special token for padding to the end of sentence token. If this is not set, the processing may result in errors
tokenizer.pad_token = tokenizer.eos_token
# The parameter return_tensors="pt" is used to convert the tokens to pytorch type data structures that are ready to be run by the GPUs
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
# Initialize the model
# Huggingface uses 'CausalLM' in the names for models that generate content
# Downloads the full precision model and then reduces the precision based on the quant config
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct", device_map="auto", quantization_config=quant_config)
# Run the model and print the output tokens
outputs = model.generate(inputs['input_ids'], max_new_tokens=80)
outputs[0]
# We can add a TextStreamer to stream the output tokens
streamer = TextStreamer(tokenizer)
outputs = model.generate(inputs['input_ids'], max_new_tokens=80, streamer=streamer)
# print the decoded output
print(tokenizer.decode(outputs[0], skip_special_tokens=True))