Skip to content

Open Source GenAI

Huggingface

  • Two API levels:

    • Pipelines: Higher level APIs to cary out standard tasks. These can primarily be of two types:
      • Transformation: For out of the box inference tasks
        • Sentiment Analysis, Classifications, Named Entity Recognition (NER), Q&A, Summarization, Translation
      • Diffusion: For out of the box audio, video and image generation
    • Tokenizers and Models: Lower level APIs providing more control

Pipelines

Transformation Pipelines

  • Accepts the pipeline type, a model id and a device as parameters
    • Pipeline types: One of the supported types
      • Examples: sentiment-analysis, text-classification, ner, question-answering, summarization, translation, text-generation etc.
    • Model id: The id of the model to use
      • Optional
      • Can be a model id from Huggingface
      • Huggingface will select the model that's the default for the task if not specified
    • Device: The device to use for inference
      • Specify "cuda" for the device to use an NVIDIA GPU
      • Specify "mps" on a Mac
  • Returns a pipeline object that can be used to perform inference
    • Can be used to perform inference on a single text or a list of texts

Basic Pipeline Usage

from huggingface_hub import login
from transformers import pipeline

# Login to Huggingface
login(hf_token, add_to_git_credential=True)

# Load a pipeline for the desired task
nlp = pipeline("pipeline_type for desired task", model="model_id", device="device")

# Use the pipeline to analyze a text
result = nlp("text")
print(result)
Sample Pipeline Code

# Sentiment Analysis
nlp = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", device="cuda")
result = nlp("I love this product!")
print(result)
# Named Entity Recognition (NER)
nlp = pipeline("ner", device="cuda")
result = nlp("I love this product!")
for entity in result:
    print(entity)
# Question Answering
question="What ia apple?"
context="Explain in context of computers"

nlp = pipeline("question-answering", device="cuda")
result = nlp(question=question, context=context)
print(result)
# Summarization
nlp = pipeline("summarization", device="cuda")
text = """
Some text to summarize
"""
summary = nlp(text, max_length=50, min_length=25, do_sample=False)
print(summary[0]['summary_text'])
# Translation
translator = pipeline("translation_en_to_hi", model="facebook/m2m100_418M", device="cuda")
result = translator("I am translating from English to Hindi.")
print(result[0]['translation_text'])
# Text Classification
classifier = pipeline("zero-shot-classification", device="cuda")
result = classifier("Apple is looking at buying U.K. startup for $1 billion", candidate_labels=["tech", "sports", "movies"])
print(result)
# Text Generation
generator = pipeline("text-generation", device="cuda")
result = generator("If there is a will, there")
print(result[0]['generated_text'])

Diffusion Pipelines

Basic Pipeline Usage

from huggingface_hub import login
from IPython.display import display
from diffusers import DiffusionPipeline, AutoPipelineForText2Image
import torch

# Login to Huggingface
login(hf_token, add_to_git_credential=True)

# Load a pipeline for the desired task
# NVIDIA's GPU technology is cuda. The .to("cuda") is used to run the pipeline on the GPU
diffusion_pipe = DiffusionPipeline.from_pretrained("diffusion_model_id", torch_dtype=torch.float16, use_safetensors=True, variant="fp16").to("cuda")

# Using AutoPipelineForText2Image
diffusion_pipe = AutoPipelineForText2Image.from_pretrained("diffusion_model_id", torch_dtype=torch.float16, use_safetensors=True, variant="fp16").to("cuda")

# Use the diffusionpipeline to generate an image
prompt = "Prompt for image"
image = diffusion_pipe(prompt=prompt, num_inference_steps=5 guidance_scale=0.0).images[0]
display(image)

# Adding a refiner to the pipeline
refiner_pipe = DiffusionPipeline.from_pretrained("refiner_model_id", text_encoder_2=base.text_encoder_2, vae=base.vae, torch_dtype=torch.float16, use_safetensors=True, variant="fp16",).to("cuda")
diffusion_pipe = diffusion_pipe.to_refiner(refiner_pipe)

# Use both diffusion pipeline and refiner pipeline to generate an image
n_steps = 40
high_noise_frac = 0.8
prompt = "Prompt for image"
image = diffusion_pipe(prompt=prompt, num_inference_steps=n_steps, denoising_end=high_noise_frac, output_type="latent").images[0]
image = refiner_pipe(prompt=prompt, num_inference_steps=n_steps, denoising_start=high_noise_frac, image=image).images[0]
display(image)

Tokenizers

  • Translates text to tokens and token ids (See Tokenization)
  • Contains a vocab with
    • tokens and token ids
    • special tokens such as (beginning of the sequence) and (end of the sequence)
      • helps understand where a sequence starts and ends
  • Uses encode and decode methods to translate text to tokens and token ids
    • encode: Translates text to tokens and token ids
    • decode: Translates tokens and token ids to text

Basic Tokenizer Usage

from huggingface_hub import login
from transformers import AutoTokenizer
import torch

# Login to Huggingface
login(hf_token, add_to_git_credential=True)

# Load a tokenizer for the desired task
tokenizer = AutoTokenizer.from_pretrained("tokenizer_model_id", trust_remote_code=True)

# Use the tokenizer to encode and decode text
text = "Text to encode and decode"
token_ids = tokenizer.encode(text)
decoded_text = tokenizer.decode(token_ids, skip_special_tokens=True) # skip_special_tokens=True removes special tokens
decoded_tokens = tokenizer.batch_decode(token_ids)  # batch_decode decodes a list of tokens
How LLMs predict chat outputs
  • We use a specific format as the input prompt for Chat models
  • The LLM does not know how to handle the messages list format we are familiar with
    • LLMs are Data Science models that take a sequence of numbers and predict the probability of the next number
  • The tokenizer has a utility method apply_chat_template that converts the messages list format into the right input prompt for the LLM
    • The right prompt format will contain a sequence of words with special tags to separate the System, User, Assistant prompt
  • Then the words are broken down into tokens
  • Then the tokens are replaced with token ids
  • The input to an LLM is this sequence of Token IDs
    • The output is the probability distribution of the next Token ID to follow this input
# Convert message to right input prompt for the LLM 
# The add_generation_prompt=True parameter ensures that the LLM generates a response to the question, instead of just predicting how the user prompt continues.
# See https://huggingface.co/docs/transformers/main/en/chat_templating#addgenerationprompt
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# View tokenized right input prompt format
prompt = tokenizer.apply_chat_template(messages)

Quantization

  • Huggingface uses a library called bitsandbytes for Quantization

Sample Model with Quantization

# accelerate is a companion library that allows LLMs to run effectively on GPUs
!pip install -q --upgrade bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TextStreamer
import torch

messages = [
    {"role": "user", "content": "Tell me a joke"}
]

# define the quantization config to load the model into memory and use less memory
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

# Load a tokenizer for the desired task
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

# Set the special token for padding to the end of sentence token. If this is not set, the processing may result in errors
tokenizer.pad_token = tokenizer.eos_token
# The parameter return_tensors="pt" is used to convert the tokens to pytorch type data structures that are ready to be run by the GPUs
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

# Initialize the model
# Huggingface uses 'CausalLM' in the names for models that generate content
# Downloads the full precision model and then reduces the precision based on the quant config
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct", device_map="auto", quantization_config=quant_config)

# Run the model and print the output tokens
outputs = model.generate(inputs['input_ids'], max_new_tokens=80)
outputs[0]

# We can add a TextStreamer to stream the output tokens
streamer = TextStreamer(tokenizer)
outputs = model.generate(inputs['input_ids'], max_new_tokens=80, streamer=streamer)

# print the decoded output
print(tokenizer.decode(outputs[0], skip_special_tokens=True))