Hugging Face - AI

Source code: https://github.com/jkmaina/huggingface_book

1. Machine Learning

1.1. Supervised Learning

Data has label.
Example:
- Predict house prices.
- Classify email is spam or not.
- Recoginize handwritten digits.

1.2. Unsupervised Learning.

Group data by patterns
Example:
- Classify customer to segments.
- Detect anomolies in financial transaction.
- Reduce dimensions of data for easier visualization.

1.3. Reinforcement Learning

Learning by rewards and penalties
Example:
- Robotics
- Game development, AI agents.
- Dynamic pricing system.

2. Deep learning

2.1. Neural networks work

Appky in image recognition, natural language processing, speech synthesis.
Contains:
- Input layer: input data.
- Hidden Layers: multiple mathematic computations to identify patterns.
- Output Layer: produce the result, classify label or numerical value.
Activation functions: determine how signals are pased through the network
- ReLU: number/linear.
- Sigmoid: Binary classification problem 0 or 1.
- Softmax statistic tasks from 0 - 1.
Training neural network: backpropagation.

3. Machine Learning Programming

Using the library scikit-learn.

3.1. Dataset Utilities

load: load smallm built-in datasets.
fetch: fetch larger datasets from external sources.

3.2. Preprocessing tools

StandardScaler: scale missing data to average, standar deviation.
MinMaxScaler: Change numbers to fit within a range.
OneHotEncoder: Covert text data into numbers by change it to a column 0 or 1.

3.3. Paramameters:

param_grid: n_estimators, max_depth, min_samples_split for Grid Search.
Machine learning models: LinearRegression, LogisticRegression, RandomForestClassifier (Decision Tree), SupportVectorMachines (Hyperlane),

3.4. Key metrics

Accuracy: If a model makes 100 predictions and 90 are correct, the accuracy = 90%.
Precision: If the model predicts 20 people have a disease, but only 15 really do, precision = 15/20 = 75%.
Recall: If 30 people actually have a disease, and the model correctly identifies 25, recall = 25/30 = 83%.
F1 Score: Harmonic mean of Precision and Recall, If Precision = 75% and Recall = 83%, then F1 ≈ 79%.
Support: 70 healthy, 30 sick → Support for “healthy” = 70, Support for “sick” = 30.

4. Building Content Moderation System

4.1. Decision Tree (Text/Structure Data)

The model is work well for text but not image/audio.
Model: DecisionTreeClassifier, RandomForestClassifier.

4.2. CNN (Image Data)

Split the image to small parts, analyze them and stick everything together.
Encoder: AutoFeatureExtractor, AutoModelForImageClassification
Model: google/vit-base-patch16-224.

4.3. RNN (Sequential Data)

Remember past input to make sense the current one.
Encoder: AutoTokenizer
Model: AutoModelForCausalLM(“gpt2”), know how to continue the sentences.

4.4. Transformer (NLP and Vision)

Using tensorflow, transformer.
Encoder: BertTokenizer, bert-based-uncased.
Model: TFBertModel, bert-base-uncased.

5. LLM Pipelines

5.1. Text Classification

Classify text is positive or negative.
Model: distilbert/distilbert-base-uncased-finetuned-sst-2-english

5.2. Summarization

Summary long text to concise summaries.
Model: facebook/bert-lage-cnn

5.3. Other models

facebook/bart-large-cnn: Best all-around performance
google/pegasus-xsum: Best for very concise summaries
t5-small: Best if you want something fast and lightweight
sshleifer/distilbart-cnn-12-6: Good balance of speed and quality

5.4. Fine-tuning a Pre-trained Model

Prepare the Dataset: Convert your data into a format compatible with Hugging Face.
Load the Pre-trained Model: Choose a model suited for your task.
Define the Trainer: Set up training parameters, including evaluation metrics, e.g. learning rates, evaluation strategy => save the model to ./my_model.

6. Hugging Face Repo

Push the model after trained to Hugging Face as Github.

Multi-modal Transformers process and combine data from different modalities, such as text and images.
Example: image captioning, visual question answering, text-to-image alignment.

8. Zero-Shot Classification

Zero-shot classification is like asking a person to categorize something they’ve never seen before by using common sense.
It enables models to categorize text without being specifically trained for that task.
Model: zero-shot-classification.
Tips: clear categories, enough context, start simple.

9. Fine-tuning for domain adaption

Insert the dataset to models => train.
Expose it to the API.

10. Few-shot learning

Few-shot learning uses pre-trained models to adapt to new tasks with minimal labeled data.
The data is lacked of label.
Encoder: AutoModelForSeq2SeqLM.
Model: t5-small

11. Metrics

11.1. Classification Metrics

Accuracy: Good when classes are balanced and all errors are equally costly.
Precision: Important when false positives are costly.
Recall: Important when false negatives are costly.
F1-Score: Good when you want a balance between precision and recall, and when classes are imbalanced.

11.2. Text summarization and translation (Generation Metrics)

BLEU: how closely the generated text matches the reference text.
ROUGE: Compares overlaps between generated text and reference text.
METEOR: Evaluate semantic similarity.

11.3. Regression & Predict sequences

Mean Squared Error (MSE): For regression task.
Perplexity: Check the predict sequences text.

12. Domain-Specific Evaluations

Finance: sentiment models based on financial impact (e.g. market sentiment classification), e.g. model + fine-tuning: google/flan-t5-base
Healthcare: medical text models classify conditions and diseases.
E-comerece: Analyze user behaviors and recommendations.

=> Compare multi-model to test precision.

13. Top models at December 2024

mistralai/Mistral-Large-Instruct-2411

Architect: transformer (NLP, Computer Vision) + 70 billion parameters.

Use cases: Exeptional for text generation and intruction-following tasks, chatbot.

Qwen/Qwen2.5-72B

Architect: Transformer + 72 billion parameters

Use cases: Machine Translation, text summarization.

Qwen/Qwen2.5-32B

Architect: Transformer + 32 billion parameters

Use cases: Sumarization, Question answerng, conversational AI.

Qwen/Qwen2.5-14B

Architect: Transformer + 14 billion parameters

Use cases: Chatbots and medium-scale applications.

Qwen/Qwen2.5-7B

Architect: Transformer + 7 billion parameters

Use cases: chatbots, smaller-scale applications.

google/gemma-2-2b-jpn-it

Architect: Transformer + 2 billion parameters

Use cases: Multilingual content creation and translation.

google/flan-t5-large

Architect: Fine-tuned variant with T5 model.
Use cases: Academic research.

01-ai/Yi-1.5-9B

Architecture: Transformer + 9 bilition parameters.
Use cases: Sentiment analysis, customer insights,

Qwen/Qwen1.5-110B

Architecture: Transformer + 110 bilion parameters
Use case: Cross-lingual applications.

IntervitensInc/internlm2_5-20b-llamafied

Architecture: 20 billion parameter model optimized for structured tasks.
Use cases: Document summarization, data extraction.

Key insights:

Scalability Matters: Larger models like Qwen/Qwen2.5-72B and Qwen1.5- 110B deliver high performance but require significant computational resources.
Specialization Adds Value: Models like google/gemma-2-2b-jpn-it excel in niche domains, proving the importance of task-specific tuning.
Efficiency is Key: Smaller models, such as Qwen/Qwen2.5-7B, balance performance and resource usage, making them suitable for edge applications.

14. Deployment and Scaling

14.1. Deployment

Direct API deployment.
Containered deployment.
Cloud deployment: AWS SageMaker, Google Vertex AI.
On-premise deployment

14.2. Production Deployment

Deploy to virtual machine or containerization.

14.3. Deploying Models with Serverless Architectures

Deploy model to AWS SageMaker.
Deploy AWS Lambda function

14.4. Accelerating Inference with GPUs

Increase the speed of inference time.
Change the model to ONNX format, to deploy to GPU Nvidia, install NVIDIA Triton

14.5. Use case product:

Customer support chatbots.
Real-time social media monitoring.
Financial risk analysis.

15. Optimize model performance

15.1. Quantization:

Reduce from 32-bit precision to 16-bit, 8-bit

15.2. Model Pruning

Prunning without losing significant loss of accuracy.

15.3. Distillation

Use larger model (Teacher) to train a smaller model (student)

15.4. Distributed Training

Data paralellism: processed in multiple GPUs, and result are aggregated.
Model parallelism: Split across multiple devices, and each device handling a subset of the layers.

15.5. Cloud Solution

AWS SageMaker
Google Vertex AI.

15.6. Monitoring

Horizontal Scaling: K8s.
Monitoring Tool: Grafana.
Auto scaling: CPU/GPU ultilization.

15.7. Model Prunning

Remove weights: remove in smaller features, assuming they contribute less to the model predictions.
Remove layers in model architecture.
Dynamic prunning based on runtime conditions.

16. NLP

16.1. Text Classification

16.1.1. Use cases

Sentiment Analysis: Determine comment is positive, negative or neutral.
Spam detection: detect an email is spam or not.
Topic categorization: classify news is politics, sports or technology.

16.1.2. Data Processing

Text processing

Tokenization: breaking text into smaller chunk/units.
Cleaning: clean noise data.
Normalization: standard it to a consistent format.

Feature extraction

Embedding text to vector.
Capturing sematic meaning and contexual relationships.
Using techniques like: TF-IDF, Word2Vec, BERT Embedding.

16.1.3. Model selections

Choose models pre-trained on similar domains.
Consider computational requirements and latency.
Balance accurancy with resource contraints.

16.1.4. Sample models

Model: distilbert-base-uncased-finetuned-sst-2-english

16.2. Named Entity Recognization (NER)

16.2.1. Use cases

Information extraction: Extract structured data from unstructured text.
Search engines: enhance search results with relevant entities.

16.2.3. References use case

Persons (PER): name of individuals.
Organization (ORG): companies, institutions, agencies.
Locations (LOC): Cities, countries, geographic features.
Date/Time expressions (DATE): Temporal references
Quantities (QUANTITY): Numerical expressions, measurements
Miscellaneous (MISC): Other named entities

16.2.4. Data Processing

Text processing

Tokenization and sentence segmentation
Part-of-speech tagging
Dependency parsing

Entity Detection

Boundary detection
Entity classification.
Contextual understanding

16.2.5. Sample models

Model: dbmdz/bert-large-cased-finetuned-conll03-english
Example: text = “Barack Obama was born in Hawaii and became the 44th President of the United States.”

=> Extracted Information: Person (Barack Obama), Location (Hawaii, United States), and Potentially numbers (44th)

16.2.6. Use cases

Creates a class for analyzing news articles
Implements confidence thresholds for entity filtering
Tracks entity frequencies across multiple articles
Categorizes entities by type
Provides summary statistics
Handles multiple articles in a batch
Uses type hints for better code clarity
Implements error handling and entity validation

16.3. Question Answering

16.3.1. Use cases

Chatbots: provide accurate answers to user queries based on context.
Customer support: answer for common questions.
Information retrieval: find specific information in documents.
Educational Tools: Supporting learning with Q & A.

16.3.2. Data processing

Question Processing
- Question understanding.
- Query formulation
- Answer type prediction
Context analysis
- Text pre-processing
- Relevant passasge idenfitication.
- Answer extraction.

16.3.3. Sample models

Model: distilbert-base-cased-distilled-squad

Example:

context = “Hugging Face is a company that creates tools for machine learning.”
question = “What does Hugging Face create?”

=> Answer: “tool”

16.4. Retrieval Augmented Generation (RAG)

16.4.1. Context

Use to dynamically fetch relevant information in Internet, or multiple datasources in run time.

16.4.2. Features

Retrieve Context Dynamically: Automatically fetch relevant information when responding to queries
Generate Synthesized Information: Combine retrieved content with generative AI for better contextual answers.
Scale to Larger Knowledge Bases: Process and handle vast document collections effectively.
Maintain Conversational Context: Enable follow-up questions while maintaining the thread of the conversation.
Update Dynamically: Keep knowledge bases updated to ensure relevance over time.

17. Computer Vision

17.1. Image Classification

Facial Recognition
Product categories by image
Medical imaging.

17.2. Architecture

VIT (Vision Transformer)
Object Detection:

YOLO: fast for each speedn and efficiency.
Faster R-CNN: high accurracy but more computational resource.
SSD (Single shot multibox detector): balance speed and accuracy.

Image Generation:

GANs (Generative Adversarial Networks): create realistic image.
Diffusion models: transform noise into detailed image, make the image cleaner.
DALL-E: combing vision and NLP => Gen image and textual description.

Multi-modal models.

Image Captioning: Generating captions for image.
Visual Question Answering (VQA): Answer questions based on image context.
CLIP (Contrstive Language-Image Pretraining): matching text with image for retrieval and understanding tasks.

17.3. ViT (Vision Transformer)

Classify 10 images using ViT.

Model: google/vit-base-patch16-224

17.4. Object Detection

Autonomous Vehicles: Detecting pedestrians, vehicles, and obstacles
Retail and Inventory: Manage inventory.
Security: Manage unauthorized activities.

Model: facebook/detr-resnet-50

17.5. Image Generation

Art and Design: Generating creative artwork.
Content Creation: Producing images for marketing or entertainment.
Synthetic Data: Generating training data for machine learning.

Model: CompVis/stable-diffusion-v1-4, diffusers library

Image Captioning: Generating descriptive captions for images.
Visual Question Answering (VQA): Answering questions about images.
Content Moderation: Identifying inappropriate content in multimedia.

Model: Salesforce/blip-image-captioning-base

18. Audio Processing

18.1. Speech Recognition

Virtual Assistants
Transciption services.
Accesibility tools

=> Translate speech to text

Model: openai/whisper-base

18.2. Text-to-speech

Accesibility Tools
Interactive System: Add voice input to chatbots.
Entertainment: generate audio for games.

Model: suno/bark-small

18.3. Audio Classification

Environment sounds: apply in filter noise in Airpod.
Music Genre classification: classify songs by their genres.
Speech emotion recognition: detect emotions from speech.

Model: MIT/ast-finetuned-audioset-10-10-0.4593

18.4. Voice CLoning

Personal assistants: creating assistants with user-specific voices.
Entertainment: replica voices for media or games.
Accesibility: Enhance voice restruction for users with speech impairment.

Model: tts = TTS(“tts_models/en/vctk/vits”, gpu=device==”cuda”)

19. Create dataset

19.1. Create datasets

19.2. Preprocessing and Transforming datasets

19.3. Normalize datasets

normalized_value = (x - min(x))/ (max(x) - min(x))

19.4. Upload datasets

19.5. Visualizae datasets

20. Top datasets in Hugging Face

20.1. GLUE

GLUE is a benchmark dataset designed to evaluate the performance of NLP models on a variety of language understanding tasks, including sentiment analysis, sentence similarity, and natural language inference.

20.2. IMDB Reviews

Customer Feedback Analysis: Analyzing product reviews.
Content Moderation: Identifying offensive or negative reviews.
Sentiment Analysis: Building models to classify text as positive or negative.

20.3. SQuAD (Stanford Question Answering Dataset)

SQuAD is a dataset for training and evaluating question-answering models.

20.4. MNIST (Modified National Institute of Standards and Technology)

MNIST is a collection of handwritten digits commonly used for image classification.

20.5. CIFAR-10

CIFAR-10 contains images categorized into 10 classes, such as airplanes, cats, and cars.

20.6. Common Voice

Common Voice is a crowd-sourced dataset of speech recordings for various languages.

20.7. LibriSpeech

LibriSpeech contains audiobook recordings in English, widely used for ASR.

20.8. COCO (Common Objects in Context)

COCO provides richly annotated images for object detection, segmentation, and captioning.

20.9. WikiText

WikiText contains high-quality text extracted from Wikipedia articles.

20.10. SNLI (Stanford Natural Language Inference)

SNLI is a dataset for natural language inference, categorizing sentence pairs as entailment, contradiction, or neutral.

21. Hugging Face Space

Hugging Face Spaces are cloud-hosted environments where you can deploy and showcase machine learning models, applications, and demos.
Demo frontend: Streamlit, Gradio
Backend: Docker

22. Hugging Face Inference API

Provide easy to use API to demo it => but it charge the token usage.

=> You can create access token free.

Tasks:
- Text classification
- Summarization
- Translation
- Question answering
- Image classification
- Object detection
- Audio transcription

23. Deploy Model to Hugging Face Inference API

23.1. Batch Processing

Process multiple inputs in a single request:

23.2. Deploy model to hugging face

Fine-tune models on your dataset, upload them to the Hugging Face Hub, and access them via the Inference API.

23.3. Integration with Web Applications

Integrate with web application using Flask.

23.4. Sample projects

Building a Text Summarization App
Multi-Model Pipelines: multiple models together to perform complex tasks.
Batch Inference: reduce network calls.
Create a Python Lambda function to use the Hugging Face Inference API.
Monitoring and Logging API Calls

24. Diffusers

Diffusion models work by gradually adding noise to data (forward diffusion) and then learning to reverse this process (reverse diffusion).

25. Low-Rank Adaptation (LoRA) + Diffusions

A training technique to fine-tune large models like Diffusers

25.1. SDXL Base Lightning

Reduces inference time by up to 3-4x
Maintains quality comparable to base SDXL

25.2. Pixel Art XL

Converting images into pixel art style

25.3. LCM LoRA

Fast inference with Latent Consistency Models

25.4. ToonYou

Anime/cartoon style conversion

25.5. Juggernaut XL

Enhanced detail and realism

25.6. Toy Face

Toy/plastic figure style conversion

25.7. Detail Tweaker XL

Enhanced image detail and clarity

25.8. DreamShaper XL

Overall quality improvement

25.9. Add Detail

Detail enhancement for specific areas

25.10. Realistic Vision

Photorealistic image generation

26. KolorsTryOn and Flux:

Flux is a series of advanced text-to-image models
Model to support full-body image of yourself after trying on clothes.

27. SOTA

This is a model has set a new record or is among the best available for solving a specific problem => great of the time model.

27.1. Llama Models

Developed by Meta.
Suitable: text generation, translation.

27.2. Mistral

Developed by Mistral AI team.
Suitable: Question-answering, text generation.

27.3. Qwen and Coding Models

Suitable: Coding algorithms

28. Domain-specific models

28.1. Instruction-Pretrained/Finance-Llama3-8B

Application: Credit Risk Assessment for Loan Applications

28.2. Instruction-Pretrained/Medicine-Llama3-8B - Specialized AI for Healthcare

Application: Diagnostic Support, Research Summarization, Patient Education, Medical Training

28.3. Law-AI/InLegalBERT - Specialized AI for Legal Applications

Application: Legal Document Classification, Legal Question Answering.

28.4. Chronos-Bolt-Base

Demand Forecasting
Financial Predictions
Anomaly Detection
Climate and Weather Analysis
Multivariate Forecasting

29. JavaScript and Transformer

Library JavaScript: transformer.js, allow to call with model in Hugging Face by model name.
Use case:
- Text classification: ‘distilbert-base-uncased-finetuned-sst-2-english’.
- Question and answer: ‘deepset/roberta-base-squad2’

30. Text to video models

30.1. THUDM/CogVideoX-5b

Model to train text to video.

30.2. ByteDance/AnimateDiff-Lightning

Generating animated sequences from textual prompts with a focus on character and object motion.

30.3. genmo/mochi-1-preview

A text-to-video model that emphasizes stylized and creative video outputs, blending artistic styles with descriptive prompts.

31. Leaderboards

31.1. Text Classification

Task: Classify text into predefined categories.
Metrics: Accuracy, F1 score.
Top Models:
- distilbert-base-uncased-finetuned-sst-2-english
- roberta-large-mnli

31.2. Question Answering

Task: Answer questions based on a given context.
Metrics: Exact Match (EM), F1 score.
Top Models:
- deepset/roberta-base-squad2
- bert-large-uncased-whole-word-masking-finetuned-squad

31.3. Summarization

Task: Generate concise summaries for long documents.
Metrics: ROUGE-L, BLEU.
Top Models:
- facebook/bart-large-cnn
- t5-small

31.4. Machine Translation

Task: Translate text between languages.
Metrics: BLEU, METEOR.
Top Models:
- Helsinki-NLP/opus-mt-en-fr
- facebook/wmt19-de-en

32. How Leaderboards Are Evaluated

Datasets Models are evaluated on standardized datasets, such as:

GLUE: For text classification.
SQuAD: For question answering.
CNN/Daily Mail: For summarization.

Metrics Each leaderboard specifies metrics that reflect model performance:

Accuracy: Percentage of correct predictions.
F1 Score: Harmonic mean of precision and recall.
BLEU: Measures translation quality.
ROUGE: Evaluates summary relevance.

Evaluation Protocols

Reproducibility: Models must be evaluated in a consistent environment.
Fair Comparison: Benchmarks ensure fair comparisons across models.

33. Enterprise Options on Hugging Face

Interence Endpoints: Scalability, Security.
Private models.
AI workflows: AI Workflows provide a structured approach to building and deploying machine learning pipelines.
Collaboration tools.
Enterprise support.

34. Open-source models

Open-Source Models and Datasets
Collaborative Tools
Funding with NVDIA, Microsoft, Google.

Last Updated On September 2, 2025