Data efficient foundation models

Foundation models are notorious for being very data hungry. Segment anything was trained with 1 billion masks. CLIP was trained with >100 million image captions.

However, it is possible to train useful foundation models with much fewer labels, using a number of very pragmatic tricks to utilize preexisting models and available data as effectively as possible. These tricks are powering a new generation of data-efficient foundation models that are bringing recent innovations from social platforms into the real world.

Foundation model overview
- Foundation model implies multi-task—usable for different real world workflows, different people trying to do different things.
- Semantic multi-task requires open vocabulary—can use arbitrary natural language to describe objects
- Multi-task requires some degree of human direction of what task to focus on
- Encode task into model input
  - Segment anything: Point/box prompt as decoder input
  - Clip: Can choose either image or text to be the prompt
  - Conditional diffusion model: Choose masked image, get back full
  - Chat-bot: Encode prompt into token sequence

Key ideas include:

Transfer learning from social media foundation models
Focus on diversity over quantity
Smaller model architectures (a big step down in data diversity vs natural images)
Richer prompting to allow manual error correction

Examples include:

Medical foundation models
- PLIP:
  - Twitter scraped data, 208,414 image-text pairs scraped with hashtags
  - Pathologists collaborate by posting unusual or interesting cases on twitter to discuss what they could be.
  - Critical dependency on transfer learning from CLIP
  - Extreme diversity, only rare and interesting cases make their way to twitter
  - A basis for virtually all open-vocabulary pathology models
- Virchov2: Self-supervised feature encoding for pathology
  - Focus on dataset diversity
  - Pathology-specific loss function
- MedSam:
  - 1,570,263 image-mask pairs
  - 10 imaging modalities and over 30 cancer types
    - CT images
    - MRI images
    - derm photographs
    - H&E Pathology scans
  - Smaller models (Vit-B model is big model, then lots of micro-models)
  - Focus on diversity over volume
  - ScribblePrompt
    - Rich prompts for better control when model is not as smart
    - Needs somewhat more expensive decoder
- Medical LLMs
  - Competitive and rich field, with proposed models from many LLM vendors (google, meta, etc)
  - Lots of scrapable medical data in pure-text format, but several orders of magnitude less than general text data.
  - Transfer learning from regular LLMs
- Medical diffusion models
  - No clear winning model
  - Old tasks such as anomaly detection, nearest neighbors and classification by looking at reconstruction loss or other discriminator.

Clever bootstrapping and applications

PLIP->Grounded object tagging: Using noisy PLIP generated labels to automatically train object detection model. CVPR NOTES!!!
- Can generate a diversity of text headers for each source image
- Can manually review/clean those generated headers for accuracy.
Medical Segment anything -> cell segmenter MIDL Notes!!!
- No need to train specialized cell counter, only a very rough models to tell cells of interest apart from other cells.
PLIP + cell segmenter -> open-vocabulary instance segmentation
Medical LLMs + Virchov/PLIP image features -> Good start to train custom open-vocabulary models ontop novel data

Anomaly detection models

Negative data labeling:

Decision boundary theory of AI
- Positive labels on one side, negative on the other
- High dimensional space means complex boundaries are possible (100 dimensional space means decision boundary is a 99 dimensional hyperplane)
- Deep neural networks have millions of dimensions, a measure of the decision boundary complexity, allowing extremely complex boundary conditions
  - If few positive examples, then overfitting is very likely, will miss some out of domain true positives with very high confidence.
  - Some tricks to mitigate this
    - Don’t label visually similar negatives (tricky judgement call)
  - Explicit Regularization
  - Implicit regularization
  - Deep learning just works best with balanced datasets, but can solve this with above tricks to some degree
Pretrained feature engines to the rescue
- Supervised
  - Bottleneck hypothesis
  - Needs a strictly harder task than the downstream task (otherwise model will throw away distinguishing features which are useful to downstream task)
- Self-supervised
  - Learn as much as possible about image’s internal structure (cross-correlations)
  - Lots of progress, DinoV2
  - Visual similarity scoring (scene typing)
- Much simpler decision boundaries can be effective for screening (hundreds of dimensions, simple linear structures)
  - Linear models
  - Random forests

In the wild labeling

Report/image cross-correlations
patient metadata/image cross-correlations