New Mixture-of-Experts (MoE) Models. MS Phi-2 2.7B Small Model. StripedHyena 7B Models. DeepMind Imagen2. Diffusion Models + XGBoost. promptbase. Automated Continual Learning. CogAgent V-L Model.
New Mixture-of-Experts (MoE) Models. I’ve read somewhere that Jeff Bezos once said that: “consensus & compromise between experts is not good for seeking truth.” Probably he is right. Well, it seems Mixture-of-Experts models are all the rage in the AI community these days. Let’s see why.
Dense transformer models are hugely demanding in terms of computational resources and model pipeline execution. MoE models provide faster pre-training, faster inference, and require less VRAM/compute resources. All the new MoEs models that are popping up recently, seem to outperform GPT-3.5 and Llama 2 models too.
How do Mixture-of-Experts models work? A group of leading AI researchers, just posted this excellent blogpost on MoEs. The researchers take a look at the building blocks of MoEs, how they’re trained, and the tradeoffs to consider when serving them for inference. Blogpost: Mixture of Experts Explained
Mixtral-8x7B: A new SMoE model. Mistral AI just announced this high-quality, sparse mixture-of-experts (SMoE) model with open weights. Licensed under Apache 2.0. Mixtral outperforms Llama 2 70B on most benchmarks with 6x faster inference. It’s the strongest open-weight model with a permissive license and the best model overall regarding cost/performance trade-offs. Blogpost: Mixtral of experts, A high quality Sparse Mixture-of-Experts.
Mixtral8-7B: Overview and fine-tuning. A great video explainer in which Greg reviews the architecture of Mixtral8-7B, and explains where it stands relative to other models, and how it differs from a classic transformer architecture. The video also includes a section on how to run inference using Mixtral and how to instruct-tune the model using Mosaic Instruct V3!
SwitchHead: A new MoE Attention model. Just a few days ago, a diverse group of AI researchers -including Schmidhuber- just released SwitchHead. The model uses Mixture-of-Experts (MoE) layers for the value and output projections and requires 4 to 8 times fewer attention matrices than standard Transformers. This novel attention model can also be combined with MoE MLP layers, resulting in an efficient fully-MoE “SwitchAll” Transformer model. Paper and source code: SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention.
CausalLM / Qwen with 8 MoEs. The awesome team at CausalLM have come up with a new model trained -note merged- on 8 completely different expert models based on Qwen-7B / CausalLM. Six of which are specific domain models including: a Toolformer/Agent expert model, a multilingual translation expert model, a mathematics expert model, a visual expert model, a coding and computer expert model, and an uncensored knowledge model — along with Qwen-Chat and Qwen-Base models. Checkout model description and repo: CausalLM / Qwen 8x7B MoE - This is not Mixtral / Mistral 7B
Running Mixtral1 8x7B on the new Apple MLX. A couple of days ago, Apple just published a repo on how to run Mixtral1 8x7B MoE model on the brand new MLX framework. This example also supports the instruction fine-tuned Mixtral model. Repo: Mixtral1 8x7B on Apple MLX example.
Have a nice week.
10 Link-o-Troned
Google Research - Advancements in ML for ML
MSR Phi-2 2.7B: The Surprising Power of Small LMs
DeepMind Imagen 2 - Our Most Advanced Img-to-Txt Model
The New StripedHyena 7B Models: Beyond Transformers
A Hacker's Guide to Open Source LLMs (12/2023)
A Systems Programmer's Perspectives on Generative AI
MS promptbase - A Repo on All Things Prompt Engineering
[free book] Deep Learning: Foundations and Concepts, Nov 2023
the ML Pythonista
Samsung AI: Diffusion Models + Flow XGBoost for Tabular Data
Google AI Gemini API - Getting Started Notebook
Spin up a Swarm of 10,000 Internet Agents, Let Them Work for You
Deep & Other Learning Bits
High Dimensional, Tabular DL Aided with a Knowledge Graph
[free course] RL with Human Feedback (RLHF)
[free NeuroIPS2023 tutorial] On World Models, Agents & LLMs
AI/ DL ResearchDocs
Introducing Automated Continual Learning (ACL)
Dense X Retrieval: What Retrieval Granularity Should We Use?
CogAgent: A SOTA Visual Language Model for GUI Agents
data v-i-s-i-o-n-s
1,374 Days: My Life with Long COVID
[interactive] Cost of Living: Why Things are Expensive?
How Many Hobbits? 3,000 Years of Middle Earth Population History
MLOps Untangled
How to Setup VS Code for AI/ML & MLOps in Python
BricksLLM: AI Gateway for Putting LLM In Production
The Big Dictionary of MLOps & LLMOps
AI startups -> radar
Relevance - Build & Deploy Your Own AI Agents with No Code
Delphina - A Copilot for Data Science & ML
Typeface - A Platform for Personalised Enterprise GenAI
ML Datasets & Stuff
The AI Art Dataset - 200k Txt-to-Img Prompts
UTD19: Largest, Public Multi-city Traffic Dataset Available
Toxic DPO - A Highly Toxic, Harmful Dataset for DPO & AI Unalignment
Postscript, etc
Tips? Suggestions? Feedback? email Carlos
Curated by @ds_ldn in the middle of the night.