Data Machina #232

11 months ago 99

New Mixture-of-Experts (MoE) Models. MS Phi-2 2.7B Small Model. StripedHyena 7B Models. DeepMind Imagen2. Diffusion Models + XGBoost. promptbase. Automated Continual Learning. CogAgent V-L Model.

New Mixture-of-Experts (MoE) Models. I’ve read somewhere that Jeff Bezos once said that: “consensus & compromise between experts is not good for seeking truth.” Probably he is right. Well, it seems Mixture-of-Experts models are all the rage in the AI community these days. Let’s see why.

Dense transformer models are hugely demanding in terms of computational resources and model pipeline execution. MoE models provide faster pre-training, faster inference, and require less VRAM/compute resources. All the new MoEs models that are popping up recently, seem to outperform GPT-3.5 and Llama 2 models too.

How do Mixture-of-Experts models work? A group of leading AI researchers, just posted this excellent blogpost on MoEs. The researchers take a look at the building blocks of MoEs, how they’re trained, and the tradeoffs to consider when serving them for inference. Blogpost: Mixture of Experts Explained

Mixtral-8x7B: A new SMoE model. Mistral AI just announced this high-quality, sparse mixture-of-experts (SMoE) model with open weights. Licensed under Apache 2.0. Mixtral outperforms Llama 2 70B on most benchmarks with 6x faster inference. It’s the strongest open-weight model with a permissive license and the best model overall regarding cost/performance trade-offs. Blogpost: Mixtral of experts, A high quality Sparse Mixture-of-Experts.

Mixtral8-7B: Overview and fine-tuning. A great video explainer in which Greg reviews the architecture of Mixtral8-7B, and explains where it stands relative to other models, and how it differs from a classic transformer architecture. The video also includes a section on how to run inference using Mixtral and how to instruct-tune the model using Mosaic Instruct V3!

SwitchHead: A new MoE Attention model. Just a few days ago, a diverse group of AI researchers -including Schmidhuber- just released SwitchHead. The model uses Mixture-of-Experts (MoE) layers for the value and output projections and requires 4 to 8 times fewer attention matrices than standard Transformers. This novel attention model can also be combined with MoE MLP layers, resulting in an efficient fully-MoE “SwitchAll” Transformer model. Paper and source code: SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention.

CausalLM / Qwen with 8 MoEs. The awesome team at CausalLM have come up with a new model trained -note merged- on 8 completely different expert models based on Qwen-7B / CausalLM. Six of which are specific domain models including: a Toolformer/Agent expert model, a multilingual translation expert model, a mathematics expert model, a visual expert model, a coding and computer expert model, and an uncensored knowledge model — along with Qwen-Chat and Qwen-Base models. Checkout model description and repo: CausalLM / Qwen 8x7B MoE - This is not Mixtral / Mistral 7B

Running Mixtral1 8x7B on the new Apple MLX. A couple of days ago, Apple just published a repo on how to run Mixtral1 8x7B MoE model on the brand new MLX framework. This example also supports the instruction fine-tuned Mixtral model. Repo: Mixtral1 8x7B on Apple MLX example.

Have a nice week.

Subscribe now

10 Link-o-Troned

Google Research - Advancements in ML for ML

MSR Phi-2 2.7B: The Surprising Power of Small LMs

DeepMind Imagen 2 - Our Most Advanced Img-to-Txt Model

The AI Trust Crisis

The New StripedHyena 7B Models: Beyond Transformers

A Hacker's Guide to Open Source LLMs (12/2023)

A Systems Programmer's Perspectives on Generative AI

MS promptbase - A Repo on All Things Prompt Engineering

[free book] Deep Learning: Foundations and Concepts, Nov 2023

Bash One-Liners for LLMs


Share Data Machina with your friends


the ML Pythonista

Samsung AI: Diffusion Models + Flow XGBoost for Tabular Data

Google AI Gemini API - Getting Started Notebook

Spin up a Swarm of 10,000 Internet Agents, Let Them Work for You

Deep & Other Learning Bits

High Dimensional, Tabular DL Aided with a Knowledge Graph

[free course] RL with Human Feedback (RLHF)

[free NeuroIPS2023 tutorial] On World Models, Agents & LLMs

AI/ DL ResearchDocs

Introducing Automated Continual Learning (ACL)

Dense X Retrieval: What Retrieval Granularity Should We Use?

CogAgent: A SOTA Visual Language Model for GUI Agents

data v-i-s-i-o-n-s

1,374 Days: My Life with Long COVID

[interactive] Cost of Living: Why Things are Expensive?

How Many Hobbits? 3,000 Years of Middle Earth Population History

MLOps Untangled

How to Setup VS Code for AI/ML & MLOps in Python

BricksLLM: AI Gateway for Putting LLM In Production

The Big Dictionary of MLOps & LLMOps

AI startups -> radar

Relevance - Build & Deploy Your Own AI Agents with No Code

Delphina - A Copilot for Data Science & ML

Typeface - A Platform for Personalised Enterprise GenAI

ML Datasets & Stuff

The AI Art Dataset - 200k Txt-to-Img Prompts

UTD19: Largest, Public Multi-city Traffic Dataset Available

Toxic DPO - A Highly Toxic, Harmful Dataset for DPO & AI Unalignment

Postscript, etc

Enjoyed this post? Tell your friends about Data Machina. Thanks for reading.

Share

Tips? Suggestions? Feedback? email Carlos

Curated by @ds_ldn in the middle of the night.


View Entire Post

Read Entire Article