Phi-4 Bug Fixes

(unsloth.ai)

193 points | by danielhanchen 354 days ago

16 comments

danielhanchen 354 days ago
Hey HN family! I found a few bugs for Phi-4 - Microsoft's latest MIT licensed LLM to be on par with GPT-4o mini
1. End of sentence should be <|im_end|> not <|endoftext|>
2. Chat template should not auto add an assistant prompt
3. Padding token should not be EOS but <|dummy_87|>
I also converted Phi-4 to Llama-arch. I uploaded GGUFs, 4bit quants, dynamic quants and all fixes to https://huggingface.co/unsloth
I also made a Colab notebook to finetune Phi-4 on a free GPU: https://colab.research.google.com/github/unslothai/notebooks...
[-]
- CGamesPlay 354 days ago
  > We converted Phi-4 to Llama’s architecture for better accuracy and easier use.
  What does this mean? When I think about "model architecture", I think about the number of weights in each layer, the organization of the layers, etc. And AFAIK, it's untenable to "port" a model from one to the other without effectively retraining it. So what does it actually mean to "convert to Llama's architecture"?
  [-]
  - danielhanchen 354 days ago
    Oh Phi-4's architecture is inspired from Llama itself, except they merged the attention matrices into 1 large matrix for better FLOP utilization, and the gate/up matrices in the MLP.
    Phi-3 use to use sliding window attention, but they got rid of that in Phi-4.
    So, you can "Mistral-fy" Phi-3 and convert it to Mistral arch (by unmerging the merges), and now you can "Llama-fy" Phi-4 to Llama arch.
    The reason why accuracy increases in finetuning is because during LoRA finetuning, you learn only 1 A matrix for merged QKV, whilst unmerging it creates 3 A matrices - this allows the model to have more freedom to learn new features.
    [-]
    - behnamoh 353 days ago
      I know some of those words... Man, do you recommend any blog/book/etc. that teaches me how to know this stuff?
      Most books are either too low level or too high level.
      [-]
      - danielhanchen 353 days ago
        If it helps there a few YouTube recordings of conferences and workshops I did about stuff!
        Low level Technicals of LLMs: https://www.youtube.com/watch?v=pRM_P6UfdIc
        CUDA / GPU Mode talk about it here: https://www.youtube.com/watch?v=hfb_AIhDYnA
        Chat with PyTorch team here: https://www.youtube.com/watch?v=MQwryfkydc0
        PyTorch Conference talk here: https://www.youtube.com/watch?v=PdtKkc5jB4g
  - Sn0wCoder 354 days ago
    Would guess GGUF so you can run on llama.cpp, LM Studio, etc..., but OP can hopefully clarity further for you.
    [-]
    - danielhanchen 354 days ago
      Yep converting to Llama arch definitely makes accessibility much better - also many fast LLM serving libraries normally support Llama, so it makes it easier to port and use!
- sunaookami 354 days ago
  Wasn't Phi-3 also bugged/is still bugged? Seems like Microsoft just doesn't care.
  >to be on par with GPT-4o mini
  Phi is known to overfit benchmarks. It's way, way worse then that.
  [-]
  - danielhanchen 354 days ago
    Phi-3 should be fixed as well - but yes there were bugs as well! https://x.com/danielhanchen/status/1782853167572832650
    Phi-3's sliding window should be 2048 and not 2047, and they also had chat template issues - I uploaded correct versions to https://huggingface.co/unsloth/Phi-3.5-mini-instruct
  - throwaway314155 354 days ago
    Anecdotally, I've been experimenting with Phi-4 the past hour or so (so, yeah, not very comprehensive) and it's certainly a strong model. Definitely better than the previous Phi models.
    [-]
    - danielhanchen 354 days ago
      Yep Phi-4 definitely is better than Phi-3.5!
- simonw 354 days ago
  Huh! That may explain why I kept on getting visible <|im_end|> output when I tried running a Phi-4 GGUF file using llama.cpp.
  [-]
  - danielhanchen 354 days ago
    Oh yes exactly! I trimmed it out now :)
    The better chat template should be:
    {% for message in messages %}{% if (message['role'] == 'system') %}{{'<|im_start|>system<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'user') %}{{'<|im_start|>user<|im_sep|>' + message['content'] + '<|im_end|>'}}{% elif (message['role'] == 'assistant') %}{{'<|im_start|>assistant<|im_sep|>' + message['content'] + '<|im_end|>'}}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant<|im_sep|>' }}{% endif %}
- magicalhippo 353 days ago
  > I uploaded GGUFs, 4bit quants, dynamic quants
  The dynamic quantization[1] looks really interesting. Now, I've just been dabbling, but did I understand correctly that this dynamic quantization is compatible with GUFF? If so, how do you convert it? Just the standard way or?
  I was really curious to try the dynamic 4-bit version of the Llama-3.2 11B Vision model as I found the Q8 variant much better than the standard Q4_K_M variant in certain cases, but it doesn't fully fit my GPU so is significantly slower.
  [1]: https://unsloth.ai/blog/dynamic-4bit
  [-]
  - danielhanchen 353 days ago
    Oh the dynamic 4bit quants sadly are not GGUF compatible yet - it currently works through Hugging Face transformers, Unsloth and other trainers.
    My goal was to make a dynamic quant for GGUF as well - it's just a tad bit complicated to select which layers to quantize and not with GGUF - I might have to manually edit the llama.cpp quantize C file
    Also I'm unsure yet if llama.cpp as of 11th Jan 2025 supports Llama Vision (or maybe it's new?) I do remember Qwen / Llava type models are working
    [-]
    - magicalhippo 353 days ago
      > Oh the dynamic 4bit quants sadly are not GGUF compatible yet
      Ah bummer. Is this a GGUF file-format issue or mostly "just" a code-doesn't-exist issue?
      > Also I'm unsure yet if llama.cpp as of 11th Jan 2025 supports Llama Vision (or maybe it's new?)
      Ah, I totally forgot Ollama did that on their own and didn't merge upstream.
      I'm using Ollama because was so easy to get running on my main Windows rig so I can take advantage of my GPU there, I still do a bit of gaming, while all the stuff which uses Ollama for inference I run on my server.
      Anyway thanks for the response.
      [-]
      - danielhanchen 353 days ago
        > Ah bummer. Is this a GGUF file-format issue or mostly "just" a code-doesn't-exist issue?
        Just code! Technically I was working on dynamic quants for DeepSeek V3 (200GB in size), which will increase accuracy by a lot for a 2bit model (if you leave attention in 4bit), and just use 20GB more. -> But I'm still working on it!
        > Ah, I totally forgot Ollama did that on their own and didn't merge upstream.
        Yep they have Llama vision support! Llama.cpp has Qwen, Llava support - I think Llama V support is coming, but it'll take much more time - the arch is vastly different than normal transformers due to cross attention
- sroussey 354 days ago
  Can you convert to ONNX so I can try in web browser?
  [-]
  - sroussey 354 days ago
    Would like to update this:
    https://huggingface.co/spaces/webml-community/phi-3.5-webgpu
  - danielhanchen 354 days ago
    Oh I can probs try doing this!
- mugivarra69 354 days ago
  [dead]
danielhanchen 354 days ago
Update: The Phi-4 team is actively working on adding all our fixes into the original model! https://huggingface.co/microsoft/phi-4/discussions/21
RandyOrion 354 days ago
Hi. It's nice to see these fixes.
I got a question after checking results on the open LLM leaderboard[1].
Comparing the result of NyxKrage/Microsoft_Phi-4 and microsoft/phi-4 or unsloth/phi-4, I can see fixing both the tokenizer and chat template causes the performance of both IFEval and BBH to increase. However, the performance on MATH, GPQA and MUSR degrades A LOT.
Is there any explanation on why this is happening?
[1] https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...
[-]
- danielhanchen 353 days ago
  Yep that is something I've been dumbfounded by as well - the official Microsoft phi4 upload also suffers on MATH so at least we can rule out it's because I did something wrong.
  I thought of two possibilities:
  1. 509 does better on MATH but absolutely terribly on IFEVAL because it does not use a chat template - whilst others so use the chat template.
  2. I think HF uses exact matching I think so maybe that's the culprit.
  I can test 1. by resubmitting without using the chat template!
  [-]
  - RandyOrion 353 days ago
    Thanks for the reply. Would like to see whether the chat template is cause of this strange behavior.
t1amat 354 days ago
Daniel’s fixes to Phi-4 make it the best scoring Phi-4 on HF’s Open LLM Leaderboard. Great job on that.
Unsloth is a masterpiece, keep up the great work!
[-]
- danielhanchen 354 days ago
  Thanks a lot!
lostmsu 354 days ago
The benchmark results of the model before and after the "fixes" do not match numbers reported in the model card: https://huggingface.co/microsoft/phi-4
According to Microsoft MATH score should be 80.4, while both original and the "fixed" models as run by unsloth only score just over 12.3. So either Microsoft made a few huge mistakes, or unsloth was not able to run their model correctly.
[-]
- danielhanchen 354 days ago
  Oh yes I found this to be a bit strange - I uploaded our versions and Microsoft's own version to Hugging Face's public LLM leaderboard - https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...
  You can see Microsoft's own original Phi-3 scores 12.31% - I'm unsure why. My fixes at least pushes it to 20%.
  It's possible because HF's benchmark does "Scoring: Exact match: Was the solution generated correct and in the expected format" which might be the issue
dorian-graph 354 days ago
These seem like amazingly egregious mistakes MS made? Or is it not as bad as it seems? I suppose, I'm curious how these kinds of mistakes happen for a model release.
[-]
- danielhanchen 353 days ago
  This happens quite often sadly - for example I fixed 8 bugs in Gemma https://x.com/danielhanchen/status/1765446273661075609, multiple bugs in Llama, Mistral, Qwen, a gradient accumulation bug https://x.com/danielhanchen/status/1846235913443262891 etc
  I wouldn't blame model training teams - sadly it's relatively hard to coordinate large teams so it might have been overlooked.
  But hey - I'm always here to fix them up :))
  [-]
  - dorian-graph 353 days ago
    Fair enough! I haven't been through the train and release of a model like this, so it's new to me. Thanks for your work! ;)
excerionsforte 354 days ago
Available on Ollama already: https://ollama.com/vanilj/phi-4-unsloth
[-]
- tandr 354 days ago
  looking at "original" Phi4 on ollama, it looks like they have fixed parameters issue for im_start/end
- danielhanchen 354 days ago
  Oh fabulous! :)
NooneAtAll3 354 days ago
Application Error
TypeError: m(...).findLast is not a function
at L (https://unsloth.ai/assets/root-DexjOeLv.js:1:340)
at ia (https://unsloth.ai/assets/components-D38fXVcE.js:7:30549)
at Ac (https://unsloth.ai/assets/components-D38fXVcE.js:7:98661)
at Am (https://unsloth.ai/assets/components-D38fXVcE.js:7:94250)
at o0 (https://unsloth.ai/assets/components-D38fXVcE.js:7:93401)
at ha (https://unsloth.ai/assets/components-D38fXVcE.js:7:93212)
at Mm (https://unsloth.ai/assets/components-D38fXVcE.js:7:90555)
at Om (https://unsloth.ai/assets/components-D38fXVcE.js:7:89963)
at MessagePort.M (https://unsloth.ai/assets/components-D38fXVcE.js:1:11235
[-]
- danielhanchen 354 days ago
  Sorry are there some issues with our website?
  [-]
  - NooneAtAll3 354 days ago
    yep, it appears for a second - then displays only this :(
    [-]
    - danielhanchen 353 days ago
      I did post a Twitter post here: https://x.com/danielhanchen/status/1877781452818968615 which essentially is a long form Tweet about the bug fixes if that helps! Apologies on the issue
    - danielhanchen 354 days ago
      Oh no :( Do you know which device / platform?
adultSwim 354 days ago
Are there alternatives to unsloth?
I would love to use it but the open/free version only handles one GPU, and it's unclear how much the paid version would cost. I have some limited access to multiple older NVidia cards and would love to make better use of them while I'm still learning. My budget for learning/projects is rather modest.
Hopefully they succeed. At work I could make a strong case for going with them as they allow keeping data local only, instead of relying on an API.
[-]
- danielhanchen 354 days ago
  Multi GPU support is definitely coming to Unsloth OSS! Our goal was to release it this month, but unsure on exact timelines - maybe next month!!
  [-]
  - adultSwim 354 days ago
    Thank you!
    [-]
    - danielhanchen 354 days ago
      I'll ping you when it comes along!
greensh 354 days ago
Microsoft developed and trained Phi-4. How can there be bugs in their official implementation? Does this mean they trained und evaluated it on their own completly different code and then ported it to the huggingface library for compatibility?
[-]
- danielhanchen 353 days ago
  The chat template adding an assistant prompt by default for example is also shown in the technical report - so they did this during training. The issue is inference workloads should not have this, otherwise inference workloads might inadvertently append extra assistant prompts or forget about it - so hence I removed it.
  The rest I'm not sure - for eg the EOS token should be im_end and not endoftext - it could be a small mistake
  [-]
  - greensh 350 days ago
    Thanks. I guess this means for Benchmarks they didn't use it. I find it fascinating and admire your dedication to fixing and improving those models.
sinuhe69 353 days ago
How big is GPT4o-mini? Some sources say it's 8b big, but I guess they have different models with different sizes. But if GPT4o-mini is just 8b, I don't see the point of a "distilled" model, which requires a much bigger network but still not on par with the original. Because it's open source?
[-]
- danielhanchen 353 days ago
  I think it's a MoE probably with 8b activated parameters - so not exactly 8b but maybe 8b x 8
  But on many benchmarks it does surpass GPT 4o mini - some benchmarks are even better than GPT 4o.
  But yes in general it's because its a powerful open source models!
make3 354 days ago
"Yes it improves performance!" proceeds to show the most unconvincing stats ever
you can probably blow on your GPU and get a similar performance change
[-]
- danielhanchen 354 days ago
  I uploaded our fixed versions to https://huggingface.co/spaces/open-llm-leaderboard/open_llm_... which show the difference in scores.
  I agree it's not super convincing, so I provided anecdotal evidence as well - I'll work with the Phi-4 team to upstream these fixes!
  PS for further credibility, we also fixed 8 bugs in Gemma 1 - see https://x.com/danielhanchen/status/1765446273661075609 , multiple bugs in Llama, Mistral, Qwen and other models
- refulgentis 354 days ago
  I'm sorry, I don't understand what you mean. I checked the original article again too. As it stands, my understanding is you are claiming:
  - blowing on a GPU (which I take to mean doing roughly nothing)
  - gets roughly the same perf change
  - as moving from fp16 to q4
  [-]
  - danielhanchen 354 days ago
    Update - the Phi-4 team is working on adding all our fixes to the original model! https://huggingface.co/microsoft/phi-4/discussions/21
    [-]
    - make3 353 days ago
      hey this is great work, I'm sorry I complained, I'm thankful for what you're doing here
      [-]
      - danielhanchen 353 days ago
        No worries at all! :)
  - danielhanchen 354 days ago
    Are you referring to the finetuning part?
    The multiple bug fixes are separate from the finetuning sections - Unsloth itself makes finetuning 2x faster and use 70% less memory - the bug fixes are totally detached from finetuning - ie you can take the fixed version we uploaded at https://huggingface.co/unsloth/phi-4, and use it in any framework or inference engine.
    Apologies I'm confused on the comment sorry.
    If you're questioning the credibility of the bug fixes - we fixed 8 bugs in Gemma https://x.com/danielhanchen/status/1765446273661075609, multiple bugs in Llama, Mistral, Qwen, a gradient accumulation bug https://x.com/danielhanchen/status/1846235913443262891 and much more
    [-]
    - grumpopotamus 354 days ago
      2x faster than what?
      [-]
      - danielhanchen 354 days ago
        Oh 2x faster and uses >70% less memory than Hugging Face + Flash Attention 2! I did a CUDA / GPU Mode talk about it here: https://www.youtube.com/watch?v=hfb_AIhDYnA Also to the PyTorch team here: https://www.youtube.com/watch?v=MQwryfkydc0 and the PyTorch Conference here: https://www.youtube.com/watch?v=PdtKkc5jB4g
        [-]
        kouteiheika 354 days ago
        > Oh 2x faster and uses >70% less memory than Hugging Face + Flash Attention 2!
        Is this doing the same type of fine-tuning, or are you comparing full bf16 fine-tuning in HF with 4-bit QLoRA in Unsloth (in which case it's not really an apples-to-apples comparison)? If it's the latter then do you have a comparison of the former?
        [-]
        danielhanchen 353 days ago
        Oh I compared 4bit QLoRA HF+FA2 with Unsloth 4bit QLoRA.
        16bit LoRA have similar boosts in performance!
        Full bf16 full finentuning is not yet supported, but it'll come out soon!
c1b 354 days ago
daniel youre a legend, thanks for all you do!
one question, I see perf comparisons here are done on an L4, but isn't this SKU very rare? Im used to T4 at that tier
[-]
- danielhanchen 353 days ago
  Thanks!! Oh Colab provides L4s - but the benchmarks are similar for T4!
  In fact Unsloth is the only framework afaik that fits in a t4 for finentuning with reasonable sequence lengths!
m3kw9 353 days ago
But fixing a model is the first I’ve heard of.
[-]
- danielhanchen 353 days ago
  Oh I do this quite a lot and tweet about it! For example I fixed 8 bugs in Gemma https://x.com/danielhanchen/status/1765446273661075609, multiple bugs in Llama, Mistral, Qwen, a gradient accumulation bug https://x.com/danielhanchen/status/1846235913443262891 etc
TZubiri 354 days ago
Ah yes, drawing ASCII art, the de facto benchmark for evaluating LLM quality.
[-]
- danielhanchen 354 days ago
  Anecdotal evidence was provided to show some Redditors tested it out - but I do agree it's not correct to show that as an example - so I uploaded our fixed versions to Hugging Face's public LLM leaderboard here: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_... - this shows the fixes do in fact work!
  [-]
  - aghilmort 354 days ago
    [dead]
wsintra2022 354 days ago
>Reddit comments show our fixes make Phi-4 inference much better
I’d like to try ‘Reddit comments show my fixes make app better’ in my next review
[-]
- danielhanchen 354 days ago
  Fixed versions are also independently scored by Hugging Face's Open LLM Leaderboard: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...
  The Reddit LocalLlama community is actually pretty cool - tonnes of research actually comes from the community - for example kaiokendev's linear RoPE scaling, YaRN, NTK Aware RoPE Scaling, many LLM benchmarks - many researchers use LocalLlama to share research and discuss on new stuff.
  I know a lot of AI researchers use the "LocalLlama vibe check" which essentially is an anecdotal approach to LLM evaluation - ie instead of relying on Chat LMsys or LLM benchmarks, 3rd party crowd sourced vibe checks sometimes do much better.
- danielhanchen 354 days ago
  As an update - the Phi-4 team is actively working on incorporating all fixes! See https://huggingface.co/microsoft/phi-4/discussions/21