Suggestion: Using Karpathyโs "AutoResearch" for Anima training optimization
I wanted to first congratulate the dev team for Anima for releasing Preview 2. It has definitely improved in natural language understanding.
I wanted to share a new method recently released by Andrej Karpathy called "AutoResearch" that might be useful for your current training process. I first heard about this while watching Matt Wolfe's YouTube video where he talks about the latest AI news, and the Karpathy AutoResearch tool really caught my eye.
In simple terms, it's a tool that uses an AI agent to automatically test hundreds of small changes to the training code. It keeps the changes that lower the "loss" and reverts the ones that don't. Karpathy found that it could independently discover ways to make training about 11% more efficient. While Karpathy tested this on text models (nanochat), I was wondering if the same logic having an agent optimize hyperparameters and architecture could eventually be adapted for image models like Anima as well.
Since Anima is still being refined, this could be a great way to find the best architecture and settings automatically.
Let me be straightforward, recommending something like this to a developer who is putting serious effort into sophisticated model fine-tuning is as pointless, and sometimes even insulting or counterproductive, as suggesting an AI art tool to a painter who is dedicated to their craft. If you genuinely want to be helpful, it would be far better to share properly verified results, or firsthand reviews and insights gained from actually using the model yourself.
I believe giving suggestions on open source should be encouraged and grateful things, but we also remind some important opinions should be highlighted, which can be hidden by noises. Regarding opinion as 'noise' would be disrespectful, but I hope you to think about this once again: why some people thought that using autoresearch sounds like 'noise'?
Let me be straightforward, recommending something like this to a developer who is putting serious effort into sophisticated model fine-tuning is as pointless, and sometimes even insulting or counterproductive, as suggesting an AI art tool to a painter who is dedicated to their craft. If you genuinely want to be helpful, it would be far better to share properly verified results, or firsthand reviews and insights gained from actually using the model yourself.
First of all, I would like to apologise if that came across like that. I'm obviously really appreciative of all the hard work that the dev team have done on this amazing model. The only reason why I shared this was because I follow AI news quite a lot. This new open source method caught my eye. As the agents could probably work alongside the dev team more like a collaborative tool. Obviously there will still need to be human input in it because it's Ai and it's not perfect. It could do some bad training, therefore a human will need to monitor it. But it's still a pretty cool technology. I 100% agree with you regarding AI arts and even creative writing as well but all most LLMs and image models are trained on human artwork and writing. If we like it or not, that's how they're made. Also this is open source so people can further enhance the code and things like that if it was closed source then I obviously wouldn't be that keen to know about it.
Thank you for your honesty as well. I know AI is a bit of a mixed topic at the moment especially as it's trained on human data.
I hope you're enjoying this amazing model. I definitely am. The prompt adherence is very good as I'm used to using Illustrious and NoobAI.
I don't really have much ML experience and I can't comment super well on this, but I'd still like to.
Since Anima is still being refined, this could be a great way to find the best architecture and settings automatically.
Since anima is being refined, it's precisely best to not be changing Anima's architecture, at least not in major ways. As it would basically be starting over.
A lot of architectural changes that improve things are known, they're not something that needs discovering with AI agents. I can go off:
Buncha bullet points
- Muon should be a better optimizer, not quite architectural but I imagine it has more benefit from the start rather than just swapping to it mid-training
- Decoupled diffusion transformer is an architectural change that claims some massive convergence and FID and IS improvements.
- The Flux 2 VAE is basically REPA packaged into a VAE and besides the benefits of that, it should be a bit better in quality than the Qwen VAE, assuming that one is like the Flux 1 one
- Speaking of REPA, BFL very recently published self-flow, which claims to be even better
- Swapping the TE to Qwen 3.5 0.8B at least, or something a bit bigger, should be better (and tdrussel mentioned testing a TE swap IIRC?)
- No doubt imagegen models memorize things just like LLMs, a deepseek engram-style architectural change likely would help
- Randomized positional encodings seem better than RoPE, though they are old and I wonder if there's something even betterer or if I've missed something wrong with them
- Hell, recently, there's this paper on drifting models. Imagine if anima was both 1-step AND higher quality? Though drifting models remain to be proven and I am sus of why everyone I've seen talk about them omits training costs...
- And in general, having anima be an omni, edit-and-generation model would've been nice too. Multimodal = gooder, right? Danbooru almost has edit model data, there's variation sets, style reference is right there if you pick 2 images with the same artist tags, same for characters though you might want to stick to official art for those and/or include a special prompt when it's not official<->official, and of course the typical controlnet stuff fits in here nicely
You probably could do self-flow right now rather than from scratch, but man that's risky. Engram and DDT though I guess it's too late to implement. To this day I'm still confused on why DDT is seemingly abandoned other than risk, even a random furry supposedly reproduced it fast (not so for drifting models).
...However, all of those are risky, not quite proven and/or training/testing/implementation costs. And I really don't have the knowledge to sus out their pitfalls.
I do wish they happened and I got the ultimate anime model to ever exist, but it's too much to expect 1 person do/risk all of that for me for free, even more when the others hate art, their users, or both - this is a pretty exceptional charity already.
Autoresearch happens with ~2/7 hours (successful/all) for the entire training, from scratch, with 5 minute test periods, on some small (100M+?) LLM, discarding 4 experiments every 1 success. That's too bad for a bigger model like anima with a limited budget; nothing meaningful is happening in as few as 5 minutes and the overall training time is months.
Would it even translate from a smaller toy model when part of that is hyperparameters, different for a bigger model? Also anima is a finetune, right? Extra variable.
Since it's short - what about improvements that initially underperform? See BFL's self-flow, initially worse vs REPA, then not. Or vice-versa, initially overperform?
It's meant to optimize a tiny LLM that's already blazing fast to train, to be even faster to train, for research. To me it seems mostly non-applicable here.
I imagine you could inspect what the agents did and try to generalize it (would it?), but that sounds like a worse & more work version of applying what some papers say, which I already see as risky. Do you trust Qwen, Deepseek, BFL more than a swarm of AI agents?
I believe giving suggestions on open source should be encouraged and grateful things, but we also remind some important opinions should be highlighted, which can be hidden by noises. Regarding opinion as 'noise' would be disrespectful, but I hope you to think about this once again: why some people thought that using autoresearch sounds like 'noise'?
It's "noise" because it's what we call "out of scope".
Autoresearch relies on rapid iteration, as the author states, it checks loss after 5 minutes of training, tweaks settings, and starts again. While training an LLM. LLM training loss =/= DiT training loss.
In Transformer training(especially LLM) early loss can tell you if you'll hit your preplxity target down range, in Diffuser training early loss only tells you if you have to restart or not. (Abnormally high loss, bad loss curve = optimizer exploded).
Questions about how autoresearch could be used in diffuser training should be posted in the autoresearch discussions, not here, so that's why some feel it's "noise".