Instruction tuning is the future, but the delay in launching RLHF via PPO suggests it is finicky in practice. Practitioners will likely need to stick to instruction tuning for the time being.
Is FeedMe equivalent to just Step 1 of RLHF or am I misunderstanding?
I believe so, but what confused me is that would make it the same as Flan with different input data (which maybe it is?)
Is FeedMe equivalent to just Step 1 of RLHF or am I misunderstanding?
I believe so, but what confused me is that would make it the same as Flan with different input data (which maybe it is?)