Instruction tuning is the future, but the delay in launching RLHF via PPO suggests it is finicky in practice. Practitioners will likely need to stick to instruction tuning for the time being.
Is FeedMe equivalent to just Step 1 of RLHF or am I misunderstanding?
Is FeedMe equivalent to just Step 1 of RLHF or am I misunderstanding?