Podržano učenje na osnovu povratnih informacija od ljudi

U mašinskom učenju, podržano učenje na osnovu povratnih informacija od ljudi (engleski: reinforcement learning from human feedback – RLHF) tehnika je koja obučava "model nagrađivanja" uz pomoć povratnih informacija ljudi. Taj model se koristi kao funkcija nagrade za bolje usmjeravanje inteligentnog agenta. Povratne informacije poboljšavaju proces donošenja odluka omogućavajući mu da se prilagodi novim situacijama.[1] Obično se povratne informacije skupljaju upitom ljudskom korisniku da ocijeni ponašanje agenta.[2][3]

Neki primjeri jezičkih modela koji su obučeni na osnovu povratnih informacija od ljudi su ChatGPT[4][5] i Sparrow.[6]

Također pogledajte

uredi

Reference

uredi
  1. ^ MacGlashan, James; Ho, Mark K; Loftin, Robert; Peng, Bei; Wang, Guan; Roberts, David L.; Taylor, Matthew E.; Littman, Michael L. (6 August 2017). "Interactive learning from policy-dependent human feedback". Proceedings of the 34th International Conference on Machine Learning - Volume 70. JMLR.org: 2285–2294. arXiv:1701.06049.
    • Warnell, Garrett; Waytowich, Nicholas; Lawhern, Vernon; Stone, Peter (25 April 2018). "Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces". Proceedings of the AAAI Conference on Artificial Intelligence. 32 (1). doi:10.1609/aaai.v32i1.11485. S2CID 4130751.
    • Bai, Yuntao; Jones, Andy; Ndousse, Kamal; Askell, Amanda; Chen, Anna; DasSarma, Nova; Drain, Dawn; Fort, Stanislav; Ganguli, Deep; Henighan, Tom; Joseph, Nicholas; Kadavath, Saurav; Kernion, Jackson; Conerly, Tom; El-Showk, Sheer; Elhage, Nelson; Hatfield-Dodds, Zac; Hernandez, Danny; Hume, Tristan; Johnston, Scott; Kravec, Shauna; Lovitt, Liane; Nanda, Neel; Olsson, Catherine; Amodei, Dario; Brown, Tom; Clark, Jack; McCandlish, Sam; Olah, Chris; Mann, Ben; Kaplan, Jared (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback". arXiv:2204.05862. journal zahtijeva |journal= (pomoć)
  2. ^ Ouyang, Long; Wu, Jeffrey; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Gray, Alex; Schulman, John; Hilton, Jacob; Kelton, Fraser; Miller, Luke; Simens, Maddie; Askell, Amanda; Welinder, Peter; Christiano, Paul; Leike, Jan; Lowe, Ryan (31 October 2022). "Training language models to follow instructions with human feedback" (jezik: engleski). arXiv:2203.02155. journal zahtijeva |journal= (pomoć)
  3. ^ Edwards, Benj (1 December 2022). "OpenAI invites everyone to test ChatGPT, a new AI-powered chatbot—with amusing results". Ars Technica (jezik: engleski). Pristupljeno 4 March 2023.
  4. ^ Edwards, Benj (1 December 2022). "OpenAI invites everyone to test ChatGPT, a new AI-powered chatbot—with amusing results". Ars Technica (jezik: engleski). Pristupljeno 4 March 2023.
  5. ^ Farseev, Aleks. "Council Post: Is Bigger Better? Why The ChatGPT Vs. GPT-3 Vs. GPT-4 'Battle' Is Just A Family Chat". Forbes (jezik: engleski). Pristupljeno 4 March 2023.
  6. ^ Glaese, Amelia; McAleese, Nat; Trębacz, Maja; Aslanides, John; Firoiu, Vlad; Ewalds, Timo; Rauh, Maribeth; Weidinger, Laura; Chadwick, Martin; Thacker, Phoebe; Campbell-Gillingham, Lucy; Uesato, Jonathan; Huang, Po-Sen; Comanescu, Ramona; Yang, Fan; See, Abigail; Dathathri, Sumanth; Greig, Rory; Chen, Charlie; Fritz, Doug; Elias, Jaume Sanchez; Green, Richard; Mokrá, Soňa; Fernando, Nicholas; Wu, Boxi; Foley, Rachel; Young, Susannah; Gabriel, Iason; Isaac, William; Mellor, John; Hassabis, Demis; Kavukcuoglu, Koray; Hendricks, Lisa Anne; Irving, Geoffrey (2022). "Improving alignment of dialogue agents via targeted human judgements". arXiv:2209.14375. journal zahtijeva |journal= (pomoć)