: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models
Abstract
倧èŠæš¡ãã«ãã¢ãŒãã«ã¢ãã«ïŒLMMïŒã®ææ°ã®çºå±ã«ããããã®èœåã¯åç»ç解ã«ãŸã§æ¡å€§ããŠãããç¹ã«ãããã¹ãããåç»ãžã®ïŒT2VïŒã¢ãã«ã¯ãå質ãç解åãããã³åç»ã®é·ãã«ãããŠå€§ããªé²æ©ãéããåçŽãªããã¹ãããã³ããããåç»ãäœæããããšã«åªããŠãããããããäŸç¶ãšããŠé »ç¹ã«ãæããã«AIãçæããããšã瀺ããããªå¹»èŠçãªã³ã³ãã³ããçæããŠãããæã ã¯ã玹ä»ããïŒT2Vã¢ãã«ããçæãããå¹»èŠçãªåç»ã®å€§èŠæš¡ãªããã¹ãããåç»ãžã®ãã³ãããŒã¯ã§ãããæã ã¯å¹»èŠã®5ã€ã®äž»èŠãªã¿ã€ããç¹å®ããïŒæ¶å€±ãã被åäœãæ°å€ã®å€åæ§ãæéç圢æ ç°åžžãçç¥ãšã©ãŒãããã³ç©ççäžæŽåã§ããã10ã®ãªãŒãã³ãœãŒã¹T2Vã¢ãã«ã䜿çšããŠãæã ã¯äººéã«ãã£ãŠããã5ã€ã®ã«ããŽãªãŒã«æ³šéä»ãããã3,782ã®åç»ãããªããå¹»èŠçãªåç»ã®æåã®å€§èŠæš¡ããŒã¿ã»ãããéçºããããã®ããŒã¿ã»ããã¯ãMS COCOã®ãã£ãã·ã§ã³ãçšããŠT2Vã¢ãã«ã«ããã³ãããäžããçµæãå¹»èŠã®ã¿ã€ãããšã«æåã§åé¡ããããšã§äœæããããã¯ãT2Vã¢ãã«ã®ä¿¡é Œæ§ãè©äŸ¡ããããã®ãŠããŒã¯ãªãªãœãŒã¹ãæäŸããåç»çæã«ãããå¹»èŠã®æ€åºãšè»œæžãæ¹åããããã®åºç€ãæäŸãããæã ã¯åé¡ãããŒã¹ã©ã€ã³ãšããŠç¢ºç«ããæ§ã ãªã¢ã³ãµã³ãã«åé¡åšã®æ§æãæ瀺ãããTimeSFormer + CNNã®çµã¿åãããæãåªããããã©ãŒãã³ã¹ã瀺ãã0.345ã®ç²ŸåºŠãš0.342ã®F1ã¹ã³ã¢ãéæãããæ¬ãã³ãããŒã¯ã¯ãå ¥åããã³ãããšããæ£ç¢ºã«äžèŽããåç»ãçæããå ç¢ãªT2Vã¢ãã«ã®éçºãä¿é²ããããšãç®çãšããŠããã
1 Introduction
ããã¹ãããåç»ãžã®å€æã¢ãã«ã¯è¿å¹Žã倧ããªé²æ©ãéããŠãããããã¹ãããã³ããããå°è±¡çãªäžè²«æ§ãšèŠèŠçå¿ å®åºŠãæã€åç»ã³ã³ãã³ããçæããããšãå¯èœã«ãªã£ãŠããããããã®ã¢ãã«ã¯ãå ¥åããã¹ãã®æå³ã«å¯Ÿå¿ããè€éãªèŠèŠç詳现ãå¹æçã«æããé«å質ã®åç»ãçæããèœåãåŸã ã«åäžãããŠãããããããªããããããã®é²æ©ã«ããããããããã®åéã§æãå·®ãè¿«ã£ã課é¡ã®äžã€ã¯ãå¹»èŠãããã³ã³ãã³ãã®çæâããã¹ãããã³ããã§èšè¿°ãããæå³ãããã·ãŒã³ãšäžäžèŽãŸãã¯æªæ²ããèŠèŠçèŠçŽ âã§ãããå¹»èŠã¯T2Våºåã®çŸå®æ§ãšä¿¡é Œæ§ãæãªããããã³ã³ãã³ãå¶äœãæè²ãã·ãã¥ã¬ãŒã·ã§ã³ã·ã¹ãã ãªã©ãå ¥åããã¹ããžã®æ£ç¢ºãªéµå®ãæéèŠã§ããå¿çšåéã«ãããŠé倧ãªåé¡ãšãªã£ãŠããã
ãã®åé¡ã«å¯Ÿå¿ãããããæã ã¯ãå°å ¥ãããããã¯T2Vã¢ãã«å ã®å¹»èŠãäœç³»çã«èª¿æ»ãåé¡ããããšãç®çãšããå æ¬çãªå€§èŠæš¡ããŒã¿ã»ããã§ããããã®ããŒã¿ã»ããã¯ãMS-COCOããŒã¿ã»ããããã©ã³ãã ã«éžæããã700ã®ãã£ãã·ã§ã³ãåéããããããçšããŠMS1.7BãMagicTimeãAnimateDiff-MotionAdapterãZeroscope V2 XLãå«ã10ã®äž»èŠãªãªãŒãã³ãœãŒã¹T2Vã¢ãã«ã«ããã³ãããäžããããšã§éçºããããçµæãšããŠåŸãããããŒã¿ã»ããã¯3,782ã®åç»ã§æ§æãããããããã«äººéã«ãã泚éãä»ããããT2Vçæã§é »ç¹ã«ééããæ§ã ãªçš®é¡ã®å¹»èŠãç¹å®ããŠããããããã«ã¯ãäž»èŠãªã·ãŒã³æ§æèŠçŽ ã®çç¥ã被åäœæ°ã®äžäžèŽãæéçäžæŽåãç©ççäžæŽåãäºæããæ¶å€±ãã被åäœãªã©ã®ãšã©ãŒãå«ãŸããã
ã¯ãT2Vã¢ãã«ã«ãããå¹»èŠæ€åºãè©äŸ¡ããé²æ©ãããããã®è²ŽéãªãªãœãŒã¹ãšããŠæ©èœããããã®ããŒã¿ã»ããã¯è©³çŽ°ãªåæãå¯èœã«ããããã«ç¶¿å¯ã«æ³šéãä»ããããŠãããç 究è ã«çŸåšã®T2Vã·ã¹ãã ã®éçãè©äŸ¡ãããããã®ãšã©ãŒãæžå°ãããããã®æ¹æ³è«ãæ¢æ±ããããã®ããŒã«ãæäŸããŠãããå¹»èŠãåé¡ããããã®æšæºåããããã¬ãŒã ã¯ãŒã¯ãæäŸãããã³ãããŒã¯ã確ç«ããããšã§ãã¯å ¥åããã¹ãã®æå³ãããæå³å 容ãããè¯ãåæ ãããããæ£ç¢ºã§ä¿¡é Œæ§ã®é«ãT2Vã¢ãã«ã®éçºãžã®éãéãã
èŠçŽãããšãæ¬çš¿ã®äž»èŠãªè²¢ç®ã¯ä»¥äžã®éãã§ããïŒ
-
â¢
ããã¹ãããèŠèŠãžã®å¹»èŠçŸè±¡ãè©äŸ¡ããããã®æ°ãããã³ãããŒã¯ã§ãããå°å ¥ããããã®ãã³ãããŒã¯ã¯ãã¢ãã«ãããã¹ãå ¥åããèŠèŠçã³ã³ãã³ããçæããèœåãå³å¯ã«è©äŸ¡ããããã«èšèšãããŠãããç¹ã«çæãããèŠèŠã®æ£ç¢ºæ§ãäžè²«æ§ãããã³æäŸãããããã¹ãèšè¿°ãšã®å¿ å®æ§ã«çŠç¹ãåœãŠãŠããïŒã»ã¯ã·ã§ã³ 3åç §ïŒã
-
â¢
å¹»èŠïŒçæãããèŠèŠãå ¥åããã¹ãããéžè±ããã誀ã£ãŠè¡šçŸãããããäºäŸïŒãå®éåããããã®æšæºåããããã¬ãŒã ã¯ãŒã¯ãæäŸããããšã§ãã¯T2Vã¢ãã«ã®ãšã©ãŒã®ç解ãšè»œæžãä¿é²ãããããã®æŽåæ§ãšä¿¡é Œæ§ã®åäžãä¿ãããšãç®æããŠããïŒã»ã¯ã·ã§ã³ 3åç §ïŒã
-
â¢
æ§ã ãªåé¡ã¢ãã«ã®å æ¬çãªãã³ãããŒã¯è©äŸ¡ãå®æœããæ£ç¢ºæ§ãF1ã¹ã³ã¢ãªã©ã®äž»èŠãªææšã«ããã£ãŠãããã®æ§èœãè©äŸ¡ããïŒã»ã¯ã·ã§ã³ 4åç §ïŒã
2 Related Work
çæã¢ãã«ã«ããããã«ã·ããŒã·ã§ã³çŸè±¡ã¯ãããã¹ããç»åãåç»ãªã©æ§ã ãªã¢ããªãã£ã«ããã£ãŠåºç¯ã«ç 究ãããŠãã[22]ãããã¹ãçæã®åéã§ã¯ãGPT-3[3]ã®ãããªå€§èŠæš¡èšèªã¢ãã«ïŒLLMïŒããæ§æçã«ã¯åŠ¥åœã§ãããã®ã®ãäºå®çãªæ£ç¢ºæ§ãæ¬ ããŠããããå ¥åããã³ãããšççŸããå 容ãçæããèœåã瀺ããŠããããã®ãã«ã·ããŒã·ã§ã³ã®åé¡ã«å¯ŸããŠã¯ãHallucinations Leaderboard[11]ãªã©ã®ç¹æ®ãªãã³ãããŒã¯ã®éçºãéããŠäœç³»çã«åãçµãŸããŠãããããã¯ãã«ã·ããŒã·ã§ã³ãå«ãã¿ã¹ã¯ã«ãããLLMã®è©äŸ¡ãã¬ãŒã ã¯ãŒã¯ãæäŸããŠããã
ç»åçæïŒDALL-E[21]ãImagen[23]ãªã©ã®ããã¹ãããç»åãçæããã¢ãã«ã¯ãããã¹ãã®èª¬æã«åºã¥ããŠéåžžã«ãªã¢ã«ãªç»åãçæããé«åºŠãªèœåã瀺ããŠãããããããªããããããã®ã¢ãã«ããã¢ãŒãã£ãã¡ã¯ãã®çæãå ¥å説æãšççŸããèŠèŠèŠçŽ ã®çæããå ããŠããããã§ã¯ãªãããã®åé¡ã«å¯ŸåŠãããããHallucination Detection dataset (HADES)[15]ã®ãããªããŒã¿ã»ãããå°å ¥ãããèªç±åœ¢åŒã®ããã¹ãããç»åçæã«ãããããŒã¯ã³ã¬ãã«ã®åç §ãªããã«ã·ããŒã·ã§ã³æ€åºã®ãã³ãããŒã¯ãæäŸããŠããã
åç»çæïŒåç»çæã«ããããã«ã·ããŒã·ã§ã³ã®èª²é¡ã¯ããã¬ãŒã ã®ã·ãŒã±ã³ã¹å šäœã§æéçäžè²«æ§ãç¶æããå¿ èŠããããããããè€éã«ãªãããã®åéã«ãããæè¿ã®é²å±ã¯ããã®åé¡ã®ç·©åãç®æããŠãããäŸãã°ãSora Detector[5]ã¯ã倧èŠæš¡ãªT2Vã¢ãã«ã«ããããã«ã·ããŒã·ã§ã³ãæ€åºããããã®çµ±äžããããã¬ãŒã ã¯ãŒã¯ãæ瀺ããŠããããã®ã¢ãããŒãã¯ãããŒãã¬ãŒã æœåºãç¥èã°ã©ãæ§ç¯ãªã©ã®æè¡ãçµã¿èŸŒã¿ãåã ã®ãã¬ãŒã å ããã³åç»ã·ãŒã±ã³ã¹ã®æéç次å ã«ãããäžæŽåãç¹å®ãããããã«ãVideoHallucerãã³ãããŒã¯[30]ã¯ããªããžã§ã¯ãé¢ä¿ãæéçãæå³ç詳现ãå€å çäºå®ãå€å çéäºå®çãã«ã·ããŒã·ã§ã³ãªã©ãæ§ã ãªã¿ã€ãã«åé¡ããããšã§ãåç»ããããã¹ããžã®ã¢ãã«ã«ããããã«ã·ããŒã·ã§ã³ã®è©³çŽ°ãªè©äŸ¡ãæäŸããŠããã
ãããã®é²æ©ã«ãããããããT2Vã¢ãã«ã«ããããã«ã·ããŒã·ã§ã³ã«ç¹åãã倧èŠæš¡ãªäººæã¢ãããŒã·ã§ã³ããŒã¿ã»ããã®å©çšå¯èœæ§ã«ã¯äŸç¶ãšããŠå€§ããªéããããããããŒã¿ã»ããã¯ãT2Vã¢ãã«ã«ããããã«ã·ããŒã·ã§ã³ã®äœç³»çãªç 究ãšè©äŸ¡ã®ããã®å æ¬çãªãªãœãŒã¹ãæäŸããããšã§ããã®éããã«å¯ŸåŠããããèšèšãããŠããããã«ã·ããŒã·ã§ã³ãç°ãªãçš®é¡ã«åé¡ãã倧éã®ã¢ãããŒã·ã§ã³ä»ããããªããŒã¿ãæäŸããããšã§ãã¯T2Vã¢ãã«ã«ããããã«ã·ããŒã·ã§ã³ã®æ€åºãšè»œæžãç®çãšããææ³ã®éçºãšè©äŸ¡ã®ããã®éèŠãªãã³ãããŒã¯ãšããŠæ©èœããã
ç°ãªãã¢ããªãã£ã«ããããã«ã·ããŒã·ã§ã³ã®ç解ãšè»œæžã«é¢ããŠé¡èãªé²æ©ãèŠãããäžæ¹ã§ãããŒã¿ã»ããã¯T2Vã¢ãã«ã®ç¹å®ã®æèã«ãããŠéèŠãªåé²ãè¡šããŠãããæ¬ããŒã¿ã»ããã¯ãç 究è ãå®åè ã«ãããæ£ç¢ºã§ä¿¡é Œæ§ã®é«ããããªçæã·ã¹ãã ãéçºããããã«å¿ èŠãªããŒã«ãæäŸããæçµçã«T2Væè¡ã®å¿ å®æ§ãšé©çšå¯èœæ§ãåäžããããã®ã§ããã
3 Dataset
3.1 Dataset construction
ããŒã¿ã»ãããæ§ç¯ããããã«ãæã ã¯MS COCOããŒã¿ã»ãããã700ã®ã©ã³ãã ãªãã£ãã·ã§ã³ãéžæãã[14]ãMS COCOã¯å€æ§ã§èª¬æçãªããã¹ãããã³ããã§ç¥ãããŠãããT2Vã¢ãã«ã®çææ§èœãè©äŸ¡ããããã®çæ³çãªãªãœãŒã¹ã§ããããããã®ãã£ãã·ã§ã³ã¯ããã®åŸã10ã®ç°ãªããªãŒãã³ãœãŒã¹T2Vã¢ãã«ãžã®å ¥åãšããŠäœ¿çšãããããããã®ã¢ãã«ã¯ãæ§ã ãªã¢ãŒããã¯ãã£ãã¢ãã«ãµã€ãºããã¬ãŒãã³ã°ãã©ãã€ã ã代衚ããããã«éžæããããæ¬ç 究ã«å«ãŸããå ·äœçãªã¢ãã«ã¯ä»¥äžã®éãã§ããïŒ(i) MS1.7B [1]ã(ii) MagicTime [33]ã(iii) AnimateDiff-MotionAdapter [9]ã(iv) zeroscope_v2_576w [24]ã(v) zeroscope_v2_XL [25]ã(vi) AnimateLCM [29]ã(vii) HotShotXL [19]ã(viii) AnimateDiff Lightning [13]ã(ix) Show1 [35]ã(x) MORA [34]ã
ãããã®ã¢ãã«ã¯MS COCOã®ãã£ãã·ã§ã³ã«åºã¥ããŠãããªåºåãçæãããããã¯å¹»èŠã®ååšãšé »åºŠãç¹å®ããããã«äœç³»çã«åæãããããããã®ãªãŒãã³ãœãŒã¹ã¢ãã«ã«å ããŠãæã ã¯2ã€ã®ã¯ããŒãºããœãŒã¹ã®æå 端ã¢ãã«ãRunway [8]ãšLuma [17]ã䜿çšããŠçŽ40-50æ¬ã®ãããªãå¶äœããããªãŒãã³ãœãŒã¹ãšã¯ããŒãºããœãŒã¹ã®äž¡æ¹ã®ã¢ãã«ããçæããããããªã¯ãå¹»èŠã®äŸã匷調ããããã«å³å¯ã«æ€èšŒãããäž¡ã¢ãã«ã«ããŽãªãŒã«ããã£ãŠãã®ãããªã¢ãŒãã£ãã¡ã¯ããåºãååšããããšãããã«è£ä»ããããã®åæã¯ãå¹»èŠããªãŒãã³ãœãŒã¹ãã¯ããŒãºããœãŒã¹ãã«é¢ããããå€æ§ãªT2Vã·ã¹ãã ã«ããã£ãŠåºãèŠãããããšã®èšŒæ ãæäŸããŠããããã®ãã€ãã©ã€ã³ã¯å³ 3ã«èšè¿°ãããŠããã
3.2 Hallucination Categories
芳å¯ãããæ§ã ãªçš®é¡ã®å¹»èŠãäœç³»çã«åé¡ããããã«ãæã ã¯å³2ã«ç€ºã5ã€ã®ç°ãªãã«ããŽãªãŒã確ç«ããããããã¯T2Våºåã«ååšããäžè¬çãªå¹»èŠã®å€§éšåãç¶²çŸ ããŠããã
-
1.
æ¶å€±ããäž»äœ (VS)ïŒçæãããåç»å ã®äž»äœããŸãã¯ãã®äžéšããåç»ã®ä»»æã®æç¹ã§æç¶çã«æ¶å€±ããïŒå³4åç §ïŒã
-
2.
æ°å€ã®å€åæ§ (NV)ïŒäžããããããã³ããã§äž»äœã®æ°ãæå®ãããŠããå Žåãçæãããåç»ã§ã¯äž»äœã®ã€ã³ã¹ã¿ã³ã¹æ°ãå¢å ãŸãã¯æžå°ããïŒå³5åç §ïŒã
-
3.
æéç圢æ ç°åžž (TD)ïŒåç»å ã§ã¬ã³ããªã³ã°ããããªããžã§ã¯ããé£ç¶çãªæéçå€åœ¢ã瀺ããã·ãŒã±ã³ã¹ã®æç¶æéã«ããã£ãŠåœ¢ç¶ãã¹ã±ãŒã«ããŸãã¯æ¹åãåŸã ã«ãŸãã¯æç¶çã«å€åããïŒå³6åç §ïŒã
-
4.
çç¥ãšã©ãŒ (OE)ïŒçæãããåç»ãåæããã³ããã®éèŠãªèŠçŽ ãçç¥ããŠããïŒå³7åç §ïŒ- æå®ãããäž»äœã®æ°ãå«ãå Žåãé€ã - ãã®çµæãäžå®å šãŸãã¯äžæ£ç¢ºãªæåãšãªããããŸãã¯èæ¬ã«ãªãè¡åãæ¯ãèããå°å ¥ããæå³ãããã·ãŒã³ã®èª€ã£ãè¡šçŸã«ã€ãªããã
-
5.
ç©ççäžæŽå (PI)ïŒçæãããåç»ãåºæ¬çãªç©çæ³åã«éåããããäžé©åãªèŠçŽ ã䞊眮ããïŒå³8åç §ïŒãããã«ãããèŠèŽè ã«ç¥èŠçãªççŸãèªç¥çäžååãåŒãèµ·ããã
ç©ççäžæŽåãšæéç圢æ ç°åžžã¯å¹»èŠã®äž»èŠãªã«ããŽãªãŒã代衚ããçŸåšã®T2Vã¢ãã«ã§èŠ³å¯ãããå¹»èŠã³ã³ãã³ãã®50%以äžãå ããããã®ååžã¯ããããã®ã¢ãã«ãçæãããç»åãšããã¹ãå ¥åãšã®éã®è«ççäžè²«æ§ã確ä¿ããããšãããã³ããã³ããã§æå®ããããã¹ãŠã®èŠçŽ ãå¿ å®ã«è¡šçŸããããšã«é »ç¹ã«èª²é¡ãæ±ããŠããããšã瀺åããŠããã
äžæ¹ãæãé »åºŠã®äœãæ¶å€±ããäž»äœã«ããŽãªãŒã¯ãT2Vã¢ãã«ãææäž»èŠãªäž»äœãäžè²«ããŠæåããããšã«èŠåŽããŠããããšã瀺ããŠããããããããã®åé¡ã¯ç©ççãªäžæŽåãçç¥ãããçšã§ãããT2Vã®å¹»èŠã«ãããŠäž»äœã®ä¿æãããäžè¬çã§ãªãããšã匷調ããŠããã
å¹»èŠãªãïŒçæãããåç»ãäžããããã³ã³ããã¹ããæ£ç¢ºã«åæ ããäœåãªèŠçŽ ãæé ãããèŠçŽ ããªããå¹»èŠããªãããšã確èªãããèŠèŠçåºåãããã³ããã®çŸå®äžçã®æåãšã·ãŒã ã¬ã¹ã«äžèŽããã·ããªãªã«å¿ å®ã§ãããããããªãèŠçŽ ãé¿ããŠããïŒå³9åç §ïŒã
3.3 Dataset Analysis
T2VãªãŒãã³ãœãŒã¹ã¢ãã«ã«ãã£ãŠçæãããåç»ã¯1ã2ç§ã®é·ãã§ãåèš3,782æ¬ã®åå¥åç»ãããªãããŒã¿ã»ãããæ§æããŠããããããã¯ãäºåã«å®çŸ©ããã5ã€ã®å¹»èŠã«ããŽãªãŒã®ããããã«å¯Ÿå¿ããç¹åŸŽã瀺ããŠããããã®ååžã«ãããå€æ§ãªããŒã¿ã»ããã確ä¿ãããå¹»èŠçèŠçŽ ãå«ãã³ã³ãã³ãã®å æ¬çãªè©äŸ¡ãšåæãå¯èœãšãªã£ãŠãããè¡š 1ã¯ãç°ãªãåç»ã¢ãã«éã§ã®å¹»èŠã«ããŽãªãŒã®ååžã瀺ããŠããã
T2V Model | VS | NV | TD | OE | PI | Total |
---|---|---|---|---|---|---|
AnimateLCM [29] | 2 | 70 | 70 | 70 | 70 | 282 |
zeroscope_v2_XL [25] | 18 | 0 | 37 | 109 | 199 | 363 |
Show1 [35] | 13 | 71 | 88 | 111 | 55 | 338 |
MORA [34] | 82 | 96 | 99 | 202 | 215 | 694 |
AnimateDiff Lightning [13] | 11 | 33 | 52 | 56 | 63 | 215 |
AnimateDiff-MotionAdapter [9] | 28 | 59 | 158 | 182 | 94 | 521 |
MagicTime [33] | 70 | 70 | 70 | 69 | 70 | 349 |
zeroscope_v2_576w [24] | 17 | 0 | 41 | 115 | 187 | 360 |
MS1.7B [1] | 51 | 50 | 70 | 70 | 70 | 311 |
HotShotXL [19] | 70 | 70 | 70 | 69 | 70 | 349 |
Total | 362 | 519 | 755 | 1053 | 1093 | 3782 |
Hallucination Categories | Cohenâs Kappa | Krippendorffâs Alpha |
---|---|---|
Vanishing Subject | 0.7660 | 0.7669 |
Numeric Variability | 0.8500 | 0.8508 |
Temporal Dysmorphia | 0.8173 | 0.8181 |
Omission Error | 0.7474 | 0.7487 |
Physical Incongruity | 0.8737 | 0.8743 |
3.4 Annotation Details
ã®3,782æ¬ã®åç»ããããã«ãèå¥ãããæãé¡èãªå¹»èŠã¿ã€ãã«å¯Ÿå¿ããã©ãã«ãå²ãåœãŠããããäžéšã®åç»ã«ã¯è€æ°ã®å¹»èŠãå«ãŸããŠããå¯èœæ§ãããããæã ã¯ã¢ãããŒã·ã§ã³ããã»ã¹ã®äžè²«æ§ã確ä¿ãããããååç»ãæãæ¯é çãªå¹»èŠã«ããŽãªãŒã«åŸã£ãŠã¢ãããŒã·ã§ã³ããããšãéžæããã
3.5 Human Annotation
10çš®é¡ã®T2Vã¢ãã«ã«ããã£ãŠåèš6,950æ¬ã®åç»ãçæããããïŒã¢ãã«ããšã«695æ¬ïŒã人éã«ãã泚éä»ãã¯ãªãœãŒã¹éçŽçã§ãããããéããããµã³ãã«ã«å¯ŸããŠã®ã¿å®æœãããã泚éä»ãã®ã¬ã€ãã©ã€ã³ã¯ã¢ã«ãŽãªãºã 1ã«ç€ºãããŠããã
3.5.1 Inter-Annotator Agreement
æã ã®æ³šéã®äžè²«æ§ãšä¿¡é Œæ§ãè©äŸ¡ãããããåå¹»èŠã«ããŽãªãŒã«ã€ããŠCohenã®ã«ããä¿æ°ïŒïŒ[31]ãšKrippendorffã®ã¢ã«ãã¡ä¿æ°ïŒïŒ[32]ãèšç®ããããããã®æ³šéè éäžèŽåºŠã®ææšã¯ãç°ãªã泚éè ã®åé¡ãã©ã®çšåºŠåæããŠããããå®éçã«è©äŸ¡ãããã®ã§ãããè¡š 2ã«ãããåäžã®æ³šéè éäžèŽåºŠã¹ã³ã¢ã¯ããµã³ãã«ãµã€ãºãéãããŠããããšã«èµ·å ãããåç»ã®æ³šéããã»ã¹ã«ã¯2åã®æ³šéè ***倧åŠé¢ç2åã®ã¿ãé¢äžããããã§ãããããã«ããããã倧èŠæš¡ãªæ³šéè ã°ã«ãŒãã§èŠ³å¯ãããå¯èœæ§ã®ãã解éã®çžéãäžäžèŽã®å¯èœæ§ãå¶éãããŠãããçµæã¯ãã»ãšãã©ã®ã«ããŽãªãŒã§é«ã¬ãã«ã®äžèŽã瀺ããŠããã泚éããã»ã¹ã®å åºãªäžè²«æ§ã瀺ããŠããã
æã ã®åæã«ãããšãç©ççäžæŽåã«ããŽãªãŒãæãé«ã泚éè éä¿¡é Œæ§ã瀺ãããšã®äž¡æ¹ã0.87ã®å€ã«éãããããã¯ããã®ç¹å®ã®çš®é¡ã®å¹»èŠãèå¥ããåºæºãæ確ãã€é©åã«å®çŸ©ãããŠããã泚éè éã§äžè²«ããå€æã«ã€ãªãã£ãŠããããšã瀺åããŠãããäžæ¹ãçç¥ãšã©ãŒã«ããŽãªãŒã¯æãäœãäžèŽåºŠã¹ã³ã¢ã瀺ãããšããããã0.7474ãš0.7487ã§ãã£ãããã®äœãäžè²«æ§ã¯ãæéããŒã¹ã®æªã¿ãè©äŸ¡ãã䞻芳çãªæ§è³ªã«èµ·å ãã泚éè éã§è§£éãèå¥ã®éŸå€ãç°ãªãå¯èœæ§ãããã
泚éè éã®èª²é¡ïŒ1ã€ã®åç»å ã§è€æ°ã®å¹»èŠãçºçããå¯èœæ§ãããã人éã®èªç¥ã¯ãã®ãã¡ã®1ã€ãåªå ããåŸåããããå³ 10ã«ç€ºãäŸã¯ã泚éè éã®äžèŽåºŠãè©äŸ¡ããããã«äœ¿çšãããåç»ããéžã°ãããã®ã§ãããåœåã¯æ¶å€±ãã被åäœãšããŠã©ãã«ä»ããããããåŸç¶ã®æ³šéè ã¯ç©ççäžæŽåãšããŠåé¡ãããäž¡æ¹ã®è§£éã劥åœã§ããïŒããªã¹ããŒãæéãšãšãã«æ¶å€±ããïŒæ¶å€±ãã被åäœïŒäžæ¹ã§ãã¬ã³ããªã³ã°ããããã¬ã€ã€ãŒãšã«ã¡ã©ã¢ã³ã°ã«ã®äžäžèŽãèªç¥çäžååãçã¿åºããŠããïŒç©ççäžæŽåïŒã
3.6 Open-source vs. Closed-source T2V models
éãªãŒãã³ãœãŒã¹ã¢ãã«ã¯éåžžã4ç§ãè¶ ããåç»ãçæããããæã ã®ç 究ã«ããããã¹ãŠã®10ã®ãªãŒãã³ãœãŒã¹åç»ã¯æ倧2ç§ã«å¶éãããŠãããéãªãŒãã³ãœãŒã¹ã¢ãã«ã«ããã«ã·ããŒã·ã§ã³ã¯ååšãããããã®é »åºŠã¯äœãããã«èŠãããåã¢ãã«ã§çæããã40ã®åç»ã®ãã¡ãå°ãªããšã6ã8ã®åç»ããã«ã·ããŒã·ã§ã³ã瀺ããªãã£ããããã¯ãéãªãŒãã³ãœãŒã¹ã¢ãã«ããªãŒãã³ãœãŒã¹ã¢ãã«ãšæ¯èŒããŠãäžããããããã³ããã«ããå¿ å®ãã€äžè²«ããŠåŸãåŸåãããããšã瀺åããŠãããåç»ã®å質ãšã¬ã³ããªã³ã°ããããªããžã§ã¯ãã®æçãã¯ãéãªãŒãã³ãœãŒã¹ã¢ãã«ã®æ¹ãåªããŠãããå¯Ÿç §çã«ããªãŒãã³ãœãŒã¹ã¢ãã«ãç¹ã«äœè§£å床ã®åç»ã§ã¯ããªããžã§ã¯ããå€åœ¢ããŠããã®ãããããã¯åç»ã®è§£å床ãäœãããã«æéçãªåœ¢æ ç°åžžãååšããã®ããå€å¥ããããšãå°é£ã«ãªãå Žåãããã
4 Benchmark
ãããªå¹»èŠã®èª²é¡ãå¢å€§ããäžããã®åé¡ã«åãçµãããšã¯æ¥µããŠéèŠã§ãããçŸåšãæç®ã«å«ãŸããT2Vå¹»èŠã®ãã³ãããŒã¯ã¯ãT2VHaluBench [5]ã®ã¿ã§ãããããã¯ããã50æ¬ã®ãããªã§æ§æãããŠãããããå åºãªè©äŸ¡ã«ã¯éçãããïŒè¡š 3ïŒããã®åé¡ãå æãããããæã ã¯ãããªãç 究ãæšé²ããããã®ããå æ¬çãªãã³ãããŒã¯ãææ¡ããå¹»èŠã«ããŽãªäºæž¬ããµããŒãããããã€ãã®å€å žçãªåé¡ããŒã¹ã©ã€ã³ãæäŸãããæ¬çš¿ã¯ããã®ãã³ãããŒã¯ããã®åéã®ç 究ãé²å±ãããäžã§éèŠãªãªãœãŒã¹ã«ãªããšèããŠããã
T2V Hallucination Benchmark | # Videos |
---|---|
T2VHaluBench [4] | 50 |
3,782 |
4.1 T2V Hallucination Classification
æã ã¯ãæ§ã ãªåé¡ã¢ãã«ãçšããŠããŒã¿ã»ãããè©äŸ¡ããããŸããããã¹ãããåç»çæã«ãããå¹»èŠãåé¡ããæ°ããã¿ã¹ã¯ãæ瀺ãããæåã®ã¹ãããã§ã¯ã2ã€ã®äºååŠç¿ã¢ãã«ãVideoMAEïŒããŒã¿å¹çã®è¯ãäºååŠç¿ã®ããã®ãããªãã¹ã¯èªå·±ç¬Šå·ååšïŒ[27]ãšTimeSFormerïŒãããªç解ã®ããã®æ空é泚æãããã¯ãŒã¯ïŒ[2]ããåç»åã蟌ã¿ãæœåºããããããã®æœåºãããåã蟌ã¿ã¯ããã®åŸã7ã€ã®ç°ãªãåé¡ã¢ã«ãŽãªãºã ã®ç¹åŸŽè¡šçŸãšããŠäœ¿çšãããïŒLong Short-Term Memory (LSTM) [26]ãTransformer [28]ãConvolutional Neural Network (CNN) [12]ãGated Recurrent Unit (GRU) [6]ãRecurrent Neural Network (RNN) [18]ãRandom Forest (RF) [10]ãããã³Support Vector Machine (SVM) [7]ãç°ãªãã¢ãã«ã¢ãŒããã¯ãã£ã«ããããã®å æ¬çãªè©äŸ¡ã«ãããäžããããåç»ããŒã¿ã»ããã®åé¡ã«ãããæ§èœã®åŸ¹åºçãªæ¯èŒãå¯èœãšãªãã
4.2 Experimental Setup
ããŒã¿ã»ããã¯èšç·Žçšã«80%ããã¹ãçšã«20%ã«åå²ãããAdam/AdamWãªããã£ãã€ã¶ã䜿çšããã[16]ãè¿œå ã®è©³çŽ°ã¯è¡š 4ã«ç€ºãããŠããã
Hyperparameters | ||||
Model | # epochs | batch size | optimizer | loss |
GRU | 30 | 32 | AdamW | categorical_crossentropy |
LSTM | 120 | 128 | Adam | categorical_crossentropy |
Transformer | 100 | 128 | Adam | categorical_crossentropy |
CNN | 100 | 128 | Adam | categorical_crossentropy |
RNN | 120 | 128 | Adam | categorical_crossentropy |
RF | N/A | |||
SVM | N/A |
åé¡ã¯ãåã ã®ãã¬ãŒã ã«å¯ŸããŠåäœããTimeSformerãšVideoMAEã¢ãã«ã«ãã£ãŠæœåºãããåç»åã蟌ã¿ã䜿çšããŠå®è¡ãããããã ããåé¡ã¿ã¹ã¯ã¯ãã¬ãŒã ããšã®ã¢ãããŒããæ瀺çã«å©çšããªãã£ãã
4.3 Results and Analysis
è¡š 5ã¯ãVideoMAEãšTimeSFormerã®åã蟌ã¿ãšãã2ã€ã®ç°ãªãç¹åŸŽã»ããã«ãããåã¢ãã«ã®æ§èœææšïŒç²ŸåºŠãšF1ã¹ã³ã¢ïŒã®å æ¬çãªæ¯èŒã瀺ããŠããã
VideoMAEã®åã蟌ã¿ã§èšç·Žãããã¢ãã«ã«ã€ããŠã¯ãRFã¢ãã«ãæé«ã®ç²ŸåºŠã瀺ãã0.331ãéæãããããããLSTMã¢ãã«ã¯F1ã¹ã³ã¢ã§åªããŠããã0.299ãšããæé«å€ãèšé²ãããäžæ¹ãGRUã¢ãã«ã¯æãäœãæ§èœã瀺ãã粟床ã¯0.268ãF1ã¹ã³ã¢ã¯0.190ã§ããããã®ã«ããŽãªã®ä»ã®ã¢ãã«ãšæ¯èŒããŠäž¡ææšãå€§å¹ ã«äœäžããŠããããšã瀺ããŠããã
TimeSFormerã®åã蟌ã¿ã䜿çšããå ŽåãCNNã¢ãã«ãä»ã®ãã¹ãŠã®ã¢ãã«ãäžåããæé«ã®ç²ŸåºŠïŒ0.345ïŒãšF1ã¹ã³ã¢ïŒ0.342ïŒã®äž¡æ¹ãéæãããLSTMã¢ãã«ã競äºåã®ããæ§èœã瀺ãã粟床0.337ãF1ã¹ã³ã¢0.334ãèšé²ãããå¯Ÿç §çã«ãSVMã¢ãã«ãæãå¹æãäœãã粟床0.270ãF1ã¹ã³ã¢0.274ã§ãããä»ã®ã¢ãã«ãšæ¯èŒããŠé¡èã«äœãå€ã§ãã£ãã
å šäœãšããŠãTimeSFormerã®åã蟌ã¿ã¯ãã»ãšãã©ã®ã¢ãã«ã«ãããŠVideoMAEã®åã蟌ã¿ãäžè²«ããŠäžåããããé«ã粟床ãšF1ã¹ã³ã¢ã瀺ãããTimeSFormerã®åã蟌ã¿ãšCNNã¢ãã«ã®çµã¿åãããã粟床ãšF1ã¹ã³ã¢ã®äž¡æ¹ã«ãããŠæé©ãªæ§èœãçºæ®ããæ¬ç 究ã«ãããŠæãå¹æçãªæ§æãšãªã£ãã
Model | Accuracy | F1 Score |
---|---|---|
VideoMAE + GRU | 0.268 | 0.190 |
VideoMAE + LSTM | 0.302 | 0.299 |
VideoMAE + Transformer | 0.284 | 0.254 |
VideoMAE + CNN | 0.303 | 0.290 |
VideoMAE + RNN | 0.289 | 0.289 |
VideoMAE + RF | 0.331 | 0.279 |
VideoMAE + SVM | 0.277 | 0.282 |
\hdashlineTimeSFormer + GRU | 0.325 | 0.279 |
TimeSFormer + LSTM | 0.337 | 0.334 |
TimeSFormer + Transformer | 0.322 | 0.284 |
TimeSFormer + CNN | 0.345 | 0.342 |
TimeSFormer + RNN | 0.299 | 0.299 |
TimeSFormer + RF | 0.341 | 0.282 |
TimeSFormer + SVM | 0.270 | 0.274 |
5 Conclusion, Limitations, and Future Work
çæAIã®æ¥éãªé²æ©ãç¹ã«T2Vã¢ãã«ã®çºå±ã«ããããã®æ§èœã¯ä»ã®ã¢ããªãã£ãšåçã«ãªã£ãŠããããããããããã®ã¢ãã«ã«ããããã«ã·ããŒã·ã§ã³ã¯é倧ãªèª²é¡ãæèµ·ããŠããããã®åé¡ã«åãçµããããæã ã¯T2Vã¢ãã«ã«ããããã«ã·ããŒã·ã§ã³ãè©äŸ¡ããããã®æ°ãã倧èŠæš¡ãã³ãããŒã¯ãå°å ¥ãããããã«ãããæšæºåãããè©äŸ¡ãå¯èœãšãªããå°æ¥ã®ç 究ãæ¯èŒç 究ãããã³ã¢ãã«ã®æ¹åã®ããã®åºç€ãç¯ããããæ¬çš¿ã®äž»ãªè²¢ç®ã¯ä»¥äžã®éãã§ããïŒ
-
â¢
ã®å°å ¥ïŒT2Vã¢ãã«ã«ããããã«ã·ããŒã·ã§ã³ã®è©äŸ¡ã«ç¹åããæ°ãã倧èŠæš¡ãã³ãããŒã¯ã
-
â¢
åé¡åšã«ãããã«ã·ããŒã·ã§ã³æ€åºã®ããŒã¹ã©ã€ã³æ§èœã®ç¢ºç«ãå«ããããŒã¿ã»ããã®å æ¬çãªåæã
æã ã®çŸåšã®ç 究ã®éçã¯ãåäžã®åç»å ã§ã®è€æ°ã®ãã«ã·ããŒã·ã§ã³ã«ããŽãªã®æ€åºã«å¯Ÿå¿ããŠããªãããšã§ãããããã¯äŸç¶ãšããŠè€éãªåé¡ã§ãããããã«ãã¢ãããŒã·ã§ã³ã®æ¬è³ªçãªäž»èŠ³æ§ã課é¡ããããããŠãããåã ã®è©äŸ¡ã¯ãç¹å®ã®ã¬ãã«ã®ãã«ã·ããŒã·ã§ã³ã蚱容å¯èœã§ãããããŸãã¯é€å€ãæ£åœåããããã®éŸå€ã«é¢ããŠç°ãªãå¯èœæ§ãããã
ä»åŸã®ç 究ã§ã¯ãæ°ãã«åºçŸãããã«ã·ããŒã·ã§ã³ã®ã«ããŽãªãå«ãããã«ããŒã¿ã»ãããæ¡åŒµãããããã®ãšã©ãŒã軜æžããããã®æœåšçãªæè¡ãæ¢æ±ããããšã«çŠç¹ãåœãŠãã
References
- ali vilab [2023] ali vilab. ali-vilab/text-to-video-ms-1.7b · hugging face. https://huggingface.co/ali-vilab/text-to-video-ms-1.7b, 2023. (Accessed on 10/28/2024).
- Bertasius et al. [2021] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding?, 2021.
- Brown et al. [2020] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
- Chu et al. [2024a] Zhixuan Chu, Lei Zhang, Yichen Sun, Siqiao Xue, Zhibo Wang, Zhan Qin, and Kui Ren. Sora detector: A unified hallucination detection for large text-to-video models. arXiv preprint arXiv:2405.04180, 2024a.
- Chu et al. [2024b] Zhixuan Chu, Lei Zhang, Yichen Sun, Siqiao Xue, Zhibo Wang, Zhan Qin, and Kui Ren. Sora detector: A unified hallucination detection for large text-to-video models, 2024b.
- Chung et al. [2014] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling, 2014. cite arxiv:1412.3555Comment: Presented in NIPS 2014 Deep Learning and Representation Learning Workshop.
- Cortes and Vapnik [1995] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273â297, 1995.
- [8] Anastasis Germanidis. Gen-2: Generate novel videos with text, images or video clips.
- Guo [2023] Yuwei Guo. guoyww/animatediff-motion-adapter-v1-5-2 · hugging face. https://huggingface.co/guoyww/animatediff-motion-adapter-v1-5-2, 2023. (Accessed on 10/28/2024).
- Ho [1995] Tin Kam Ho. Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition, pages 278â282. IEEE, 1995.
- Hong et al. [2024] Giwon Hong, Aryo Pradipta Gema, Rohit Saxena, Xiaotang Du, Ping Nie, Yu Zhao, Laura Perez-Beltrachini, Max Ryabinin, Xuanli He, Clémentine Fourrier, and Pasquale Minervini. The hallucinations leaderboard - an open effort to measure hallucinations in large language models. CoRR, abs/2404.05904, 2024.
- Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2012.
- Lin and Yang [2024] Shanchuan Lin and Xiao Yang. Animatediff-lightning: Cross-model diffusion distillation, 2024.
- Lin et al. [2015] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015.
- Liu et al. [2022] Tianyu Liu, Yizhe Zhang, Chris Brockett, Yi Mao, Zhifang Sui, Weizhu Chen, and Bill Dolan. A token-level reference-free hallucination detection benchmark for free-form text generation, 2022.
- Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
- [17] Lumalabs. Dream machine.
- Mikolov et al. [2010] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan ÄernockÃœ, and Sanjeev Khudanpur. Recurrent neural network based language model. In Interspeech 2010, pages 1045â1048, 2010.
- Mullan et al. [2023] John Mullan, Duncan Crawbuck, and Aakash Sastry. Hotshot-XL, 2023.
- Pedregosa et al. [2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825â2830, 2011.
- Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents, 2022.
- Rawte et al. [2023] Vipula Rawte, Amit Sheth, and Amitava Das. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922, 2023.
- Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding, 2022.
- Sterling [2023a] Spencer Sterling. cerspense/zeroscope_v2_576w · hugging face. https://huggingface.co/cerspense/zeroscope_v2_576w, 2023a. (Accessed on 10/28/2024).
- Sterling [2023b] Spencer Sterling. cerspense/zeroscope_v2_xl · hugging face. https://huggingface.co/cerspense/zeroscope_v2_XL, 2023b. (Accessed on 10/28/2024).
- Sutskever et al. [2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2014.
- Tong et al. [2022] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training, 2022.
- Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Å ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
- Wang et al. [2024a] Fu-Yun Wang, Zhaoyang Huang, Weikang Bian, Xiaoyu Shi, Keqiang Sun, Guanglu Song, Yu Liu, and Hongsheng Li. Animatelcm: Computation-efficient personalized style video generation without personalized video data, 2024a.
- Wang et al. [2024b] Yuxuan Wang, Yueqian Wang, Dongyan Zhao, Cihang Xie, and Zilong Zheng. Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models, 2024b.
- [31] Wikipedia_Cohenâs_Kappa. Cohenâs kappa.
- [32] Wikipedia_Krippendorffâs_Alpha. Krippendorffâs alpha.
- Yuan et al. [2024a] Shenghai Yuan, Jinfa Huang, Yujun Shi, Yongqi Xu, Ruijie Zhu, Bin Lin, Xinhua Cheng, Li Yuan, and Jiebo Luo. Magictime: Time-lapse video generation models as metamorphic simulators, 2024a.
- Yuan et al. [2024b] Zhengqing Yuan, Yixin Liu, Yihan Cao, Weixiang Sun, Haolong Jia, Ruoxi Chen, Zhaoxu Li, Bin Lin, Li Yuan, Lifang He, Chi Wang, Yanfang Ye, and Lichao Sun. Mora: Enabling generalist video generation via a multi-agent framework, 2024b.
- Zhang et al. [2023] David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation, 2023.