From 21c541222add1c12fe2d5a1e892fc40ffa952b71 Mon Sep 17 00:00:00 2001 From: Max Ku Date: Sun, 7 Jan 2024 15:17:56 -0500 Subject: [PATCH] update ghpage --- index.html | 316 ++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 311 insertions(+), 5 deletions(-) diff --git a/index.html b/index.html index ee1b9af..b18b754 100644 --- a/index.html +++ b/index.html @@ -227,12 +227,318 @@

How is Traditional Metrics correlating with human compare to VIEScore?

Looking into the details, we found that GPT4v achieves on par with human ratings on text-to-image task but it straggles on image editing tasks. We also compared with the traditional metrics.

- - MY ALT TEXT - MY ALT TEXT -

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MethodMethod-HumanSCcorrMethod-HumanPQcorrMethod-HumanOcorr
Text-guided Image Generation Model (5 models)
Human Raters -Unknown0.50440.36400.4652
CLIP-Score-0.0817-0.0114-0.0881
VIEScore(GPT-4v0shot)0.48850.23790.4614
VIEScore(GPT-4v1shot)0.45310.17700.3801
VIEScore(LLaVA0shot)0.18090.03060.1410
VIEScore(LLaVA1shot)0.1789-0.00200.1309
Mask-guided Image Editing Model (4 models)
Human Raters 0.53900.50300.4981
LPIPS-0.10120.0646-0.0694
VIEScore(GPT-4v0shot)0.45080.28590.4069
VIEScore(GPT-4v1shot)0.40880.23520.3810
VIEScore(LLaVA0shot)0.1180-0.05310.0675
VIEScore(LLaVA1shot)0.1263-0.01450.1040
Text-guided Image Editing Model (8 models)
Human Raters 0.42300.50520.4184
LPIPS0.09560.25040.1142
VIEScore(GPT-4v0shot)0.26100.42740.2456
VIEScore(GPT-4v1shot)0.24280.34020.2279
VIEScore(LLaVA0shot)0.04480.05830.0273
VIEScore(LLaVA1shot)0.0185-0.01070.0258
Subject-driven Image Generation Model (4 models)
Human Raters 0.47800.35650.4653
DINO0.41600.12060.4246
CLIP-I0.29610.16940.3058
VIEScore(GPT-4v0shot)0.39790.19030.3738
VIEScore(GPT-4v1shot)0.27570.22610.2753
VIEScore(LLaVA0shot)0.0326-0.03030.1219
VIEScore(LLaVA1shot)0.13340.08580.1248
Subject-driven Image Editing Model (3 models)
Human Raters 0.48870.29860.4747
DINO0.3022-0.03810.3005
CLIP-I0.28340.12480.2813
VIEScore(GPT-4v0shot)0.32740.29600.1507
VIEScore(GPT-4v1shot)-0.02550.1572-0.0139
VIEScore(LLaVA0shot)0.0360-0.00730.0168
VIEScore(LLaVA1shot)0.0587-0.02490.0309
Multi-concept Image Composition Model (3 models)
Human Raters 0.59270.51450.5919
DINO0.0979-0.16430.0958
CLIP-I0.1512-0.09630.1498
VIEScore(GPT-4v0shot)0.32090.30250.3346
VIEScore(GPT-4v1shot)0.18590.11850.1918
VIEScore(LLaVA0shot)0.10220.11940.1070
VIEScore(LLaVA1shot)0.08280.03790.0293
Control-guided Image Generation Model (2 models)
Human Raters 0.54430.52790.5307
LPIPS0.36990.42040.4133
VIEScore(GPT-4v0shot)0.43600.49750.3999
VIEScore(GPT-4v1shot)0.38920.41320.4237
VIEScore(LLaVA0shot)0.22070.10600.1679
VIEScore(LLaVA1shot)0.11210.02470.0416
+ +

Table 2: Correlations comparison of available methods. We highlight the best method and the correlation numbers closest to human raters. To conclude, VIEScore is the best metric in evaluating synthetic images across all tasks with high potential. DINO on the other hand proves to be an effective metric in Subject-Driven image generation and editing tasks. -

+

+