forked from wangf3014/mambar-page
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
205 lines (175 loc) · 9.42 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Scaling White-Box Transformers for Vision</title>
<link rel="stylesheet" href="./static/css/bulma.min.css">
<link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="./static/css/bulma-slider.min.css">
<link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
<link rel="stylesheet"
href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="./static/css/index.css">
<!-- <link rel="icon" href="./resources/icon.png"> -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script defer src="./static/js/fontawesome.all.min.js"></script>
<script src="./static/js/bulma-carousel.min.js"></script>
<script src="./static/js/bulma-slider.min.js"></script>
<script src="./static/js/index.js"></script>
<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
</head>
<body>
<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-1 publication-title">Scaling White-Box Transformers for Vision</h1>
<div class="is-size-5 publication-authors">
<span class="author-block">
<span>Jinrui Yang</a></span><sup>*1</sup>,</span>
<span class="author-block">
<span>Xianhang Li</a></span><sup>*1</sup>,</span>
<span class="author-block">
<span>Druv Pai</a></span><sup>2</sup>,</span>
</span>
<span class="author-block">
<span>Yuyin Zhou</a></span><sup>1</sup>,</span>
</span>
<span class="author-block">
<span>Yi Ma</a></span><sup>2</sup>,</span>
</span>
<span class="author-block">
<span>Yaodong Yu</a></span><sup>†2</sup>,</span>
</span>
<span class="author-block">
<span>Cihang Xie</a></span><sup>†1</sup></span>
</span>
</div>
<!-- Equal contribution note -->
<div class="equal-contribution-advising-note">
<p>* equal contribution, † equal advising</p>
</div>
<div class="is-size-5 publication-authors">
<span class="author-block">
<span><sup>1</sup>UC Santa Cruz</a></span>,</span>
<span class="author-block">
<span><sup>2</sup>UC Berkeley</a></span></span>
<!-- <span class="author-block"><sup>3</sup>UC, Santa Cruz</span> -->
</div>
<div class="column has-text-centered">
<div class="publication-links">
<!-- PDF Link. -->
<span class="link-block">
<a href="https://arxiv.org/pdf/2405.20299"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fa-solidassasas fa-face-smiling-hands"></i>
<img src="./resources/ar.svg" alt="img" style="width: 100%; height: 100%" />
</span>
<span>arXiv</span>
</a>
</span>
<!-- Code Link. -->
<span class="link-block">
<a href="https://github.com/UCSC-VLAA/CRATE-alpha"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>Code</span>
</a>
</span>
<!-- Model Link. -->
<span class="link-block">
<a href="https://huggingface.co/UCSC-VLAA/CRATE-alpha/tree/main"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fa-solidassasas fa-face-smiling-hands"></i>
<img src="./resources/hg.svg" alt="img" style="width: 100%; height: 100%" />
</span>
<span>Model</span>
</a>
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<br>
<section class="hero teaser">
<div class="container">
<div class="hero-body">
<center><h2 class="title is-3">Framework of CRATE-α</h2></center>
<center><img src="./resources/crate-alpha-arch.png" alt="alt text"
style="width: 80%; object-fit: cover; max-width:80%;"></a></center>
<h2 class="subtitle has-text-centered">
One layer of the CRATE-α model architecture.
<span class="math">MSSA</span> (<strong>M</strong>ulti-head <strong>S</strong>ubspace <strong>S</strong>elf-<strong>A</strong>ttention) represents the compression block, and <tt>ODL</tt> (<strong>O</strong>vercomplete <strong>D</strong>ictionary <strong>L</strong>earning) represents the sparse coding block.
</h2>
</div>
</div>
</section>
<br>
<section class="section">
<div class="container">
<!-- Abstract. -->
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">
<p>
CRATE, a white-box transformer architecture designed to learn compressed and sparse representations, offers an intriguing alternative to standard vision transformers (ViTs) due to its inherent mathematical interpretability. Despite extensive investigations into the scaling behaviors of language and vision transformers, the scalability of CRATE remains an open question which this paper aims to address. Specifically, we propose CRATE-α, featuring strategic yet minimal modifications to the sparse coding block in the CRATE architecture design, and a light training recipe designed to improve the scalability of CRATE. Through extensive experiments, we demonstrate that CRATE-α can effectively scale with larger model sizes and datasets. For example, our CRATE-α-B substantially outperforms the prior best CRATE-B model accuracy on ImageNet classification by 3.7%, achieving an accuracy of 83.2%. Meanwhile, when scaling further, our CRATE-α-L obtains an ImageNet classification accuracy of 85.1%. More notably, these model performance improvements are achieved while preserving, and potentially even enhancing the interpretability of learned CRATE models, as we demonstrate through showing that the learned token representations of increasingly larger trained CRATE-α models yield increasingly higher-quality unsupervised object segmentation of images. The project page is https://rayjryang.github.io/CRATE-alpha/.
</p>
</div>
</div>
</div>
</section>
<br>
<section class="hero teaser">
<div class="container">
<div class="hero-body">
<center><h2 class="title is-3">Comparison of CRATE, CRATE-α, and ViT </h2></center>
<center><img src="./resources/fig_1_crate_alpha.png" alt="alt text"
style="width: 80%; object-fit: cover; max-width:80%;"></a></center>
<h2 class="subtitle has-text-centered">
<i>Left:</i> We demonstrate how modifications to the components enhance the performance of the <b>CRATE</b> model on ImageNet-1K. <i>Right:</i> We compare the FLOPs and accuracy on ImageNet-1K of our methods with ViT <a href="https://arxiv.org/abs/2010.11929">Dosovitskiy et al., 2020</a> and CRATE <a href="https://ma-lab-berkeley.github.io/CRATE/">Yu et al., 2023</a>. CRATE is trained only on ImageNet-1K, while <b>ours</b> and ViT are pre-trained on ImageNet-21K.
</h2>
</div>
</div>
</section>
<br>
<section class="hero teaser">
<div class="container">
<div class="hero-body">
<center><h2 class="title is-3">Visualize the Improvement of Semantic Interpretability of CRATE-α</h2></center>
<center><img src="./resources/figure_cutler_segmentation.png" alt="alt text"
style="width: 80%; object-fit: cover; max-width:80%;"></a></center>
<h2 class="subtitle has-text-centered">
<strong>Visualization of segmentation on COCO val2017 <a href="https://arxiv.org/abs/1405.0312">Lin et al., 2014</a> with MaskCut <a href="https://arxiv.org/abs/2301.11320">Wang et al., 2023</a>.</strong>
<em>Top row</em>: Supervised <strong>ours</strong> effectively identifies the main objects in the image. Compared with <strong>CRATE</strong> (<em>Middle row</em>), <strong>ours</strong> achieves better segmentation performance in terms of boundary.
<em>Bottom row</em>: Supervised ViT fails to identify the main objects in most images. We mark failed images with <img src="./resources/red_box.png" alt="Red Box" style="width: 0.25cm;">.
</h2>
</div>
</div>
</section>
<br>
<footer class="footer">
<div class="container">
<div class="columns is-centered">
<div class="column is-8">
<div class="content">
<p>
Based on the following <a href="http://nerfies.github.io">template</a>.
</p>
</div>
</div>
</div>
</div>
</footer>
</body>
</html>