-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsearch.xml
4501 lines (4121 loc) · 583 KB
/
search.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?>
<search>
<entry>
<title>Python 爬虫教程 02</title>
<url>/2024/10/17/02Python%E7%88%AC%E8%99%AB/</url>
<content><![CDATA[<h1 id="1-测试网址"><a href="#1-测试网址" class="headerlink" title="1.测试网址"></a>1.测试网址</h1><p><a class="link" href="https://www.spiderbuf.cn/list" >测试网站<i class="fa-solid fa-arrow-up-right ml-[0.2em] font-light align-text-top text-[0.7em] link-icon"></i></a></p>
<h1 id="2-编写代码"><a href="#2-编写代码" class="headerlink" title="2.编写代码"></a>2.编写代码</h1><h2 id="2-1-代码示例"><a href="#2-1-代码示例" class="headerlink" title="2.1 代码示例"></a>2.1 代码示例</h2><div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="keyword">from</span> lxml <span class="keyword">import</span> etree</span><br><span class="line"></span><br><span class="line">url = <span class="string">"https://spiderbuf.cn/playground/s01"</span></span><br><span class="line"></span><br><span class="line">html = requests.get(url=url).text</span><br><span class="line"></span><br><span class="line">f = <span class="built_in">open</span>(<span class="string">'./课程/01course/01.html'</span>, <span class="string">'w'</span>, encoding=<span class="string">'utf-8'</span>)</span><br><span class="line">f.write(html)</span><br><span class="line">f.close()</span><br><span class="line"></span><br><span class="line"><span class="comment"># print(html)</span></span><br><span class="line">root = etree.HTML(html)</span><br><span class="line">trs = root.xpath(<span class="string">'//tr'</span>)</span><br><span class="line"></span><br><span class="line">f = <span class="built_in">open</span>(<span class="string">'./课程/01course/data01.txt'</span>, <span class="string">'w'</span>, encoding=<span class="string">'utf-8'</span>)</span><br><span class="line"><span class="keyword">for</span> tr <span class="keyword">in</span> trs:</span><br><span class="line"> tds = tr.xpath(<span class="string">'./td'</span>)</span><br><span class="line"> s = <span class="string">''</span></span><br><span class="line"> <span class="keyword">for</span> td <span class="keyword">in</span> tds:</span><br><span class="line"> <span class="comment"># 这里加 str 是为了防止 <td></td> 之间是空数据</span></span><br><span class="line"> s = s + <span class="built_in">str</span>(td.text) + <span class="string">' | '</span></span><br><span class="line"> <span class="built_in">print</span>(s)</span><br><span class="line"> <span class="comment"># 保存解析到的数据到本地</span></span><br><span class="line"> <span class="keyword">if</span> s!= <span class="string">''</span>:</span><br><span class="line"> f.write(s + <span class="string">'\n'</span>)</span><br></pre></td></tr></table></figure></div>
<p>在很多时候,建议先把请求到的网页内容保存到本地(如上面代码所示),因为有时候解析数据无法一次性解析成功,那就会再向网页发送请求,这样不断的去访问服务器,就容易被服务器检测到是爬虫。</p>
<h1 id="3-lxml-基础使用"><a href="#3-lxml-基础使用" class="headerlink" title="3.lxml 基础使用"></a>3.lxml 基础使用</h1><h2 id="3-1-XPath-基本语法"><a href="#3-1-XPath-基本语法" class="headerlink" title="3.1 XPath 基本语法"></a>3.1 XPath 基本语法</h2><ul>
<li><code>/</code>:从根节点选择。根节点是一个 XML 或 HTML 文档的最顶层节点,所有其他元素都是它的子节点。</li>
<li><code>//</code>:选择文档中的所有符合条件的节点,不论它们在文档的什么位置。</li>
<li><code>.</code>:当前节点。</li>
<li><code>..</code>:父节点。</li>
<li><code>@</code>:选择属性。</li>
<li><code>[ ]</code>:筛选器,添加条件。</li>
<li><code>text()</code>:选择文本节点。</li>
</ul>
<h2 id="3-2-示例说明"><a href="#3-2-示例说明" class="headerlink" title="3.2 示例说明"></a>3.2 示例说明</h2><p>假设有如下 HTML 结构,<code><html></code>是根节点:</p>
<div class="code-container" data-rel="Html"><figure class="iseeu highlight html"><table><tr><td class="code"><pre><span class="line"><span class="tag"><<span class="name">html</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">body</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">table</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">tr</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span> <span class="attr">class</span>=<span class="string">"status"</span>></span>在线<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span>123<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">tr</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">tr</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span> <span class="attr">class</span>=<span class="string">"status"</span>></span>离线<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span>456<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">tr</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">tr</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span> <span class="attr">class</span>=<span class="string">"status"</span>></span>在线<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span>789<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">tr</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">table</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">body</span>></span></span><br><span class="line"><span class="tag"></<span class="name">html</span>></span></span><br></pre></td></tr></table></figure></div>
<h2 id="3-3-使用-XPath-提取数据"><a href="#3-3-使用-XPath-提取数据" class="headerlink" title="3.3 使用 XPath 提取数据"></a>3.3 使用 XPath 提取数据</h2><ol>
<li><strong>提取所有行的 <code><td></code> 内容:</strong></li>
</ol>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">root = etree.HTML(html)</span><br><span class="line">trs = root.xpath(<span class="string">'//tr'</span>)</span><br></pre></td></tr></table></figure></div>
<p><code>//tr</code> 表示选择文档中所有的 <code><tr></code> 节点,不论它们在文档的哪个层级。</p>
<ol start="2">
<li><strong>提取某个特定列的数据:</strong></li>
</ol>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">status_list = root.xpath(<span class="string">'//td[@class="status"]/text()'</span>)</span><br><span class="line"><span class="built_in">print</span>(status_list)</span><br></pre></td></tr></table></figure></div>
<ul>
<li><code>//td[@class="status"]</code>:选择所有 class 为 “status” 的 <code><td></code> 节点。</li>
<li><code>/text()</code>:选择这些节点的文本内容。</li>
</ul>
<p><strong>结果:</strong></p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">[<span class="string">'在线'</span>, <span class="string">'离线'</span>, <span class="string">'在线'</span>]</span><br></pre></td></tr></table></figure></div>
<ol start="3">
<li><strong>结合 <code>for</code> 循环逐行提取多个列的内容:</strong></li>
</ol>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">trs = root.xpath(<span class="string">'//tr'</span>)</span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> tr <span class="keyword">in</span> trs:</span><br><span class="line"> status = tr.xpath(<span class="string">'./td[@class="status"]/text()'</span>)[<span class="number">0</span>] <span class="comment"># 获取状态</span></span><br><span class="line"> number = tr.xpath(<span class="string">'./td[2]/text()'</span>)[<span class="number">0</span>] <span class="comment"># 获取数字</span></span><br><span class="line"> <span class="built_in">print</span>(<span class="string">f"状态: <span class="subst">{status}</span>, 数字: <span class="subst">{number}</span>"</span>)</span><br></pre></td></tr></table></figure></div>
<ul>
<li><code>./td[@class="status"]/text()</code>:从当前行的 <code><td></code> 中选择 class 为 “status” 的列。</li>
<li><code>./td[2]/text()</code>:选择当前行的第二个 <code><td></code> 节点的文本内容。</li>
</ul>
<p><strong>输出:</strong></p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">状态: 在线, 数字: 123</span><br><span class="line">状态: 离线, 数字: 456</span><br><span class="line">状态: 在线, 数字: 789</span><br></pre></td></tr></table></figure></div>
<h2 id="3-4-XPath-实用技巧"><a href="#3-4-XPath-实用技巧" class="headerlink" title="3.4 XPath 实用技巧"></a>3.4 XPath 实用技巧</h2><ol>
<li><p><strong>使用属性选择节点:</strong></p>
<ul>
<li><code>//td[@class="status"]</code> 选择所有 class 为 “status” 的 <code><td></code>。</li>
<li><code>//a[@href]</code> 选择带有 <code>href</code> 属性的所有 <code><a></code> 标签。</li>
</ul>
</li>
<li><p><strong>使用索引选择特定节点:</strong></p>
<ul>
<li><code>//tr[1]</code> 选择第一个 <code><tr></code> 节点。</li>
<li><code>//td[2]</code> 选择当前 <code><tr></code> 中的第二个 <code><td></code> 节点。</li>
</ul>
</li>
<li><p><strong>选择文本和属性:</strong></p>
<ul>
<li><code>//td/text()</code> 选择 <code><td></code> 内的文本。</li>
<li><code>//a/@href</code> 选择所有 <code><a></code> 标签的 <code>href</code> 属性。</li>
</ul>
</li>
<li><p><strong>条件筛选:</strong></p>
<ul>
<li><code>//tr[td[1]="在线"]</code> 选择第一列文本为“在线”的行。</li>
</ul>
</li>
</ol>
]]></content>
<categories>
<category>Python爬虫</category>
</categories>
<tags>
<tag>python</tag>
</tags>
</entry>
<entry>
<title>Python 爬虫教程 01</title>
<url>/2024/09/17/01Python%E7%88%AC%E8%99%AB/</url>
<content><![CDATA[<h1 id="1-浏览器"><a href="#1-浏览器" class="headerlink" title="1.浏览器"></a>1.浏览器</h1><h2 id="1-1-获取页面信息"><a href="#1-1-获取页面信息" class="headerlink" title="1.1 获取页面信息"></a>1.1 获取页面信息</h2><p>进入要爬取的页面,按<code>F12</code>进入开发者模式:<br><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://s2.loli.net/2024/09/17/2L7UlQKkAjaNVud.png"
alt="豆瓣Top250电影.png"
><figcaption>豆瓣Top250电影.png</figcaption></figure></p>
<p>选择<code>network</code>一栏,左上角有一个<code>红点</code>标识暂停的意思,通过刷新页面可以获得当前页面的流量,但是对我们来说最重要的是<code>top250?start=</code>这个信息,正好与该网址<code>https://movie.douban.com/top250?start=</code>最后一部分的站点相对应。<br><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://s2.loli.net/2024/09/17/uDgx4JOBaVZT6w3.png"
alt="豆瓣Top250电影.png"
><figcaption>豆瓣Top250电影.png</figcaption></figure></p>
<p>选择<code>top250?start=</code>这块,这部分内容主要是用于我们伪装自己的爬虫程序为浏览器(相当于把<code>Python</code>伪装成<code>Chrome</code>浏览器)。尤其是<code>User-Agent</code>部分。<br><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://s2.loli.net/2024/09/17/z5tewy4Zmr2AD3V.png"
alt="top250?start=.png"
><figcaption>top250?start=.png</figcaption></figure></p>
<h1 id="2-配置爬虫编程环境"><a href="#2-配置爬虫编程环境" class="headerlink" title="2.配置爬虫编程环境"></a>2.配置爬虫编程环境</h1><ul>
<li>创建虚拟环境<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">conda create -n scrapy python</span><br></pre></td></tr></table></figure></div></li>
<li>激活:<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">conda activate scrapy</span><br></pre></td></tr></table></figure></div></li>
<li>安装必要的包:<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">pip install pandas bs4 urllib xlwt sqlite3</span><br></pre></td></tr></table></figure></div></li>
</ul>
<h1 id="3-urllib-包讲解"><a href="#3-urllib-包讲解" class="headerlink" title="3.urllib 包讲解"></a>3.urllib 包讲解</h1><p>需要导入这两个类:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="keyword">import</span> urllib.parse</span><br><span class="line"><span class="keyword">import</span> urllib.request</span><br></pre></td></tr></table></figure></div>
<h2 id="3-1-get-请求"><a href="#3-1-get-请求" class="headerlink" title="3.1 get 请求"></a>3.1 get 请求</h2><p>获取一个百度的<code>get</code>请求,然后把内容保存到<code>baidu.html</code>文件中:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="comment"># 发送一个 HTTP GET 请求,访问百度首页</span></span><br><span class="line">response = urllib.request.urlopen(<span class="string">"http://www.baidu.com"</span>)</span><br><span class="line"><span class="comment"># 使用 read() 方法从 response 响应对象中读取网页的内容。</span></span><br><span class="line">content = response.read()</span><br><span class="line"><span class="keyword">with</span> <span class="built_in">open</span>(<span class="string">"baidu.html"</span>, <span class="string">"wb"</span>) <span class="keyword">as</span> f:</span><br><span class="line"> f.write(content)</span><br><span class="line"><span class="built_in">print</span>(<span class="string">"内容已保存到 baidu.html 文件中"</span>)</span><br></pre></td></tr></table></figure></div>
<p>用浏览器打开<code>baidu.html</code>文件,可以发现它就是百度搜索的页面。</p>
<p>这里的<code>read()</code>方法返回的是网页的字节数据(<code>bytes</code>类型),即网页的 HTML 源代码。这是因为读取的是未经解码的二进制数据。<br><code>open</code>方法中传入参数的参数是<code>wb (write binary)</code>,该模式用于打开文件进行写入,且文件以二进制模式打开。由于<code>response.read()</code>返回的是字节数据(非文本),所以使用二进制模式写入文件。</p>
<h2 id="3-2-post-请求"><a href="#3-2-post-请求" class="headerlink" title="3.2 post 请求"></a>3.2 post 请求</h2><p>测试网址为:<code>https://httpbin.org/post</code>:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="comment"># 将字典 {"hello": "world"} 编码为查询字符串 'hello=world'</span></span><br><span class="line">data = <span class="built_in">bytes</span>(urllib.parse.urlencode({<span class="string">"hello"</span> : <span class="string">"world"</span>}), encoding=<span class="string">"utf-8"</span>)</span><br><span class="line"><span class="comment"># 发送 POST 请求,data 包含要发送的数据</span></span><br><span class="line">response = urllib.request.urlopen(<span class="string">"https://httpbin.org/post"</span>, data=data)</span><br><span class="line"><span class="built_in">print</span>(response.read().decode(<span class="string">"utf-8"</span>))</span><br></pre></td></tr></table></figure></div>
<p>输出结果:</p>
<div class="code-container" data-rel="Txt"><figure class="iseeu highlight txt"><table><tr><td class="code"><pre><span class="line">{</span><br><span class="line"> "args": {}, </span><br><span class="line"> "data": "", </span><br><span class="line"> "files": {}, </span><br><span class="line"> "form": {</span><br><span class="line"> "hello": "world"</span><br><span class="line"> }, </span><br><span class="line"> "headers": {</span><br><span class="line"> "Accept-Encoding": "identity", </span><br><span class="line"> "Content-Length": "11", </span><br><span class="line"> "Content-Type": "application/x-www-form-urlencoded", </span><br><span class="line"> "Host": "httpbin.org", </span><br><span class="line"> "User-Agent": "Python-urllib/3.12", </span><br><span class="line"> "X-Amzn-Trace-Id": "Root=1-66e971aa-12f8daa034aa05d42f8a080c"</span><br><span class="line"> }, </span><br><span class="line"> "json": null, </span><br><span class="line"> "origin": "113.57.44.61", </span><br><span class="line"> "url": "https://httpbin.org/post"</span><br><span class="line">}</span><br></pre></td></tr></table></figure></div>
<p>从输出结果可以看出在请求时伪装成了浏览器的样子。</p>
<p><code>POST</code> 请求在 Web 开发中非常常用,通常用于向服务器提交数据,比如表单提交、文件上传、新数据的创建等。在发送 <code>POST</code> 请求时,服务器期望接收到的数据会放在请求的主体(<code>body</code>)中,而不是像 <code>GET</code> 请求那样附加在 URL 的查询字符串中。这也是需要通过 <code>data</code> 参数来封装发送给服务器的数据的原因。</p>
<h3 id="3-2-1-POST-请求的用途"><a href="#3-2-1-POST-请求的用途" class="headerlink" title="3.2.1 POST 请求的用途"></a>3.2.1 POST 请求的用途</h3><ul>
<li><strong>提交表单数据:</strong> 用户在网页上填写表单,并点击提交按钮,浏览器会向服务器发送一个 <code>POST</code> 请求,表单中的数据会通过 <code>POST</code> 请求的主体发送。</li>
<li><strong>上传文件:</strong> 文件上传通常通过 <code>POST</code> 请求,文件数据被封装在请求主体中传递给服务器。</li>
<li><strong>创建或修改资源:</strong> 比如在 REST API 中,<code>POST</code> 请求常用于向服务器创建新的资源,发送 JSON 或 XML 格式的数据。</li>
<li><strong>提交敏感数据:</strong> 由于 <code>POST</code> 请求的数据不在 URL 中,而是在请求主体中,适合传递一些敏感信息,如密码、个人数据等(但仍需结合 HTTPS 保障安全性)。</li>
</ul>
<h3 id="3-2-2-封装数据的用途"><a href="#3-2-2-封装数据的用途" class="headerlink" title="3.2.2 封装数据的用途"></a>3.2.2 封装数据的用途</h3><ul>
<li>在 <code>POST</code> 请求中,服务器期望从请求的 <strong>主体</strong> 接收数据。因此,数据必须通过 <code>data</code> 参数以字节(<code>bytes</code>)形式发送。</li>
<li>如果不提供 <code>data</code> 参数,<code>urllib.request.urlopen()</code> 默认发送的是一个 <code>GET</code> 请求(即不携带数据),而 <code>GET</code> 请求不能用于修改资源,只能用于获取资源。这时,访问像 <code>https://httpbin.org/post</code> 这种只能处理 <code>POST</code> 请求的 API,服务器会返回 <code>HTTP 405</code>错误,表示 <code>METHOD NOT ALLOWED</code>(方法不被允许),因为服务器只允许 <code>POST</code>,不允许 <code>GET</code>。</li>
<li>将数据通过 <code>bytes</code> 封装后,<code>urlopen</code> 方法知道你正在发送的是一个 <code>POST</code> 请求,而不是 <code>GET</code> 请求。</li>
</ul>
<h3 id="3-2-3-将数据封装为-bytes-的原因"><a href="#3-2-3-将数据封装为-bytes-的原因" class="headerlink" title="3.2.3 将数据封装为 bytes 的原因"></a>3.2.3 将数据封装为 bytes 的原因</h3><p>在发送 <code>POST</code> 请求时,数据会被放在 HTTP 请求的主体中,HTTP 协议规定传输的数据以字节(<code>bytes</code>)的形式发送,因此你需要先将数据转换为 <code>bytes</code> 格式。Python 的 <code>urllib.parse.urlencode()</code> 函数将字典编码为查询字符串格式,然后将其通过 <code>bytes()</code> 转换为字节流。</p>
<h2 id="3-3-超时处理"><a href="#3-3-超时处理" class="headerlink" title="3.3 超时处理"></a>3.3 超时处理</h2><p>以<code>get</code>请求为例:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="keyword">try</span>:</span><br><span class="line"> response = urllib.request.urlopen(<span class="string">"https://httpbin.org/get"</span>, timeout=<span class="number">0.01</span>)</span><br><span class="line"> <span class="built_in">print</span>(response.read().decode(<span class="string">"utf-8"</span>))</span><br><span class="line"><span class="keyword">except</span> urllib.error.URLError <span class="keyword">as</span> e:</span><br><span class="line"> <span class="built_in">print</span>(<span class="string">"time out"</span>)</span><br></pre></td></tr></table></figure></div>
<h2 id="3-4-查看响应"><a href="#3-4-查看响应" class="headerlink" title="3.4 查看响应"></a>3.4 查看响应</h2><div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">response = urllib.request.urlopen(<span class="string">"http://www.baidu.com"</span>)</span><br><span class="line"><span class="built_in">print</span>(response.status)</span><br></pre></td></tr></table></figure></div>
<p>输出结果为<code>200</code>, 表示 HTTP 请求成功,服务器正常处理并返回了所请求的资源。HTTP 状态码是服务器在处理客户端请求后返回的标准响应代码。</p>
<p><strong>常见的 HTTP 状态码:</strong></p>
<ul>
<li><code>2xx</code>(成功类状态码):<ul>
<li><code>200 OK</code>:请求成功并返回资源。</li>
<li><code>201 Created</code>:请求成功且服务器创建了新的资源。</li>
</ul>
</li>
<li><code>3xx</code>(重定向类状态码):<ul>
<li><code>301 Moved Permanently</code>:资源已永久移动到新位置。</li>
<li><code>302 Found</code>:资源临时移动到新位置。</li>
</ul>
</li>
<li><code>4xx</code>(客户端错误类状态码):<ul>
<li><code>400 Bad Request</code>:请求无效,通常是由于请求格式错误。</li>
<li><code>401 Unauthorized</code>:未经授权,通常是未提供认证信息。</li>
<li><code>404 Not Found</code>:服务器无法找到请求的资源。</li>
<li><code>418</code>:表示访问的网站有反爬虫机制,解决方法就是带请求头<code>header(suser-agent)</code>访问。</li>
</ul>
</li>
<li><code>5xx</code>(服务器错误类状态码):<ul>
<li><code>500 Internal Server Error</code>:服务器内部错误。</li>
<li><code>503 Service Unavailable</code>:服务器暂时无法处理请求,通常是因为过载或维护。</li>
</ul>
</li>
</ul>
<p>查看响应内容:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">response = urllib.request.urlopen(<span class="string">"http://baidu.com"</span>)</span><br><span class="line"><span class="built_in">print</span>(response.getheaders())</span><br></pre></td></tr></table></figure></div>
<p>输出结果:</p>
<div class="code-container" data-rel="Txt"><figure class="iseeu highlight txt"><table><tr><td class="code"><pre><span class="line">[('Date', 'Tue, 17 Sep 2024 15:43:39 GMT'), ('Server', 'Apache'), ('Last-Modified', 'Tue, 12 Jan 2010 13:48:00 GMT'), ('ETag', '"51-47cf7e6ee8400"'), ('Accept-Ranges', 'bytes'), ('Content-Length', '81'), ('Cache-Control', 'max-age=86400'), ('Expires', 'Wed, 18 Sep 2024 15:43:39 GMT'), ('Connection', 'Close'), ('Content-Type', 'text/html')]</span><br></pre></td></tr></table></figure></div>
<p>这个输出结果正好就是对应网页(这里是<code>baidu</code>)的响应头。如图所示:<br><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://s2.loli.net/2024/09/17/1JbukLBxFqTNgOm.png"
alt="百度响应头.png"
><figcaption>百度响应头.png</figcaption></figure></p>
<h2 id="3-5-把访问伪装成浏览器"><a href="#3-5-把访问伪装成浏览器" class="headerlink" title="3.5 把访问伪装成浏览器"></a>3.5 把访问伪装成浏览器</h2><p><code>user-agen</code>的信息需要在浏览器中按<code>F12</code>后,在对应页面的响应头中找到。</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">url = <span class="string">"https://douban.com"</span></span><br><span class="line"><span class="comment"># 这里可以添加更多的键值对,模拟得更像</span></span><br><span class="line">headers = {</span><br><span class="line"> <span class="string">"user-agent"</span>: <span class="string">"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"</span></span><br><span class="line">}</span><br><span class="line">data = <span class="built_in">bytes</span>(urllib.parse.urlencode({<span class="string">'name'</span>: <span class="string">'eric'</span>}), encoding=<span class="string">"utf-8"</span>)</span><br><span class="line">req = urllib.request.Request(url=url, data=data, headers=headers, method=<span class="string">"POST"</span>)</span><br><span class="line">response = urllib.request.urlopen(req)</span><br><span class="line"><span class="built_in">print</span>(response.read())</span><br></pre></td></tr></table></figure></div>
<p>这样的话就能获得该网址的页面信息。</p>
<p>这里并不是必须封装 <code>data</code>的,因为我们 <strong>不需要向豆瓣发送数据</strong> ,因此可以不传递 <code>data</code> 参数,但这会导致 <code>urllib.request.Request()</code> 默认使用 <code>GET</code> 方法而不是 <code>POST</code> 方法。<title>百度一下,你就知道</title></p>
<p> 如果使用 <code>GET</code> 请求的话,需要注意 <code>GET</code> 请求的特点是将数据 <strong>附加在 URL 后面</strong> 作为查询字符串,而不是放在请求体中。因此,<code>GET</code> 请求一般不用于提交大量数据,且不适合提交敏感数据。</p>
<p>可以将数据拼接到 URL 中,比如这样:<br> <div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">url = <span class="string">"https://douban.com?name=eric"</span></span><br><span class="line">req = urllib.request.Request(url=url, headers=headers)</span><br><span class="line">response = urllib.request.urlopen(req)</span><br><span class="line"><span class="built_in">print</span>(response.read().decode(<span class="string">"utf-8"</span>))</span><br></pre></td></tr></table></figure></div><br> 这种形式会将参数 <code>name=eric</code> 直接放到 URL 后面,形成一个完整的查询 URL。</p>
<h1 id="4-bs4-包讲解"><a href="#4-bs4-包讲解" class="headerlink" title="4.bs4 包讲解"></a>4.bs4 包讲解</h1><p><code>BeautifulSoup</code> 通过解析 HTML 或 XML,将其转换成 Python 对象树,方便开发者使用各种方式来查找和操作数据,比如通过标签、类名、属性等。<br>需要导入这个类:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="keyword">from</span> bs4 <span class="keyword">import</span> BeautifulSoup</span><br></pre></td></tr></table></figure></div>
<ul>
<li><p>主要功能</p>
<ol>
<li><strong>解析 HTML/XML 文档</strong> :将 HTML 或 XML 文档解析成树形结构,便于操作。</li>
<li><strong>提取数据</strong> :提供多种方法,如通过标签、类名、属性来查找和筛选网页内容。</li>
<li><strong>格式化输出</strong> :可以格式化 HTML 文档,使其更易于阅读。</li>
<li><strong>修复格式问题</strong> :它能处理不规范的 HTML 代码,并生成结构化的输出。</li>
</ol>
</li>
<li><p>常用方法</p>
<ol>
<li><code>find(tag, **kwargs)</code>:找到符合条件的第一个标签。</li>
<li><code>find_all(tag, **kwargs)</code>:找到所有符合条件的标签。</li>
<li><code>select(css_selector)</code>:通过 CSS 选择器查找元素。</li>
<li><code>get(attribute)</code>:获取标签的某个属性值。</li>
<li><code>text</code>:获取标签的文本内容。</li>
</ol>
</li>
</ul>
<h2 id="4-1-获取标签及其里的内容"><a href="#4-1-获取标签及其里的内容" class="headerlink" title="4.1 获取标签及其里的内容"></a>4.1 获取标签及其里的内容</h2><p>这里的测试文件是我们在<code>3.1 get 请求</code>章节中获得的<code>baidu</code>首页。</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">file = <span class="built_in">open</span>(<span class="string">"course/test/baidu.html"</span>, <span class="string">"rb"</span>)</span><br><span class="line">html = file.read().decode(<span class="string">"utf-8"</span>)</span><br><span class="line">bs = BeautifulSoup(html, <span class="string">"html.parser"</span>)</span><br></pre></td></tr></table></figure></div>
<p>这里向<code>BeautifulSoup</code>传入的参数有两个,第一个参数<code>html</code>表示指定文档类型,因为<code>BeautifulSoup</code>还能解析<code>json</code>、<code>xml</code>这类文件。第二个参数<code>"html.parser</code>表示指定解析器的类型。返回的对象<code>bs</code>中就存储有解析后的结果,后续操作也是对<code>bs</code>进行。</p>
<ul>
<li><strong>获取标签及其里面的内容</strong></li>
</ul>
<p>比如:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="built_in">print</span>(bs.title) <span class="comment"># 标签里的内容</span></span><br></pre></td></tr></table></figure></div>
<p>输出结果:</p>
<div class="code-container" data-rel="Txt"><figure class="iseeu highlight txt"><table><tr><td class="code"><pre><span class="line"><title>百度一下,你就知道</title></span><br></pre></td></tr></table></figure></div>
<p>直接就把文档里的<code>title</code>给拿出来了。</p>
<p>再比如执行:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="built_in">print</span>(bs.a)</span><br></pre></td></tr></table></figure></div>
<p>输出结果:</p>
<div class="code-container" data-rel="Txt"><figure class="iseeu highlight txt"><table><tr><td class="code"><pre><span class="line"><a class="toindex" href="/">百度首页</a></span><br></pre></td></tr></table></figure></div>
<p>就把文档里的<code>a</code>的内容给拿出来了。</p>
<p>我们可以发现规律,当用这种方式访问查找的时候,它会把文件中出现的 <strong>第一个</strong> 标签返回给你。</p>
<p>如果还是不清楚的话,我们可以尝试打印下该返回对象的类型:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="built_in">print</span>(<span class="built_in">type</span>(bs.a))</span><br></pre></td></tr></table></figure></div>
<p>输出结果:</p>
<div class="code-container" data-rel="Txt"><figure class="iseeu highlight txt"><table><tr><td class="code"><pre><span class="line"><class 'bs4.element.Tag'></span><br></pre></td></tr></table></figure></div>
<p>可以看出返回类型就是 <strong>标签(<code>Tag</code>)</strong> 。</p>
<p>如果我们在打印的对象后面加上限定:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="built_in">print</span>(bs.title.string)</span><br></pre></td></tr></table></figure></div>
<p>输出结果:</p>
<div class="code-container" data-rel="Txt"><figure class="iseeu highlight txt"><table><tr><td class="code"><pre><span class="line">百度一下,你就知道</span><br></pre></td></tr></table></figure></div>
<p>只会返回标签里的内容。</p>
<p>把标签里的内容以字典的方式保存:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="built_in">print</span>(bs.a.attrs)</span><br></pre></td></tr></table></figure></div>
<p>输出结果:</p>
<div class="code-container" data-rel="Txt"><figure class="iseeu highlight txt"><table><tr><td class="code"><pre><span class="line">{'class': ['toindex'], 'href': '/'}</span><br></pre></td></tr></table></figure></div>
<p>这是对应的找到的源文件内容的那行<code><a class="toindex" href="/">百度首页</a></code>,可以发现是以字典的形式返回给我们的。</p>
<h2 id="4-2-文档的遍历"><a href="#4-2-文档的遍历" class="headerlink" title="4.2 文档的遍历"></a>4.2 文档的遍历</h2><div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="built_in">print</span>(bs.head.contents) </span><br></pre></td></tr></table></figure></div>
<p>输出结果:返回的是一个列表,如图所示<br><a href="https://imgse.com/i/pAK8ozd"><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://s21.ax1x.com/2024/09/18/pAK8ozd.png"
alt="返回结果.png"
><figcaption>返回结果.png</figcaption></figure></a></p>
<p>既然我们知道返回的是一个列表,那么我们就可以通过下标来访问:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="built_in">print</span>(bs.head.contents[<span class="number">0</span>]) </span><br></pre></td></tr></table></figure></div>
<p>想了解更多内容可以通过在浏览器搜索 <strong>遍历文件树</strong> 获知。w</p>
<h2 id="4-3-文档的搜索"><a href="#4-3-文档的搜索" class="headerlink" title="4.3 文档的搜索"></a>4.3 文档的搜索</h2><ol>
<li><p><strong>字符串过滤:</strong> 会查找与字符串完全匹配的内容</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">t_list = bs.find_all(<span class="string">"a"</span>)</span><br></pre></td></tr></table></figure></div>
<p>这表示查找所有的<code>a</code>。会把所有符合<code>a</code>标签的以列表的形式存储。</p>
</li>
<li><p><strong>正则表达式搜索:</strong> 使用<code>search</code>方法来匹配内容</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="keyword">import</span> re <span class="comment"># 正则表达式解析库</span></span><br><span class="line"></span><br><span class="line">t_list = bs.find_all(re.<span class="built_in">compile</span>(<span class="string">"a"</span>))</span><br><span class="line"><span class="built_in">print</span>(t_list)</span><br></pre></td></tr></table></figure></div>
<p>这种方式会把包含<code>a</code>的内容(包括标签和文本内容)以列表的形式存储。</p>
</li>
<li><p><strong>以方法形式:</strong> 传入一个函数(方法),根据函数的要求搜索</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">name_is_exist</span>(<span class="params">tag</span>):</span><br><span class="line"> <span class="keyword">return</span> tag.has_attr(<span class="string">"name"</span>)</span><br><span class="line"></span><br><span class="line">t_list = bs.find_all(name_is_exist)</span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> item <span class="keyword">in</span> t_list:</span><br><span class="line"> <span class="built_in">print</span>(item)</span><br></pre></td></tr></table></figure></div>
<p>在<code>find_all()</code>中使用自定义筛选函数时,函数会对每个标签对象进行评估。函数的返回值必须是<code>True</code>或<code>False</code>。如果函数返回<code>True</code>,这个标签会被包含在返回结果中;如果返回<code>False</code>,则跳过该标签。<code> has_attr()</code>是<code>BeautifulSoup</code>的一个方法,用于检查某个标签是否具有指定的属性。在这个例子中,<code>tag.has_attr("name")</code>用来判断标签是否有<code>name</code>属性。</p>
</li>
<li><p><strong>kwargs :</strong> 指定参数来搜索</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">t_list = bs.find_all(<span class="built_in">id</span>=<span class="string">"head"</span>)</span><br></pre></td></tr></table></figure></div>
<p>会把<code>id=head</code>的标签存放到<code>t_list</code>列表中。</p>
</li>
</ol>
<p>还比如:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">t_list = bs.find_all(class_ =<span class="literal">True</span>)</span><br></pre></td></tr></table></figure></div>
<p>这会把所有包含<code>class</code>的标签搜索出来。为什么要写成<code>class_</code>呢?因为<code>class</code>是 Python 保留关键字,保留关键字是不能之际作为参数名来使用的。</p>
<ol start="5">
<li><p><strong>string 参数:</strong> 搜索出指定参数</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">t_list = bs.find_all(string =[<span class="string">"hao123"</span>, <span class="string">"地图"</span>, <span class="string">"贴吧"</span>])</span><br><span class="line"><span class="built_in">print</span>(t_list)</span><br></pre></td></tr></table></figure></div>
<p>输出结果:</p>
<div class="code-container" data-rel="Txt"><figure class="iseeu highlight txt"><table><tr><td class="code"><pre><span class="line">['hao123', '地图', '贴吧', '贴吧', '地图']</span><br></pre></td></tr></table></figure></div>
<p>这种方式还可以应用正则表达式来查找包含特定文本的内容,也就是标签里的字符串。<br>比如这样:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="keyword">import</span> re</span><br><span class="line"></span><br><span class="line">t_list = bs.find_all(string =re.<span class="built_in">compile</span>(<span class="string">"\d"</span>))</span><br></pre></td></tr></table></figure></div>
</li>
<li><p><strong>css选择器:</strong> 用<code>select()</code>方法</p>
</li>
</ol>
<p>通过标签来查找:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">t_list = bs.select(<span class="string">"title"</span>)</span><br></pre></td></tr></table></figure></div>
<p>通过类名来查找:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">t_list = bs.select(<span class="string">".mnav"</span>) </span><br></pre></td></tr></table></figure></div>
<p>在 CSS 中,<code>.(点号)</code>表示选择具有特定类名的元素。<code>.mnav</code>表示选择所有<code>class="mnav"</code>的元素。</p>
<h1 id="5-re-包讲解"><a href="#5-re-包讲解" class="headerlink" title="5.re 包讲解"></a>5.re 包讲解</h1><p>Python 的 <code>re</code> 包用于处理正则表达式(Regular Expressions),它提供了在字符串中进行模式匹配、替换、拆分等操作的功能。正则表达式是一种强大的工具,可以用来查找、匹配和操作字符串中的特定模式。</p>
<h2 id="5-1-re-包的常用方法"><a href="#5-1-re-包的常用方法" class="headerlink" title="5.1 re 包的常用方法"></a>5.1 re 包的常用方法</h2><ol>
<li><p><strong>re.match():</strong></p>
<ul>
<li>作用:从字符串的起始位置开始匹配正则表达式。</li>
<li>如果匹配成功,返回一个 <code>Match</code> 对象;如果不匹配,返回 <code>None</code>。</li>
<li>示例:<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="keyword">import</span> re</span><br><span class="line">result = re.<span class="keyword">match</span>(<span class="string">r'\d+'</span>, <span class="string">'123abc456'</span>)</span><br><span class="line"><span class="built_in">print</span>(result.group()) <span class="comment"># 输出: '123'</span></span><br></pre></td></tr></table></figure></div></li>
</ul>
</li>
<li><p><strong>re.search():</strong></p>
<ul>
<li>作用:扫描整个字符串,找到第一个符合正则表达式的匹配项。</li>
<li>如果匹配成功,返回一个 <code>Match</code> 对象;如果不匹配,返回 <code>None</code>。</li>
<li>示例:<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">result = re.search(<span class="string">r'\d+'</span>, <span class="string">'abc123xyz456'</span>)</span><br><span class="line"><span class="built_in">print</span>(result.group()) <span class="comment"># 输出: '123'</span></span><br></pre></td></tr></table></figure></div></li>
</ul>
</li>
<li><p><strong>re.findall():</strong></p>
<ul>
<li>作用:返回字符串中所有与正则表达式匹配的部分,结果是一个列表。</li>
<li>示例:<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">result = re.findall(<span class="string">r'\d+'</span>, <span class="string">'abc123xyz456'</span>)</span><br><span class="line"><span class="built_in">print</span>(result) <span class="comment"># 输出: ['123', '456']</span></span><br></pre></td></tr></table></figure></div></li>
</ul>
</li>
<li><p><strong>re.finditer():</strong></p>
<ul>
<li>作用:返回一个迭代器,生成字符串中所有匹配的 <code>Match</code> 对象。</li>
<li>示例:<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">matches = re.finditer(<span class="string">r'\d+'</span>, <span class="string">'abc123xyz456'</span>)</span><br><span class="line"><span class="keyword">for</span> <span class="keyword">match</span> <span class="keyword">in</span> matches:</span><br><span class="line"> <span class="built_in">print</span>(<span class="keyword">match</span>.group()) <span class="comment"># 输出: '123' '456'</span></span><br></pre></td></tr></table></figure></div></li>
</ul>
</li>
<li><p><strong>re.sub():</strong></p>
<ul>
<li>作用:将匹配的部分替换为指定的字符串。</li>
<li>示例:<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">result = re.sub(<span class="string">r'\d+'</span>, <span class="string">'NUMBER'</span>, <span class="string">'abc123xyz456'</span>)</span><br><span class="line"><span class="built_in">print</span>(result) <span class="comment"># 输出: 'abcNUMBERxyzNUMBER'</span></span><br></pre></td></tr></table></figure></div></li>
</ul>
</li>
<li><p><strong>re.split():</strong></p>
<ul>
<li>作用:根据正则表达式匹配的部分分割字符串,返回列表。</li>
<li>示例:<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">result = re.split(<span class="string">r'\d+'</span>, <span class="string">'abc123xyz456'</span>)</span><br><span class="line"><span class="built_in">print</span>(result) <span class="comment"># 输出: ['abc', 'xyz', '']</span></span><br></pre></td></tr></table></figure></div></li>
</ul>
</li>
<li><p><strong>re.compile():</strong></p>
<ul>
<li>作用:将正则表达式编译成正则对象,以便多次复用。</li>
<li>示例:<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">pattern = re.<span class="built_in">compile</span>(<span class="string">r'\d+'</span>)</span><br><span class="line">result = pattern.findall(<span class="string">'abc123xyz456'</span>)</span><br><span class="line"><span class="built_in">print</span>(result) <span class="comment"># 输出: ['123', '456']</span></span><br></pre></td></tr></table></figure></div></li>
</ul>
</li>
</ol>
<h2 id="5-2-正则表达式的常用操作符"><a href="#5-2-正则表达式的常用操作符" class="headerlink" title="5.2 正则表达式的常用操作符"></a>5.2 正则表达式的常用操作符</h2><ol>
<li><p><strong><code>.</code></strong> (点号):匹配任意单个字符(除换行符 <code>\n</code> 之外)。</p>
<ul>
<li>示例:<code>a.b</code> 可以匹配 <code>acb</code>, <code>a9b</code>, 但不能匹配 <code>ab</code>。</li>
</ul>
</li>
<li><p><strong><code>^</code></strong> (脱字符):匹配字符串的开头。</p>
<ul>
<li>示例:<code>^a</code> 可以匹配 <code>abc</code> 中的 <code>a</code>,但不能匹配 <code>bca</code>。</li>
</ul>
</li>
<li><p><strong><code>$</code></strong> (美元符号):匹配字符串的末尾。</p>
<ul>
<li>示例:<code>a$</code> 可以匹配 <code>bca</code> 中的 <code>a</code>,但不能匹配 <code>abc</code>。</li>
</ul>
</li>
<li><p><strong><code>*</code></strong> (星号):匹配前一个字符 0 次或多次。</p>
<ul>
<li>示例:<code>ab*</code> 可以匹配 <code>a</code>, <code>ab</code>, <code>abb</code>, <code>abbbbb</code>。</li>
</ul>
</li>
<li><p><strong><code>+</code></strong> (加号):匹配前一个字符 1 次或多次。</p>
<ul>
<li>示例:<code>ab+</code> 可以匹配 <code>ab</code>, <code>abb</code>, <code>abbbbb</code>,但不能匹配 <code>a</code>。</li>
</ul>
</li>
<li><p><strong><code>?</code></strong> (问号):匹配前一个字符 0 次或 1 次。</p>
<ul>
<li>示例:<code>ab?</code> 可以匹配 <code>a</code>, <code>ab</code>,但不能匹配 <code>abb</code>。</li>
</ul>
</li>
<li><p><strong><code>{m}</code></strong> :匹配前一个字符 <strong>正好 m 次</strong>。</p>
<ul>
<li>示例:<code>a{3}</code> 只能匹配 <code>aaa</code>。</li>
</ul>
</li>
<li><p><strong><code>{m,n}</code></strong> :匹配前一个字符 <strong>至少 m 次,至多 n 次</strong>。</p>
<ul>
<li>示例:<code>a{2,4}</code> 可以匹配 <code>aa</code>, <code>aaa</code>, <code>aaaa</code>。</li>
</ul>
</li>
<li><p><strong><code>[]</code></strong> (字符集):匹配字符集中的任意一个字符。</p>
<ul>
<li>示例:<code>[abc]</code> 可以匹配 <code>a</code>, <code>b</code>, <code>c</code> 中的任意一个。</li>
</ul>
</li>
<li><p><strong><code>|</code></strong> (管道符号):表示“或”运算符,匹配左边或右边的内容。</p>
<ul>
<li>示例:<code>a|b</code> 可以匹配 <code>a</code> 或 <code>b</code>。</li>
</ul>
</li>
<li><p><strong><code>\</code></strong> (反斜杠):用于转义特殊字符或表示特殊序列。</p>
<ul>
<li>示例:<code>\d</code> 匹配任意数字字符,<code>\s</code> 匹配任意空白字符,<code>\w</code> 匹配任意单词字符(字母、数字、下划线)。</li>
</ul>
</li>
<li><p><strong><code>()</code></strong> (括号):用于分组,或提取匹配的子字符串。</p>
<ul>
<li>示例:<code>(abc)+</code> 可以匹配 <code>abc</code>, <code>abcabc</code> 等。</li>
<li>提取示例:<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">result = re.search(<span class="string">r'(abc)(\d+)'</span>, <span class="string">'abc123'</span>)</span><br><span class="line"><span class="built_in">print</span>(result.groups()) <span class="comment"># 输出: ('abc', '123')</span></span><br></pre></td></tr></table></figure></div></li>
</ul>
</li>
</ol>
<h2 id="5-3-常用的特殊序列"><a href="#5-3-常用的特殊序列" class="headerlink" title="5.3 常用的特殊序列"></a>5.3 常用的特殊序列</h2><ul>
<li><strong><code>\d</code></strong> :匹配任何数字字符,等价于 <code>[0-9]</code>。</li>
<li><strong><code>\D</code></strong> :匹配任何非数字字符,等价于 <code>[^0-9]</code>。</li>
<li><strong><code>\s</code></strong> :匹配任何空白字符(空格、制表符等)。</li>
<li><strong><code>\S</code></strong> :匹配任何非空白字符。</li>
<li><strong><code>\w</code></strong> :匹配任何单词字符(字母、数字、下划线),等价于 <code>[a-zA-Z0-9_]</code>。</li>
<li><strong><code>\W</code></strong> :匹配任何非单词字符,等价于 <code>[^a-zA-Z0-9_]</code>。</li>
</ul>
<h2 id="5-4-实际应用示例"><a href="#5-4-实际应用示例" class="headerlink" title="5.4 实际应用示例"></a>5.4 实际应用示例</h2><ol>
<li><p><strong>匹配电子邮件地址:</strong></p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">email_pattern = <span class="string">r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'</span></span><br><span class="line">email = <span class="string">"[email protected]"</span></span><br><span class="line"><span class="keyword">if</span> re.<span class="keyword">match</span>(email_pattern, email):</span><br><span class="line"> <span class="built_in">print</span>(<span class="string">"Valid email"</span>)</span><br></pre></td></tr></table></figure></div>
</li>
<li><p><strong>替换字符串中的数字:</strong></p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">text = <span class="string">"My phone number is 12345"</span></span><br><span class="line">result = re.sub(<span class="string">r'\d+'</span>, <span class="string">'#####'</span>, text)</span><br><span class="line"><span class="built_in">print</span>(result) <span class="comment"># 输出: "My phone number is #####"</span></span><br></pre></td></tr></table></figure></div></li>
</ol>
]]></content>
<categories>
<category>Python爬虫</category>
</categories>
<tags>
<tag>python</tag>
</tags>
</entry>
<entry>
<title>Python 爬虫教程 03</title>
<url>/2024/10/17/03Python%E7%88%AC%E8%99%AB/</url>
<content><![CDATA[<h1 id="1-无请求头访问"><a href="#1-无请求头访问" class="headerlink" title="1.无请求头访问"></a>1.无请求头访问</h1><p>如果不构建请求头,直接向目标网站发送请求:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="keyword">from</span> lxml <span class="keyword">import</span> etree</span><br><span class="line"></span><br><span class="line">url = <span class="string">"https://spiderbuf.cn/playground/s02"</span></span><br><span class="line"></span><br><span class="line">html = requests.get(url=url).text</span><br><span class="line"></span><br><span class="line">f = <span class="built_in">open</span>(<span class="string">'./课程/02course/02.html'</span>, <span class="string">'w'</span>, encoding=<span class="string">'utf-8'</span>)</span><br><span class="line">f.write(html)</span><br><span class="line">f.close()</span><br><span class="line"></span><br><span class="line"><span class="built_in">print</span>(html)</span><br><span class="line">root = etree.HTML(html)</span><br><span class="line">trs = root.xpath(<span class="string">'//tr'</span>)</span><br><span class="line"></span><br><span class="line">f = <span class="built_in">open</span>(<span class="string">'./课程/02course/data02.txt'</span>, <span class="string">'w'</span>, encoding=<span class="string">'utf-8'</span>)</span><br><span class="line"><span class="keyword">for</span> tr <span class="keyword">in</span> trs:</span><br><span class="line"> tds = tr.xpath(<span class="string">'./td'</span>)</span><br><span class="line"> s = <span class="string">''</span></span><br><span class="line"> <span class="keyword">for</span> td <span class="keyword">in</span> tds:</span><br><span class="line"> <span class="comment"># print(td.text)</span></span><br><span class="line"> s = s + <span class="built_in">str</span>(td.text) + <span class="string">' | '</span></span><br><span class="line"> <span class="built_in">print</span>(s)</span><br><span class="line"> <span class="keyword">if</span> s!= <span class="string">''</span>:</span><br><span class="line"> f.write(s + <span class="string">'\n'</span>)</span><br></pre></td></tr></table></figure></div>
<p>输出结果:</p>
<div class="code-container" data-rel="Txt"><figure class="iseeu highlight txt"><table><tr><td class="code"><pre><span class="line"><!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"></span><br><span class="line"><html></span><br><span class="line"><head><title>403 Forbidden</title></head></span><br><span class="line"><body></span><br><span class="line"><center><h1>403 Forbidden</h1></center></span><br><span class="line"><hr><center>tengine</center></span><br><span class="line"></body></span><br><span class="line"></html></span><br></pre></td></tr></table></figure></div>
<p>很容易被网站检测到是爬虫。</p>
<h1 id="2-添加请求头"><a href="#2-添加请求头" class="headerlink" title="2.添加请求头"></a>2.添加请求头</h1><p>所以基本上在发送请求之前都会封装一个<code>http</code>请求的头部信息:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">headers = {</span><br><span class="line"> <span class="string">"User-Agent"</span>: <span class="string">"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36"</span>,</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line">html = requests.get(url=url, headers=headers).text</span><br></pre></td></tr></table></figure></div>
<p>有时候还需要往里面填入<code>Cookie</code>。甚至为了防止被检测到是爬虫,需要更换<code>User-Agent</code>,比如用火狐浏览器的等,或者浏览器不同版本的,这在网上可以查询到。</p>
]]></content>
<categories>
<category>Python爬虫</category>
</categories>
<tags>
<tag>python</tag>
</tags>
</entry>
<entry>
<title>Python 爬虫教程 04</title>
<url>/2024/10/17/04Python%E7%88%AC%E8%99%AB/</url>
<content><![CDATA[<h1 id="1-测试案例"><a href="#1-测试案例" class="headerlink" title="1.测试案例"></a>1.测试案例</h1><div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="keyword">from</span> lxml <span class="keyword">import</span> etree</span><br><span class="line"></span><br><span class="line">url = <span class="string">"https://spiderbuf.cn/playground/s03"</span></span><br><span class="line"></span><br><span class="line">headers = {</span><br><span class="line"> <span class="string">"User-Agent"</span>: <span class="string">"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36"</span>,</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line">html = requests.get(url=url, headers=headers).text</span><br><span class="line"></span><br><span class="line">f = <span class="built_in">open</span>(<span class="string">'./课程/03course/03.html'</span>, <span class="string">'w'</span>, encoding=<span class="string">'utf-8'</span>)</span><br><span class="line">f.write(html)</span><br><span class="line">f.close()</span><br><span class="line"></span><br><span class="line"><span class="comment"># print(html)</span></span><br><span class="line">root = etree.HTML(html)</span><br><span class="line">trs = root.xpath(<span class="string">'//tr'</span>)</span><br><span class="line"></span><br><span class="line">f = <span class="built_in">open</span>(<span class="string">'./课程/03course/data03.txt'</span>, <span class="string">'w'</span>, encoding=<span class="string">'utf-8'</span>)</span><br><span class="line"><span class="keyword">for</span> tr <span class="keyword">in</span> trs:</span><br><span class="line"> tds = tr.xpath(<span class="string">'./td'</span>)</span><br><span class="line"> s = <span class="string">''</span></span><br><span class="line"> <span class="keyword">for</span> td <span class="keyword">in</span> tds:</span><br><span class="line"> s = s + <span class="built_in">str</span>(td.text) + <span class="string">' | '</span></span><br><span class="line"> <span class="built_in">print</span>(s)</span><br><span class="line"> <span class="keyword">if</span> s!= <span class="string">''</span>:</span><br><span class="line"> f.write(s + <span class="string">'\n'</span>)</span><br></pre></td></tr></table></figure></div>
<p>直接运行上面这段代码,会发现输出结果是这样的:</p>
<div class="code-container" data-rel="Txt"><figure class="iseeu highlight txt"><table><tr><td class="code"><pre><span class="line">1 | None | CD-82-76-71-65-75 | 堡垒机 | 服务器 | Windows10 | 80,22,443 | </span><br><span class="line"> | </span><br><span class="line">2 | None | E6-84-22-55-44-BE | 摄像头 | 摄像头 | HUAWEI | 80,22,443 | </span><br><span class="line"> | </span><br><span class="line">3 | None | 37-01-AE-BE-5E-C0 | 文件服务器 | 服务器 | Linux | 80,22,443 | </span><br><span class="line"> | </span><br><span class="line">4 | None | 84-A9-97-A8-B2-99 | 交换机 | 交换机 | HUAWEI | None | </span><br><span class="line"> | </span><br><span class="line">5 | None | 8C-94-9D-85-6C-C1 | 堡垒机 | 服务器 | Windows10 | 80,22,443 | </span><br><span class="line"> | </span><br><span class="line">6 | None | F2-10-E3-CA-DF-DC | 数据库服务器 | 服务器 | Windows10 | 80,22,443 | </span><br><span class="line"> | </span><br><span class="line">7 | None | 46-9A-AF-F4-DC-33 | 数据库服务器 | 服务器 | Windows10 | 80,22,443 | </span><br><span class="line"> | </span><br><span class="line">8 | None | B5-66-A6-0C-C6-57 | 堡垒机 | 服务器 | Linux | 80,22,443 | </span><br><span class="line"> | </span><br><span class="line">9 | None | 5C-3F-8E-6E-D9-C5 | OA服务器 | 服务器 | Linux | 80,22,443 | </span><br><span class="line"> | </span><br><span class="line">10 | None | EC-C6-79-88-4C-BA | 测试服务器 | 服务器 | Linux | 80,22,443 | </span><br><span class="line"> | </span><br></pre></td></tr></table></figure></div>
<p>第一列的内容都是<code>None</code>,这是为什么呢?</p>
<h1 id="2-分析页面源码结构"><a href="#2-分析页面源码结构" class="headerlink" title="2.分析页面源码结构"></a>2.分析页面源码结构</h1><p>我们先来看下我们要爬取的页面的源码结构,是这段内容:</p>
<div class="code-container" data-rel="Html"><figure class="iseeu highlight html"><table><tr><td class="code"><pre><span class="line"><span class="tag"><<span class="name">tr</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span>1<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span><span class="tag"><<span class="name">a</span> <span class="attr">href</span>=<span class="string">"#"</span>></span>172.16.80.178<span class="tag"></<span class="name">a</span>></span><span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span>CD-82-76-71-65-75<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span>堡垒机<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span>服务器<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span>Windows10<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span>80,22,443<span class="tag"></<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">td</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">font</span> <span class="attr">color</span>=<span class="string">"green"</span>></span>在线<span class="tag"></<span class="name">font</span>></span></span><br><span class="line"> <span class="tag"></<span class="name">td</span>></span></span><br><span class="line"><span class="tag"></<span class="name">tr</span>></span></span><br></pre></td></tr></table></figure></div>
<p>可以发现有两个部分:</p>
<ul>
<li>第二个<code>td</code>:<code><td><a href="#">172.16.80.178</a></td></code></li>
<li>最后一个<code>td</code>:<code><td><font color="green">在线</font></td></code><br>这两个部分<code><td></code>里是不含内容的。要像这种<code><td>文本</td></code>才说名含有内容,对于这种<code><td><a>文本<a></td></code>是<code><a></code>含有内容,<code><td></code>不含。<br>所以我们要取出<code><a></code>节点里的内容应该怎么做呢?有一个思路是把<code><td></code>节点下的<code><a></code>节点的内容再取一遍,但这未免有点麻烦了。</li>
</ul>
<p>这里有一个简单的方法,把解析部分的代码改成这样:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">root = etree.HTML(html)</span><br><span class="line">trs = root.xpath(<span class="string">'//tr'</span>)</span><br><span class="line"></span><br><span class="line">f = <span class="built_in">open</span>(<span class="string">'./课程/03course/data03.txt'</span>, <span class="string">'w'</span>, encoding=<span class="string">'utf-8'</span>)</span><br><span class="line"><span class="keyword">for</span> tr <span class="keyword">in</span> trs:</span><br><span class="line"> tds = tr.xpath(<span class="string">'./td'</span>)</span><br><span class="line"> s = <span class="string">''</span></span><br><span class="line"> <span class="keyword">for</span> td <span class="keyword">in</span> tds:</span><br><span class="line"> s = s + <span class="built_in">str</span>(td.xpath(<span class="string">'string(.)'</span>)) + <span class="string">'|'</span> <span class="comment"># 修改过后的解析操作</span></span><br><span class="line"> <span class="built_in">print</span>(s)</span><br><span class="line"> <span class="keyword">if</span> s!= <span class="string">''</span>:</span><br><span class="line"> f.write(s + <span class="string">'\n'</span>)</span><br></pre></td></tr></table></figure></div>
<p>这种方式可以获取元素及其所有子元素的完整文本,而不仅仅是直接子节点的文本。<code>.</code>表示当前节点,这里就表示把当前节点的所有文本内容(包括子节点中的文本)提取出来。</p>
]]></content>
<categories>
<category>Python爬虫</category>
</categories>
<tags>
<tag>python</tag>
</tags>
</entry>
<entry>
<title>Python 爬虫教程 05</title>
<url>/2024/10/17/05Python%E7%88%AC%E8%99%AB/</url>
<content><![CDATA[<h1 id="1-网页分析"><a href="#1-网页分析" class="headerlink" title="1.网页分析"></a>1.网页分析</h1><p>有时候要爬取的网页是需要翻页的,如图所示:</p>
<p><a href="https://imgse.com/i/pAUAa1e"><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://s21.ax1x.com/2024/10/17/pAUAa1e.png"
alt="第一页.png"
><figcaption>第一页.png</figcaption></figure></a></p>
<p>设置有时候尽管网页需要翻页,但是在执行翻页操作时,地址栏(网址)没有发生变化,这时候可能隐藏在浏览器控制台的<code>network</code>中,如图所示:<br><a href="https://imgse.com/i/pAUAwXd"><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://s21.ax1x.com/2024/10/17/pAUAwXd.png"
alt="控制台.png"
><figcaption>控制台.png</figcaption></figure></a></p>
<h1 id="2-知道明确页数的情况"><a href="#2-知道明确页数的情况" class="headerlink" title="2.知道明确页数的情况"></a>2.知道明确页数的情况</h1><p>示例代码:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="keyword">from</span> lxml <span class="keyword">import</span> etree</span><br><span class="line"></span><br><span class="line">base_url = <span class="string">"https://www.spiderbuf.cn/playground/s04?pageno={}"</span></span><br><span class="line"></span><br><span class="line">headers = {</span><br><span class="line"> <span class="string">"User-Agent"</span>: <span class="string">"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36"</span>,</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line"><span class="comment"># 实现翻页爬取</span></span><br><span class="line"><span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(<span class="number">1</span>, <span class="number">6</span>):</span><br><span class="line"> url = base_url.<span class="built_in">format</span>(i)</span><br><span class="line"> html = requests.get(url=url, headers=headers).text</span><br><span class="line"></span><br><span class="line"> f = <span class="built_in">open</span>(<span class="string">'./课程/04course/04_%d.html'</span> % i, <span class="string">'w'</span>, encoding=<span class="string">'utf-8'</span>)</span><br><span class="line"> f.write(html)</span><br><span class="line"> f.close()</span><br><span class="line"></span><br><span class="line"> root = etree.HTML(html)</span><br><span class="line"> trs = root.xpath(<span class="string">'//tr'</span>)</span><br><span class="line"></span><br><span class="line"> f = <span class="built_in">open</span>(<span class="string">'./课程/04course/data04_%d.txt'</span> % i, <span class="string">'w'</span>, encoding=<span class="string">'utf-8'</span>)</span><br><span class="line"> <span class="keyword">for</span> tr <span class="keyword">in</span> trs:</span><br><span class="line"> tds = tr.xpath(<span class="string">'./td'</span>)</span><br><span class="line"> s = <span class="string">''</span></span><br><span class="line"> <span class="keyword">for</span> td <span class="keyword">in</span> tds:</span><br><span class="line"> s = s + <span class="built_in">str</span>(td.xpath(<span class="string">'string(.)'</span>)) + <span class="string">'|'</span></span><br><span class="line"> <span class="built_in">print</span>(s)</span><br><span class="line"> <span class="keyword">if</span> s!= <span class="string">''</span>:</span><br><span class="line"> f.write(s + <span class="string">'\n'</span>)</span><br></pre></td></tr></table></figure></div>
<h1 id="3-不知道明确页数的情况"><a href="#3-不知道明确页数的情况" class="headerlink" title="3.不知道明确页数的情况"></a>3.不知道明确页数的情况</h1><p>可以通过查看网页源码找寻总页数,然后先用<code>xpath</code>解析网页里<code>总页数</code>的这个内容,再通过正则表达式解析出里面的数字:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="keyword">from</span> lxml <span class="keyword">import</span> etree</span><br><span class="line"><span class="keyword">import</span> re <span class="comment"># 正则表达式库</span></span><br><span class="line"></span><br><span class="line">base_url = <span class="string">"https://www.spiderbuf.cn/playground/s04?pageno={}"</span></span><br><span class="line"></span><br><span class="line">headers = {</span><br><span class="line"> <span class="string">"User-Agent"</span>: <span class="string">"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36"</span>,</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line">html_1 = requests.get(url= base_url.<span class="built_in">format</span>(<span class="number">1</span>), headers=headers).text</span><br><span class="line">root = etree.HTML(html_1)</span><br><span class="line">pages_text = root.xpath(<span class="string">'.//li/span/text()'</span>) <span class="comment"># 返回的是一个列表</span></span><br><span class="line"><span class="built_in">print</span>(pages_text[<span class="number">0</span>])</span><br><span class="line"><span class="comment"># 正则表达式解析,提取数字</span></span><br><span class="line">pages = re.findall(<span class="string">'[0-9]'</span>, pages_text[<span class="number">0</span>]) <span class="comment"># 返回的是一个列表</span></span><br><span class="line"><span class="built_in">print</span>(pages)</span><br></pre></td></tr></table></figure></div>
<p>输出结果:</p>
<div class="code-container" data-rel="Txt"><figure class="iseeu highlight txt"><table><tr><td class="code"><pre><span class="line">共5页</span><br><span class="line">['5']</span><br></pre></td></tr></table></figure></div>
<p>感觉这个方法并不怎么实用,因为这种情况的话是能直接在网页看到有多少页的,再解析就是多此一举了。提这个也只是为了引出正则表达式,关于正则表达式的教程,可以结合 AI 来学习。</p>
]]></content>
<categories>
<category>Python爬虫</category>
</categories>
<tags>
<tag>python</tag>
</tags>
</entry>
<entry>
<title>Python 爬虫教程 06</title>
<url>/2024/10/18/06Python%E7%88%AC%E8%99%AB/</url>
<content><![CDATA[<h1 id="1-分析网站"><a href="#1-分析网站" class="headerlink" title="1.分析网站"></a>1.分析网站</h1><p>第一步还是要先向网站发送请求,然后获得请求的内容:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="keyword">from</span> lxml <span class="keyword">import</span> etree</span><br><span class="line"></span><br><span class="line">url = <span class="string">"https://www.spiderbuf.cn/playground/s05"</span></span><br><span class="line"></span><br><span class="line">headers = {</span><br><span class="line"> <span class="string">"User-Agent"</span>: <span class="string">"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36"</span>,</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line">html = requests.get(url= url, headers=headers).text</span><br><span class="line"><span class="built_in">print</span>(html)</span><br><span class="line"></span><br><span class="line">f = <span class="built_in">open</span>(<span class="string">'./课程/05course/05.html'</span>, <span class="string">'w'</span>, encoding=<span class="string">'utf-8'</span>)</span><br><span class="line">f.write(html)</span><br><span class="line">f.close()</span><br></pre></td></tr></table></figure></div>
<p>然后开始解析网页源码,在浏览器中,要爬取的网站页面点击右键,然后选择 <strong>查看网页源代码</strong> :<br><a href="https://imgse.com/i/pAUDMm4"><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://s21.ax1x.com/2024/10/18/pAUDMm4.png"
alt="如图操作.png"
><figcaption>如图操作.png</figcaption></figure></a></p>
<p>找到要爬取的图片部分:<br><a href="https://imgse.com/i/pAUD8t1"><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://s21.ax1x.com/2024/10/18/pAUD8t1.png"
alt="图片部分.png"
><figcaption>图片部分.png</figcaption></figure></a></p>
<p>点击里面的链接可以进入到图片页面:<br><a href="https://imgse.com/i/pAUD8t1"><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://s21.ax1x.com/2024/10/18/pAUD8t1.png"
alt="图片.png"
><figcaption>图片.png</figcaption></figure></a></p>
<p>注意这幅图中左上角的链接<code>https://www.spiderbuf.cn/static/images/beginner/1kwfkui2.jpg</code>。这条链接分成两部分来看:<code>https://www.spiderbuf.cn</code> + <code>/static/images/beginner/1kwfkui2.jpg</code></p>
<p>也就是说,要爬取图片的话,只需要解析出后半部分的链接即可。</p>
<h1 id="2-解析图片链接"><a href="#2-解析图片链接" class="headerlink" title="2.解析图片链接"></a>2.解析图片链接</h1><p>代码示例:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">root = etree.HTML(html)</span><br><span class="line">imgs = root.xpath(<span class="string">'//img/@src'</span>)</span><br><span class="line"><span class="keyword">for</span> item <span class="keyword">in</span> imgs:</span><br><span class="line"> img_data = requests.get(<span class="string">'https://www.spiderbuf.cn'</span> + item, headers=headers).content <span class="comment"># content 表示以二进制解析内容</span></span><br><span class="line"> img = <span class="built_in">open</span>(<span class="string">"./课程/05course/"</span> + <span class="built_in">str</span>(item).replace(<span class="string">'/'</span>, <span class="string">''</span>), <span class="string">'wb'</span>) <span class="comment"># b 表示以二进制的方式写入</span></span><br><span class="line"> img.write(img_data)</span><br><span class="line"> img.close()</span><br></pre></td></tr></table></figure></div>
<p><strong>为什么这里是写成<code>//img/@src</code>,而不是<code>//img/[@src]/text()</code>?</strong></p>
<p>在 XPath 中,<code>//img/@src</code> 和 <code>//img[@src]/text()</code> 的含义和使用场景不同,主要区别在于它们提取的是元素的属性值还是元素的文本内容。</p>
<ol>
<li><code>//img/@src</code>: 提取属性值</li>
</ol>
<ul>
<li><strong>作用</strong> :<code>//img/@src</code> 用于选择所有 <code><img></code> 标签的 <code>src</code> 属性值。</li>
<li><strong>解释</strong> :<ul>
<li><code>//img</code>:选择文档中所有的 <code><img></code> 元素。</li>
<li><code>@src</code>:选择每个 <code><img></code> 元素的 <code>src</code> 属性值。</li>
</ul>
</li>
<li><strong>返回值</strong> :它直接返回属性的值,比如 <code>/static/images/beginner/9cwjdins.jpg</code>,即图片的 URL。</li>
</ul>
<ol start="2">
<li><code>//img[@src]/text()</code>: 提取元素文本</li>
</ol>
<ul>
<li><strong>作用</strong> :<code>//img[@src]/text()</code> 用于选择所有具有 <code>src</code> 属性的 <code><img></code> 元素的 <strong>文本内容</strong> 。</li>
<li><strong>解释</strong> :<ul>
<li><code>//img[@src]</code>:选择所有带有 <code>src</code> 属性的 <code><img></code> 元素。</li>
<li><code>text()</code>:提取该元素的文本内容。</li>
</ul>
</li>
<li><strong>返回值</strong> :它会返回元素的文本内容,而不是属性的值。</li>
</ul>
<p>在这里的<code><img></code> 标签是一个 <strong>自闭合标签</strong> ,通常不会包含任何文本内容。因此,<code>//img[@src]/text()</code> 通常不会返回任何有效的结果,因为 <code><img></code> 标签中<strong>没有文本</strong>。</p>
]]></content>
<categories>
<category>Python爬虫</category>
</categories>
<tags>
<tag>python</tag>
</tags>
</entry>
<entry>
<title>Python 爬虫教程 07</title>
<url>/2024/10/19/07Python%E7%88%AC%E8%99%AB/</url>
<content><![CDATA[<h1 id="1-示例"><a href="#1-示例" class="headerlink" title="1.示例"></a>1.示例</h1><p>示例代码:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="keyword">from</span> lxml <span class="keyword">import</span> etree</span><br><span class="line"></span><br><span class="line"><span class="comment"># 注意这条链建</span></span><br><span class="line">url = <span class="string">"https://www.spiderbuf.cn/playground/s06"</span></span><br><span class="line"></span><br><span class="line">headers = {</span><br><span class="line"> <span class="string">"User-Agent"</span>: <span class="string">"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36"</span>,</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line">html = requests.get(url=url, headers=headers).text</span><br><span class="line"></span><br><span class="line">f = <span class="built_in">open</span>(<span class="string">'./课程/06course/06.html'</span>, <span class="string">'w'</span>, encoding=<span class="string">'utf-8'</span>)</span><br><span class="line">f.write(html)</span><br><span class="line">f.close()</span><br><span class="line"></span><br><span class="line"><span class="built_in">print</span>(html)</span><br><span class="line"></span><br><span class="line">root = etree.HTML(html)</span><br><span class="line">trs = root.xpath(<span class="string">'//tr'</span>)</span><br><span class="line"></span><br><span class="line">f = <span class="built_in">open</span>(<span class="string">'./课程/06course/data06.txt'</span>, <span class="string">'w'</span>, encoding=<span class="string">'utf-8'</span>)</span><br><span class="line"><span class="keyword">for</span> tr <span class="keyword">in</span> trs:</span><br><span class="line"> tds = tr.xpath(<span class="string">'./td'</span>)</span><br><span class="line"> s = <span class="string">''</span></span><br><span class="line"> <span class="keyword">for</span> td <span class="keyword">in</span> tds:</span><br><span class="line"> s = s + <span class="built_in">str</span>(td.xpath(<span class="string">'string(.)'</span>)) + <span class="string">'|'</span></span><br><span class="line"> <span class="built_in">print</span>(s)</span><br><span class="line"> <span class="keyword">if</span> s!= <span class="string">''</span>:</span><br><span class="line"> f.write(s + <span class="string">'\n'</span>)</span><br></pre></td></tr></table></figure></div>
<p>运行这段代码,会发现是解析不出我们想要的内容的。虽然<code>url</code>、解析逻辑看起来填写的都是正确的,那么问题是出在哪儿呢?其实就是出在<code>url</code>,这个<code>url</code>只是看起来正确,但并不是真正的<code>url</code>(我们想要的,包含内容的)。</p>
<h1 id="2-分析页面"><a href="#2-分析页面" class="headerlink" title="2.分析页面"></a>2.分析页面</h1><p>这时候我们来分析下网页源代码,发现在目标页面代开源码是没有我们想要的内容的,如下图所示:<br><a href="https://imgse.com/i/pAUHxVs"><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://s21.ax1x.com/2024/10/19/pAUHxVs.png"
alt="网页源码.png"
><figcaption>网页源码.png</figcaption></figure></a><br>那么这时候我们就得找到 <strong>真实链接</strong> 。</p>
<p>按<code>F12</code>打开浏览器控制台,选择<code>network</code>:<br><a href="https://imgse.com/i/pAUbP2T"><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://s21.ax1x.com/2024/10/19/pAUbP2T.png"
alt="控制台.png"
><figcaption>控制台.png</figcaption></figure></a></p>
<p>先看左边(红色框)部分,选中一个,看一下它的<code>preview</code>,如图所示:<br><a href="https://imgse.com/i/pAUbkMF"><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://s21.ax1x.com/2024/10/19/pAUbkMF.png"
alt="preview.png"
><figcaption>preview.png</figcaption></figure></a></p>
<p>可以发现没有我们想要的大内容,说明原始的<code>url</code>不对,我们得通过这种方式查找出包含内容的<code>url</code>。可以把后缀是<code>.js</code>和<code>.css</code>的排除掉,这些都是指网页样式。</p>
<p>如下图所示:<br><a href="https://imgse.com/i/pAUbQxO"><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://s21.ax1x.com/2024/10/19/pAUbQxO.png"
alt="pAUbQxO.png"
><figcaption>pAUbQxO.png</figcaption></figure></a><br>这个<code>inner</code>才是我们想要的,再看它的<code>headers</code>,我们就能找到它的正确的<code>url = https://www.spiderbuf.cn/playground/inner</code></p>
<p>把这个<code>url</code>替换掉原始<code>url</code>就行了:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line">url = <span class="string">"https://www.spiderbuf.cn/playground/inner"</span></span><br><span class="line"></span><br><span class="line">headers = {</span><br><span class="line"> <span class="string">"User-Agent"</span>: <span class="string">"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36"</span>,</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line">html = requests.get(url=url, headers=headers).text</span><br><span class="line"></span><br><span class="line"><span class="string">'''剩余部分的代码不变......'''</span></span><br></pre></td></tr></table></figure></div>
<h1 id="3-总结"><a href="#3-总结" class="headerlink" title="3.总结"></a>3.总结</h1><p>如图所示:<br><a href="https://imgse.com/i/pAUb8qH"><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://s21.ax1x.com/2024/10/19/pAUb8qH.png"
alt="控制台.png"
><figcaption>控制台.png</figcaption></figure></a></p>
<p>可以发现在<code>html</code>下还有一个<code>html</code>,称作<code>iframe</code>。就是网页里面嵌套了一个浏览器再打开另一个网页,只有在控制台看它后台的一个请求,才能找出真实的<code>url</code>。</p>
]]></content>
<categories>
<category>Python爬虫</category>
</categories>
<tags>
<tag>python</tag>
</tags>
</entry>
<entry>
<title>Python 爬虫教程 08</title>
<url>/2024/10/19/08Python%E7%88%AC%E8%99%AB/</url>
<content><![CDATA[<h1 id="1-解析网页"><a href="#1-解析网页" class="headerlink" title="1.解析网页"></a>1.解析网页</h1><p>解析网页的步骤与<strong>Python 爬虫教程 07</strong> 的差不多,因为直接用原始的<code>url</code>是无法爬取到数据的,还是需要通过浏览器的控制台才能找到。</p>
<p>原始的<code>url = https://www.spiderbuf.cn/playground/s07</code>。</p>
<p>代码示例:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="keyword">import</span> json</span><br><span class="line"></span><br><span class="line">url = <span class="string">"https://www.spiderbuf.cn/playground/iplist"</span></span><br><span class="line"></span><br><span class="line">headers = {</span><br><span class="line"> <span class="string">"User-Agent"</span>: <span class="string">"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36"</span>,</span><br><span class="line">}</span><br><span class="line"></span><br><span class="line">response = requests.get(url=url, headers=headers)</span><br><span class="line">response.encoding = <span class="string">'utf-8'</span> <span class="comment"># 设置编码为 utf-8</span></span><br><span class="line">json_data = response.text</span><br><span class="line"></span><br><span class="line"><span class="comment"># Save the raw HTML to a file</span></span><br><span class="line"><span class="keyword">with</span> <span class="built_in">open</span>(<span class="string">'./课程/07course/07.html'</span>, <span class="string">'w'</span>, encoding=<span class="string">'utf-8'</span>) <span class="keyword">as</span> f:</span><br><span class="line"> f.write(json_data)</span><br><span class="line"></span><br><span class="line">ls = json.loads(json_data)</span><br><span class="line"></span><br><span class="line">txt = <span class="built_in">open</span>(<span class="string">'./课程/07course/data07.txt'</span>, <span class="string">'w'</span>, encoding=<span class="string">'utf-8'</span>)</span><br><span class="line"><span class="keyword">for</span> item <span class="keyword">in</span> ls:</span><br><span class="line"> s = <span class="string">f'<span class="subst">{item[<span class="string">'ip'</span>]}</span>|<span class="subst">{item[<span class="string">'mac'</span>]}</span>|<span class="subst">{item[<span class="string">'name'</span>]}</span>|<span class="subst">{item[<span class="string">'type'</span>]}</span>|<span class="subst">{item[<span class="string">'manufacturer'</span>]}</span>|<span class="subst">{item[<span class="string">'ports'</span>]}</span>|<span class="subst">{item[<span class="string">'status'</span>]}</span>\n'</span></span><br><span class="line"> txt.write(s)</span><br><span class="line"></span><br><span class="line">txt.close()</span><br></pre></td></tr></table></figure></div>
<p>可以看出爬取到的数据是<code>json</code>格式,注意这里的<code>response.encoding = 'utf-8' </code>,不设置的话,爬取到的中文内容就是乱码。</p>
<h1 id="2-说明(补充)"><a href="#2-说明(补充)" class="headerlink" title="2.说明(补充)"></a>2.说明(补充)</h1><p><strong>AJAX</strong> (Asynchronous JavaScript and XML)是一种在网页中异步加载数据的技术,它允许在不刷新整个网页的情况下更新页面的部分内容。<strong>动态加载数据</strong> 指的是通过 AJAX 技术从服务器获取数据,并将这些数据动态地插入到当前网页中。这个过程可以是数据的初次加载或用户交互时额外加载的数据,比如滚动页面加载更多内容、点击按钮加载新的数据等。</p>
<h2 id="2-1-AJAX-动态加载数据的关键点"><a href="#2-1-AJAX-动态加载数据的关键点" class="headerlink" title="2.1 AJAX 动态加载数据的关键点"></a>2.1 AJAX 动态加载数据的关键点</h2><ol>
<li><p><strong>异步加载</strong> :页面不需要刷新,也不会阻塞用户的其他操作。在数据请求发送后,用户可以继续操作页面,等待数据加载完成。</p>
</li>
<li><p><strong>服务器请求</strong> :AJAX 使用 JavaScript 发起 HTTP 请求(通常是 <code>GET</code> 或 <code>POST</code> 请求)到服务器,从而获取所需的数据。获取的数据格式通常是 JSON、XML 或 HTML。</p>
</li>
<li><p><strong>动态更新</strong> :当服务器返回数据后,JavaScript 会处理响应数据,并将其插入页面中的特定位置,更新页面的部分内容,而不会刷新整个网页。</p>
</li>
</ol>
<h2 id="2-2-工作流程"><a href="#2-2-工作流程" class="headerlink" title="2.2 工作流程"></a>2.2 工作流程</h2><ol>
<li><strong>用户操作</strong> :用户的某些操作(如滚动、点击按钮)触发 AJAX 请求。</li>
<li><strong>发送请求</strong> :通过 JavaScript 使用 <code>XMLHttpRequest</code> 或 <code>fetch</code> API 发送 HTTP 请求到服务器。</li>
<li><strong>服务器响应</strong> :服务器接收到请求后处理并返回数据(如 JSON、HTML 片段等)。</li>
<li><strong>动态更新页面</strong> :前端 JavaScript 解析服务器返回的数据,并将其动态插入页面中,使页面内容更新而不需要重新加载整个网页。</li>
</ol>
<h2 id="2-3-示例"><a href="#2-3-示例" class="headerlink" title="2.3 示例"></a>2.3 示例</h2><p>假设一个网页有一个按钮,当用户点击按钮时,通过 AJAX 加载一段新的数据并显示在页面中。以下是一个简单的 AJAX 动态加载例子(使用 <code>fetch</code> API):</p>
<div class="code-container" data-rel="Html"><figure class="iseeu highlight html"><table><tr><td class="code"><pre><span class="line"><span class="meta"><!DOCTYPE <span class="keyword">html</span>></span></span><br><span class="line"><span class="tag"><<span class="name">html</span> <span class="attr">lang</span>=<span class="string">"en"</span>></span></span><br><span class="line"><span class="tag"><<span class="name">head</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">meta</span> <span class="attr">charset</span>=<span class="string">"UTF-8"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">meta</span> <span class="attr">name</span>=<span class="string">"viewport"</span> <span class="attr">content</span>=<span class="string">"width=device-width, initial-scale=1.0"</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">title</span>></span>AJAX 动态加载示例<span class="tag"></<span class="name">title</span>></span></span><br><span class="line"><span class="tag"></<span class="name">head</span>></span></span><br><span class="line"><span class="tag"><<span class="name">body</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">h1</span>></span>AJAX 动态加载示例<span class="tag"></<span class="name">h1</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">div</span> <span class="attr">id</span>=<span class="string">"content"</span>></span>这里是初始内容。<span class="tag"></<span class="name">div</span>></span></span><br><span class="line"> <span class="tag"><<span class="name">button</span> <span class="attr">id</span>=<span class="string">"loadDataBtn"</span>></span>加载更多内容<span class="tag"></<span class="name">button</span>></span></span><br><span class="line"></span><br><span class="line"> <span class="tag"><<span class="name">script</span>></span><span class="language-javascript"></span></span><br><span class="line"><span class="language-javascript"> <span class="variable language_">document</span>.<span class="title function_">getElementById</span>(<span class="string">'loadDataBtn'</span>).<span class="title function_">addEventListener</span>(<span class="string">'click'</span>, <span class="keyword">function</span>(<span class="params"></span>) {</span></span><br><span class="line"><span class="language-javascript"> <span class="title function_">fetch</span>(<span class="string">'https://api.example.com/get-more-data'</span>) <span class="comment">// 向服务器发送请求,这里的链接就是含有真实数据内容的 url</span></span></span><br><span class="line"><span class="language-javascript"> .<span class="title function_">then</span>(<span class="function"><span class="params">response</span> =></span> response.<span class="title function_">json</span>()) <span class="comment">// 解析返回的 JSON 数据</span></span></span><br><span class="line"><span class="language-javascript"> .<span class="title function_">then</span>(<span class="function"><span class="params">data</span> =></span> {</span></span><br><span class="line"><span class="language-javascript"> <span class="comment">// 将新数据动态插入页面中</span></span></span><br><span class="line"><span class="language-javascript"> <span class="variable language_">document</span>.<span class="title function_">getElementById</span>(<span class="string">'content'</span>).<span class="property">innerHTML</span> += <span class="string">`<p><span class="subst">${data.newContent}</span></p>`</span>;</span></span><br><span class="line"><span class="language-javascript"> })</span></span><br><span class="line"><span class="language-javascript"> .<span class="title function_">catch</span>(<span class="function"><span class="params">error</span> =></span> <span class="variable language_">console</span>.<span class="title function_">error</span>(<span class="string">'加载失败:'</span>, error));</span></span><br><span class="line"><span class="language-javascript"> });</span></span><br><span class="line"><span class="language-javascript"> </span><span class="tag"></<span class="name">script</span>></span></span><br><span class="line"><span class="tag"></<span class="name">body</span>></span></span><br><span class="line"><span class="tag"></<span class="name">html</span>></span></span><br></pre></td></tr></table></figure></div>]]></content>
<categories>
<category>Python爬虫</category>
</categories>
<tags>
<tag>python</tag>
</tags>
</entry>
<entry>
<title>Python 爬虫教程 09</title>
<url>/2024/10/21/09Python%E7%88%AC%E8%99%AB/</url>
<content><![CDATA[<h1 id="1-代码示例"><a href="#1-代码示例" class="headerlink" title="1.代码示例"></a>1.代码示例</h1><div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="keyword">from</span> lxml <span class="keyword">import</span> etree</span><br><span class="line"></span><br><span class="line">url = <span class="string">"https://www.spiderbuf.cn/playground/s08"</span></span><br><span class="line"></span><br><span class="line">headers = {</span><br><span class="line"> <span class="string">"User-Agent"</span>: <span class="string">"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36"</span>,</span><br><span class="line">}</span><br><span class="line">html = requests.get(url=url, headers=headers).text</span><br><span class="line"></span><br><span class="line">f = <span class="built_in">open</span>(<span class="string">'./课程/08course/08.html'</span>, <span class="string">'w'</span>, encoding=<span class="string">'utf-8'</span>)</span><br><span class="line">f.write(html)</span><br><span class="line">f.close()</span><br><span class="line"></span><br><span class="line">root = etree.HTML(html)</span><br><span class="line">trs = root.xpath(<span class="string">'//tr'</span>)</span><br><span class="line"></span><br><span class="line">f = <span class="built_in">open</span>(<span class="string">'./课程/08course/data08.txt'</span>, <span class="string">'w'</span>, encoding=<span class="string">'utf-8'</span>)</span><br><span class="line"><span class="keyword">for</span> tr <span class="keyword">in</span> trs:</span><br><span class="line"> tds = tr.xpath(<span class="string">'./td'</span>)</span><br><span class="line"> s = <span class="string">''</span></span><br><span class="line"> <span class="keyword">for</span> td <span class="keyword">in</span> tds:</span><br><span class="line"> s = s + <span class="built_in">str</span>(td.xpath(<span class="string">'string(.)'</span>)) + <span class="string">'|'</span></span><br><span class="line"> <span class="built_in">print</span>(s)</span><br><span class="line"> <span class="keyword">if</span> s!= <span class="string">''</span>:</span><br><span class="line"> f.write(s + <span class="string">'\n'</span>)</span><br></pre></td></tr></table></figure></div>
<p>直接运行这段代码是不会解析出任何数据的,同时可以看到抓取到的网页与我们想要的不一样。</p>
<h1 id="2-网页分析"><a href="#2-网页分析" class="headerlink" title="2,网页分析"></a>2,网页分析</h1><p>打开浏览器控制台,选择<code>network</code>:<br><a href="https://imgse.com/i/pAa40dx"><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://s21.ax1x.com/2024/10/21/pAa40dx.png"
alt="控制台.png"
><figcaption>控制台.png</figcaption></figure></a><br>可以发现,请求方式是<code>post</code>。所以我们就得在代码中采用<code>post</code>请求方式:</p>
<div class="code-container" data-rel="Python"><figure class="iseeu highlight python"><table><tr><td class="code"><pre><span class="line"><span class="keyword">import</span> requests</span><br><span class="line"><span class="keyword">from</span> lxml <span class="keyword">import</span> etree</span><br><span class="line"></span><br><span class="line">url = <span class="string">"https://www.spiderbuf.cn/playground/s08"</span></span><br><span class="line"></span><br><span class="line">headers = {</span><br><span class="line"> <span class="string">"User-Agent"</span>: <span class="string">"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36"</span>,</span><br><span class="line">}</span><br><span class="line"><span class="comment"># 传入 post 请求中的数据</span></span><br><span class="line">payload = {<span class="string">'level'</span>: <span class="string">'8'</span>}</span><br><span class="line"><span class="comment"># post 请求</span></span><br><span class="line">html = requests.post(url=url, headers=headers, data=payload).text</span><br><span class="line"></span><br><span class="line">f = <span class="built_in">open</span>(<span class="string">'./课程/08course/08.html'</span>, <span class="string">'w'</span>, encoding=<span class="string">'utf-8'</span>)</span><br><span class="line">f.write(html)</span><br><span class="line">f.close()</span><br><span class="line"></span><br><span class="line">root = etree.HTML(html)</span><br><span class="line">trs = root.xpath(<span class="string">'//tr'</span>)</span><br><span class="line"></span><br><span class="line">f = <span class="built_in">open</span>(<span class="string">'./课程/08course/data08.txt'</span>, <span class="string">'w'</span>, encoding=<span class="string">'utf-8'</span>)</span><br><span class="line"><span class="keyword">for</span> tr <span class="keyword">in</span> trs:</span><br><span class="line"> tds = tr.xpath(<span class="string">'./td'</span>)</span><br><span class="line"> s = <span class="string">''</span></span><br><span class="line"> <span class="keyword">for</span> td <span class="keyword">in</span> tds:</span><br><span class="line"> s = s + <span class="built_in">str</span>(td.xpath(<span class="string">'string(.)'</span>)) + <span class="string">'|'</span></span><br><span class="line"> <span class="built_in">print</span>(s)</span><br><span class="line"> <span class="keyword">if</span> s!= <span class="string">''</span>:</span><br><span class="line"> f.write(s + <span class="string">'\n'</span>)</span><br></pre></td></tr></table></figure></div>
<p><code>payload</code>是在这里:<br><a href="https://imgse.com/i/pAa4oY8"><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://s21.ax1x.com/2024/10/21/pAa4oY8.png"
alt="payload.png"
><figcaption>payload.png</figcaption></figure></a></p>
<p>可以参考这里:<a class="link" href="https://requests.readthedocs.io/projects/cn/zh-cn/latest/user/quickstart.html#post" >更加复杂的 POST 请求<i class="fa-solid fa-arrow-up-right ml-[0.2em] font-light align-text-top text-[0.7em] link-icon"></i></a></p>
]]></content>
<categories>
<category>Python爬虫</category>
</categories>
<tags>
<tag>python</tag>
</tags>
</entry>
<entry>
<title>Conda教程</title>
<url>/2024/07/19/Conda%E6%95%99%E7%A8%8B/</url>
<content><![CDATA[<h1 id="一、conda-指令"><a href="#一、conda-指令" class="headerlink" title="一、conda 指令"></a>一、conda 指令</h1><h2 id="1-1-查看配置"><a href="#1-1-查看配置" class="headerlink" title="1.1 查看配置"></a>1.1 查看配置</h2><div class="code-container" data-rel="Shell"><figure class="iseeu highlight shell"><table><tr><td class="code"><pre><span class="line">conda config --show channels # 查看镜像源</span><br><span class="line">conda config --show-sources # 查看配置文件内容</span><br></pre></td></tr></table></figure></div>
<h2 id="1-2-设置代理端口"><a href="#1-2-设置代理端口" class="headerlink" title="1.2 设置代理端口"></a>1.2 设置代理端口</h2><div class="code-container" data-rel="Shell"><figure class="iseeu highlight shell"><table><tr><td class="code"><pre><span class="line"><span class="meta prompt_"># </span><span class="language-bash">添加代理地址端口</span></span><br><span class="line">conda config --set proxy_servers.http http://127.0.0.1:10809</span><br><span class="line">conda config --set proxy_servers.https http://127.0.0.1:10809</span><br><span class="line"><span class="meta prompt_"></span></span><br><span class="line"><span class="meta prompt_"># </span><span class="language-bash">移除代理</span></span><br><span class="line">conda config --remove-key proxy_servers</span><br></pre></td></tr></table></figure></div>
<h2 id="1-3-conda环境"><a href="#1-3-conda环境" class="headerlink" title="1.3 conda环境"></a>1.3 conda环境</h2><div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line"><span class="comment"># 创建(带名字)</span></span><br><span class="line">conda create -n <conda_name> python=<版本号></span><br><span class="line"><span class="comment"># 在指定文件路径下创建conda环境</span></span><br><span class="line">conda create --<span class="built_in">yes</span> --prefix /home/sky/桌面/pointTest/.conda python=3.11</span><br><span class="line"></span><br><span class="line"><span class="comment"># 激活conda环境</span></span><br><span class="line">conda activate <conda_name></span><br><span class="line"></span><br><span class="line"><span class="comment"># 回到base环境</span></span><br><span class="line">conda deactivate</span><br><span class="line"></span><br><span class="line"><span class="comment"># 查看有哪些conda环境</span></span><br><span class="line">conda info --envs</span><br><span class="line"></span><br><span class="line"><span class="comment"># 删除全部环境</span></span><br><span class="line">conda remove -n env_name --all</span><br><span class="line"><span class="comment"># 删除指定环境</span></span><br><span class="line">conda <span class="built_in">env</span> remove -n env_name</span><br><span class="line"></span><br><span class="line"><span class="comment"># 重命名环境(将 --clone 后面的环境重命名成 -n 后面的名字)</span></span><br><span class="line">conda create -n torch --<span class="built_in">clone</span> py3 <span class="comment"># 将 py3 重命名为 torch</span></span><br></pre></td></tr></table></figure></div>
<p>注意:<code>--prefix/-p</code>不能与<code>--name/-n</code>同时使用!</p>
<h2 id="1-4-下载库"><a href="#1-4-下载库" class="headerlink" title="1.4 下载库"></a>1.4 下载库</h2><div class="code-container" data-rel="Shell"><figure class="iseeu highlight shell"><table><tr><td class="code"><pre><span class="line"><span class="meta prompt_"># </span><span class="language-bash">从 conda-forge 渠道中提供的包安装</span></span><br><span class="line">conda install -c conda-forge <package_name></span><br><span class="line"><span class="meta prompt_"></span></span><br><span class="line"><span class="meta prompt_"># </span><span class="language-bash">查询 conda-forge 中的包</span></span><br><span class="line">conda search -c conda-forge <package_name></span><br><span class="line"><span class="meta prompt_"></span></span><br><span class="line"><span class="meta prompt_"># </span><span class="language-bash">安装指定版本的</span></span><br><span class="line">conda install -c conda-forge <package_name>=<版本号></span><br></pre></td></tr></table></figure></div>
<h3 id="1-4-1-安装GDAL库"><a href="#1-4-1-安装GDAL库" class="headerlink" title="1.4.1 安装GDAL库"></a>1.4.1 安装GDAL库</h3><div class="code-container" data-rel="Shell"><figure class="iseeu highlight shell"><table><tr><td class="code"><pre><span class="line"><span class="meta prompt_"># </span><span class="language-bash"> 安装 gdal 的依赖库 geos 和 proj</span></span><br><span class="line">conda install geos proj</span><br><span class="line"><span class="meta prompt_"># </span><span class="language-bash">安装指定版本GDAL</span></span><br><span class="line">conda install -c conda-forge gdal=3.2.1</span><br></pre></td></tr></table></figure></div>
<h2 id="1-5-迁移-conda-环境"><a href="#1-5-迁移-conda-环境" class="headerlink" title="1.5 迁移 conda 环境"></a>1.5 迁移 conda 环境</h2><p>将要迁移的环境打包</p>
<div class="code-container" data-rel="Shell"><figure class="iseeu highlight shell"><table><tr><td class="code"><pre><span class="line">conda pack -n 虚拟环境名称 -o environment.tar.gz</span><br></pre></td></tr></table></figure></div>
<p>如果报错:No command ‘conda pack’</p>
<div class="code-container" data-rel="Shell"><figure class="iseeu highlight shell"><table><tr><td class="code"><pre><span class="line"><span class="meta prompt_"># </span><span class="language-bash">尝试使用</span></span><br><span class="line">conda install -c conda-forge conda-pack</span><br></pre></td></tr></table></figure></div>
<p>复制压缩文件到新的电脑环境。进到conda的安装目录:/anaconda(或者miniconda)/envs/</p>
<div class="code-container" data-rel="Shell"><figure class="iseeu highlight shell"><table><tr><td class="code"><pre><span class="line"><span class="meta prompt_"># </span><span class="language-bash">对于 ubuntu 可以通过 whereis conda 查看 conda的安装路径</span></span><br><span class="line"><span class="meta prompt_"># </span><span class="language-bash"><span class="built_in">cd</span> 到 conda 的安装路径</span></span><br><span class="line">mkdir environment</span><br><span class="line"><span class="meta prompt_"></span></span><br><span class="line"><span class="meta prompt_"># </span><span class="language-bash">解压conda环境:</span></span><br><span class="line">tar -xzvf environment.tar.gz -C environment</span><br></pre></td></tr></table></figure></div>
<h1 id="二、pip-指令"><a href="#二、pip-指令" class="headerlink" title="二、pip 指令"></a>二、pip 指令</h1><h2 id="2-1-使用临时镜像源下载库"><a href="#2-1-使用临时镜像源下载库" class="headerlink" title="2.1 使用临时镜像源下载库"></a>2.1 使用临时镜像源下载库</h2><div class="code-container" data-rel="Shell"><figure class="iseeu highlight shell"><table><tr><td class="code"><pre><span class="line">pip install <package_name> -i <镜像源url></span><br></pre></td></tr></table></figure></div>
<h1 id="三、镜像源"><a href="#三、镜像源" class="headerlink" title="三、镜像源"></a>三、镜像源</h1><div class="code-container" data-rel="Shell"><figure class="iseeu highlight shell"><table><tr><td class="code"><pre><span class="line">https://pypi.tuna.tsinghua.edu.cn/simple # 清华</span><br><span class="line">https://pypi.mirrors.ustc.edu.cn/simple # 中科大</span><br><span class="line">http://mirrors.aliyun.com/pypi/simple/ # 阿里云</span><br><span class="line">http://pypi.douban.com/simple/ # 豆瓣</span><br></pre></td></tr></table></figure></div>
]]></content>
<tags>
<tag>conda</tag>
</tags>
</entry>
<entry>
<title>解决 Debian 下中文乱码问题</title>
<url>/2025/01/03/Debian%E4%B8%AD%E6%96%87%E4%B9%B1%E7%A0%81/</url>
<content><![CDATA[<h1 id="1-安装语言包"><a href="#1-安装语言包" class="headerlink" title="1.安装语言包"></a>1.安装语言包</h1><p>首先需要安装<code>locales</code>这个软件包:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">apt install locales</span><br></pre></td></tr></table></figure></div>
<p>然后执行下面命令并配置语言环境:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">dpkg-reconfigure locales</span><br></pre></td></tr></table></figure></div>
<h1 id="2-修改配置"><a href="#2-修改配置" class="headerlink" title="2.修改配置"></a>2.修改配置</h1><p>修改<code>~/.bashrc</code>文件:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">nano ~/.bashrc</span><br></pre></td></tr></table></figure></div>
<p>添加:</p>
<div class="code-container" data-rel="Txt"><figure class="iseeu highlight txt"><table><tr><td class="code"><pre><span class="line">export LANG=zh_CN.UTF-8</span><br><span class="line">export LANGUAGE=zh_CN:zh</span><br><span class="line">export LC_ALL=zh_CN.UTF-8</span><br></pre></td></tr></table></figure></div>
<p>然后执行:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line"><span class="built_in">source</span> ~/.bashrc</span><br></pre></td></tr></table></figure></div>
<h1 id="3-参考链接"><a href="#3-参考链接" class="headerlink" title="3.参考链接"></a>3.参考链接</h1><ul>
<li><a class="link" href="https://developer.aliyun.com/article/1143167" >Debian配置系统中文语言及环境<i class="fa-solid fa-arrow-up-right ml-[0.2em] font-light align-text-top text-[0.7em] link-icon"></i></a></li>
</ul>
]]></content>
<tags>
<tag>linux</tag>
<tag>docker</tag>
</tags>
</entry>
<entry>
<title>解决 Debian 系统下,软件包更新和软件下载速度慢的问题</title>
<url>/2024/12/03/Debian%E9%95%9C%E5%83%8F%E6%BA%90/</url>
<content><![CDATA[<h1 id="1-问题复现"><a href="#1-问题复现" class="headerlink" title="1.问题复现"></a>1.问题复现</h1><p>在安装后 Debian 系统后,一般第一步都是执行<code>sudo apt update</code>,此时用的是自带的官方镜像源,速度很慢,所以我们会选择切换为国内的镜像源,这里以清华镜像源为例,默认已经配置好了清华镜像源。</p>
<p>在执行更新时:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">sky@DESKTOP-IU948AR:~$ <span class="built_in">sudo</span> apt update</span><br><span class="line">[<span class="built_in">sudo</span>] password <span class="keyword">for</span> sky: </span><br><span class="line">Err:1 https://security.debian.org/debian-security bookworm-security InRelease</span><br><span class="line"> Certificate verification failed: The certificate is NOT trusted. The certificate issuer is unknown. Could not handshake: Error <span class="keyword">in</span> the certificate verification. [IP: 151.101.2.132 443]</span><br><span class="line">Err:2 https://mirrors.tuna.tsinghua.edu.cn/debian bookworm InRelease</span><br><span class="line"> Certificate verification failed: The certificate is NOT trusted. The certificate issuer is unknown. Could not handshake: Error <span class="keyword">in</span> the certificate verification. [IP: 101.6.15.130 443]</span><br><span class="line">Err:3 https://mirrors.tuna.tsinghua.edu.cn/debian bookworm-updates InRelease</span><br><span class="line"> Certificate verification failed: The certificate is NOT trusted. The certificate issuer is unknown. Could not handshake: Error <span class="keyword">in</span> the certificate verification. [IP: 101.6.15.130 443]</span><br><span class="line">Err:4 https://mirrors.tuna.tsinghua.edu.cn/debian bookworm-backports InRelease</span><br><span class="line"> Certificate verification failed: The certificate is NOT trusted. The certificate issuer is unknown. Could not handshake: Error <span class="keyword">in</span> the certificate verification. [IP: 101.6.15.130 443]</span><br><span class="line">Reading package lists... Done</span><br><span class="line">Building dependency tree... Done</span><br><span class="line">Reading state information... Done</span><br><span class="line">All packages are up to <span class="built_in">date</span>.</span><br><span class="line">W: https://mirrors.tuna.tsinghua.edu.cn/debian/dists/bookworm/InRelease: No system certificates available. Try installing ca-certificates. </span><br><span class="line">W: https://security.debian.org/debian-security/dists/bookworm-security/InRelease: No system certificates available. Try installing ca-certificates.</span><br><span class="line">W: https://mirrors.tuna.tsinghua.edu.cn/debian/dists/bookworm-updates/InRelease: No system certificates available. Try installing ca-certificates.</span><br><span class="line">W: https://mirrors.tuna.tsinghua.edu.cn/debian/dists/bookworm-backports/InRelease: No system certificates available. Try installing ca-certificates.</span><br><span class="line">W: Failed to fetch https://mirrors.tuna.tsinghua.edu.cn/debian/dists/bookworm/InRelease Certificate verification failed: The certificate is NOT trusted. The certificate issuer is unknown. Could not handshake: Error <span class="keyword">in</span> the certificate verification. [IP: 101.6.15.130 443]</span><br><span class="line">W: Failed to fetch https://mirrors.tuna.tsinghua.edu.cn/debian/dists/bookworm-updates/InRelease Certificate verification failed: The certificate is NOT trusted. The certificate issuer is unknown. Could not handshake: Error <span class="keyword">in</span> the certificate verification. [IP: 101.6.15.130 443] </span><br><span class="line">W: Failed to fetch https://mirrors.tuna.tsinghua.edu.cn/debian/dists/bookworm-backports/InRelease Certificate verification failed: The certificate is NOT trusted. The certificate issuer is unknown. Could not handshake: Error <span class="keyword">in</span> the certificate verification. [IP: 101.6.15.130 443] </span><br><span class="line">W: Failed to fetch https://security.debian.org/debian-security/dists/bookworm-security/InRelease Certificate verification failed: The certificate is NOT trusted. The certificate issuer is unknown. Could not handshake: Error <span class="keyword">in</span> the certificate verification. [IP: 151.101.2.132 443] </span><br><span class="line">W: Some index files failed to download. They have been ignored, or old ones used instead.</span><br><span class="line"></span><br><span class="line">sky@DESKTOP-IU948AR:~$ <span class="built_in">sudo</span> apt install ca-certificates</span><br><span class="line">Reading package lists... Done</span><br><span class="line">Building dependency tree... Done</span><br><span class="line">Reading state information... Done</span><br><span class="line">Package ca-certificates is not available, but is referred to by another package.</span><br><span class="line">This may mean that the package is missing, has been obsoleted, or</span><br><span class="line">is only available from another <span class="built_in">source</span></span><br><span class="line"></span><br><span class="line">E: Package <span class="string">'ca-certificates'</span> has no installation candidate</span><br></pre></td></tr></table></figure></div>
<p>可以看出更新失败,需要我们安装<code>ca-certificates</code>,但是又无法成功安装<code>ca-certificates</code>。</p>
<h1 id="2-解决办法"><a href="#2-解决办法" class="headerlink" title="2.解决办法"></a>2.解决办法</h1><p>这个问题的核心是系统无法验证 HTTPS 源的证书,导致无法更新包管理器的索引或安装新软件包。这可能是因为系统缺少必要的 CA 证书或某些配置问题。</p>
<h2 id="2-1-方式一:手动安装ca-certificates包"><a href="#2-1-方式一:手动安装ca-certificates包" class="headerlink" title="2.1 方式一:手动安装ca-certificates包"></a>2.1 方式一:手动安装<code>ca-certificates</code>包</h2><h3 id="2-1-1-下载-DEB-包"><a href="#2-1-1-下载-DEB-包" class="headerlink" title="2.1.1 下载 DEB 包"></a>2.1.1 下载 DEB 包</h3><ol>
<li>在另一台可以正常联网的机器上,访问 <a class="link" href="https://packages.debian.org/" >Debian Packages<i class="fa-solid fa-arrow-up-right ml-[0.2em] font-light align-text-top text-[0.7em] link-icon"></i></a>。</li>
<li>下载与系统版本匹配的 <code>ca-certificates</code> 包。例如,适用于 Bookworm 的链接可能是:<div class="code-container" data-rel="Plaintext"><figure class="iseeu highlight plaintext"><table><tr><td class="code"><pre><span class="line">https://packages.debian.org/bookworm/all/ca-certificates/download</span><br></pre></td></tr></table></figure></div></li>
<li>使用 USB 或其他方法将下载的 <code>.deb</code> 文件复制到你的系统。</li>
</ol>
<h3 id="2-1-2-手动安装包"><a href="#2-1-2-手动安装包" class="headerlink" title="2.1.2 手动安装包"></a>2.1.2 手动安装包</h3><p>在目标机器上运行以下命令:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line"><span class="built_in">sudo</span> dpkg -i /path/to/ca-certificates*.deb</span><br><span class="line"><span class="built_in">sudo</span> apt update</span><br></pre></td></tr></table></figure></div>
<hr>
<h3 id="2-2-方式二:临时禁用-HTTPS-验证(推荐)"><a href="#2-2-方式二:临时禁用-HTTPS-验证(推荐)" class="headerlink" title="2.2 方式二:临时禁用 HTTPS 验证(推荐)"></a>2.2 方式二:临时禁用 HTTPS 验证(推荐)</h3><p>可以尝试临时禁用 HTTPS 验证以更新软件源和安装 <code>ca-certificates</code>:</p>
<p>编辑 <code>/etc/apt/apt.conf.d/99disable-https-check</code> 文件:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line"><span class="built_in">sudo</span> nano /etc/apt/apt.conf.d/99disable-https-check</span><br></pre></td></tr></table></figure></div>
<p>添加以下内容:</p>
<div class="code-container" data-rel="Plaintext"><figure class="iseeu highlight plaintext"><table><tr><td class="code"><pre><span class="line">Acquire::https::Verify-Peer "false";</span><br><span class="line">Acquire::https::Verify-Host "false";</span><br></pre></td></tr></table></figure></div>
<p>保存后运行以下命令:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line"><span class="built_in">sudo</span> apt update</span><br><span class="line"><span class="built_in">sudo</span> apt install ca-certificates</span><br></pre></td></tr></table></figure></div>
<p>安装完 <code>ca-certificates</code> 后,删除或注释掉该配置以恢复安全性。</p>
<p>也可以直接在命令行临时禁用:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">apt -o Acquire::https::Verify-Peer=<span class="literal">false</span> -o Acquire::https::Verify-Host=<span class="literal">false</span> update</span><br></pre></td></tr></table></figure></div>]]></content>
<tags>
<tag>linux</tag>
</tags>
</entry>
<entry>
<title>通过Docker搭建Hexo博客</title>
<url>/2024/07/24/Docker-Hexo/</url>
<content><![CDATA[<h1 id="1-创建项目文件夹"><a href="#1-创建项目文件夹" class="headerlink" title="1.创建项目文件夹"></a>1.创建项目文件夹</h1><p>创建博客的工作</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line"><span class="built_in">mkdir</span> -p ~/Hexo && <span class="built_in">cd</span> ~/Hexo</span><br><span class="line"></span><br><span class="line"><span class="comment"># 创建存放 Dockerfile 的文件夹</span></span><br><span class="line"><span class="built_in">mkdir</span> hexo_docker && <span class="built_in">cd</span> hexo_docker</span><br><span class="line"></span><br><span class="line"><span class="comment"># 创建 Dockerfile</span></span><br><span class="line"><span class="built_in">touch</span> Dockerfile</span><br></pre></td></tr></table></figure></div>
<h1 id="2-配置-Dockerfile"><a href="#2-配置-Dockerfile" class="headerlink" title="2.配置 Dockerfile"></a>2.配置 Dockerfile</h1><div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line"><span class="comment"># 基础镜像</span></span><br><span class="line">FROM node:latest</span><br><span class="line"></span><br><span class="line"><span class="comment"># 维护者信息</span></span><br><span class="line">MAINTAINER yourname<[email protected]></span><br><span class="line"></span><br><span class="line"><span class="comment"># 工作目录</span></span><br><span class="line">WORKDIR /hexo</span><br><span class="line"></span><br><span class="line"><span class="comment"># 设置 npm 使用淘宝镜像源</span></span><br><span class="line">RUN npm config <span class="built_in">set</span> registry https://registry.npmmirror.com</span><br><span class="line"></span><br><span class="line"><span class="comment"># 安装 Hexo</span></span><br><span class="line">RUN npm install hexo-cli -g</span><br><span class="line">RUN hexo init blog</span><br><span class="line">RUN <span class="built_in">cd</span> blog</span><br><span class="line">RUN npm install</span><br><span class="line"></span><br><span class="line"><span class="comment"># 设置git</span></span><br><span class="line">RUN git config --global user.name <span class="string">"loskyertt"</span></span><br><span class="line">RUN git config --global user.email <span class="string">"[email protected]"</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 映射端口</span></span><br><span class="line">EXPOSE 4000</span><br><span class="line"></span><br><span class="line"><span class="comment"># 运行命令</span></span><br><span class="line">CMD [<span class="string">"/bin/bash"</span>]</span><br></pre></td></tr></table></figure></div>
<p>然后构建镜像(和<code>Dockerfile</code>同目录下):</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">docker build -t hexo:latest .</span><br></pre></td></tr></table></figure></div>
<p>Docker 的 BuildKit 可以加速构建过程并启用更多的优化选项,在构建镜像时启用 BuildKit:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">DOCKER_BUILDKIT=1 docker build -t hexo:latest .</span><br></pre></td></tr></table></figure></div>
<h1 id="3-更新镜像"><a href="#3-更新镜像" class="headerlink" title="3.更新镜像"></a>3.更新镜像</h1><p>要在已经构建的基础镜像更新镜像,可以使用以下两种方法:</p>
<h2 id="3-1-方法一:更新现有-Dockerfile-并重建镜像"><a href="#3-1-方法一:更新现有-Dockerfile-并重建镜像" class="headerlink" title="3.1 方法一:更新现有 Dockerfile 并重建镜像"></a>3.1 方法一:更新现有 Dockerfile 并重建镜像</h2><p>更新<code>Dockerfile</code>,然后重新构建镜像。</p>
<h2 id="3-2-方法二:从已有镜像启动容器并手动添加-Hexo"><a href="#3-2-方法二:从已有镜像启动容器并手动添加-Hexo" class="headerlink" title="3.2 方法二:从已有镜像启动容器并手动添加 Hexo"></a>3.2 方法二:从已有镜像启动容器并手动添加 Hexo</h2><ol>
<li><strong>启动一个容器</strong></li>
</ol>
<p>从已有的基础镜像启动一个交互式容器:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">docker run -it hexo:latest /bin/bash</span><br></pre></td></tr></table></figure></div>
<ol start="2">
<li><p><strong>在容器内进行操作</strong></p>
</li>
<li><p><strong>提交容器为新镜像</strong></p>
</li>
</ol>
<p>退出容器(<code>exit</code> 命令),然后提交容器为新镜像:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">docker commit <container_id> hexo:with-hexo</span><br></pre></td></tr></table></figure></div>
<p><code>with-hexo</code>是镜像标签,可以自定义。这里的 <code><container_id></code> 是刚刚启动的容器的 ID。可以使用 <code>docker ps -a</code> 命令找到它。<br><strong>注意:</strong> 如果要把挂载的宿主机的文件提交到镜像里,需要提前做一个<code>cp</code>操作,即通过<code>docker cp</code>把宿主计内的文件复制到容器内,然后再提交。</p>
<h1 id="4-推送与备份镜像"><a href="#4-推送与备份镜像" class="headerlink" title="4.推送与备份镜像"></a>4.推送与备份镜像</h1><h2 id="4-1-推送镜像到-Docker-Hub"><a href="#4-1-推送镜像到-Docker-Hub" class="headerlink" title="4.1 推送镜像到 Docker Hub"></a>4.1 推送镜像到 Docker Hub</h2><ol>
<li><strong>登录 Docker Hub</strong></li>
</ol>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">docker login</span><br></pre></td></tr></table></figure></div>
<ol start="2">
<li><strong>标记镜像</strong></li>
</ol>
<p>将镜像标记为你的 Docker Hub 存储库。例如,将 <code>hexo:latest</code> 标记为 <code>yourusername/hexo:latest</code>:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">docker tag hexo:latest yourusername/hexo:latest</span><br></pre></td></tr></table></figure></div>
<ol start="3">
<li><strong>推送镜像</strong></li>
</ol>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">docker push yourusername/hexo:latest</span><br></pre></td></tr></table></figure></div>
<h2 id="4-1-备份镜像到本地"><a href="#4-1-备份镜像到本地" class="headerlink" title="4.1 备份镜像到本地"></a>4.1 备份镜像到本地</h2><ol>
<li><strong>保存镜像</strong></li>
</ol>
<p>使用 <code>docker save</code> 命令将镜像保存到一个 tar 文件。例如,将 <code>hexo:latest</code> 保存到 <code>hexo_latest.tar</code>:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">docker save -o hexo_latest.tar hexo:latest</span><br></pre></td></tr></table></figure></div>
<ol start="2">
<li><strong>加载镜像</strong></li>
</ol>
<p>以后你可以使用 <code>docker load</code> 命令从 tar 文件加载镜像。例如,从 <code>hexo_latest.tar</code> 加载镜像:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">docker load -i hexo_latest.tar</span><br></pre></td></tr></table></figure></div>
<h1 id="5-挂载容器文件的注意项"><a href="#5-挂载容器文件的注意项" class="headerlink" title="5.挂载容器文件的注意项"></a>5.挂载容器文件的注意项</h1><p>在构建镜像时,已经生成了<code>/hexo/blog</code>目录,如果直接执行:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">docker run -it --name=<span class="string">"my-blog"</span> -p 4000:4000 -v ~/Hexo/hexo:/hexo hexo:latest /bin/bash</span><br></pre></td></tr></table></figure></div>
<p>这回在主机的<code>~/Hexo/hexo</code>目录和容器内的<code>/hexo</code>目录之间创建一个卷映射。这意味着容器内的<code>/hexo</code>目录会被主机上的<code>~/Hexo</code>目录的内容覆盖。如果主机上的<code>~/Hexo</code>目录是空的或不存在,那么容器内的<code>/hexo</code>目录也会是空的。</p>
<h3 id="5-6-1-解决方法"><a href="#5-6-1-解决方法" class="headerlink" title="5.6.1 解决方法"></a>5.6.1 解决方法</h3><h4 id="方式一、确保主机目录包含内容"><a href="#方式一、确保主机目录包含内容" class="headerlink" title="方式一、确保主机目录包含内容"></a>方式一、确保主机目录包含内容</h4><p>在主机上确保 <code>~/Hexo</code> 目录存在并包含<code>Hexo</code>项目文件。如果目录不存在或为空,可以先在主机上初始化<code>hexo</code>项目:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line"><span class="built_in">mkdir</span> -p ~/Hexo/hexo</span><br><span class="line"><span class="built_in">cd</span> ~/Hexo/hexo</span><br><span class="line">npm install hexo-cli -g</span><br><span class="line">hexo init .</span><br><span class="line"><span class="built_in">cd</span> blog</span><br><span class="line">npm install</span><br></pre></td></tr></table></figure></div>
<h4 id="方式二、在容器内初始化-Hexo-项目"><a href="#方式二、在容器内初始化-Hexo-项目" class="headerlink" title="方式二、在容器内初始化 Hexo 项目"></a>方式二、在容器内初始化 Hexo 项目</h4><p>如果希望在容器内初始化 Hexo 项目而不是依赖主机目录,可以先启动一个临时容器,初始化项目,然后将其复制到主机目录:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">docker run -it hexo:latest /bin/bash</span><br></pre></td></tr></table></figure></div>
<p>在容器内执行以下命令:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">npm install hexo-cli -g</span><br><span class="line">hexo init /hexo</span><br><span class="line"><span class="built_in">cd</span> /hexo</span><br><span class="line">npm install</span><br></pre></td></tr></table></figure></div>
<p>退出容器(<code>exit</code>),然后将容器内的 <code>/hexo</code> 目录复制到主机:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">docker <span class="built_in">cp</span> <container_id>:/hexo ~/Hexo</span><br></pre></td></tr></table></figure></div>
<p>这里的 <code><container_id></code> 是你刚刚启动的容器的 ID。可以使用 <code>docker ps -a</code> 命令找到它。</p>
<h1 id="6-最终步骤"><a href="#6-最终步骤" class="headerlink" title="6.最终步骤"></a>6.最终步骤</h1><p>确保主机上的 <code>~/Hexo</code> 目录存在并包含 Hexo 项目文件,然后再次运行 Docker 容器:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">docker run -it --name=<span class="string">"my-blog"</span> -p 4000:4000 -v ~/Hexo/hexo:/hexo hexo:latest /bin/bash</span><br></pre></td></tr></table></figure></div>
<p>这时,容器内的 <code>/hexo</code> 目录将包含主机上 <code>~/Hexo/hexo</code> 目录的内容。</p>
]]></content>
<tags>
<tag>docker</tag>
</tags>
</entry>
<entry>
<title>Docker 配置 MySQL 教程</title>
<url>/2024/09/05/Docker-MySQL/</url>
<content><![CDATA[<h1 id="1-Docker-MySQL-配置"><a href="#1-Docker-MySQL-配置" class="headerlink" title="1.Docker-MySQL 配置"></a>1.Docker-MySQL 配置</h1><h2 id="1-1-拉取镜像与创建容器"><a href="#1-1-拉取镜像与创建容器" class="headerlink" title="1.1 拉取镜像与创建容器"></a>1.1 拉取镜像与创建容器</h2><ol>
<li><p><strong>拉取 MySQL 镜像:</strong><br>通过运行 <code>docker pull mysql[:版本号]</code>,版本号可选,默认是latest。</p>
</li>
<li><p><strong>创建并运行 MySQL 容器(推荐):</strong><br>使用以下命令来运行 MySQL 容器。这里假设想在主机上绑定3306端口,并设置一个root用户的密码:</p>
</li>
</ol>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">docker run \</span><br><span class="line">--name some-mysql \</span><br><span class="line">-e MYSQL_ROOT_PASSWORD=1234 \</span><br><span class="line">-p 3306:3306 \</span><br><span class="line">-v ~/docker_volume/mysql/data:/var/lib/mysql \</span><br><span class="line">-v ~/docker_volume/mysql/init:/docker-entrypoint-initdb.d \</span><br><span class="line">-v ~/docker_volume/mysql/conf:/etc/mysql/conf.d \</span><br><span class="line">-d mysql:<版本号></span><br></pre></td></tr></table></figure></div>
<p><strong>以上命令的说明:</strong></p>
<ul>
<li><code>--name some-mysql</code>:给容器指定一个名称,比如 <code>some-mysql</code>。</li>
<li><code>-e MYSQL_ROOT_PASSWORD=1234</code>:设置root用户的密码为 <code>1234</code>。</li>
<li><code>mysql:latest</code>:使用最新版本的 MySQL 镜像。</li>
<li><code>p</code>:设置端口,<code>3306:3306</code>中,前者是宿主机的端口,可以自定义,后者是映射到容器的端口。默认是<code>TCP</code>协议。</li>
<li><code>-v</code>:挂载容器里的文件(<code>~/docker_volume/mysql/data</code>为本地目录)。<strong>注意:</strong> 要先创建好对应的文件路径!!同时本地文件夹会把容器内的文件夹完全覆盖!</li>
</ul>
<h2 id="1-2-连接到容器"><a href="#1-2-连接到容器" class="headerlink" title="1.2 连接到容器"></a>1.2 连接到容器</h2><ol>
<li><strong>连接到 MySQL 容器:</strong></li>
</ol>
<p>进入到容器内部然后进行连接:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">docker <span class="built_in">exec</span> -it some-mysql bash</span><br><span class="line"></span><br><span class="line">mysql -uroot -p</span><br></pre></td></tr></table></figure></div>
<p>直接连接到容器:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">docker <span class="built_in">exec</span> -it some-mysql mysql -uroot -p</span><br></pre></td></tr></table></figure></div>
<p>这会启动一个终端会话,并要求输入上面设置的root用户的密码 (<code>1234</code>)。</p>
<p>以上命令的说明:</p>
<ol>
<li><strong>docker exec:</strong></li>
</ol>
<ul>
<li><strong>含义:</strong> 在一个已经运行的容器内执行命令。</li>
<li><strong>用途:</strong> 允许你在运行的容器中执行新的命令。</li>
</ul>
<ol start="2">
<li><strong>-it:</strong></li>
</ol>
<ul>
<li><strong><code>-i</code> (interactive):</strong> 保持标准输入(stdin)打开,使得你可以与容器中的进程进行交互。</li>
<li><strong><code>-t</code> (tty):</strong> 分配一个伪终端,提供一个终端会话环境。这两个选项一起使用可以进入容器并交互。</li>
</ul>
<ol start="3">
<li><strong>some-mysql:</strong></li>
</ol>
<ul>
<li><strong>含义:</strong> 容器的名称或 ID。</li>
</ul>
<ol start="4">
<li><strong>mysql:</strong></li>
</ol>
<ul>
<li><strong>含义:</strong> 这是在容器内执行的命令。在这个例子中,是启动 MySQL 客户端。</li>
<li><strong>用途:</strong> 连接到 MySQL 数据库。</li>
</ul>
<ol start="5">
<li><strong>-uroot:</strong></li>
</ol>
<ul>
<li><strong><code>-u</code>:</strong> 指定 MySQL 客户端的用户名。</li>
<li><strong>root:</strong> MySQL 数据库的用户名。在这个例子中,使用的是 <code>root</code> 用户。</li>
</ul>
<ol start="6">
<li><strong><code>-p</code>:</strong></li>
</ol>
<ul>
<li><strong>含义:</strong> 提示输入 MySQL 用户的密码。</li>
<li><strong>用途:</strong> 在执行命令后,你会被提示输入 <code>root</code> 用户的密码。这个密码是在运行容器时通过 <code>MYSQL_ROOT_PASSWORD</code> 环境变量设置的。</li>
</ul>
<p>以上命令的 <code>-p 3306:3306</code> 会将主机的3306端口映射到容器的3306端口,这样可以从主机上的MySQL客户端连接到容器中的MySQL服务。</p>
<h2 id="1-3-更改容器信息"><a href="#1-3-更改容器信息" class="headerlink" title="1.3 更改容器信息"></a>1.3 更改容器信息</h2><h3 id="1-3-1-方法一:停止并删除现有容器"><a href="#1-3-1-方法一:停止并删除现有容器" class="headerlink" title="1.3.1 方法一:停止并删除现有容器"></a>1.3.1 方法一:停止并删除现有容器</h3><p>首先,停止并删除现有的 <code>mysql-test</code> 容器:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">docker start mysql-test <span class="comment"># 运行</span></span><br><span class="line">docker stop mysql-test <span class="comment"># 停止</span></span><br><span class="line">docker <span class="built_in">rm</span> mysql-test <span class="comment"># 删除</span></span><br></pre></td></tr></table></figure></div>
<p>然后再重新创建<code>mysql-test</code>即可。</p>
<h3 id="1-3-2-方法二:不同的容器名称"><a href="#1-3-2-方法二:不同的容器名称" class="headerlink" title="1.3.2 方法二:不同的容器名称"></a>1.3.2 方法二:不同的容器名称</h3><p>可以使用不同的名称来运行容器,这样就不需要删除现有的容器。例如:</p>
<div class="code-container" data-rel="Bash"><figure class="iseeu highlight bash"><table><tr><td class="code"><pre><span class="line">docker run --name mysql-test-2 -e MYSQL_ROOT_PASSWORD=0403 -p 3307:3306 -d mysql:latest</span><br></pre></td></tr></table></figure></div>
<p>这样会启动一个新容器,名称为 <code>mysql-test-2</code>,并绑定主机的3307端口到容器的3306端口。</p>
<h2 id="1-4-查看容器信息"><a href="#1-4-查看容器信息" class="headerlink" title="1.4 查看容器信息"></a>1.4 查看容器信息</h2><p>查看容器或镜像内部信息(如端口,ip地址,挂载卷等):</p>
<div class="code-container" data-rel="Plaintext"><figure class="iseeu highlight plaintext"><table><tr><td class="code"><pre><span class="line">docker inspect <容器名/容器ID/镜像名/镜像ID></span><br></pre></td></tr></table></figure></div>]]></content>
<tags>
<tag>docker</tag>
</tags>
</entry>
<entry>
<title>Openwrt 在 Docker 下运行并作为旁路由</title>
<url>/2024/07/30/Docker-Openwrt/</url>
<content><![CDATA[<h1 id="1-前言"><a href="#1-前言" class="headerlink" title="1.前言"></a>1.前言</h1><p>这几天想试着玩一下<code>Openwrt</code>来作旁路由,但是又没有软路由固件,后来考虑到<code>Openwrt</code>也是基于 Linux 的,那么在 Docker Hub 上应该有其对应的镜像吧,然后查了下果真有。于是有了后续的操作 … …</p>
<p><strong>推荐的镜像:</strong> <a class="link" href="https://hub.docker.com/r/zzsrv/openwrt" >zzsrv/openwrt<i class="fa-solid fa-arrow-up-right ml-[0.2em] font-light align-text-top text-[0.7em] link-icon"></i></a></p>
<h1 id="2-准备工作"><a href="#2-准备工作" class="headerlink" title="2.准备工作"></a>2.准备工作</h1><p>建议是在虚拟机上安装一个 Linux 系统,我在实体机上(Linux 系统)试了下,是能成功,但是在该电脑上无法访问<code>Openwrt</code>的 <code>ip</code>地址,在其它设备(比如手机)能正常访问(需要多做一些配置)。</p>
<h2 id="2-1-虚拟机安装"><a href="#2-1-虚拟机安装" class="headerlink" title="2.1 虚拟机安装"></a>2.1 虚拟机安装</h2><p>虚拟机在这里只推荐<code>VirtualBox</code>。<a class="link" href="https://www.virtualbox.org/wiki/Downloads" >下载地址<i class="fa-solid fa-arrow-up-right ml-[0.2em] font-light align-text-top text-[0.7em] link-icon"></i></a></p>
<p>还需要下载插件:<a class="link" href="https://download.virtualbox.org/virtualbox/7.0.20/Oracle_VM_VirtualBox_Extension_Pack-7.0.20.vbox-extpack" >点这里<i class="fa-solid fa-arrow-up-right ml-[0.2em] font-light align-text-top text-[0.7em] link-icon"></i></a><br>或者点图片这里红色方框处的下载连接:<br><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://im.gurl.eu.org/file/15123858a31e537afcf88.png"
alt="图一"
><figcaption>图一</figcaption></figure></p>
<p>下载完后直接双击下载好后的扩展,就能自动安装到 VirtualBox 中。</p>
<h2 id="2-2-Linux安装"><a href="#2-2-Linux安装" class="headerlink" title="2.2 Linux安装"></a>2.2 Linux安装</h2><p>推荐的 Linux 发行版:</p>
<ul>
<li><a class="link" href="https://endeavouros.com/#Download" >EndeavourOS<i class="fa-solid fa-arrow-up-right ml-[0.2em] font-light align-text-top text-[0.7em] link-icon"></i></a> </li>
<li><a class="link" href="https://cachyos.org/download/" >CachyOS<i class="fa-solid fa-arrow-up-right ml-[0.2em] font-light align-text-top text-[0.7em] link-icon"></i></a>,建议下载<code>Desktop Edition</code>版本</li>
<li><a class="link" href="https://www.linuxmint.com/download.php" >LinuxMint<i class="fa-solid fa-arrow-up-right ml-[0.2em] font-light align-text-top text-[0.7em] link-icon"></i></a>,建议下载<code>Xfce Edition</code>版本</li>
</ul>
<p>前两个是<code>Arch</code>系的发行版,<code>LinuxMint</code>应该是<code>Debian</code>系的发行版。我这里用的是<code>CachyOS</code>,因为正好想尝试下这个发行版,好处是配置有国内的镜像源,下载东西嘎嘎快。<br>在安装时最好选择有桌面环境,并且桌面选择<code>xfce4</code>,毕竟轻量嘛,在虚拟机里用着也会更流畅。当然,也可以不需要桌面环境,那么在安装时就必须把语言设置为英语,因为终端界面的中文会是乱码,而且没有图形化界面也不好配置。</p>
<h2 id="2-3-虚拟机的创建"><a href="#2-3-虚拟机的创建" class="headerlink" title="2.3 虚拟机的创建"></a>2.3 虚拟机的创建</h2><p><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://im.gurl.eu.org/file/bd3ae00cf631639f8637d.png"
alt="step1"
><figcaption>step1</figcaption></figure></p>
<p><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://im.gurl.eu.org/file/d29cd0f0a8b49c85eba19.png"
alt="step2"
><figcaption>step2</figcaption></figure></p>
<p><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://im.gurl.eu.org/file/5fa57afd8c3f1ce85dbaf.png"
alt="step3"
><figcaption>step3</figcaption></figure></p>
<p><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://im.gurl.eu.org/file/3238679faab643583e0d2.png"
alt="step3"
><figcaption>step3</figcaption></figure></p>
<h2 id="2-4-虚拟机的配置"><a href="#2-4-虚拟机的配置" class="headerlink" title="2.4 虚拟机的配置"></a>2.4 虚拟机的配置</h2><p>创建完虚拟机后,需要对其进行配置:</p>
<ul>
<li><p>网络配置:<br><figure class="image-caption"><img
lazyload
src="/images/loading.svg"
data-src="https://im.gurl.eu.org/file/c0ac4b0af36dd0d8d8669.png"
alt="step1"
><figcaption>step1</figcaption></figure></p>
</li>
<li><p>显示配置<br><figure class="image-caption"><img
lazyload
src="/images/loading.svg"