Python抓取知乎几千张小姐姐图片是什么体验？

2021-04-01

来源：【公众号】

Python技术

知乎上有许多关于颜值、身材的话题，有些话题的回复数甚至高达几百上千，拥有成千上万的关注者与被浏览数。如果我们在摸鱼的时候欣赏这些话题将花费大量的时间，可以用 Python 制作一个下载知乎回答图片的小脚本，将图片下载到本地。

请求 URL 分析

首先打开 F12 控制台面板，看到照片的 URL 都是 https://pic4.zhimg.com/80/xxxx.jpg?source=xxx 这种格式的。

滚动知乎页面向下翻页，找到一个带 limit，offset 参数的 URL 请求。

检查 Response 面板中的内容是否包含了图片的 URL 地址，其中图片地址 URL 存在 data-original 属性中。

提取图片的 URL

从上图可以看出图片的地址存放在 content 属性下的 data-original 属性中。

下面代码将获取图片的地址，并写入文件。

import re import requests import os import urllib.request import ssl from urllib.parse import urlsplit from os.path import basename import json

ssl._create_default_https_context = ssl._create_unverified_context

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36",
    'Accept-Encoding': 'gzip, deflate' } def get_image_url(qid, title):     answers_url = 'https://www.zhihu.com/api/v4/questions/'+str(qid)+'/answers?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B*%5D.mark_infos%5B*%5D.url%3Bdata%5B*%5D.author.follower_count%2Cbadge%5B*%5D.topics%3Bdata%5B*%5D.settings.table_of_content.enabled&offset={}&limit=10&sort_by=default&platform=desktop'     offset = 0     session = requests.Session()

    while True:
        page = session.get(answers_url.format(offset), headers = headers)
        json_text = json.loads(page.text)
        answers = json_text['data']

        offset += 10         if not answers:
            print('获取图片地址完成')
            return         pic_re = re.compile('data-original="(.*?)"', re.S)

        for answer in answers:
            tmp_list = []
            pic_urls = re.findall(pic_re, answer['content'])

            for item in pic_urls:  
                # 去掉转移字符                  pic_url = item.replace("", "")
                pic_url = pic_url.split('?')[0]

                # 去重复                 if pic_url not in tmp_list:
                    tmp_list.append(pic_url)

            
            for pic_url in tmp_list:
                if pic_url.endswith('r.jpg'):
                    print(pic_url)
                    write_file(title, pic_url) def write_file(title, pic_url):     file_name = title + '.txt'     f = open(file_name, 'a')
    f.write(pic_url + 'n')
    f.close()

示例结果：

下载图片

下面代码将读取文件中的图片地址并下载。

def read_file(title):
    file_name = title + '.txt'     pic_urls = []

    # 判断文件是否存在
    if not os.path.exists(file_name):
        return pic_urls

    with open(file_name, 'r') as f:
        for line in f:
            url = line.replace("n", "")
            if url not in pic_urls:
                pic_urls.append(url)

    print("文件中共有{}个不重复的 URL".format(len(pic_urls)))
    return pic_urls

def download_pic(pic_urls, title):

    # 创建文件夹
    if not os.path.exists(title):
        os.makedirs(title)

    error_pic_urls = []
    success_pic_num = 0     repeat_pic_num = 0     index = 1     for url in pic_urls:
        file_name = os.sep.join((title,basename(urlsplit(url)[2])))

        if os.path.exists(file_name):
            print("图片{}已存在".format(file_name))
            index += 1             repeat_pic_num += 1             continue

        try:
            urllib.request.urlretrieve(url, file_name)
            success_pic_num += 1             index += 1             print("下载{}完成！({}/{})".format(file_name, index, len(pic_urls)))
        except:
            print("下载{}失败！({}/{})".format(file_name, index, len(pic_urls)))
            error_pic_urls.append(url)
            index += 1             continue
        
    print("图片全部下载完毕！(成功：{}/重复：{}/失败：{})".format(success_pic_num, repeat_pic_num, len(error_pic_urls)))

    if len(error_pic_urls) > 0:
        print('下面打印失败的图片地址')
        for error_url in error_pic_urls:
            print(error_url)

结语

今天的文章用 Python 爬虫制作了一个小脚本，如果小伙伴们觉得文章有趣且有用，点个转发支持一下吧！

CDA数据分析师考试相关入口一览（建议收藏）：

▷ 想报名CDA认证考试，点击>>> “CDA报名” 了解CDA考试详情；

▷ 想学习CDA考试教材，点击>>> “CDA教材” 了解CDA考试详情；

▷ 想加入CDA考试题库，点击>>> “CDA题库” 了解CDA考试详情；

▷ 想了解CDA考试含金量，点击>>> “CDA含金量” 了解CDA考试详情；

requests python

数据分析咨询请扫描二维码

若不方便扫码，搜微信号：CDAshujufenxi

上一篇CDA LEVEL I 数据分析认证考试模拟题库（三十一）

下一篇CDA LEVEL II考试内容公布，这些知识你掌握了吗

Python抓取知乎几千张小姐姐图片是什么体验？

请求 URL 分析

提取图片的 URL

下载图片

CDA考试动态

CDA报考指南

热门栏目

最新资讯

【CDA持证人干货分享】13年国企财务：如何借助DeepS ...

【CDA持证人案例分享】Excel动态报表设计：基于业务 ...

【CDA干货】字节大佬：如何通过动态分级快速提升转 ...

Windows 系统和 MacOS 系统下的 Anaconda 安装教程 ...

数据运营的工作内容、技能要求及发展前景 ...

【干货】字节大佬：教培行业销售运营全景作战地图【 ...

四大一线城市约50%人口租房，数据分析能挖出哪些 “ ...

Python 实战案例 —RFM 客户价值分析模型 ...

【案例】奥利奥坚果新品与蒙牛“数字牧场”的成功经 ...

美关税政策下的全球金融市场动荡：深度数据分析与洞 ...

【重磅】苹果捐赠3000万给浙大这专业，透露未来就业 ...

非专业，怎么才能证明自己的数据分析能力？ ...

保姆级教程！一文读懂数据分析师的职业发展路径 ...

【干货】用DeepSeek三步骤搞定Excel数据清洗，效率 ...

被统计公式劝退？这门极简课程让你14天学会用Python ...

【干货】如何用RFM模型精准识别高价值客户？ ...

CDA数据分析师就业班3月29日开班，仅剩1个名额 ...

【案例】网飞Netflix流量漏斗分析案例 ...

tensorflow_datasets 如何load本地的数据集？ ...

《CDA二级教材》试读版上线CDA网校，助你轻松拿下二 ...