UNNAMED の博客 - Python3 Scrapy 学习笔记（二）

文章摘要：Scrapy Selectors 选择器简介、Scrapy 快速入门，爬虫进阶

一、基础知识（Selectors 选择器简介）

1、xpath ( )：传入 xpath 表达式，返回该表达式所对应的所有节点的 selector list 列表。

response.xpath("html") #： 选取 html 元素的所有子节点。
response.xpath("html/head") #： 选取属于 html 的子元素的所有 head 元素。
response.xpath("html//div") #： 选择属于 html 元素的后代的所有 div 元素，而不管它们位于 html 之下的什么位置。
response.xpath("/html/head/title") #： 选择HTML文档中 <head> 标签内的 <title> 元素
response.xpath("/html/head/title/text()") #： 选择上面提到的 <title> 元素的文字
response.xpath("//td") #： 选择所有的 <td>元素
response.xpath("//@href") #： 选取所有具有 href 的所有属性
response.xpath("//a[@href]") #： 选取所有具有 href 属性的 a 元素
response.xpath('//div[@class="mine"]') #： 选择所有具有 class="mine" 属性的 div 元素

2、css ( )：传入 css 表达式，返回该表达式所对应的所有节点的 selector list 列表.

# 听说都不建议使用 css 表达式

3、extract ( )：序列化该节点为 unicode 字符串并返回 list。

response.xpath('/html/head/title/text()').extract()

4、re ( )：根据传入的正则表达式对数据进行提取，返回 unicode 字符串 list 列表。

response.xpath('//title/text()').re('(\w+):')

二、快速入门

简单数据提取示例代码：

# -*- coding: utf-8 -*-
import scrapy
class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ["mmonly.cc"]
    start_urls = [
        "http://www.mmonly.cc/ktmh/dmmn/136697.html",
    ]
    def parse(self, response):
        # with open("1.txt", 'wb') as f:
        #     f.write(response.body)
        title = response.xpath('/html/head/title/text()').extract() # 数据提取
        img = response.xpath('//img//@src').extract() # 数据提取
        print("############# 原始数据 #############")
        print(title) # 原始数据
        print(img) # 原始数据
        print("############# 加工数据 #############")
        print(title[0]) # 加工数据
        print(img[1]) # 加工数据
        pass

Dos 输出数据

c:\Users\Unnamed\Desktop\project_name>scrapy crawl example --nolog
############# 原始数据 #############
['动漫清纯美女高清精选图片 - 唯一图库']
['/skins/images/mmonly1.png', 'http://t1.mmonly.cc/uploads/tu/201703/52/11_15123
1135805_1.jpg', 'http://t1.mmonly.cc/uploads/tu/201612/31/37.png', 'http://t1.mm
only.cc/uploads/tu/sm/201601/29/27slt.jpg', 'http://t1.mmonly.cc/uploads/150407/
2-15040G00H3543.jpg', 'http://t1.mmonly.cc/uploads/150723/2-150H311362M31.jpg',
'http://t1.mmonly.cc/uploads/tu/sm/201601/05/139slt.jpg', 'http://t1.mmonly.cc/u
ploads/150813/8-150Q3092401554.jpg', 'http://t1.mmonly.cc/uploads/tu/sm/201602/0
4/27slt.jpg', 'http://t1.mmonly.cc/uploads/150723/2-150H3113941127.jpg']
############# 加工数据 #############
动漫清纯美女高清精选图片 - 唯一图库
http://t1.mmonly.cc/uploads/tu/201703/52/11_151231135805_1.jpg

三、爬虫进阶

先拿 //www.mmonly.cc/ktmh/dmmn/list_29_1.html 里的动漫美女专栏练练手吧，希望对方站长不要打我，哈哈哈

这次要把整个专栏里面的图片都扒下来。

爬虫思路：

先获取专栏页面数量，通过末页按钮提供的链接来确定页面数量

response.xpath("//a[@href]").re('list_29_(\w+).html')[-1]
# 获取所有 a href 链接使用正则表达式筛选数据 并 选择数组最后一个元素

本次引用一个新的方法 scrapy.Request(url,callback)

完整的爬虫代码：

# -*- coding: utf-8 -*-
# scrapy crawl example --nolog
import scrapy
class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ["mmonly.cc"]
    # 默认启动函数
    def start_requests(self):
        url = "http://www.mmonly.cc/ktmh/dmmn/"
        yield scrapy.Request(url,callback=self.parse_item) # 开始爬虫
    # 回调函数
    def parse_item(self, response):
        # number = response.xpath("//a[@href]").re('list_29_(\w+).html')[-1] # 获取页面数量
        number = 2 # 这里我就填写固定的循环次数了
        for i in range(int(number)):
            i+=1 # 变量 i 是从 0 开始的，根据算法 list_29_0.html 这个页面是没有的，所以这里需要加一
            url = "http://www.mmonly.cc/ktmh/dmmn/list_29_" + str(i) + ".html"
            yield scrapy.Request(url,callback=self.parse_item_sub) # 再次爬虫
    # 回调函数
    def parse_item_sub(self, response):
        link = response.xpath('/html//div[@class="ABox"]/a/@href').extract();
        #
        # 前几个小时一直在解决 xpath('div[@class="ABox"]') 不显示数据的问题
        # 后来改为了 /html//div[@class="ABox" 才解决的
        #
        for link in link:
            yield scrapy.Request(link,callback=self.parse_item_sub_sub) # 再次爬虫
        print("end")
    # 回调函数
    def parse_item_sub_sub(self, response):
        title = response.xpath('/html/head/title/text()').extract();
        img = response.xpath('/html//a[@class="down-btn"]/@href').extract();
        print(title)
        print(img)
        # 想要再次深入爬虫可以重复 scrapy.Request 方法继续爬虫，这里就先爬到这。

运行效果图如下：

本文部分资料引用以下地址：

http://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/spiders.html?highlight=scrapy.Request

https://www.cnblogs.com/thunderLL/p/6551641.html

https://www.jianshu.com/p/461d74641e80

http://blog.xx21.cn/?id=6 【Scrapy】 Python Scrapy 学习笔记（一）

http://blog.xx21.cn/?id=7 【Scrapy】 Python Scrapy 学习笔记（二）

« 2024年6月 »
一	二	三	四	五	六	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30