管道

同一scrapy项目所有爬虫都会执行管道的类
管道通过spider.name判断是那个爬虫提取的数据，然后进行不同的处理
每个管道类中都有process_item方法，用于对数据进行处理
每个管道类中都有open_spider和close_spider方法，用于开启和关闭时执行

定义

在pipelines.py中定义管道类

python


import json

import pymysql

class WangyiPipeline:
	# 对数据进行处理
	def process_item(self, item, spider):
		return item

# 存放数据到csv的管道
class CsvPipeline:
	# 开启时执行
	def open_spider(self, spider):
		# 如果是hr爬虫
		if spider.name == "hr":
			self.file = open(f'{spider.name}.csv', 'w')
			self.file.write("[\n")

	# 关闭时执行
	def close_spider(self, spider):
		# 如果是hr爬虫
		if spider.name == "hr":
			self.file.write("]")
			self.file.close()

	# 对数据进行处理
	def process_item(self, item, spider):
		# 如果是hr爬虫
		if spider.name == "hr":
			# 把item对象转换为字典
			item = dict(item)
			json_str = json.dumps(item, ensure_ascii = False)
			self.file.write(json_str + ",\n")
		return item

# 存放数据到mysql的管道
class MysqlPipeline:
	def open_spider(self, spider):
		self.db = pymysql.connect(host = "localhost", user = "root", password = "123456", database = "wangyi")
		self.cursor = self.db.cursor()

	# 关闭时执行
	def close_spider(self, spider):
		if spider.name == "hr":
			self.cursor.close()
			self.db.close()

	# 对数据进行处理
	def process_item(self, item, spider):
		# 如果是hr爬虫
		if spider.name == "hr":
			# 把item对象转换为字典
			item = dict(item)
			# 执行sql语句
			self.cursor.execute("select version()")

配置

在setting.py中配置管道类是否执行
格式"管道类": 优先级
优先级: 数值越小,优先级越高

json

ITEM_PIPELINES = {
    "myproject.pipelines.CsvPipeline": 300,
    "myproject.pipelines.MysqlPipeline": 400,
}

管道 ​

定义 ​

配置 ​

运行 ​

管道

定义

配置

运行