[1]Python-Scrapy Application


1.Install scrapy:
- Ensure your pip is new version
- Downloadlxml ,installed locally
- Download pyOpenSSL,installed locally
- Downloadpywin32, installed locally
- Downloadtwisted,installed locally
How to install modules locally:
- open cmd and enter
cd Download Location
- type
pip install Twisted-20.3.0-cp38-cp38-win_amd64.whl
- The error might be
Read Time Out
Solution:
If it shows errors like Read TIme Out
, please change server
pip install -i https://pypi.douban.com/simple scrapy
2.Create project:
scrapy startporject kuaidaili
3.Create Spider
scrapy genspider kuaidaili kuaidaili.com
–>
scrapy genspider name url
-
Change robot rules to false
ROBOTSTXT_OBEY = False
- Example: crawl ip proxy website’s IP address
import scrapy
import time
import random
time.sleep(random.random()*3)
#创建爬虫类,并且继承自scrapy.Spider(最基础的类)
class KuaidailiSpider(scrapy.Spider):
name = 'kuaidaili' #爬虫名字,必须唯一不能重名
allowed_domains = ['kuaidaili.com'] #允许采集的域名
start_urls = ['https://www.kuaidaili']#开始采集的网站for page in range(1,1)
def parse(self, response):#解析响应 response就是响应 网页源码
#提取数据
#提取IP
selectors = response.xpath('//tr') #选择所有tr
for selector in selectors:
ip = selector.xpath('./td[1]/text()').get() #.表示在当前节点下继续选择
port = selector.xpath('./td[2]/text()').get()
print(ip,port)
# items = {
# 'ip':ip,
# 'port':port
# }
# yield items
4.Run scrapy spider
open CMD
scrapy crawl kuaidaili