" Life is short,you need Python——Bruce Eckel"

[1]Python-Scrapy Application

1.Install scrapy:

  1. Ensure your pip is new version
  2. Downloadlxml ,installed locally
  3. Download pyOpenSSL,installed locally
  4. Downloadpywin32, installed locally
  5. Downloadtwisted,installed locally

How to install modules locally:

  1. open cmd and enter cd Download Location
  2. type pip install Twisted-20.3.0-cp38-cp38-win_amd64.whl
  3. The error might be Read Time Out

Solution:

If it shows errors like Read TIme Out, please change server

pip install -i https://pypi.douban.com/simple scrapy

2.Create project:

scrapy startporject kuaidaili

3.Create Spider

  • scrapy genspider kuaidaili kuaidaili.com

    –> scrapy genspider name url

  • Change robot rules to false

ROBOTSTXT_OBEY = False
  • Example: crawl ip proxy website’s IP address
import scrapy
import time
import random

time.sleep(random.random()*3)
#创建爬虫类,并且继承自scrapy.Spider(最基础的类)
class KuaidailiSpider(scrapy.Spider):
    name = 'kuaidaili' #爬虫名字,必须唯一不能重名
    allowed_domains = ['kuaidaili.com'] #允许采集的域名
    start_urls = ['https://www.kuaidaili']#开始采集的网站for page in range(1,1)

    def parse(self, response):#解析响应 response就是响应 网页源码
        #提取数据

        #提取IP
        selectors = response.xpath('//tr') #选择所有tr
        for selector in selectors:
            ip = selector.xpath('./td[1]/text()').get() #.表示在当前节点下继续选择
            port = selector.xpath('./td[2]/text()').get()
           print(ip,port)
            # items = {
            #     'ip':ip,
            #     'port':port
            # }
            # yield items


4.Run scrapy spider

open CMD

scrapy crawl kuaidaili

YOU MIGHT ALSO LIKE

0 0 vote
Article Rating
Subscribe
提醒
guest
0 评论
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x