源码网商城,靠谱的源码在线交易网站 我的订单 购物车 帮助

源码网商城

python使用rabbitmq实现网络爬虫示例

  • 时间:2022-11-30 17:03 编辑: 来源: 阅读:
  • 扫一扫,手机访问
摘要:python使用rabbitmq实现网络爬虫示例
编写tasks.py
[u]复制代码[/u] 代码如下:
from celery import Celery from tornado.httpclient import HTTPClient app = Celery('tasks') app.config_from_object('celeryconfig') @app.task def get_html(url):     http_client = HTTPClient()     try:         response = http_client.fetch(url,follow_redirects=True)         return response.body     except httpclient.HTTPError as e:         return None     http_client.close()
编写celeryconfig.py
[u]复制代码[/u] 代码如下:
CELERY_IMPORTS = ('tasks',) BROKER_URL = 'amqp://guest@localhost:5672//' CELERY_RESULT_BACKEND = 'amqp://'
编写spider.py
[u]复制代码[/u] 代码如下:
from tasks import get_html from queue import Queue from bs4 import BeautifulSoup from urllib.parse import urlparse,urljoin import threading class spider(object):     def __init__(self):         self.visited={}         self.queue=Queue()     def process_html(self, html):         pass         #print(html)     def _add_links_to_queue(self,url_base,html):         soup = BeautifulSoup(html)         links=soup.find_all('a')         for link in links:             try:                 url=link['href']             except:                 pass             else:                 url_com=urlparse(url)                 if not url_com.netloc:                     self.queue.put(urljoin(url_base,url))                 else:                     self.queue.put(url_com.geturl())     def start(self,url):         self.queue.put(url)         for i in range(20):             t = threading.Thread(target=self._worker)             t.daemon = True             t.start()         self.queue.join()     def _worker(self):         while 1:             url=self.queue.get()             if url in self.visited:                 continue             else:                 result=get_html.delay(url)                 try:                     html=result.get(timeout=5)                 except Exception as e:                     print(url)                     print(e)                 self.process_html(html)                 self._add_links_to_queue(url,html)                 self.visited[url]=True                 self.queue.task_done() s=spider() s.start("http://www.1sucai.cn/")
由于html中某些特殊情况的存在,程序还有待完善。
  • 全部评论(0)
联系客服
客服电话:
400-000-3129
微信版

扫一扫进微信版
返回顶部