当前位置：首页 > 资讯 > 技术文档

Python实现抓取百度搜索结果页的网站标题信息

时间：2020-03-17 18:08 编辑：来源：阅读：
扫一扫，手机访问

摘要：Python实现抓取百度搜索结果页的网站标题信息

[img]http://files.jb51.net/file_images/article/201501/2015122130900742.png?201502213913[/img] 比如，你想采集标题中包含“58同城”的SERP结果，并过滤包含有“北京”或“厦门”等结果数据。该Python脚本主要是实现以上功能。其中，使用BeautifulSoup来解析HTML，可以参考我的另外一篇文章：[url=http://www.1sucai.cn/article/60242.htm]Windows8下安装BeautifulSoup[/url] 代码如下：

[u]复制代码[/u] 代码如下:

__author__ = '曾是土木人'

# -*- coding: utf-8 -*-

#采集SERP搜索结果标题

import urllib2

from bs4 import BeautifulSoup

import time

#写文件

def WriteFile(fileName,content):

    try:

        fp = file(fileName,"a+")

        fp.write(content + "\r")

        fp.close()

    except:

        pass
#获取Html源码

def GetHtml(url):

    try:

        req = urllib2.Request(url)

        response= urllib2.urlopen(req,None,3)#设置超时时间

        data    = response.read().decode('utf-8','ignore')

    except:pass

    return data

#提取搜索结果SERP的标题

def FetchTitle(html):

    try:

        soup = BeautifulSoup(''.join(html))

        for i in soup.findAll("h3"):

            title = i.text.encode("utf-8")　　　　　　

　　　　　　　if any(str_ in title for str_ in ("北京","厦门")):

　　　　　　　　  continue

            else:

                print title

            WriteFile("Result.txt",title)

    except:

        pass

keyword = "58同城"

if __name__ == "__main__":

    global keyword

    start = time.time()

    for i in range(0,8):

        url = "http://www.baidu.com/s?wd=intitle:"+keyword+"&rn=100&pn="+str(i*100)

        html = GetHtml(url)

        FetchTitle(html)

        time.sleep(1)

    c = time.time() - start

    print('程序运行耗时:%0.2f 秒'%(c))

全部评论(0)

上一篇：Python编程实现输入某年某月某日计算出这一天是该年第几天的方法
下一篇：python 根据正则表达式提取指定的内容实例详解

资讯排行榜
更多>>