網(wǎng)頁(yè)爬蟲(chóng)抓取百度圖片(《百度熱點(diǎn)新聞上》第6期 )
優(yōu)采云 發(fā)布時(shí)間: 2021-09-17 19:06網(wǎng)頁(yè)爬蟲(chóng)抓取百度圖片(《百度熱點(diǎn)新聞上》第6期
)
在百度熱點(diǎn)新聞中,前6條在strong>A下抓取,后30條在各子欄目(國內、國際、本地、娛樂(lè )、體育等)下抓取,抓取的特征值為標簽下的mon值,C=欄目名稱(chēng),PN=各欄目下的新聞條數,12個(gè)項目顯示在一個(gè)類(lèi)別下(8個(gè)本地新聞項目),只需查看原創(chuàng )網(wǎng)頁(yè)即可
完整代碼如下所示
import requests
from bs4 import BeautifulSoup
import time
url='http://news.baidu.com/'
res=requests.get(url)
soup = BeautifulSoup(res.text,'lxml')
print('百度新聞python爬蟲(chóng)抓取')
print('頭條熱點(diǎn)新聞')
sel_a =soup.select('strong a')
for i in range(0,5):
print(sel_a[i].get_text())
print(sel_a[i].get('href'))
print('熱點(diǎn)新聞')
titles_b=[]
titlew=""
for i in range(1,31):
sel_b=soup.find_all('a',mon="ct=1&a=2&c=top&pn="+str(i))
titles_b.append(sel_b[0])
for i in range(0,30):
print(titles_b[i].get_text())
print(titles_b[i].get('href'))
titlew=titlew + titles_b[i].get_text() + "\n"
# 獲取當前時(shí)間
now = time.strftime('%Y-%m-%d', time.localtime(time.time()))
# 輸出到文件
with open('news' + now + '.txt', 'a', encoding='utf-8') as file:
file.write(titlew) #只輸出標題
在瀏覽過(guò)程中,您可以直接將網(wǎng)頁(yè)下載到本地進(jìn)行調試。代碼如下:
with open('本地文件路徑',encoding='utf-8') as f:
# print(f.read())
soup = BeautifulSoup(f,'lxml')