告発＼金沢地方検察庁＼最高検察庁＼法務省＼石川県警察御中2020

＊ pythonのbeautifulsoup4でBloggerの最新記事のURLとタイトルをスクレイピング

python スクレイピング

:CATEGORIES: python,スクレイピング

import re
from bs4 import BeautifulSoup
import requests
load_url = "https://kk2020-09.blogspot.com/"
html = requests.get(load_url)
soup = BeautifulSoup(html.content, "html.parser")
t = str(soup.find("h3", attrs={"class", "post-title"}))
url = re.findall('
(.+)<.+', t)[0][0]
title = re.findall('
(.+)<.+', t)[0][1].replace('\n', '').replace('\u3000', '')
print("{0} {1}".format(url, title))

print("{0} {1}".format(url, title))
https://kk2020-09.blogspot.com/2020/09/keitaadachi.html ＼弁護士足立敬太 @アレクサ六甲おろしかけて@keita_adachi＼接見報酬ですか・・・とはいえ前はカジュアルに再逮捕してたのに今はしなくなったのは人権要請ではな

　Webページのスクレイピングは，これまでRubyのnokogiriでやってきたのですが，今回はpythonでやってみました。nokogiriに比べると使い勝手がよくない上に，文字列に変換すると\u3000などという妙な記号が含まれていました。

　文字列の操作もRubyに比較すると，ずいぶん面倒が多くなります。面倒ですが，プログラムの理解も深まるような気もしています。

　Bloggerでは最新の記事１件しかまともにタイトルが取得できない感じです。サイドバーにあるリンク集は，「soup.find(class_="posts")」取得が出来ましたが，タイトルの文字列が短く切り詰められていました。