Crawlee for Python – Python Nagoya

It helps you build reliable Python web crawlers. Fast.

Crawlee for Python · Fast, reliable Python web crawlers.

Crawlee helps you build and maintain your Python crawlers. It’s open source and modern, with type hints for Python to help you catch bugs early.

こちらが Crawlee for Python の基本的な使い方と流れの解説です：

⚙️ セットアップとインストール

Python 3.9 以上が必要。
BeautifulSoup や Playwrightなど含めて、以下でインストールできます： python -m pip install 'crawlee[all]' (crawlee.dev)
Playwright を使う場合は、追加で： playwright install などが必要です。

1. クローラーの選択

Crawlee には以下のクローラーがあり、それぞれ用途に応じて使い分けます：

BeautifulSoupCrawler：単純なHTTPリクエスト＋HTML解析。高速だが JavaScript は非対応。(crawlee.dev)
ParselCrawler：CSSセレクタを使う人に親和性。BeautifulSoup と同様に高速。(crawlee.dev)
PlaywrightCrawler：ヘッドレスブラウザ（Chromium／Firefox）で JavaScript に対応。動的コンテンツ向け。(crawlee.dev)

2. 簡単なクローラーの作成例

BeautifulSoupCrawler を使った例

import asyncio
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

async def main():
    crawler = BeautifulSoupCrawler()
    
    @crawler.router.default_handler
    async def handler(ctx: BeautifulSoupCrawlingContext):
        url = ctx.request.url
        title = ctx.soup.title.string if ctx.soup.title else ''
        ctx.log.info(f'Title of {url}: {title}')

    await crawler.run(['https://crawlee.dev/'])

if __name__ == '__main__':
    asyncio.run(main())

run(...) に URL を渡すことで自動で RequestQueue が起動し、並列処理も設定されます。(crawlee.dev)

3. リンクのクロール（enqueue_links）

動的にリンクを収集したい場合は enqueue_links を使用：

await ctx.enqueue_links(selector='.collection-block-item', label='CATEGORY')

特定要素を指定することで不要なリンクを除外し、クローリングを制御可能。(crawlee.dev)

4. ストレージとデータ保存

RequestQueue：訪問予定の URL を管理・永続化（ローカルまたは Apify）(crawlee.dev)
Dataset / KeyValueStore：スクレイピング結果の保存先として利用可能。(crawlee.dev)

5. PlaywrightCrawler による動的ページ処理

PlaywrightCrawler では JavaScript やクッキーの扱いも可能。

import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext

async def main():
    crawler = PlaywrightCrawler(headless=True)

    @crawler.router.default_handler
    async def handler(ctx: PlaywrightCrawlingContext):
        title = await ctx.page.title()
        await ctx.enqueue_links()  # ページ内のリンクも収集
        await ctx.push_data({'url': ctx.request.url, 'title': title})

    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    asyncio.run(main())

JavaScript の実行後に DOM を処理できるため、動的コンテンツやポップアップへの対応なども可能です。(crawlee.dev)

6. 応用例と実践ガイド

無限スクロールページの対応やクッキークリック自動処理なども可能です。
Next.js など API を直接叩く方法で JavaScriptなしでも速くスクレイピングする手法もあります。

✅ まとめ

パッケージをインストールし、Python 3.9+ を用意
ニーズに応じてクローラーを選ぶ（HTTP／ブラウザ）
run() で簡単に開始、データの抽出・保存・リンク追跡ができる
動的ページや無限スクロール、API呼び出しなど、実務に合わせた拡張が可能

BeautifulSoup crawler | Crawlee for Python · Fast, reliable Python web crawlers.

Crawlee helps you build and maintain your Python crawlers. It’s open source and modern, with type hints for Python to help you catch bugs early.