📕
📕 Here are some good solutions we found for this question:
Good solution: https://www.hellointerview.com/learn/system-design/problem-breakdowns/web-crawler
Good solution: https://www.hellointerview.com/learn/system-design/problem-breakdowns/web-crawler
Should it be real-time or periodic?
How do you handle URL deduplication?
What system do you use to track visited vs unvisited URLs?
How to implement delay between requests to the same domain?
How to prioritize pages (e.g. news sites vs low-priority content)?
Should you support re-crawling? How do you schedule that?