Design Web Crawler


🙋 Here are some details you should know about this question:

Should it be real-time or periodic?

How do you handle URL deduplication?

What system do you use to track visited vs unvisited URLs?

How to implement delay between requests to the same domain?

How to prioritize pages (e.g. news sites vs low-priority content)?

Should you support re-crawling? How do you schedule that?


← Back to Main Table