<a href="https://www.buymeacoffee.com/eduardoeljaiek" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/v2/default-blue.png" alt="Buy Me A Coffee" style="height: 35px !important;width: 125px !important;" ></a>
[GitHub Repo](https://github.com/ehayik/web-scraping-kata)
***
### What is my goal or intention?
- I want to implement a proof of concept, having a proxy microservice that will do the web scrapping and return response via REST.
- I want to simulate random failures in the target website to evaluate which mechanisms could be used to overcome this issue, such as, [[Cache Vs Buffer | caching]] and retry.
### What do I already know?
- [[What web scraping is]].
- [Jsoup](https://jsoup.org) can help to speed up.
- How to implement [[Pattern - Object Pool | Object Pool pattern]]?
- Selenium *WebDriver* instance is not thread-safe.
- A *WebDriver* instance is needed by request.
- The cost of initializing *WebDriver* instances is very high.
- A *WebDriver* object pool can help to reduce the cost, allowing to have some instances ready to be reused.
- Apache Commons Pool framework offers a basic and robust implementation for pooling arbitrary objects.
- I could implement two levels of [[Caching in Spring | cache]] using Redis (distributed) and [[Caffeine Cache | Caffeine]] ([[Spring - How to setup Caffeine cache provider | local]])
- Not only that, but I could use [[Spring Boot - How to use Resilience4j Retry| Resilience4j Retry]], If WebDriver doesn't have a retry mechanism out of the box.
- I can use [[Selenium - WebDriver management with WebDriverManagement | WebDriverManagement]] to automate the *WebDriver* management locally, but cannot when [[Selenium - Running Chrome WebDriver inside docker container | inside docker container]].
- It always downloads binaries compatibles with host machine, which can be a problem, when running Docker from MacBook with M1 chips.
### What I don't know that I need to find out
- How to set up a Selenium web scraping java project.
- How to implement *PageObject* pattern?
- Whether CAPTCHA challenge can be pass with Selenium WebDriver.
- Whether Selenium WebDriver provides a retry mechanism.
- How to set up Spring Cache with Redis
- How to set up a Redis container.
- Could I set up multiple Spring Cache providers?
### Who can I talk to who might provide insights?
- IntelliJ AI Assistant 😍
### What can I read or listen to for relevant ideas?
- [Web Scraping in Java in 2023: The Complete Guide](https://www.zenrows.com/blog/web-scraping-java)
- [How To Perform Web Scraping with Selenium Java](https://www.lambdatest.com/blog/web-scraping-with-selenium-java/)
- [Web Scraping 101 (Using Selenium for Java)](https://galabra.medium.com/web-scraping-101-using-selenium-for-java-9d37c52ce7a2)
- [Speed up your Selenium Test with Jsoup library](https://www.linkedin.com/pulse/speed-up-your-selenium-test-jsoup-library-java-fakrudeen-shahul/)
- [Object Pool Pattern](https://java-design-patterns.com/patterns/object-pool/)
- [Pool resources using Apache's Commons Pool Framework](https://www.infoworld.com/article/2071834/pool-resources-using-apache-s-commons-pool-framework.html)
- [Guide to Spring Retry](https://www.baeldung.com/spring-retry)
- [Retry with Spring Boot and Resilience4j](https://reflectoring.io/retry-with-springboot-resilience4j/)
- [WebDriver Waiting Strategies](https://www.selenium.dev/documentation/webdriver/waits/)
- [Using Selenium WebDriver Waits as Retries in Your Tests](https://blog.testproject.io/2021/01/13/using-selenium-webdriver-waits-as-retries-in-your-selenium-tests/)
- [Using Multiple Cache Managers in Spring](https://www.baeldung.com/spring-multiple-cache-managers)
- [Spring Boot Cache with Redis](https://www.baeldung.com/spring-boot-redis-cache)
- [Page object models](https://www.selenium.dev/documentation/test_practices/encouraged/page_object_models/)
- [Implement Two-Level Cache With Spring | Baeldung](https://www.baeldung.com/spring-two-level-cache)