Brainstorm - Selenium Web Scraping Kata

<a href="https://www.buymeacoffee.com/eduardoeljaiek" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/v2/default-blue.png" alt="Buy Me A Coffee" style="height: 35px !important;width: 125px !important;" ></a> [GitHub Repo](https://github.com/ehayik/web-scraping-kata) *** ### What is my goal or intention? - I want to implement a proof of concept, having a proxy microservice that will do the web scrapping and return response via REST. - I want to simulate random failures in the target website to evaluate which mechanisms could be used to overcome this issue, such as, [[Cache Vs Buffer | caching]] and retry. ### What do I already know? - [[What web scraping is]]. - [Jsoup](https://jsoup.org) can help to speed up. - How to implement [[Pattern - Object Pool | Object Pool pattern]]? - Selenium *WebDriver* instance is not thread-safe. - A *WebDriver* instance is needed by request. - The cost of initializing *WebDriver* instances is very high. - A *WebDriver* object pool can help to reduce the cost, allowing to have some instances ready to be reused. - Apache Commons Pool framework offers a basic and robust implementation for pooling arbitrary objects. - I could implement two levels of [[Caching in Spring | cache]] using Redis (distributed) and [[Caffeine Cache | Caffeine]] ([[Spring - How to setup Caffeine cache provider | local]]) - Not only that, but I could use [[Spring Boot - How to use Resilience4j Retry| Resilience4j Retry]], If WebDriver doesn't have a retry mechanism out of the box. - I can use [[Selenium - WebDriver management with WebDriverManagement | WebDriverManagement]] to automate the *WebDriver* management locally, but cannot when [[Selenium - Running Chrome WebDriver inside docker container | inside docker container]]. - It always downloads binaries compatibles with host machine, which can be a problem, when running Docker from MacBook with M1 chips. ### What I don't know that I need to find out - How to set up a Selenium web scraping java project. - How to implement *PageObject* pattern? - Whether CAPTCHA challenge can be pass with Selenium WebDriver. - Whether Selenium WebDriver provides a retry mechanism. - How to set up Spring Cache with Redis - How to set up a Redis container. - Could I set up multiple Spring Cache providers? ### Who can I talk to who might provide insights? - IntelliJ AI Assistant 😍 ### What can I read or listen to for relevant ideas? - [Web Scraping in Java in 2023: The Complete Guide](https://www.zenrows.com/blog/web-scraping-java) - [How To Perform Web Scraping with Selenium Java](https://www.lambdatest.com/blog/web-scraping-with-selenium-java/) - [Web Scraping 101 (Using Selenium for Java)](https://galabra.medium.com/web-scraping-101-using-selenium-for-java-9d37c52ce7a2) - [Speed up your Selenium Test with Jsoup library](https://www.linkedin.com/pulse/speed-up-your-selenium-test-jsoup-library-java-fakrudeen-shahul/) - [Object Pool Pattern](https://java-design-patterns.com/patterns/object-pool/) - [Pool resources using Apache's Commons Pool Framework](https://www.infoworld.com/article/2071834/pool-resources-using-apache-s-commons-pool-framework.html) - [Guide to Spring Retry](https://www.baeldung.com/spring-retry) - [Retry with Spring Boot and Resilience4j](https://reflectoring.io/retry-with-springboot-resilience4j/) - [WebDriver Waiting Strategies](https://www.selenium.dev/documentation/webdriver/waits/) - [Using Selenium WebDriver Waits as Retries in Your Tests](https://blog.testproject.io/2021/01/13/using-selenium-webdriver-waits-as-retries-in-your-selenium-tests/) - [Using Multiple Cache Managers in Spring](https://www.baeldung.com/spring-multiple-cache-managers) - [Spring Boot Cache with Redis](https://www.baeldung.com/spring-boot-redis-cache) - [Page object models](https://www.selenium.dev/documentation/test_practices/encouraged/page_object_models/) - [Implement Two-Level Cache With Spring | Baeldung](https://www.baeldung.com/spring-two-level-cache)