Instead of placing multiple sites in this page, I’ve decided to place only one, containing a more complex project.
Some of project work was made in 2016. It was a site selling products from only one category “dedicated GPS for autos” with products from two manufacturers. Most SEO work was made then. Few month ago client called and asked me to place more products and out of stock to all items belonging to one of those two manufacturers. He needed an automatic solution to update prices and add products according to manufacturer site.
This video doesn’t contain theme customization.
We’ll use WORDPRESS, WP XMLRC, PHP, Python, Mongo, Bash Scripting, Docker, Authotkey, Imacros and Sqlite to:
- import all existing info into local development setup keeping posts dates. Import also in mongodb
- modify certain products and product attributes. Keep only one manufacturer
- import from csv file some additional info(links, sku’s)
- make 2 scrappers: 1 crawler and 1 spider. Crawler will run once per week and it’s responsible for finding new products and placing correct product categories. It must take into account that a product can exists in multiple categories so it needs to make 2 collections 1 with product urls (filtered, no dupes) and 1 with product categories(this will contain dupes). Spider will run every 2 days and will retrieve product info and check for updates
- a php updater and poster will run every day and it’s responsible for updating site and mongodb. if product is in wp or databese it checks the price, if price is same will move on. If product is not in db or site, will import it. if manufacturer changed price will update accordingly
- an ad poster
Scrapper will reside on a separate pc, in this case a Dell Workstation and will run 24/7. What we’re doing it looks like this:
– php 5.6, nginx, Extended wp-xmlrpc so it can accept custom methods from poster. It is hosted on dynamic ip.
– Docker, base OS UBUNTU (3 important containers: Python with TOR for spider and crawler, PHP 7.1 for poster and updater, and Mongo for database) + some bash scripts and crons
Any usb stick (minim 512MB) so we can place ads poster:
– We have portable firefox, Imacros, Autohotkey and Sqlite. Needs Windows >= XP to run. Poster can be improved with Tesseract OCR, for example to brake simple captchas, but for now this will have to be enough
WTSERVER with php 5.6 holds 3 sites ip 18.104.22.168:
– navigatia.io (Main site clone) this is site that we’re pulling data from,
– navigatia.vv main site, will put info too, and will be exported as main site
– navigatia.uo made this for two reasons:
1 To see if it works on other themes
2.To post to more than 1 site. Maybe we need to make more clones, or make 10 sites, each for specific category
Oracle VM contains scraper, poster and mongo database ip 192.168.1.104 with php 7.1, python 2.6
Tried to mimic actual site and scrapper conditions.
Coping old site: If it were only new content it was much simpler but we need to get old data and we need to retain dates while we do it.
- Couldn’t just dump mariadb database because it was too large (was using some plugins that kept adding useless data to my db every new post) I did not find already made plugins to copy yoast and keep published date for all post types. Thought in making one but I already needed xmplrc for updater, so..
- I needed to keep published date, especially for images because site was already indexed .Example old image url: /2016/03/bmw.jpg by default will upload to /2018/06/bmw.jpg and of course 404’s from google. I suppose that we can do 301 for images but will have to repeat for future sites.
- Product categories goes 3 levels deep, you need to go recursively to discover parent. id’s ex: 257 has parent 100, 100 has parent 53, 53’s parent is 0
- Product had cross-sell, so we need to keep old ids and find cross-selling product
- We didn’t had variable products which made script easier. If we did, I would used WOO REST API
- Not all scrapped products had same meta price as displayed meta
- All images needed optimization
- Some products had unwanted words
- Some products offered service in whole country, client has a local service
- Some products were combined
- Some products had different image in meta as in displayed page
- Some existing products needed price and stock changed
- Trickiest part was mapping manufacturer’s categories with ours. Ids are, after all, variables. If tomorrow I decide to do it again and skip a single step categories will not have same ids. Needed to search by name, but for some brands we have more than one result in db.
Statistics: wrote(including comments and empty lines)(didn’t include external libraries used):
– PHP: 6573 lines from which 511 are settings
– Python: 1207
– Autohotkey : 2040 this generates also imacros lines
– WP extending XMLRPC: 935
– Bash,.YML,Dockerfile,.conf: 1022
Library used and a big thanks for their work:
PHP, Python, Scrapy, AutoHotKey, Imacros, Mongodb, Mariadb, RoboMongo, Nginx, SQLite, Docker, Ubuntu, Oracle Virtual Box, WTServer, Atom ,Notepad++ ,Firfox Protabil, OBS, Yoast, WordPress, Woocommerce, WP Super Cache and of course stack overflow and google