How To Scrape With Selenium: Automate AliExpress Reviews Scraping With Python

Dakidarts — Mon, 11 Mar 2024 09:39:24 +0000

Unlock the power of data with our step-by-step guide on how to scrape with Selenium, scraping AliExpress reviews using Python alongside BeautifulSoup. Learn to navigate through the complexities of web pages, gracefully handle errors, and extract invaluable insights effortlessly. Whether you’re a seasoned developer or a curious explorer, this tutorial promises an engaging dive into the world of automated web scraping, equipping you with the skills to gather and analyze AliExpress reviews like a pro.

Read on to discover the secrets of Selenium, the art of parsing with BeautifulSoup, and the joy of automating your AliExpress reviews scraping journey. Let’s turn the mundane into the extraordinary and transform your Python skills into a force of automation. The AliExpress reviews treasure trove awaits – are you ready to unearth it?

Introduction
Setting Up the Environment
- 2.1 Importing Libraries
- 2.2 Creating a WebDriver
Fetching HTML Content
- 3.1 Defining the Function
- 3.2 Handling Errors
Parsing Reviews with BeautifulSoup
- 4.1 Review Element Structure
- 4.2 Extracting Review Data
Saving Data to CSV
- 5.1 Successful Reviews
- 5.2 Skipped Reviews
Automating the Process
- 6.1 Reading Product List from CSV
- 6.2 Scraping Reviews for Multiple Products
Usage

Introduction

Have you ever wished for a magical wand to fetch AliExpress reviews effortlessly? Well, say hello to Python, Selenium, and our step-by-step guide! In this blog post, we’re about to embark on an exciting adventure where coding meets commerce, and automation becomes your trusty sidekick.

Imagine a world where you can gather AliExpress reviews without the monotony of manual labor. Picture yourself sipping coffee while Python scripts do the heavy lifting for you. Intrigued? You should be! Join us as we unravel the mysteries of AliExpress reviews scraping, turning the seemingly complex into a walk in the virtual park.

Whether you’re a seasoned developer looking to enhance your skills or a curious soul eager to explore the realms of web scraping, this tutorial is your gateway. Fasten your seatbelt, because we’re about to blend code, creativity, and a sprinkle of humor to make your AliExpress reviews scraping journey not just informative, but downright enjoyable. Let the scraping saga begin!

Setting Up Your Scraping Arsenal

Before we embark on our web scraping adventure, it’s crucial to set up the environment. This section covers the configuration of the Firefox WebDriver, installation of necessary Python packages, and the creation of essential functions.

Prerequisites

– Preparing Your Product List

Before diving into the world of AliExpress reviews scraping, make sure you have your product list ready. Create a CSV file containing ali_id and woo_id columns. Each row should represent a product, with ali_id being the AliExpress product ID and woo_id as the corresponding WooCommerce ID if you plan to import the reviews to your Woo store.

– Downloading GeckoDriver

To harness the power of Selenium with Firefox, you’ll need GeckoDriver, the Firefox WebDriver. If you haven’t installed it yet, you can download it here. Make sure to place it in a directory accessible by your system.

– Creating an AliExpress Account

To begin, you need an AliExpress account. If you don’t have one, head over to AliExpress and sign up. Don’t worry; it’s a quick and straightforward process.

– Obtaining Your AliExpress Member ID

Once you’ve successfully registered, navigate to the account settings page. Click on “Edit Profile,” where you’ll find your Member ID.

Now, locate the numerical values in the Member ID section. We’ll need this ID for our scraping adventure.

2.1 Importing Libraries

If you haven’t installed Python on your machine, fear not! You can download it from python.org. Follow the installation instructions provided for your operating system.

Our secret weapons for this journey are Selenium and BeautifulSoup. Install them using the following commands:

pip install selenium
pip install beautifulsoup4

We begin by importing the necessary libraries. Selenium is our go-to tool for web automation, while BeautifulSoup assists in parsing HTML structures. Additionally, we include modules for handling time, CSV file operations, and more.

# Code snippet for importing libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.service import Service as FirefoxService
from bs4 import BeautifulSoup
import time
import csv

2.2 Creating a Shared Firefox WebDriver

To interact with AliExpress dynamically, we create a shared Firefox WebDriver instance using Selenium. This instance will facilitate headless browsing, ensuring a seamless and non-intrusive scraping process.

def get_driver():
  """
  Creates and returns a single shared Firefox WebDriver instance.
  """
  firefox_options = Options()
  firefox_options.add_argument('-headless')
  firefox_options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 12.5; rv:114.0) Gecko/20100101 Firefox/114.0')
  geckodriver_path = 'driver/firefox/geckodriver'  #Replace with the path of the downloaded geckodriver
  firefox_service = FirefoxService(geckodriver_path)
  return webdriver.Firefox(service=firefox_service, options=firefox_options)

Now that our arsenal is ready, let’s move on to the next section where the real action begins.

3. Fetching HTML Content

Now that our environment is set up, it’s time to fetch the HTML content of AliExpress product reviews. This section guides you through defining the function responsible for this task and handling potential errors that may arise during the process.

3.1 Defining the Function

In this step, we’ll create a function that navigates to the specified AliExpress product page, iterates through the desired number of review pages, and retrieves the HTML content of each page. The function, get_html_content, ensures a graceful handling of potential errors.

def get_html_content(driver, url, page_num):
    try:
        driver.get(url)
        driver.implicitly_wait(10)  # Set a default implicit wait time

        all_reviews_html = []  # List to store all collected reviews

        for page in range(1, page_num + 1):
            # Wait for the reviews container to be present
            try:
                reviews_container_locator = (By.CSS_SELECTOR, "#transction-feedback > div.feedback-list-wrap")
                WebDriverWait(driver, 20).until(EC.presence_of_element_located(reviews_container_locator))
            except Exception as e:
                print(f"Error waiting for reviews container on page {page}: {str(e)}")
                break

            # Execute JavaScript to get the outerHTML of the reviews container
            reviews_container_script = 'return document.querySelector("#transction-feedback > div.feedback-list-wrap").outerHTML;'
            reviews_outer_html = driver.execute_script(reviews_container_script)
            all_reviews_html.append(reviews_outer_html)

            if page < page_num:
                # Click the next page button
                try:
                    next_page_button_locator = (By.CSS_SELECTOR, "#complex-pager > div > div > a.ui-pagination-next.ui-goto-page")
                    WebDriverWait(driver, 20).until(EC.element_to_be_clickable(next_page_button_locator)).click()

                    # Wait for the next page to load
                    time.sleep(10)  # Adjust the sleep time based on the time it takes to load the next page
                except Exception as e:
                    print(f"Error clicking next page button on page {page}: {str(e)}")
                    break

        # Concatenate all collected reviews into a single string
        all_reviews_combined = '\n'.join(all_reviews_html)

        return all_reviews_combined
    except Exception as e:
        print(f"Error in get_html_content: {str(e)}")
        return None

3.2 Handling Errors

Web scraping is an adventure, and like any adventure, we might encounter obstacles along the way. To ensure a smooth journey, our script incorporates error-handling mechanisms. The get_html_content function gracefully manages errors, such as missing review containers or difficulties in navigating to the next page.

Stay tuned as we move on to the next section, where we’ll delve into parsing the retrieved HTML content.

4. Parsing Reviews with BeautifulSoup

Now that we’ve successfully fetched the HTML content, it’s time to roll up our sleeves and dive into parsing the reviews. This section guides you through understanding the structure of a review element and extracting valuable review data.

4.1 Review Element Structure

Before we extract data, it’s essential to understand how a review is structured in the HTML. In our case, reviews are encapsulated within a div element with the class feedback-item clearfix. Nested within this structure are various sub-elements holding information such as user details, ratings, and feedback content.

4.2 Extracting Review Data

With the structure in mind, we proceed to extract valuable information from each review. The parse_reviews function utilizes BeautifulSoup to navigate the HTML tree and extract relevant data. Here’s a glimpse of the code:

def parse_reviews(html_content, product_id, woo_id):
    """
    Parses the HTML content and extracts review data using BeautifulSoup.
    """
    try:
        soup_html = BeautifulSoup(html_content, 'html.parser')

        reviews = []
        for review_element in soup_html.find_all('div', class_='feedback-item clearfix'):
            p_title_box = review_element.find('div', class_='fb-user-info')
            user_name_span = p_title_box.find('span', class_='user-name')

            if user_name_span:
                username_anchor = user_name_span.find('a')

                if username_anchor:
                    username_text = username_anchor.text.strip()

                    p_main = review_element.find('div', class_='fb-main')
                    rate_info_div = p_main.find('div', class_='f-rate-info')

                    star_view_span = rate_info_div.find('span', class_='star-view')

                    if star_view_span:
                        width_style = star_view_span.find('span')['style']
                        width_percentage = int(width_style.split(':')[-1].strip('%'))

                        if 0 <= width_percentage < 20:
                            r_star = 1
                        elif 20 <= width_percentage < 40:
                            r_star = 2
                        elif 40 <= width_percentage < 60:
                            r_star = 3
                        elif 60 <= width_percentage < 80:
                            r_star = 4
                        else:
                            r_star = 5
                    else:
                        r_star = None

                    p_content = p_main.find('div', class_='f-content')
                    b_rev = p_content.find('dl', class_='buyer-review')
                    b_rev_fb = b_rev.find('dt', class_='buyer-feedback')

                    pic_rev = b_rev.find('dd', class_='r-photo-list')

                    p_img = pic_rev.find('ul', class_='util-clearfix') if pic_rev is not None else None

                    media_list = [img['data-src'] for img in p_img.find_all('li', class_='pic-view-item')] if p_img else None

                    media_links = ','.join(media_list) if media_list else ''

                    productId = woo_id if not None else product_id

                    display_name = username_text

                    display_name = 'Store Shopper' if display_name == 'AliExpress Shopper' else display_name

                    email = "demo@demo.demo"

                    review_data = {
                        'review_content': b_rev_fb.find('span', class_=None).get_text(strip=True),
                        'review_score': r_star,
                        'date': b_rev_fb.find('span', class_='r-time-new').get_text(strip=True),
                        'product_id': productId,
                        'display_name': display_name,
                        'email': email,
                        'order_id': None,
                        'media': media_links
                    }

                    reviews.append(review_data)

        return reviews

    except Exception as e:
        print(f"Error in parse_reviews: {str(e)}")
        return []

This function elegantly navigates the HTML structure and extracts essential information, including review content, score, date, and more.

Stay with us as we continue our journey into automating AliExpress reviews scraping with Python. Next, we’ll explore saving the scraped data into a convenient CSV format.

5. Saving Data to CSV

Congratulations on reaching this point! Now that we’ve mastered the art of extracting reviews, it’s time to preserve our findings. In this section, we’ll guide you through saving the scraped data into CSV files, making it easily accessible and organized.

5.1 Successful Reviews

When reviews are successfully scraped, we want to store them in a structured CSV file. The save_to_csv function takes care of this process:

# Code snippet for saving successful reviews to CSV
def save_to_csv(reviews, filename):
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['review_content', 'review_score', 'date', 'product_id', 'display_name', 'email', 'order_id', 'media']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(reviews)

This function creates a CSV file with appropriate headers and populates it with the review data, neatly organized for further analysis or sharing.

5.2 Skipped Reviews

Not all heroes wear capes, and not all reviews are scrapable. Fear not! We gracefully handle scenarios where reviews cannot be fetched, and we save the day by providing a CSV file indicating the skipped reviews:

# Code snippet for saving skipped reviews to CSV
def e_save_to_csv(filename, fieldnames):
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerow({fieldname: 'null' for fieldname in fieldnames})

This function creates a CSV file for skipped reviews, ensuring that no data is left behind, and our scraping adventure continues seamlessly.

Stay tuned for the final leg of our journey—putting it all together and unleashing the power of Python to automate AliExpress reviews scraping!

6. Automating the Process

We’re almost there! In this section, we’ll unveil the grand finale—automating the entire process. Brace yourself for an exciting ride as we dive into the world of automating AliExpress reviews scraping with Python.

6.1 Reading Product List from CSV

Before we embark on our automated journey, we need a list of products to scrape. The read_product_csv function comes to our aid, reading product details from a CSV file and preparing them for the upcoming adventure:

def get_correct_url(product_id, ali_member_id):
    base_url = 'https://feedback.aliexpress.com/display/productEvaluation.htm?v=2&productId='
    url = f'{base_url}{product_id}&ownerMemberId={ali_member_id}&page=1'
    driver = get_driver()
    try:
        driver.get(url)
        driver.implicitly_wait(10)
        return driver.current_url
    finally:
        driver.quit()  # Close the driver after use
    
def read_product_csv(csv_filename):
    products = []
    with open(csv_filename, 'r', newline='', encoding='utf-8') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            product_id = row.get('ali_id')
            woo_id = row.get('woo_id')
            if product_id and woo_id:
                products.append((product_id, woo_id))
    return products

This piece of the puzzle ensures that your script knows exactly which products to target, setting the stage for an efficient and accurate scraping performance.

6.2 Scraping Reviews for Multiple Products

Now, let's orchestrate the grand performance. The scrape_products function will lead us through the captivating experience of automating reviews scraping for multiple products:

def get_reviews(product_id, woo_id, page_num, ali_member_id):
    driver = get_driver()
  
    if woo_id is None:
        f_name = product_id
    else:
        f_name = woo_id
  
    try:
        url = get_correct_url(product_id, ali_member_id)
        html_content = get_html_content(driver, url, page_num)
        reviews = parse_reviews(html_content, product_id, woo_id)
    
        if reviews:
            csv_filename = f'reviews/{f_name}_reviews.csv'
            save_to_csv(reviews, csv_filename)
            print(f"Reviews scraped and saved to {csv_filename}")
        else:
            e_csv_filename = f'reviews/{f_name}_reviews_skipped.csv'
            e_save_to_csv(e_csv_filename, fieldnames=['review_content', 'review_score', 'date', 'product_id', 'display_name', 'email', 'order_id', 'media'])
            print("No reviews found.")
    finally:
        driver.quit()

def scrape_products(product_list, page_num, ali_member_id, delay_seconds):
    for product_id, woo_id in product_list:
        print(f"Scraping reviews for AliExpress ID: {product_id}, WooCommerce ID: {woo_id}")
        get_reviews(product_id, woo_id, page_num, ali_member_id)
        print(f"Waiting for {delay_seconds} seconds before the next scraping iteration...")
        time.sleep(delay_seconds)

In this culmination, the script takes the reins, navigating through your list of products and automating the entire review scraping process. Each function plays a vital role in this dance of automation, bringing us to the grand finale of our AliExpress reviews scraping adventure.

7. Usage

To unleash the power of this AliExpress reviews scraper, follow these simple steps:

# Example usage:
csv_filename = 'your-product-list.csv'  # Replace with the actual CSV file containing ali_id and woo_id columns
page_num = 1  # Number of pages to iterate. Adjust as needed
ali_member_id = '0000000000'  # Replace with your actual AliExpress Member ID
delay_seconds = 30  # How long you want the scraper to wait in seconds before scraping the next product in your list

product_list = read_product_csv(csv_filename)
scrape_products(product_list, page_num, ali_member_id, delay_seconds)

Adjust the csv_filename to your product list CSV file, set the desired page_num for review iteration, replace ali_member_id with your AliExpress Member ID, and decide the delay_seconds between each scraping iteration. Now, let the script work its magic!

For the complete code demonstrated in this tutorial, visit my GitHub repository.

Happy scraping!

Conclusion

Congratulations! You’ve embarked on a journey into the realm of web scraping, mastering the art of automating AliExpress reviews extraction with Python, Selenium, and BeautifulSoup. Armed with this scraper, you can gather valuable insights and enhance your e-commerce endeavors. Feel free to explore, modify, and contribute to the code on GitHub.

Now, go ahead and elevate your data-driven decision-making process!

Frequently Asked Questions (FAQs)

Q: Why do I need to scrape AliExpress reviews?

A: Scraping AliExpress reviews allows you to gather valuable insights into product performance, customer satisfaction, and market trends. Whether you’re a seller or a researcher, this data can help you make informed decisions and stay ahead in the competitive e-commerce landscape.

Q: Is it legal to scrape AliExpress reviews?

A: While web scraping itself is a gray area, scraping websites like AliExpress may violate their terms of service. It’s crucial to review and comply with the website’s policies. Always ensure your scraping activities align with legal and ethical standards.

Q: Can I use this script for other websites?

A: This script is tailored for AliExpress. Adapting it for other websites requires understanding their HTML structure and may involve significant modifications. Always respect the terms and conditions of the websites you scrape.

Q: How often can I run the scraper?

A: The frequency of scraping depends on AliExpress’s policies and your own needs. Running it too frequently may lead to IP blocking or other restrictions. Consider a reasonable scraping interval to avoid issues.

Q: What if the script stops working in the future?

A: Websites often update their structure, affecting scrapers. Regularly check for updates to the script or make adjustments based on changes in AliExpress’s HTML structure.

Q: Can I scrape reviews for any AliExpress product?

A: In theory, yes. However, AliExpress may have measures in place to prevent automated scraping. Use the script responsibly, respect the website’s policies, and consider the impact on their servers.

Got More Questions?

Feel free to reach out if you have additional questions or run into issues. Happy scraping!

Reviews Scraping – Dakidarts® Hub