In this tutorial, we’ll walk through the process of creating a web application that scrapes product reviews from a website and utilizes OpenAI’s API to generate a summarized review opinion. This application can be a valuable tool for consumers looking to quickly understand the sentiments surrounding a product based on existing reviews.
Important Disclaimer: This tutorial is for educational purposes only. Scraping data from websites may violate their terms and conditions (T&Cs). It’s crucial to always check the T&Cs of any website before scraping data.
Here we will demonstrate by scraping product page reviews on my website which I built specifically for this post, so you can also scrape this website. Below is the page link
https://codewithmarish.com/playground/scrape-reviews
Let’s start by creating a Node js project npm init
in your project directory and installing the dependencies express, cors, openai, puppeteer
import cors from "cors";
import express from "express";
import puppeteer from "puppeteer";
import OpenAI from "openai";
cors
: Middleware for handling cross-origin resource sharing.express
: Node.js framework for building web applications.puppeteer
: It is a Node.js library that allows you to automate tasks such as web scraping, form submission, UI testing, and website interaction.openai
: Node.js client for interacting with the OpenAI API.const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const prompt =
"Below are the reviews for a product, provide your review in a short paragraph";
const crawlWebsite = async (url) => {
// Puppeteer code for scraping product reviews
const browser = await puppeter.launch({ headless: "new" });
const page = await browser.newPage();
console.log("Page loading...");
await page.setViewport({ width: 1280, height: 968, deviceScaleFactor: 1 });
await page.goto(url, {
waitUntil: "networkidle0",
});
console.log("scraping...");
const data = [];
let filterSel = await page.$("select");
await filterSel?.scrollIntoView();
const filters = ["neutral", "positive", "negative"];
for (let j = 0; j < filters.length; j++) {
await filterSel?.scrollIntoView();
await filterSel?.type(filters[j]);
//Based on the html structure our reviews div is a sibling of a div whose div has a child h2 with text Customer Reviews
let customerReviewSelector = `//div[h2[text()='Customer Reviews']]/following-sibling::div`;
await page.waitForXPath(customerReviewSelector);
let [firstel, secondel] = await page.$x(customerReviewSelector);
await (secondel as ElementHandle<Element>).scrollIntoView();
let paginationButtons = await secondel.$$("button");
await (firstel as ElementHandle<Element>).scrollIntoView();
for (let k = 0; k < paginationButtons.length; k++) {
await paginationButtons[k].scrollIntoView();
await paginationButtons[k].click();
let reviewsComp = await firstel.$$("div");
for (let i = 0; i < reviewsComp.length; i++) {
try {
if (!(await reviewsComp[i].isVisible())) {
await reviewsComp[i].scrollIntoView();
}
let childDivs = await reviewsComp[i].$$("p");
if (childDivs) {
let review = await childDivs[0]?.evaluate((t) => {
return t.innerText;
});
let rating = await childDivs[1]?.evaluate((t) => {
return t.innerText;
});
data.push(`${review} with ${rating}.`);
}
} catch (err) {
console.log("Error: ", err, i, j, k);
}
}
}
}
await browser.close();
return data;
};
This function is responsible for scraping product reviews from a given URL using Puppeteer, a headless browser automation library. Let's get into it step by step:
launch
method is used to launch a new browser instance.{ headless: "new" }
indicates that a new browser window should be launched in headless mode.2. Creating Page:
browser.newPage()
creates a new page instance within the browser.await page.setViewport({ width: 1280, height: 968, deviceScaleFactor: 1 })
sets the viewport size of the page.3. Navigating to URL:
page.goto(url, { waitUntil: "networkidle0" })
It tells the browser to go to the specified URL and wait until the page finishes loading. This ensures that all the content on the page is available before proceeding.4. Scraping Reviews:
const filters = ["neutral", "positive", "negative"]
defines an array of review filters. A loop iterates over each filter to fetch reviews of different sentiments.page.$("select")
finds the <select>
element containing the review filter options. Brings the select
element into view using the scrollIntoView()
method to ensure that before we interact with it, it comes into view otherwise we could face exceptions/errors when interacting with it, and selects the current filter in the loop using filterSel[j].type(filters[j])
customerReviewSelector=”//div[h2[text()=’Customer Reviews’]]/following-subling::div
This is the XPath selector that we have used to select the reviews, as per our HTML structure we don’t have specific class names or IDs to identify if it is review text or not so we are extracting based on the HTML structure such that div element wrapping the reviews is the sibling of div which has child element h2 with text content “Customer Reviews”. This selector helps in getting our review element and the pagination as both are siblings of h2 ‘s parent. This technique will be helpful, as most of the websites have dynamically generated class names or don't have specific class names for each element or don’t have specific ids to uniquely identify.page.waitForXPath(customerReviewSelector)
waits for the presence of the element matching the XPath expression //div[h2[text()='Customer Reviews']]/following-sibling::div
.page.$x(customerReviewSelector)
finds the elements matching the XPath expression.element.isIntersectingViewport()
, element.isVisible()
or element.isHidden()
to see if the element is in view or not.paginationButtons[k].click()
to load and scrape reviews from each page.data
.browser.close()
.const getAIReview = async (data: string[]) => {
const completion = await openai.chat.completions.create({
messages: [
{
role: "user",
content: `${prompt} ${data}`,
},
],
model: "gpt-3.5-turbo",
});
let message = completion.choices[0].message.content;
return message;
};
The getAIReview
function utilizes OpenAI's GPT-3.5 model to generate an AI-generated review based on the provided review data. The function calls openai.chat.completions.create()
to generate an AI response based on the provided review data. We pass the prompt and data in the messages array. The model
property specifies the version of the GPT model to be used ("gpt-3.5-turbo" in this case). Upon receiving the completion from the OpenAI API, the generated AI response message is extracted from completion.choices[0].message.content
and returns the response.
const app = express();
app.use(cors());
app.use(express.json());
app.post("/reviews-ai-scanner", async (req, res) => {
try {
const data = await crawlWebsite(req.body.url);
const message = await getAIReview(data);
return res.json({ summary: message });
} catch (error) {
console.error("Error:", error);
return res.status(500).json({ error: "An error occurred" });
}
});
Set up an Express application and apply middleware for handling CORS and parsing JSON request bodies. Defines a POST endpoint /reviews-ai-scanner
to handle requests for analyzing product reviews. It calls the crawlWebsite
function to scrape reviews and the getAIReview
function to generate a review summary using OpenAI.
Final code
import cors from "cors";
import express from "express";
import puppeter, { ElementHandle } from "puppeteer";
import OpenAI from "openai";
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const prompt =
"Below are the reviews for a product, provide your review in a short paragraph";
const crawlWebsite = async (url: string) => {
const browser = await puppeter.launch({ headless: "new" });
const page = await browser.newPage();
console.log("Page loading...");
await page.setViewport({ width: 1280, height: 968, deviceScaleFactor: 1 });
await page.goto(url, {
waitUntil: "networkidle0",
});
console.log("scraping...");
const data = [];
let filterSel = await page.$("select");
await filterSel?.scrollIntoView();
const filters = ["neutral", "positive", "negative"];
for (let j = 0; j < filters.length; j++) {
await filterSel?.scrollIntoView();
await filterSel?.type(filters[j]);
//Based on the html structure our reviews div is a sibling of a div whose div has a child h2 with text Customer Reviews
let customerReviewSelector = `//div[h2[text()='Customer Reviews']]/following-sibling::div`;
await page.waitForXPath(customerReviewSelector);
let [firstel, secondel] = await page.$x(customerReviewSelector);
await (secondel as ElementHandle<Element>).scrollIntoView();
let paginationButtons = await secondel.$$("button");
await (firstel as ElementHandle<Element>).scrollIntoView();
for (let k = 0; k < paginationButtons.length; k++) {
await paginationButtons[k].scrollIntoView();
await paginationButtons[k].click();
let reviewsComp = await firstel.$$("div");
for (let i = 0; i < reviewsComp.length; i++) {
try {
if (!(await reviewsComp[i].isVisible())) {
await reviewsComp[i].scrollIntoView();
}
let childDivs = await reviewsComp[i].$$("p");
if (childDivs) {
let review = await childDivs[0]?.evaluate((t) => {
return t.innerText;
});
let rating = await childDivs[1]?.evaluate((t) => {
return t.innerText;
});
data.push(`${review} with ${rating}.`);
}
} catch (err) {
console.log("Error: ", err, i, j, k);
}
}
}
}
await browser.close();
return data;
};
const getAIReview = async (data: string[]) => {
const completion = await openai.chat.completions.create({
messages: [
{
role: "user",
content: `${prompt} ${data}`,
},
],
model: "gpt-3.5-turbo",
});
console.log(completion.choices[0].message);
let message = completion.choices[0].message.content;
return message;
};
const app = express();
app.use(cors());
app.use(express.json());
app.post("/reviews-ai-scanner", async (req, res) => {
console.log(req.body);
const data = await crawlWebsite(req.body.url);
const message = await getAIReview(data);
return res.send({
summary: message,
});
});
app.listen(3001, () => {
console.log("Server running on 3001");
});
"use client";
import React, { useState } from "react";
const ReviewsScan = () => {
const [urlInput, setUrlInput] = useState("");
const [output, setOutput] = useState("");
const [loading, setLoading] = useState(false);
const handleSubmit = async (e: any) => {
setLoading(true);
e.preventDefault();
var data = new FormData();
data.append("url", urlInput);
const res = await fetch("http://localhost:3001/reviews-ai-scanner", {
method: "post",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ url: urlInput }),
});
const respData = await res.json();
setOutput(respData.summary);
setLoading(false);
};
return (
<div className="max-w-5xl px-4 container mx-auto flex flex-col">
<form className="mb-4 flex flex-col text-center" onSubmit={handleSubmit}>
<label className="mb-6 text-2xl" htmlFor="url">
Enter the URL
</label>
<input
required
className="px-3 py-2 text-center outline-none"
onChange={(e) => {
setUrlInput(e.target.value);
}}
placeholder="http://codewithmarish.com/playground/scrape-reviews"
name="url"
type="url"
/>
<button
type="submit"
className="px-4 py-2 bg-black rounded text-white w-fit tracking-widest self-center mt-4 uppercase"
>
Submit
</button>
</form>
{loading ? (
<p className="text-center">Loading...</p>
) : (
<p className="text-center font-light">
</p>
)}
<div className="mt-6 h-72 bg-white rounded overflow-auto">
<p className="font-light tracking-widest p-2">{output}</p>
</div>
</div>
);
};
export default ReviewsScan;
Let's create a new component named ReviewsScan which contains a form for accepting URL as input and a div for showing AI responses.
State Variables:
urlInput
: State variable to store the input URL provided by the user.output
: State variable to store the output summary generated by the backend.loading
: State variable to track whether the data is being loaded or not.Handle Submit Function:
handleSubmit
: This function is called when the form is submitted. It sends a POST request to the backend server with the user's URL input. Upon receiving a response, it updates the output
state variable with the summary generated by the backend.JSX:
handleSubmit
function as the onSubmit
event handler. If loading
is true, it displays a "Loading..." message to indicate that data is being fetched. Finally, we have a container to display the output summary generated by the backend. The summary is stored in the output
state variable.Now you can start your node js server and next js application, enter the URL (https://codewithmarish.com/playground/scrape-reviews), and see it in action.
Congratulations now you have successfully built an AI-powered web application for analyzing product reviews., also we’ve explored the fascinating intersection of web development and artificial intelligence by building a reviews analyzer web app with OpenAI, Node JS, and Next.js. Happy Coding!.
C️️odeWithMarish