Experiance with scraping data from websites

Table of Contents

One of the projects I’m working on has recently reached over 30k organic search clicks in the last month. The website is a service that scraps data from differnt websites, reformats it and displays it in a different way to the users.

In this article, I will talk about one of the challenges I faced when I built this website.

JavaScript uglification and obfuscation are techniques used to transform code into a more compact and less readable form.

Uglification refers to the process of removing unnecessary characters from the source code without changing its functionality, primarily to reduce file size and improve load times. This includes stripping out whitespace, shortening variable names, and eliminating comments.

Obfuscation, on the other hand, goes a step further by making the code intentionally difficult to understand to protect intellectual property and prevent reverse engineering. This involves complex renaming of variables and functions, altering the control flow, and embedding confusing patterns.

While these techniques can be beneficial for performance and security, they pose significant challenges for developers who need to maintain or update the code, necessitating the use of deuglification tools and practices to restore readability and manageability.

My website relies heavily on data extracted from other websites. While some of these sites offer free APIs that I can use to retrieve the necessary data, others do not provide public APIs and even implement additional protections against scraping. This is where the challenge and excitement of deuglifying website code and scraping their data come into play. By deciphering and cleaning up the obfuscated and minified JavaScript code, I can navigate through the protective layers, allowing me to access and utilize the data effectively.

What makes data scraping from other websites feasible is that web application code is always sent to the client side and rendered in the browser. This means developers cannot completely hide their code or compile it into unreadable binary form—at least not until WebAssembly WASM becomes more widespread. Since JavaScript is interpreted, it remains accessible and understandable, allowing me to analyze the original behavior of the code and effectively scrape the website.

Deuglifying JavaScript code can be approached in several ways. One common method used by websites is employing a simple JavaScript uglify library like UglifyJS to obfuscate their code during the build process. For websites utilizing such libraries, understanding the uglified code becomes manageable with the help of various DevTools features. These include debugging the code and tracing the initiators of XHR requests, which allow developers to reverse-engineer and comprehend the obfuscated JavaScript.

Some of the websites I work with use JavaScript obfuscation libraries in an attempt to protect their code from reverse engineering and tampering. This adds an extra layer of challenge for me, as I need to deobfuscate their code to access the necessary data. These websites often embed the data I need within their JavaScript code, and the code itself changes depending on the data being requested. Deobfuscating a JavaScript file is feasible when the code remains static, but it becomes impractical when the content of the code frequently changes. In such cases, traditional deobfuscation techniques won’t work, and I need to employ dynamic analysis and other advanced methods to extract the data.

To handle these websites, I examined each site’s JavaScript files to understand their specific patterns. I then installed Node.js on my VPS and executed the deobfuscated code along with additional JavaScript commands to extract the data I needed. This approach allowed me to bypass the obfuscation and retrieve the necessary information effectively.

For example, if a website loads data from a specific variable returned by a function and displays it when the user clicks a button in the UI, I replicate this process in Node.js by executing the original code along with additional code I developed for this purpose.

I considered other solutions, such as running the website in a headless browser and extracting data once the page is rendered and loaded. However, since my server has limited resources (4 CPU cores, 6 GB RAM) and I have a large number of users, I needed a more efficient solution. By using Node.js to imitate the entire process, I can efficiently extract the necessary data without overloading my server.

Some websites go to great lengths to make it challenging to analyze their workings by employing a DevTools detector library. This library immediately closes the page if a user attempts to open DevTools and clears any console logs at regular intervals. However, even these websites can be effectively analyzed with the assistance of chrome-response-override, a Chrome extension I contributed to. This extension allows for the overriding of responses to requests sent from the browser. I utilize this extension to intercept and modify responses for JavaScript files. By replacing the original JavaScript file’s content with a version where I’ve removed the code responsible for initiating the DevTools detector library, I can debug the website’s code without triggering the DevTools detector mechanism. This approach enables me to effectively analyze and debug websites despite their attempts to thwart inspection.

One realization I’ve had while working with DevTools detector libraries is that their effectiveness varies. Some libraries are more robust than others. For instance, certain libraries fail to detect the DevTools window if the page is opened within an iframe. This small workaround can save a significant amount of time when analyzing certain websites. However, on the flip side, I’ve encountered websites using versions of these libraries that can detect DevTools even when the website is run from within an iframe. This discrepancy underscores the importance of thorough testing and adaptation to specific circumstances when dealing with DevTools detection mechanisms.

As an example of my approach to deuglifying websites, consider a recent experience I had with a particular site I was attempting to scrape. Initially, everything was running smoothly until they implemented a Web Application Firewall (WAF). Following this update, my scraping efforts were thwarted, as the WAF blocked access to the data I needed. In response, I embarked on a thorough analysis of the WAF’s mechanisms to understand how it operated and devised a strategy to bypass it. This process of deobfuscating the JavaScript code to overcome the WAF proved to be one of the most funny challenges I’ve encountered recently.

Upon closer examination of the issue, I discovered that when my service attempted to load data from the website, it was redirected to another page. This page prompted me to undergo a security check before proceeding to the intended destination.

When I examined the code in the JavaScript files for the security check page, I noticed that one of the files had been uglified. The content of this JavaScript file appeared as follows:


My initial thought was that this code might be utilizing eval to evaluate the original JavaScript code, which could be encoded within the obfuscated code. To test this hypothesis, I opened DevTools and executed the command eval = console.log in the console. Next, I pasted the obfuscated code into the console, and my suspicion was confirmed: the original code was indeed logged into the console.

var p1 = 'KY8Ec9dQrNUWim4jVQom8vZ5aAEd4m4V',
p2 = 'N2aImEFlZoFEFKfKVzp1JuHHIySj4MSG',
p3 = 'LOgF8G9poMMSVigm6ssxwtjlQ51zIVhR';
(function(h) {
var l = p1.length,
u = 'undefined',
i, o = '';
for (i = 0; i < l; i++) o += p1[i] + p2[i] + p3[i];
location.href = h.replace(/&?challenge=[^&]+/g, '') + (h.indexOf('?') < 0 ? '?' : '&') + 'challenge=' + o;

Upon examining the original code, I observed that it concatenates three strings to form a new string, which is then sent to a specific link. When I made a cURL request to this link, I observed that it returned a set-cookie header and redirected to the original page. Notably, the variables p1, p2, and p3 within the code were dynamically changing. To address this, I executed this JavaScript code in Node.js along with additional code to retrieve the redirect link containing the challenge string.

const location = {href: ''};
// Original code(loaded from CURL)
var p1 = 'KY8Ec9dQrNUWim4jVQom8vZ5aAEd4m4V',
p2 = 'N2aImEFlZoFEFKfKVzp1JuHHIySj4MSG',
p3 = 'LOgF8G9poMMSVigm6ssxwtjlQ51zIVhR';
(function(h) {
var l = p1.length,
u = 'undefined',
i, o = '';
for (i = 0; i < l; i++) o += p1[i] + p2[i] + p3[i];
location.href = h.replace(/&?challenge=[^&]+/g, '') + (h.indexOf('?') < 0 ? '?' : '&') + 'challenge=' + o;
// End original code

After obtaining the redirect link, I used cURL to access it and retrieved the cookies from the response header. Next, I set these cookies in the request I sent to the original page I intended to scrape. By incorporating these cookies into the request, I successfully loaded the data from that page once again.

Deuglifying JavaScript code becomes easier with each new website I deuglify. However, there are still two scenarios where I struggle to bypass the JS Challenge on certain websites using the aforementioned strategies.

The first case is Google Recaptcha, and to be honest, bypassing this feels almost impossible, especially with Google Recaptcha V2.

The second case involves the Cloudflare security check, which is considerably more complex than the simple check I decoded previously. Despite its complexity, I found a way to bypass the Cloudflare security check, albeit through a more resource-intensive method.

The challenge with Cloudflare is its ability to detect Puppeteer or any other browser controlled by automated testing software. To circumvent this issue, I had to take a different approach. I installed Gnome on my VPS, launched Firefox, and opened the page of the website I intended to scrape. To avoid triggering any automated detection mechanisms, I developed a firefox addon to periodically refresh this page every hour.

In summary, the process of deuglifying JavaScript code reveals the dynamic nature of web development, where challenges spark innovation. From unraveling obfuscated code to circumventing security measures like Google Recaptcha and Cloudflare checks, each obstacle fuels our determination to find solutions. Through perseverance, trial and error, I unearth valuable insights and push the boundaries of what’s possible in web scraping.

Update 01/06/2024

I’m spending more time working on SEO for the website now. I’m following some online strategies to increase the authority of the blog and the number of backlinks. As a result, the organic traffic keeps increasing over time 💪.

