Forem

Cover image for Say Goodbye to WebDriver: Modern Alternatives for Browser Automation – Part 2
Serhii Korol
Serhii Korol

Posted on

Say Goodbye to WebDriver: Modern Alternatives for Browser Automation – Part 2

Introduction

Good day! In my previous article , I covered how to launch the Chrome browser, bypass anti-bot tests, and scrape websites effectively. Today, we'll delve into interacting with the DOM. More specifically, I’ll show you how to access elements within shadow roots and how to insert text dynamically. So, grab a cup of coffee, get comfortable, and let's jump right in!

Base implementation

I won't repeat the details of launching Chrome, but here is the foundational code:

        static async Task InteractionWithDom()
        {
            const int port = 9222;
            var chromePath = "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome";
            var userDataDir = Path.Combine(Path.GetTempPath(), Path.GetRandomFileName());

            //🚀Step 1: Start Chrome
            Console.WriteLine("🚀Starting a new Chrome instance...");
            Directory.CreateDirectory(userDataDir);

            var psi = new ProcessStartInfo
            {
                FileName = chromePath,
                Arguments = string.Join(" ",
                    $"--remote-debugging-port={port}",
                    "--no-first-run",
                    $"--user-data-dir={userDataDir}",
                    "https://selectorshub.com/xpath-practice-page/")
            };

            var chromeProcess = Process.Start(psi);
            if (chromeProcess == null)
            {
                Console.WriteLine("❌Failed to start Chrome.");
                return;
            }

            Console.WriteLine("🚀Chrome started. Waiting for initialization...");
            await Task.Delay(5000);
            try
            {
                //✅Step 2: Get WebSocket Debugger URL
                string? debuggerUrl = await GetPageWebSocketUrl();
                Console.WriteLine(debuggerUrl);
                if (string.IsNullOrEmpty(debuggerUrl))
                {
                    Console.WriteLine("❌Failed to retrieve WebSocket Debugger URL.");
                    return;
                }

                // ⚙️Step 3: Connect to WebSocket

                using var ws = new ClientWebSocket();
                await ws.ConnectAsync(new Uri(debuggerUrl), CancellationToken.None);



                //🚪Step 9: Close
                Console.WriteLine("🚪Press Enter to close...");
                Console.ReadLine();
                await ws.CloseAsync(WebSocketCloseStatus.NormalClosure, "Close", CancellationToken.None);
            }
            catch (Exception ex)
            {
                Console.WriteLine($"❌ Error: {ex.Message}");
            }
            finally
            {
                if (!chromeProcess.HasExited)
                {
                    chromeProcess.Kill();
                }
            }
        }
Enter fullscreen mode Exit fullscreen mode

Filling Forms

Let's proceed to Step 4. To illustrate DOM manipulation, I'll use the SelectorsHub practice page. Our first task is to fill out the "Company" input field.

Identifying the Input Selector

company

This is what it looks like in the code.

company code

We begin by obtaining the input element's selector and locating it within the DOM:

var companyInputSelector =
                    "#content > div > div.elementor.elementor-1097 > section.elementor-section.elementor-top-section.elementor-element.elementor-element-0731668.elementor-section-boxed.elementor-section-height-default > div > div.elementor-column.elementor-col-50.elementor-top-column.elementor-element.elementor-element-b7d792b > div > div.elementor-element.elementor-element-459c920.elementor-widget__width-inherit.elementor-widget.elementor-widget-html > div > div > div:nth-child(11) > div > div > div > input[type=\"text\"]:nth-child(3)";

                var companyInputObjectId = await QuerySelector(ws, companyInputSelector, 1);
Enter fullscreen mode Exit fullscreen mode

Querying the DOM

We need to retrieve the object ID of the input element. Here's how the QuerySelector method works:

        private static async Task<string?> QuerySelector(ClientWebSocket ws, string selector, int id)
        {
            var command = new
            {
                id,
                method = "Runtime.evaluate",
                @params = new
                {
                    expression = $"document.querySelector('{selector}')",
                    returnByValue = false
                }
            };

            var response = await SendAndReceive(ws, command);


            using var doc = JsonDocument.Parse(response);

            if (doc.RootElement.TryGetProperty("result", out var result) &&
                result.GetProperty("result").TryGetProperty("objectId", out var objectId))
            {
                return objectId.GetString();
            }

            throw new Exception($"❌Selector: '{selector}' not found");
        }
Enter fullscreen mode Exit fullscreen mode

Here, we utilize the Runtime.evaluate method, which is essential for executing JavaScript code snippets. In this case, I pass the Id parameter into the QuerySelector() method. This ID acts as a context identifier, so it's crucial to be mindful of where and how you apply identical identifiers to avoid conflicts or unintended behavior. After sending the command, we await the response, which is then parsed to extract the object ID for further use.

    private static async Task<string> SendAndReceive(ClientWebSocket ws, object command)
    {
        string message = JsonSerializer.Serialize(command);
        byte[] buffer = Encoding.UTF8.GetBytes(message);
        await ws.SendAsync(new ArraySegment<byte>(buffer), WebSocketMessageType.Text, true, CancellationToken.None);

        var response = await ReceiveMessage(ws);
        Console.WriteLine($"✅[Response]: {response}");
        return response;
    }
Enter fullscreen mode Exit fullscreen mode

Sending and Receiving Data

Another method is used for retrieving data. It’s important to note that the response is never NULL. If the system fails to locate the specified element, you will still receive a response, but it will indicate the absence of the element rather than returning a null value. This ensures consistent handling of outcomes, whether the element is found or not.

✅[Response]: {"id":1,"result":{"result":{"type":"object","subtype":"null","value":null}}}
Enter fullscreen mode Exit fullscreen mode

The method utilizes a fixed-size array, as the response body typically remains consistent in size. This approach is efficient because the length of the response is predictable and does not vary significantly. By predefining the array size, we optimize memory allocation and streamline the processing of the response data.

        private static async Task<string> ReceiveMessage(ClientWebSocket ws)
        {
            var buffer = new ArraySegment<byte>(new byte[4096]);
            var result = await ws.ReceiveAsync(buffer, CancellationToken.None);
            if (buffer.Array != null) return Encoding.UTF8.GetString(buffer.Array, 0, result.Count);
            throw new Exception("❌Buffer is empty.");
        }
Enter fullscreen mode Exit fullscreen mode

Inserting Text

Once you retrieve the object ID, it’s a good practice to validate it before proceeding. This ensures the ID is valid and corresponds to the expected object. After verification, you can safely pass it along for further processing or use in subsequent operations. This step helps prevent errors and ensures the integrity of your workflow.

if(string.IsNullOrEmpty(companyInputObjectId)) return;
await InsertText(companyInputObjectId, ws, "dev.to", 1);
Enter fullscreen mode Exit fullscreen mode

Let’s take a closer look at what happens inside the InsertText() method. This method is responsible for injecting or appending text into a specified element or location within the document. It typically identifies the target element using a selector or an object ID, validates its existence, and then inserts the provided text content. Depending on the implementation, it may overwrite existing content, append to it, or insert the text at a specific position. This method is particularly useful for dynamically updating the DOM with new data or user-generated input.

        private static async Task InsertText(string? objectId, ClientWebSocket ws, string text, int id)
        {
            var command = new
            {
                id,
                method = "Runtime.callFunctionOn",
                @params = new
                {
                    objectId,
                    functionDeclaration = @"
                async function simulateTyping(text) {
                    this.focus();
                    for (let char of text) {
                        this.value += char;
                        this.dispatchEvent(new InputEvent('input', { bubbles: true }));
                        this.dispatchEvent(new KeyboardEvent('keydown', { bubbles: true }));
                        await new Promise(resolve => setTimeout(resolve, 100));
                    }
                    this.dispatchEvent(new KeyboardEvent('keyup', { bubbles: true }));
                    this.dispatchEvent(new Event('change', { bubbles: true }));
                }",
                    arguments = new[]
                    {
                        new { value = text }
                    },
                    userGesture = true
                }
            };

            var response = await SendAndReceive(ws, command);
            Console.WriteLine($"📝 Text Insertion Response: {response}");
        }
Enter fullscreen mode Exit fullscreen mode

I utilized the Runtime.callFunctionOn method, which enables the execution of full-fledged JavaScript code. In this case, the JavaScript function is designed to simulate human-like typing of text. This involves introducing realistic delays between keystrokes and dynamically updating the target element’s content to mimic natural typing behavior. Let’s break this down further to understand how it works.

company input text

Shadow DOM Interaction

In another example, I’ll demonstrate how to interact with shadow roots, which are used to encapsulate DOM subtrees. There are two types of shadow roots: opened and closed. Here, we’ll focus on the opened shadow root, which allows limited access to its internal elements. A shadow root serves as the root node of a separate DOM subtree that is rendered independently from the main document’s DOM tree. However, accessing elements within a shadow root isn’t as straightforward as querying the main DOM.

Let’s walk through a practical example. Suppose I want to access an input element that is nested inside a shadow root. To achieve this, I’ll need to traverse the shadow DOM hierarchy step by step. Here’s how it can be done.

shadow root

Search the parent element

Since we cannot access the element inside the shadow root directly, we first need to retrieve its parent element. The parent element serves as the host for the shadow root, and by accessing it, we can then navigate into the shadow DOM. Once we have the parent element, we can use methods like shadowRoot to gain access to the encapsulated subtree. This step-by-step approach allows us to interact with elements that are otherwise hidden within the shadow DOM structure. Let’s explore how this works in practice.

var userNameObjectId = await QuerySelector(ws, "#userName", 2);
Enter fullscreen mode Exit fullscreen mode

Search the shadow root

After verifying the parent element, the next step is to retrieve the shadow root ID. This ID acts as a reference to the shadow root attached to the parent element. By obtaining this ID, we can access the shadow DOM subtree and interact with its internal elements. This process ensures that we can navigate and manipulate elements within the shadow root, even though they are isolated from the main document’s DOM.

if (string.IsNullOrEmpty(userNameObjectId)) return;
var shadowRootId = await GetShadowRootId(ws, userNameObjectId, 2);
Enter fullscreen mode Exit fullscreen mode

The GetShadowRootId() method retrieves the shadow root ID by executing a straightforward JavaScript function. This function accesses the parent element, navigates to its attached shadow root, and returns the ID of the shadow root. By leveraging this method, we can efficiently obtain the necessary reference to the shadow DOM, enabling further interaction with its encapsulated elements. Here’s how this process works in detail.

        private static async Task<string?> GetShadowRootId(ClientWebSocket ws, string hostObjectId, int id)
        {
            var command = new
            {
                id,
                method = "Runtime.callFunctionOn",
                @params = new
                {
                    objectId = hostObjectId,
                    functionDeclaration = "function() { return this.shadowRoot || null; }",
                    returnByValue = false
                }
            };

            var response = await SendAndReceive(ws, command);
            Console.WriteLine($"Shadow Root Response: {response}");

            using var doc = JsonDocument.Parse(response);
            if (!doc.RootElement.GetProperty("result").GetProperty("result")
                    .TryGetProperty("objectId", out var shadowRootId))
                throw new Exception("❌ Shadow root not found after retries.");
            Console.WriteLine("✅ Shadow root found.");
            return shadowRootId.GetString();

        }
Enter fullscreen mode Exit fullscreen mode

Searh the element in the shadow root

Next, we need to locate and retrieve the input element nested within the shadow root. Once we have the shadow root ID, we can use it to query the shadow DOM and access the input element. This involves using methods like querySelector or querySelectorAll within the context of the shadow root. By doing so, we can interact with the input element as needed, whether it’s to read its value, modify its content, or simulate user input. Let’s explore how this step is implemented.

if (string.IsNullOrEmpty(shadowRootId)) return;
var inputObjectId = await QuerySelectorInShadowRoot(ws, shadowRootId, "#kils", 2);
Enter fullscreen mode Exit fullscreen mode

This method executes the query selector within the shadow root. We reuse the function because it allows us to maintain the correct context using this. By leveraging this, we ensure that the query selector operates within the scope of the shadow root, enabling us to accurately locate and interact with the input element. This approach ensures that the context remains consistent and that the query is performed within the encapsulated shadow DOM structure.

        private static async Task<string?> QuerySelectorInShadowRoot(ClientWebSocket ws, string elementId, string selector, int id)
        {
            var command = new
            {
                id,
                method = "Runtime.callFunctionOn",
                @params = new
                {
                    objectId = elementId,
                    functionDeclaration = $"function() {{return this.querySelector('{selector}');}}",
                    returnByValue = false
                }
            };

            var response = await SendAndReceive(ws, command);

            using var doc = JsonDocument.Parse(response);

            if (!doc.RootElement.TryGetProperty("result", out var result) ||
                !result.TryGetProperty("result", out var innerResult) ||
                !innerResult.TryGetProperty("objectId", out var objectId))
                throw new Exception($"❌ Element '{selector}' not found");
            Console.WriteLine($"🎯 Element '{selector}' found.");
            return objectId.GetString();
        }
Enter fullscreen mode Exit fullscreen mode

Insert text

Finally, we need to insert the text into the input element, just as we did in the previous example. Using the reference to the input element obtained from the shadow root, we can programmatically set its value or simulate user input by typing text character by character. This ensures that the text is dynamically added to the input field, mimicking natural user interaction.

if (string.IsNullOrEmpty(inputObjectId)) return;
await InsertText(inputObjectId, ws, "John Doe", 2);
Enter fullscreen mode Exit fullscreen mode

Let’s take a closer look at how this works in practice.

username input

Bonus. Making screen capture.

In the previous section of this article, several readers asked about the ability to capture screenshots. I’ve decided to demonstrate how it’s done — and it’s surprisingly straightforward.

To take a screenshot, you simply need to declare the file path where the image will be saved and then execute the command. This process is quick and efficient, allowing you to capture and save visuals of the current state of the document or specific elements. Let’s walk through the steps to see how it works in practice.

var projectRoot = Directory.GetParent(AppContext.BaseDirectory)?.Parent?.Parent?.Parent?.FullName;
var screenshotPath = Path.Combine(projectRoot ?? ".", "screenshot.png");
await CaptureScreenshot(ws, screenshotPath, 3);
Enter fullscreen mode Exit fullscreen mode

The Chrome DevTools Protocol provides a built-in method for capturing screenshots, making the process seamless. All you need to do is specify the desired image format (e.g., PNG or JPEG) and set an optional delay to ensure all actions, such as form interactions or animations, are fully completed before taking the screenshot. This ensures the captured image accurately reflects the final state of the page.

In this case, I’ve also modified the data retrieval process to focus on extracting the most significant or relevant data, ensuring efficiency and clarity. This approach allows you to capture precise visuals while maintaining a streamlined workflow.

        private static async Task<string> SendAndReceiveLarge(ClientWebSocket ws, object command)
        {
            var message = JsonSerializer.Serialize(command);
            var bytes = Encoding.UTF8.GetBytes(message);
            await ws.SendAsync(bytes, WebSocketMessageType.Text, true, CancellationToken.None);

            var buffer = new byte[8192];
            var responseBuilder = new StringBuilder();

            WebSocketReceiveResult result;
            do
            {
                result = await ws.ReceiveAsync(buffer, CancellationToken.None);
                responseBuilder.Append(Encoding.UTF8.GetString(buffer, 0, result.Count));
            }
            while (!result.EndOfMessage);

            return responseBuilder.ToString();
        }
Enter fullscreen mode Exit fullscreen mode

The screenshot will be saved in the project's root directory. Let's take a look.

screenshot

Conclusion

In this article, I’ve demonstrated how to interact with the DOM tree using the Chrome DevTools Protocol. At first glance, this approach may appear more complex compared to using tools like Selenium WebDriver or PuppeteerSharp. However, it offers unparalleled flexibility and control. By leveraging the protocol directly, you gain the ability to manage the Chrome browser precisely as you need, without relying on third-party libraries.

Additionally, the Chrome DevTools Protocol is regularly updated by the Chrome team, ensuring you always have access to the latest features and capabilities. This contrasts with third-party libraries, which may lag behind in updates or require adjustments to stay compatible. While the learning curve may be steeper, the power and independence it provides make it a compelling choice for advanced browser automation tasks.

protocol

As always, you can access the full source code by clicking this link. Feel free to explore, experiment, and adapt it to your needs.

In the next part of this series, I’ll show you how to interact with closed shadow roots, which are even more restrictive than their opened counterparts. Additionally, I’ll include a small bonus to enhance your understanding and skills further. Stay tuned for more insights and practical examples! Happy coding! 🚀

Buy Me A Beer

Top comments (0)