Web Scraping using C#
In our previous blog, we had seen the basics of web scraping along with its use cases. In this article, we are going to understand how to do web scraping using C#. C# is one of the widely used programming languages, which can be used to develop web-based, windows-based, and console-based applications. C# also provides options to do web scraping. And there are few ways to get the data from a website such as through an API or through web scraping. C# supports both modes of data extraction. So, with C# there is an additional capability of fetching multiple sites at once; some through API and some through web scraping.
Projects which has an established codebase setup in C# need not move to a different language to achieve the task of web scraping. Also with C#, it becomes easier to link up the scraped data with the database, API, front-end systems, etc. as all these systems can be connected using C# in the backend. C# runs based on the .Net framework, and there are many .Net libraries and community support available in the market to achieve the task of web scraping. This makes it easier for developers to achieve the task and refer to community forums in case of issues. Let us now dive deeper and understand web scraping with C#.
Table of contents
- Web scraping with C#
Web scraping using C#
Today, we are going to do web scraping with HTML parsing. In this, the web scraper will look for the content in the HTML tags and will retrieve the required data.
We are going to do the web scraping of https://coinmarketcap.com/. This website holds the information of cryptocurrencies, such as the current price, percentage change in the last 24hrs, 7 days, market capital, and volume for 4 lakhs and odd currencies.
Open the Visual Studio and click on ‘Create a new project’ and choose the project template as ‘Windows Forms App(.Net Framework)’. Now click on the ‘Create’ option.
Note - Here, we are choosing Windows forms. It is also possible to achieve the same functionality using the Windows console application. Any project template can be chosen based on your convenience.
In the next screen, provide a valid name for the solution and the path in which the solution has to be created and placed. In this example, we are going with .Net Framework version 4.7.2. Based on your convenience, either .Net Core or .Net framework can be chosen.
Once the project is created, it has a Form1.cs file, while holds the design of the Windows form. As we are going to work on the web scraping feature, we are skipping the creation of button and text fields in the form. But, if users need to do the web scraping dynamically, users can create a text box (input field) and button (for submission) to enter the URL of the user's choice for scrapping.
Press the F7 button by clicking the Form1.cs file from the solution explorer. This will open the code under the Form1.cs file which was created as a part of the project template.
Now, to ease the handling of the HTML page data in the C# code, we are going to install a NuGet package that helps us in locating and moving to the particular node of the data, that in turn is being fetched from the website.
Package to install - HtmlAgilityPack (Open NuGet for the solution and search for the package and install in the solution).
Next step is to add the code that connects to the URL and grab the page data from the webpage. In this example, we are going to use the HtmlClient option. The URL to call is hard-coded in the code and the client. GetStringAsync will fetch the data from the website and the data will be stored in the local variable. All these actions are present in the method - GetDataFromWebPage
Once the data is obtained from the webpage, parse the obtained HTML from the page using the HtmlDocument that comes with the HtmlAgilityPack. Do this by loading the HTML document to the variable of HTML document.
After loading the HTML to the variable, get to the HTML tag ‘tbody’. This is the tag under which the rows of the cryptocurrency data are present.
After obtaining the table data, get the child nodes. This will give the list of all the rows that are present in the table. After getting the rows, we need to traverse through the columns to get the required data based on the requirement. So, for each row item obtained, get all the child nodes. This will return individual columns.
In this example, we have taken the currency name and its value from the scraped data and we are writing it in a CSV file. The method ParseHtml has been modified to the below to achieve the same. Also, a new method, WriteDataToCSV has been added to write the data to the CSV file.
Here is the sample output of the CSV file with 2 columns, one for currency name, and another for the value of it. Based on your requirement, you can get the required column data and save it to the format of the your choice and work on the data as required.
With the power of C# and the NuGet package, the task of scraping the data from a website, filtering the required data, and writing it to a file turned out to be quite easy and effortless. Also, the number of lines of code is quite low for such a major task. Similarly, the same data can be manipulated, modified, updated, and stored in relational databases or any files of the user’s choice using C#.