Tuesday, March 31, 2020

What exactly happens over the network when you visit a website

1. You type www.cnn.com in the Internet Explorer browser
The browser figures out that www.cnn.com is a direct request of a main page. No protocol has been specified so the default is HTTP and going to Port 80.

2. The IE browser will do a DNS query search for the IP address of www.cnn.com
a. the browser cache DNS is searched first. If no success....
b. the IE browser makes a system call and searches for the IP address of www.cnn.com in    the Operating System cache. If it is not found there....
c. the browser checks in the DNS cache in the router. If the IP address of www.cnn.com    is not found there....
d. the browser checks the cache of the DNS server of the Internet Service Provider       and will almost certainly find the IP address of www.cnn.com there. If it is still not able to find the IP address there....
e. the DNS server of your ISP will begin a recursive search, from the rootnameserver, through the .com toplevel nameserver, to CNN's nameserver. Usually a hit to the nameserver will not happen since the ISP's DNS server will have the names of the .com nameservers in its cache.

Sidenote:
  • Round-robin DNS is a feature of DNS where multiple IP addresses are returned instead of only one. www.cnn.com actually maps to 4 IP addresses
  • Load balancers are used on major sites for high performance so that every hit on the webpage is split and directed across to one of the IP addresses for equal distribution
  • Geographic DNS helps improve scalability by mapping a domain name to different IP addresses depending on the geographic location of the client machine

3. The IE browser asks the IP protocol to get information from 157.166.248.11 to retrieve the content of the main page

4. TCP/IP sends a request to the Gateway to forward this request. Before that the TCP/IP divides the request into various packets so that it reaches the end, and will also assemble all the stuff sent from the CNN Webpage
  • A GET request mentions www.cnn.com to be fetched
  • The browser identifies itself
  • States what type of responses it will accept ("Accept" and "Accept-Encoding" headers)
  • The "Connection" header asks the server to keep the TCP connection open for further requests

5. The CNN server handles the request
The server will receive the GET request, process it, parse out the front page and send back a response.

Well actually it is much more complicated than that:
  • Web server software
The Apache web server software receives the HTTP request and decides which request handler should be executed to handle this request. A request handler is a program (written in PHP, Python or Ruby) that reads the request and generates the HTML for the response.
  • Request handler
The request handler reads the request, its parameters and cookies. It will read and update some data stored on the server. Then the request handler will generate a HTML response.

6. The CNN Database server checks if it needs to send any content

7. The CNN server side script Script Engine will pull out the front page from its cache and send back a HTML response
The "Content-Encoding" header tells the browser that the response body is compressed using the gzip algorithm. After decompressing the blob, you will see the HTML page you would expect. In addition to compression, headers specify whether and how to cache the page, any cookies to set, privacy information, etc.

8. The User ID of the client is stored in the database and also sent to the caller so that there is a record of who was serviced

9. On the client machine, TCP/IP will transfer all the compiled data to the client browser

10. The IE browser begins rendering the website even before it has received the entire HTML document
As the browser renders the HTML, it will notice tags that require fetching of other URLs. The browser will send a GET request to retrieve each of these files. Each of these URLs will go through a process similar to what the HTML page went through. So the browser will look up the CNN domain name, send a request to the URL, follow redirects, etc.

However unlike the dynamic pages, the static files allow the browser to cache them. Some of the files may be served up from cache, without contacting the webserver at all. The browser knows how long to cache a particular file because the response that returned the file contains an "Expires" header. Additionally, each response may also contain an ETag header that works like a version number ---- if the browser sees an ETag for a version of the file it already has, it can stop the transfer immediately.

11. The browser sends further asynchronous requests
The client continues to communicate with the server even after the page is rendered.
To update the content on your page, the Javascript executing in your browser has to send an asynchronous request to the server. The asynchronous request is a programmatically constructed GET or POST request that goes to a special URL. This pattern is referred to as AJAX (Asynchronous JavaScript and XML).

12. The TCP/IP extracts all the additional scripts and pages the browser requested

13. The browser places the different images retrieved on different locations on its screen

14. There it is, you are reading the CNN webpage news



Summary
=======
  • Browser figures out that www.cnn.com is a direct request of a main page.  No protocol is specified, so default is HTTP & going to Port 80.  Queries with DNS for the CNN address/location
  • DNS sends query to the ISP servers and gets the address 157.166.248.11
  • Browser asks IP protocol to get information from 157.166.248.11 and the content of the main page
  • TCP/IP sends a request to the Gateway to forward this request.  Before that, the TCP/IP divides the request into various packets so that it reaches the end, and will also assemble all the stuff sent from the CNN Webpage
  • The CNN Webserver gets the request the request from this client
  • The CNN Server side Script parses out the front page
  • The CNN Database server checks if it needs to send any content
  • However, the CNN Server side Script Engine will pull out the front page from its cache
  • The User ID of the client is stored in the database & also sent to the client so that there is a record of who was serviced
  • On the client machine, TCP/IP will transfer all the compiled data to the client Browser
  • The Browser will also retrieve a bunch of additional scripts and pages and links from the CNN Web server
  • The TCP/IP will ask the browser to pack all its requests in the HTTP request
  • The DNS server checks for all the intermediate and associated pages from CNN
  • The TCP/IP extracts all the additional scripts and pages that the browser requested
  • The browser places the different images retrieved on different locations on its page
  • Browser then says: "Done" and there it is, the CNN Web page to read




No comments:

Post a Comment