Downloading https web content – the hard way (for wireshark fans)

Problem: You want to download some content (e.g. video) from a website that doesn’t want you to download that content, only to stream it. You may want to do this to view it later (maybe it expires after a while), or to view it offline or on a different device. Please note that circumventing the content owner’s wishes against downloading may be a legal problem, but there are less sofisticated ways of doing it (e.g. screen recording either software or with a camera pointed at your screen), so we’ll analyze it from a technical point of view.

Up until recently, downloading video content was fairly easy, by starting the debugger (F12 in the browser), going to the network tab, identifying the video resource, copying the curl command and pasting it somewhere for offline download. But the sites got clever and detect when you open the debugger and stop accessing the video content, so that you can’t find the video link.

The solution to this is more convoluted – we’re going to find the URL to download by doing a packet capture, so that the browser can’t detect your snooping attempt. But the sites aren’t stupid either – most of them are https enabled. So – you’d need to decrypt the https traffic to look inside.

Fortunately, since you control the browser you can tell it to dump the encryption keys to a file: https://redflagsecurity.net/2019/03/10/decrypting-tls-wireshark/

So, you can start a new browser instance (linux) with:
  SSLKEYLOGFILE=.ssl.log chromium-browser

and navigate to the desired site. You can load the page that holds the video you want to download.
Also prepare wireshark to do packet capturing on your outgoing interface (full capture, not just headers). Make sure you configure wireshark as described in the guide above, so that it decrypts TLS traffic with the keys found in the same file used by the browser.

Press play on your video (shouldn’t matter much if it’s at the beginning or not, but it’s best to be at the beginning) and leave it play for a short while – 10-20s. You may pause the video and stop the capture (an alternative would be to play the video fully in the browser while the capture is running and extract it via File -> Export objects -> HTTP, but that might take a long time because you need to play the whole thing in your browser).

Now comes the tricky part – where you kind of need to know your way around wireshark. Your challenge is to find the data stream in your packet capture. It usually is the largest transfer between you and a server. You should be able to find it relatively easily by going to Statistics -> Conversations and sorting TCP traffic by Bytes. The largest transfer should be your desired content. Now you know the destination IP address. If you right click on it and select Apply as filter -> Selected -> A<->B you should see only the relevant traffic in wireshark.

You should see after the SSL handshake a HTTP(s) GET request that we need to “convert” into a CURL string. Thankfully there’s an easy way to do that… Select the Hypertext Transfer Protocol section in the GET packet -> Right click -> Copy -> All Visible Selected Tree Items and you should get something like this in your clipboard:

Hypertext Transfer Protocol
    GET /secip/0/1V6iYKTjWZJtAfOEC39TRg/ODAuOTcuMjM4Ljc3/1557435600/hls-vod-s1/flv/api/files/videos/2017/11/14/1510638299qz9g3.mp666Frag5Num5 HTTP/1.1rn
    Host: hty4e3.vkcache.comrn
    Connection: keep-alivern
    Origin:
    User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/73.0.3683.86 Chrome/73.0.3683.86 Safari/537.36rn
    Accept: */*rn
     [truncated]Referer:
    Accept-Encoding: gzip, deflate, brrn
    Accept-Language: en-US,en;q=0.9,ro;q=0.8rn
    rn
    [Full request URI:
    [HTTP request 1/1]
    [Response in frame: 2535]

You need to do some trimming in a text editor:
* remove rn from the lines (can be done with find and replace)
* remove Hypertext Transfer Protocol
* remove any [truncated] entries
* remove anything after the lonely rn or blank line (signifies end of headers)
* reduce the indent of everything so that everything is left-aligned

The end result should look like:

GET /secip/0/1V6iYKTjWZJtAfOEC39TRg/ODAuOTcuMjM4Ljc3/1557435600/hls-vod-s1/flv/api/files/videos/2017/11/14/1510638299qz9g3.mp666Frag5Num5 HTTP/1.1
Host: hty4e3.vkcache.com
Connection: keep-alive
Origin:
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/73.0.3683.86 Chrome/73.0.3683.86 Safari/537.36
Accept: */*
Referer:
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9,ro;q=0.8

Now we can use h2c (headers to curl) to convert it into a curl request: https://curl.haxx.se/h2c/. Simply paste the string in the form and click convert and it should produce something like:

curl –compressed –header “Accept-Language: en-US,en;q=0.9,ro;q=0.8” –header “Connection: keep-alive” –header “Origin: ” –header “Referer: ” –user-agent “Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/73.0.3683.86 Chrome/73.0.3683.86 Safari/537.36”

Time to test it. Remember to redirect output to a file:

curl –compressed –header “Accept-Language: en-US,en;q=0.9,ro;q=0.8” –header “Connection: keep-alive” –header “Origin: ” –header “Re
ferer: ” –user-agent “Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/73.0.3683.86 Chrome/73.0.3683.86 Safari/537.36” > test

It should have downloaded a file called test. Let’s see what it is:

$ file test
test: MPEG transport stream data
$ mediainfo test
General
ID                                       : 1 (0x1)
Complete name                            : test
Format                                   : MPEG-TS
File size                                : 4.71 MiB
Duration                                 : 19 s 960 ms
Overall bit rate mode                    : Variable
Overall bit rate                         : 1 977 kb/s
FileExtension_Invalid                    : ts m2t m2s m4t m4s tmf ts tp trp ty

In some cases the server might give you back gzipped data. File should tell you if it’s the case. You will need to uncompress it to proceed. You should now be able to use vlc to play the file to make sure the data is fine.

Now, there’s one more issue. For efficiency reasons caching services like vkcache.com will store large data in chunks (2-10MB in size). Your web player knows how to request the next chunk, but our capture has only one. You’ll need to guess the other fragment names and download all of them. As you can see the server file name is 1510638299qz9g3.mp666Frag5Num5. The most likely things you can iterate on are Frag5 and Num5. We’ll try one, then the other and if you don’t get different chunks we’ll try both. How many chunks can we expect? Well – depends on the length of your content. For 1 hour of content you can expect ~300 2MB chunks. You can always try to download chunks that are not there, we’ll remove them later.
Note a little change we need to do. We need to add -L (follow redirection) to the curl command line (it’s not suggested by default), because for some chunks the sites will redirect you to some other storage and you need to be able to follow it.
Let’s see what happens when we run this little script:

$ cat downloader.sh
#!/bin/bash

for F in `seq 5 5`;
do
        for N in `seq 0 300`;
        do
                        echo “Downloading Frag $F Num $N”
                        curl -L –compressed –header “Accept-Language: en-US,en;q=0.9,ro;q=0.8” –header “Connection: keep-alive” –header “Origin: ” –header “Referer: ” –user-agent “Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/73.0.3683.86 Chrome/73.0.3683.86 Safari/537.36” {F}Num${N}  > Frag${F}Num${N}.ts
        done
done
At some point the chunks will start to output 0 bytes downloaded – that’s how you know how to stop.
When it’s done you should be left with a bunch of Frag5Num***.ts files in your current directory. Take your time and test a few files (make sure they play in VLC and that file/mediainfo output makes sense).

Next, let’s delete “empty” files. They’re not exactly empty, but should contain an error response:

$ cat Frag5Num299.ts
<!DOCTYPE HTML PUBLIC “-//IETF//DTD HTML 2.0//EN”>
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /hls-vod/flv/api/files/videos/2017/11/14/1510638299qz9g3.mp4Frag5Num299.ts was not found on this server.</p>
</body></html>

$ find . -name “*.ts” -size -10k -print -delete

Now we need to use ffmpeg to concatenate the chunks into one mpegts file with this script (adjust as needed):

$ cat concatenate.sh
#!/bin/bash
base=Frag5Num
for N in `seq 0 300`;
do
    if [ -f “${base}${N}.ts” ]; then
        echo “file ‘${base}${N}.ts'” >> filelist.txt
    fi
done

$ bash ./concatenate.sh

We can now run ffmpeg to stitch the chunks together in one file:
$ ffmpeg -f concat -safe 0 -i filelist.txt -c copy output.ts

So – what can the media provider do against this kind of attack? Well, lots, actually…
* they can use one-time URLs/Headers – that expire once used. But this would add complexity on their end and would kill their load balancers/caches.
* they can use harder to guess chunk names. The player normally receives a list of them, so they don’t need to be consecutive. But the attack would focus on intercepting the list and remove the guesswork (I was too lazy to look for it).


Leave your ideas/suggestions in the comments below

Leave a Reply