MIME detection from URL

Discussion:

(too old to reply)

Cristian Amarie

2006-02-05 17:15:11 UTC

Input: URL
Desired output: true/false answer if the URL represents a text file or not.
The decision (for me, at least) occurs if the MIME type is succesuflly
detected and this is "text/plain".

(The main problem is actually a little bit larger: I'm inside a BHO's
ShowContextMenu for an anchor - the URL of the anchor element will end up in
displaying a text file or not).

For direct URLs such as http://www.rfc-editor.org/rfc/rfc1957.txt, passing
the URL to FindMimeFromData seems quite sufficient.

The trouble starts when this URL is redirected, so, for example, navigating
to http://www.rfc-editor.org/cgi-bin/rfcsearch.pl and entering a RFC number
to search (such as 1957), choosing RFC File ASCII+, "RFC Content Via" radio
is set to HTTP, the result will be the URL
http://www.rfc-editor.org/cgi-bin/rfcdoctype.pl?loc=RFC&letsgo=1957&type=http&file_format=txt
Clicking the link, IE will redirect to the first URL.

Now I tried 3 ways in order to determine if the final redirection target of
the latter URL will be a text file:
1. InternetOpen + InternetOpenUrl + InternetReadFile to read 8192 bytes.
The result was saved in a local file called C:\url (without any extension)
and the filename was passed to FindMimeFromData.
The result is E_FAIL.
(Naming the file C:\url.txt retusn S_OK, but this is obviously arbitrary
since FindMimeFromData looks also on file extension).
Call:
E_FAIL << FindMimeFromData(0, L"C:\\url", 0, 0, 0, 0, &wsMime, 0);
It looks to me that FindMimeFromData does not check in this case the file
content.

2. The same as 1, but instead of passing the local filename, the downloaded
buffer is passed to FindMimeFromData.
Call:
S_OK * << FindMimeFromData(0, 0, (LPVOID)pBuffer, cbRead, 0, 0, &wsMime,
0);
The return is S_OK and wsMime is "text/plain".
Seems ok, but see below *...

3. Call FindMimeFromData by passing directly the url.
Call:
E_FAIL << FindMimeFromData(0, pwUrl, 0, 0, 0, 0, &wsMime, 0);

Now, if the RFC search is done by HTTP, seems ok in this case.
But if in the search page the "RFC Content Via" radio is set to FTP, the
downloaded data by InternetReadFile is

<html><head><title>Click to get to RFC 1957></head>
<body bgcolor="#FFFFFF"><center>
<h1>Your browser does not support redirection.</h1>
<p>You must click ...
etc.

which for me looks more like a "text/html" MIME type, and not "text/plain".

I'm looking to determine if the content is text (and not HTML, even if also
the downloaded HTML is text, after all).
I'm reading and reading the MIME MSDN's details, but I still can't figure
why the HTML buffer above is detected as text/plain and not text/html.

I would like to perform a more accurate text/plain detection (based on a
partial WinInet downloaded content if absolutely necessary), but not before
exhausting the possibility that I did something wrong on FindMimeFromData
usage.

Looking forward for any kind of suggestion.

Many thanks,
Cristian Amarie

Scherbina Vladimir

2006-02-05 17:39:40 UTC

Permalink

Hello, Cristian.

Post by Cristian Amarie
Input: URL
Desired output: true/false answer if the URL represents a text file or not.
The decision (for me, at least) occurs if the MIME type is succesuflly
detected and this is "text/plain".
(The main problem is actually a little bit larger: I'm inside a BHO's
ShowContextMenu for an anchor - the URL of the anchor element will end up
in displaying a text file or not).
For direct URLs such as http://www.rfc-editor.org/rfc/rfc1957.txt, passing
the URL to FindMimeFromData seems quite sufficient.
The trouble starts when this URL is redirected, so, for example,
navigating to http://www.rfc-editor.org/cgi-bin/rfcsearch.pl and entering
a RFC number to search (such as 1957), choosing RFC File ASCII+, "RFC
Content Via" radio is set to HTTP, the result will be the URL
http://www.rfc-editor.org/cgi-bin/rfcdoctype.pl?loc=RFC&letsgo=1957&type=http&file_format=txt
Clicking the link, IE will redirect to the first URL.
Now I tried 3 ways in order to determine if the final redirection target
1. InternetOpen + InternetOpenUrl + InternetReadFile to read 8192 bytes.
The result was saved in a local file called C:\url (without any extension)
and the filename was passed to FindMimeFromData.
The result is E_FAIL.
(Naming the file C:\url.txt retusn S_OK, but this is obviously arbitrary
since FindMimeFromData looks also on file extension).
E_FAIL << FindMimeFromData(0, L"C:\\url", 0, 0, 0, 0, &wsMime, 0);
It looks to me that FindMimeFromData does not check in this case the file
content.
2. The same as 1, but instead of passing the local filename, the
downloaded buffer is passed to FindMimeFromData.
S_OK * << FindMimeFromData(0, 0, (LPVOID)pBuffer, cbRead, 0, 0,
&wsMime, 0);
The return is S_OK and wsMime is "text/plain".
Seems ok, but see below *...
3. Call FindMimeFromData by passing directly the url.
E_FAIL << FindMimeFromData(0, pwUrl, 0, 0, 0, 0, &wsMime, 0);
Now, if the RFC search is done by HTTP, seems ok in this case.
But if in the search page the "RFC Content Via" radio is set to FTP, the
downloaded data by InternetReadFile is
<html><head><title>Click to get to RFC 1957></head>
<body bgcolor="#FFFFFF"><center>
<h1>Your browser does not support redirection.</h1>
<p>You must click ...
etc.
which for me looks more like a "text/html" MIME type, and not
"text/plain".
I'm looking to determine if the content is text (and not HTML, even if
also the downloaded HTML is text, after all).
I'm reading and reading the MIME MSDN's details, but I still can't figure
why the HTML buffer above is detected as text/plain and not text/html.
I would like to perform a more accurate text/plain detection (based on a
partial WinInet downloaded content if absolutely necessary), but not
before exhausting the possibility that I did something wrong on
FindMimeFromData usage.
Looking forward for any kind of suggestion.
Many thanks,
Cristian Amarie

Using "InternetOpen + InternetOpenUrl + InternetReadFile" would not allow
you to detect the presence of "Location" header. Why not making request and
detecting Location header using HttpQueryInfo as I advised here:
http://groups.google.com/group/microsoft.public.inetsdk.programming.wininet/browse_thread/thread/3c3dfd28eae3f52b/3fd0da342a3367c3?lnk=st&q=HttpQueryInfo+returning+wrong+status+code+200+not+301&rnum=1#3fd0da342a3367c3

This will allow you to "run" over all "Location" headers untill you get the
link to a *.txt file. Then you're welcomed to pass that URL to
FindMimeFromData.

--
Vladimir

Cristian Amarie

2006-02-06 06:29:23 UTC

Permalink

I have tried with both HTTP_QUERY_LOCATION and HTTP_QUERY_CONTENT_LOCATION,
but I get ERROR_HTTP_HEADER_NOT_FOUND (for the url i'm testing with).
I'll keep trying, probably i have to do some more passes until I'll be able
to resolve the redirected url to the text file.

Thanks,
Cristian

Post by Scherbina Vladimir
Hello, Cristian.

Post by Cristian Amarie
Input: URL
Desired output: true/false answer if the URL represents a text file or not.
The decision (for me, at least) occurs if the MIME type is succesuflly
detected and this is "text/plain".
(The main problem is actually a little bit larger: I'm inside a BHO's
ShowContextMenu for an anchor - the URL of the anchor element will end up
in displaying a text file or not).
For direct URLs such as http://www.rfc-editor.org/rfc/rfc1957.txt,
passing the URL to FindMimeFromData seems quite sufficient.
The trouble starts when this URL is redirected, so, for example,
navigating to http://www.rfc-editor.org/cgi-bin/rfcsearch.pl and entering
a RFC number to search (such as 1957), choosing RFC File ASCII+, "RFC
Content Via" radio is set to HTTP, the result will be the URL
http://www.rfc-editor.org/cgi-bin/rfcdoctype.pl?loc=RFC&letsgo=1957&type=http&file_format=txt
Clicking the link, IE will redirect to the first URL.
Now I tried 3 ways in order to determine if the final redirection target
1. InternetOpen + InternetOpenUrl + InternetReadFile to read 8192 bytes.
The result was saved in a local file called C:\url (without any
extension) and the filename was passed to FindMimeFromData.
The result is E_FAIL.
(Naming the file C:\url.txt retusn S_OK, but this is obviously arbitrary
since FindMimeFromData looks also on file extension).
E_FAIL << FindMimeFromData(0, L"C:\\url", 0, 0, 0, 0, &wsMime, 0);
It looks to me that FindMimeFromData does not check in this case the file
content.
2. The same as 1, but instead of passing the local filename, the
downloaded buffer is passed to FindMimeFromData.
S_OK * << FindMimeFromData(0, 0, (LPVOID)pBuffer, cbRead, 0, 0,
&wsMime, 0);
The return is S_OK and wsMime is "text/plain".
Seems ok, but see below *...
3. Call FindMimeFromData by passing directly the url.
E_FAIL << FindMimeFromData(0, pwUrl, 0, 0, 0, 0, &wsMime, 0);
Now, if the RFC search is done by HTTP, seems ok in this case.
But if in the search page the "RFC Content Via" radio is set to FTP, the
downloaded data by InternetReadFile is
<html><head><title>Click to get to RFC 1957></head>
<body bgcolor="#FFFFFF"><center>
<h1>Your browser does not support redirection.</h1>
<p>You must click ...
etc.
which for me looks more like a "text/html" MIME type, and not
"text/plain".
I'm looking to determine if the content is text (and not HTML, even if
also the downloaded HTML is text, after all).
I'm reading and reading the MIME MSDN's details, but I still can't figure
why the HTML buffer above is detected as text/plain and not text/html.
I would like to perform a more accurate text/plain detection (based on a
partial WinInet downloaded content if absolutely necessary), but not
before exhausting the possibility that I did something wrong on
FindMimeFromData usage.
Looking forward for any kind of suggestion.
Many thanks,
Cristian Amarie

Using "InternetOpen + InternetOpenUrl + InternetReadFile" would not allow
you to detect the presence of "Location" header. Why not making request
http://groups.google.com/group/microsoft.public.inetsdk.programming.wininet/browse_thread/thread/3c3dfd28eae3f52b/3fd0da342a3367c3?lnk=st&q=HttpQueryInfo+returning+wrong+status+code+200+not+301&rnum=1#3fd0da342a3367c3
This will allow you to "run" over all "Location" headers untill you get
the link to a *.txt file. Then you're welcomed to pass that URL to
FindMimeFromData.
--
Vladimir

Scherbina Vladimir

2006-02-06 06:41:14 UTC

Permalink

Put the exact snippet code that fails, I'll try to help you to figure out
what's wrong.
--
Vladimir

Post by Cristian Amarie
I have tried with both HTTP_QUERY_LOCATION and HTTP_QUERY_CONTENT_LOCATION,
but I get ERROR_HTTP_HEADER_NOT_FOUND (for the url i'm testing with).
I'll keep trying, probably i have to do some more passes until I'll be
able to resolve the redirected url to the text file.
Thanks,
Cristian

Post by Scherbina Vladimir
Hello, Cristian.

Post by Cristian Amarie
Input: URL
Desired output: true/false answer if the URL represents a text file or not.
The decision (for me, at least) occurs if the MIME type is succesuflly
detected and this is "text/plain".
(The main problem is actually a little bit larger: I'm inside a BHO's
ShowContextMenu for an anchor - the URL of the anchor element will end
up in displaying a text file or not).
For direct URLs such as http://www.rfc-editor.org/rfc/rfc1957.txt,
passing the URL to FindMimeFromData seems quite sufficient.
The trouble starts when this URL is redirected, so, for example,
navigating to http://www.rfc-editor.org/cgi-bin/rfcsearch.pl and
entering a RFC number to search (such as 1957), choosing RFC File
ASCII+, "RFC Content Via" radio is set to HTTP, the result will be the
URL
http://www.rfc-editor.org/cgi-bin/rfcdoctype.pl?loc=RFC&letsgo=1957&type=http&file_format=txt
Clicking the link, IE will redirect to the first URL.
Now I tried 3 ways in order to determine if the final redirection target
1. InternetOpen + InternetOpenUrl + InternetReadFile to read 8192 bytes.
The result was saved in a local file called C:\url (without any
extension) and the filename was passed to FindMimeFromData.
The result is E_FAIL.
(Naming the file C:\url.txt retusn S_OK, but this is obviously arbitrary
since FindMimeFromData looks also on file extension).
E_FAIL << FindMimeFromData(0, L"C:\\url", 0, 0, 0, 0, &wsMime, 0);
It looks to me that FindMimeFromData does not check in this case the
file content.
2. The same as 1, but instead of passing the local filename, the
downloaded buffer is passed to FindMimeFromData.
S_OK * << FindMimeFromData(0, 0, (LPVOID)pBuffer, cbRead, 0, 0,
&wsMime, 0);
The return is S_OK and wsMime is "text/plain".
Seems ok, but see below *...
3. Call FindMimeFromData by passing directly the url.
E_FAIL << FindMimeFromData(0, pwUrl, 0, 0, 0, 0, &wsMime, 0);
Now, if the RFC search is done by HTTP, seems ok in this case.
But if in the search page the "RFC Content Via" radio is set to FTP, the
downloaded data by InternetReadFile is
<html><head><title>Click to get to RFC 1957></head>
<body bgcolor="#FFFFFF"><center>
<h1>Your browser does not support redirection.</h1>
<p>You must click ...
etc.
which for me looks more like a "text/html" MIME type, and not "text/plain".
I'm looking to determine if the content is text (and not HTML, even if
also the downloaded HTML is text, after all).
I'm reading and reading the MIME MSDN's details, but I still can't
figure why the HTML buffer above is detected as text/plain and not
text/html.
I would like to perform a more accurate text/plain detection (based on a
partial WinInet downloaded content if absolutely necessary), but not
before exhausting the possibility that I did something wrong on
FindMimeFromData usage.
Looking forward for any kind of suggestion.
Many thanks,
Cristian Amarie

Cristian Amarie

2006-02-06 06:52:34 UTC

Permalink

Already figured, sorry about this error - my mistake.

1. got the "redirection" URL from BHO+browser (
http://www.rfc-editor.org/cgi-bin/rfcdoctype.pl?loc=RFC?letsgo=1980?type=ftp&file_format=txt )
;
2. call InternetCrackUrlA to split it in
host name ( www.rfc-editor.org )
and
relative url (
/cgi-bin/rfcdoctype.pl?loc=RFC?letsgo=1980&type=ftp?file_format=txt );
3. call InternetConnectA with the host name
4. call HttpOpenRequestA with the verb GET (your example was with POST and I
copied without thinking... sorry about that) AND passing the relative url in
3rd parameter, lpszObjectName.
With HTTP_QUERY_LOCATION i get the real URL of the text file, which is in
this case
ftp://ftp.rfc-editor.org/in=notes/rfc1980.txt

HTTP_QUERY_CONTENT_LOCATION indeed returns "header not found", but i believe
this is normal for this page.

Many many thanks,
Cristian

Post by Scherbina Vladimir
Put the exact snippet code that fails, I'll try to help you to figure out
what's wrong.
--
Vladimir

Post by Cristian Amarie
I have tried with both HTTP_QUERY_LOCATION and
HTTP_QUERY_CONTENT_LOCATION, but I get ERROR_HTTP_HEADER_NOT_FOUND (for
the url i'm testing with).
I'll keep trying, probably i have to do some more passes until I'll be
able to resolve the redirected url to the text file.
Thanks,
Cristian

Post by Scherbina Vladimir
Hello, Cristian.

Post by Cristian Amarie
Input: URL
Desired output: true/false answer if the URL represents a text file or not.
The decision (for me, at least) occurs if the MIME type is succesuflly
detected and this is "text/plain".
(The main problem is actually a little bit larger: I'm inside a BHO's
ShowContextMenu for an anchor - the URL of the anchor element will end
up in displaying a text file or not).
For direct URLs such as http://www.rfc-editor.org/rfc/rfc1957.txt,
passing the URL to FindMimeFromData seems quite sufficient.
The trouble starts when this URL is redirected, so, for example,
navigating to http://www.rfc-editor.org/cgi-bin/rfcsearch.pl and
entering a RFC number to search (such as 1957), choosing RFC File
ASCII+, "RFC Content Via" radio is set to HTTP, the result will be the
URL
http://www.rfc-editor.org/cgi-bin/rfcdoctype.pl?loc=RFC&letsgo=1957&type=http&file_format=txt
Clicking the link, IE will redirect to the first URL.
Now I tried 3 ways in order to determine if the final redirection
1. InternetOpen + InternetOpenUrl + InternetReadFile to read 8192 bytes.
The result was saved in a local file called C:\url (without any
extension) and the filename was passed to FindMimeFromData.
The result is E_FAIL.
(Naming the file C:\url.txt retusn S_OK, but this is obviously
arbitrary since FindMimeFromData looks also on file extension).
E_FAIL << FindMimeFromData(0, L"C:\\url", 0, 0, 0, 0, &wsMime, 0);
It looks to me that FindMimeFromData does not check in this case the
file content.
2. The same as 1, but instead of passing the local filename, the
downloaded buffer is passed to FindMimeFromData.
S_OK * << FindMimeFromData(0, 0, (LPVOID)pBuffer, cbRead, 0, 0,
&wsMime, 0);
The return is S_OK and wsMime is "text/plain".
Seems ok, but see below *...
3. Call FindMimeFromData by passing directly the url.
E_FAIL << FindMimeFromData(0, pwUrl, 0, 0, 0, 0, &wsMime, 0);
Now, if the RFC search is done by HTTP, seems ok in this case.
But if in the search page the "RFC Content Via" radio is set to FTP,
the downloaded data by InternetReadFile is
<html><head><title>Click to get to RFC 1957></head>
<body bgcolor="#FFFFFF"><center>
<h1>Your browser does not support redirection.</h1>
<p>You must click ...
etc.
which for me looks more like a "text/html" MIME type, and not "text/plain".
I'm looking to determine if the content is text (and not HTML, even if
also the downloaded HTML is text, after all).
I'm reading and reading the MIME MSDN's details, but I still can't
figure why the HTML buffer above is detected as text/plain and not
text/html.
I would like to perform a more accurate text/plain detection (based on
a partial WinInet downloaded content if absolutely necessary), but not
before exhausting the possibility that I did something wrong on
FindMimeFromData usage.
Looking forward for any kind of suggestion.
Many thanks,
Cristian Amarie

Using "InternetOpen + InternetOpenUrl + InternetReadFile" would not
allow you to detect the presence of "Location" header. Why not making
request and detecting Location header using HttpQueryInfo as I advised
http://groups.google.com/group/microsoft.public.inetsdk.programming.wininet/browse_thread/thread/3c3dfd28eae3f52b/3fd0da342a3367c3?lnk=st&q=HttpQueryInfo+returning+wrong+status+code+200+not+301&rnum=1#3fd0da342a3367c3
This will allow you to "run" over all "Location" headers untill you get
the link to a *.txt file. Then you're welcomed to pass that URL to
FindMimeFromData.
--
Vladimir

Scherbina Vladimir

2006-02-06 09:55:55 UTC

Permalink

BTW, if you're inside BHO, you can get urls in BeforeNavigate handler,
without making HttpOpenRequest, etc. (Of course this is in case when user
navigates to url).

--
Vladimir

Post by Cristian Amarie
Already figured, sorry about this error - my mistake.
1. got the "redirection" URL from BHO+browser (
http://www.rfc-editor.org/cgi-bin/rfcdoctype.pl?loc=RFC?letsgo=1980?type=ftp&file_format=txt )
;
2. call InternetCrackUrlA to split it in
host name ( www.rfc-editor.org )
and
relative url (
/cgi-bin/rfcdoctype.pl?loc=RFC?letsgo=1980&type=ftp?file_format=txt );
3. call InternetConnectA with the host name
4. call HttpOpenRequestA with the verb GET (your example was with POST and
I copied without thinking... sorry about that) AND passing the relative
url in 3rd parameter, lpszObjectName.
With HTTP_QUERY_LOCATION i get the real URL of the text file, which is in
this case
ftp://ftp.rfc-editor.org/in=notes/rfc1980.txt
HTTP_QUERY_CONTENT_LOCATION indeed returns "header not found", but i
believe this is normal for this page.
Many many thanks,
Cristian

Post by Scherbina Vladimir
Put the exact snippet code that fails, I'll try to help you to figure out
what's wrong.
--
Vladimir

Post by Cristian Amarie
I have tried with both HTTP_QUERY_LOCATION and
HTTP_QUERY_CONTENT_LOCATION, but I get ERROR_HTTP_HEADER_NOT_FOUND (for
the url i'm testing with).
I'll keep trying, probably i have to do some more passes until I'll be
able to resolve the redirected url to the text file.
Thanks,
Cristian

Post by Scherbina Vladimir
Hello, Cristian.

Post by Cristian Amarie
Input: URL
Desired output: true/false answer if the URL represents a text file or not.
The decision (for me, at least) occurs if the MIME type is succesuflly
detected and this is "text/plain".
(The main problem is actually a little bit larger: I'm inside a BHO's
ShowContextMenu for an anchor - the URL of the anchor element will end
up in displaying a text file or not).
For direct URLs such as http://www.rfc-editor.org/rfc/rfc1957.txt,
passing the URL to FindMimeFromData seems quite sufficient.
The trouble starts when this URL is redirected, so, for example,
navigating to http://www.rfc-editor.org/cgi-bin/rfcsearch.pl and
entering a RFC number to search (such as 1957), choosing RFC File
ASCII+, "RFC Content Via" radio is set to HTTP, the result will be the
URL
http://www.rfc-editor.org/cgi-bin/rfcdoctype.pl?loc=RFC&letsgo=1957&type=http&file_format=txt
Clicking the link, IE will redirect to the first URL.
Now I tried 3 ways in order to determine if the final redirection
1. InternetOpen + InternetOpenUrl + InternetReadFile to read 8192 bytes.
The result was saved in a local file called C:\url (without any
extension) and the filename was passed to FindMimeFromData.
The result is E_FAIL.
(Naming the file C:\url.txt retusn S_OK, but this is obviously
arbitrary since FindMimeFromData looks also on file extension).
E_FAIL << FindMimeFromData(0, L"C:\\url", 0, 0, 0, 0, &wsMime, 0);
It looks to me that FindMimeFromData does not check in this case the
file content.
2. The same as 1, but instead of passing the local filename, the
downloaded buffer is passed to FindMimeFromData.
S_OK * << FindMimeFromData(0, 0, (LPVOID)pBuffer, cbRead, 0, 0,
&wsMime, 0);
The return is S_OK and wsMime is "text/plain".
Seems ok, but see below *...
3. Call FindMimeFromData by passing directly the url.
E_FAIL << FindMimeFromData(0, pwUrl, 0, 0, 0, 0, &wsMime, 0);
Now, if the RFC search is done by HTTP, seems ok in this case.
But if in the search page the "RFC Content Via" radio is set to FTP,
the downloaded data by InternetReadFile is
<html><head><title>Click to get to RFC 1957></head>
<body bgcolor="#FFFFFF"><center>
<h1>Your browser does not support redirection.</h1>
<p>You must click ...
etc.
which for me looks more like a "text/html" MIME type, and not "text/plain".
I'm looking to determine if the content is text (and not HTML, even if
also the downloaded HTML is text, after all).
I'm reading and reading the MIME MSDN's details, but I still can't
figure why the HTML buffer above is detected as text/plain and not
text/html.
I would like to perform a more accurate text/plain detection (based on
a partial WinInet downloaded content if absolutely necessary), but not
before exhausting the possibility that I did something wrong on
FindMimeFromData usage.
Looking forward for any kind of suggestion.
Many thanks,
Cristian Amarie

Using "InternetOpen + InternetOpenUrl + InternetReadFile" would not
allow you to detect the presence of "Location" header. Why not making
request and detecting Location header using HttpQueryInfo as I advised
http://groups.google.com/group/microsoft.public.inetsdk.programming.wininet/browse_thread/thread/3c3dfd28eae3f52b/3fd0da342a3367c3?lnk=st&q=HttpQueryInfo+returning+wrong+status+code+200+not+301&rnum=1#3fd0da342a3367c3
This will allow you to "run" over all "Location" headers untill you get
the link to a *.txt file. Then you're welcomed to pass that URL to
FindMimeFromData.
--
Vladimir

Cristian Amarie

2006-02-06 17:17:03 UTC

Permalink

I know, but the clue is to decide BEFORE user clicks the link.
The decision will show (or not) a custom context menu item "Open as text"
that opens the text file referred by the URL into an ActiveX hosted by IE,
much as Adobe Acrobat does.

Cristian

Post by Scherbina Vladimir
BTW, if you're inside BHO, you can get urls in BeforeNavigate handler,
without making HttpOpenRequest, etc. (Of course this is in case when user
navigates to url).
--
Vladimir

Post by Cristian Amarie
Already figured, sorry about this error - my mistake.
1. got the "redirection" URL from BHO+browser (
http://www.rfc-editor.org/cgi-bin/rfcdoctype.pl?loc=RFC?letsgo=1980?type=ftp&file_format=txt )
;
2. call InternetCrackUrlA to split it in
host name ( www.rfc-editor.org )
and
relative url (
/cgi-bin/rfcdoctype.pl?loc=RFC?letsgo=1980&type=ftp?file_format=txt );
3. call InternetConnectA with the host name
4. call HttpOpenRequestA with the verb GET (your example was with POST
and I copied without thinking... sorry about that) AND passing the
relative url in 3rd parameter, lpszObjectName.
With HTTP_QUERY_LOCATION i get the real URL of the text file, which is in
this case
ftp://ftp.rfc-editor.org/in=notes/rfc1980.txt
HTTP_QUERY_CONTENT_LOCATION indeed returns "header not found", but i
believe this is normal for this page.
Many many thanks,
Cristian

Post by Scherbina Vladimir
Put the exact snippet code that fails, I'll try to help you to figure
out what's wrong.
--
Vladimir

Post by Cristian Amarie
I have tried with both HTTP_QUERY_LOCATION and
HTTP_QUERY_CONTENT_LOCATION, but I get ERROR_HTTP_HEADER_NOT_FOUND (for
the url i'm testing with).
I'll keep trying, probably i have to do some more passes until I'll be
able to resolve the redirected url to the text file.
Thanks,
Cristian

Post by Scherbina Vladimir
Hello, Cristian.

Post by Cristian Amarie
Input: URL
Desired output: true/false answer if the URL represents a text file or not.
The decision (for me, at least) occurs if the MIME type is
succesuflly detected and this is "text/plain".
(The main problem is actually a little bit larger: I'm inside a BHO's
ShowContextMenu for an anchor - the URL of the anchor element will
end up in displaying a text file or not).
For direct URLs such as http://www.rfc-editor.org/rfc/rfc1957.txt,
passing the URL to FindMimeFromData seems quite sufficient.
The trouble starts when this URL is redirected, so, for example,
navigating to http://www.rfc-editor.org/cgi-bin/rfcsearch.pl and
entering a RFC number to search (such as 1957), choosing RFC File
ASCII+, "RFC Content Via" radio is set to HTTP, the result will be
the URL
http://www.rfc-editor.org/cgi-bin/rfcdoctype.pl?loc=RFC&letsgo=1957&type=http&file_format=txt
Clicking the link, IE will redirect to the first URL.
Now I tried 3 ways in order to determine if the final redirection
1. InternetOpen + InternetOpenUrl + InternetReadFile to read 8192 bytes.
The result was saved in a local file called C:\url (without any
extension) and the filename was passed to FindMimeFromData.
The result is E_FAIL.
(Naming the file C:\url.txt retusn S_OK, but this is obviously
arbitrary since FindMimeFromData looks also on file extension).
E_FAIL << FindMimeFromData(0, L"C:\\url", 0, 0, 0, 0, &wsMime, 0);
It looks to me that FindMimeFromData does not check in this case the
file content.
2. The same as 1, but instead of passing the local filename, the
downloaded buffer is passed to FindMimeFromData.
S_OK * << FindMimeFromData(0, 0, (LPVOID)pBuffer, cbRead, 0, 0,
&wsMime, 0);
The return is S_OK and wsMime is "text/plain".
Seems ok, but see below *...
3. Call FindMimeFromData by passing directly the url.
E_FAIL << FindMimeFromData(0, pwUrl, 0, 0, 0, 0, &wsMime, 0);
Now, if the RFC search is done by HTTP, seems ok in this case.
But if in the search page the "RFC Content Via" radio is set to FTP,
the downloaded data by InternetReadFile is
<html><head><title>Click to get to RFC 1957></head>
<body bgcolor="#FFFFFF"><center>
<h1>Your browser does not support redirection.</h1>
<p>You must click ...
etc.
which for me looks more like a "text/html" MIME type, and not "text/plain".
I'm looking to determine if the content is text (and not HTML, even
if also the downloaded HTML is text, after all).
I'm reading and reading the MIME MSDN's details, but I still can't
figure why the HTML buffer above is detected as text/plain and not
text/html.
I would like to perform a more accurate text/plain detection (based
on a partial WinInet downloaded content if absolutely necessary), but
not before exhausting the possibility that I did something wrong on
FindMimeFromData usage.
Looking forward for any kind of suggestion.
Many thanks,
Cristian Amarie

Using "InternetOpen + InternetOpenUrl + InternetReadFile" would not
allow you to detect the presence of "Location" header. Why not making
request and detecting Location header using HttpQueryInfo as I advised
http://groups.google.com/group/microsoft.public.inetsdk.programming.wininet/browse_thread/thread/3c3dfd28eae3f52b/3fd0da342a3367c3?lnk=st&q=HttpQueryInfo+returning+wrong+status+code+200+not+301&rnum=1#3fd0da342a3367c3
This will allow you to "run" over all "Location" headers untill you
get the link to a *.txt file. Then you're welcomed to pass that URL to
FindMimeFromData.
--
Vladimir