Create an account

Very important

  • To access the important data of the forums, you must be active in each forum and especially in the leaks and database leaks section, send data and after sending the data and activity, data and important content will be opened and visible for you.
  • You will only see chat messages from people who are at or below your level.
  • More than 500,000 database leaks and millions of account leaks are waiting for you, so access and view with more activity.
  • Many important data are inactive and inaccessible for you, so open them with activity. (This will be done automatically)


Thread Rating:
  • 621 Vote(s) - 3.63 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Ruby character encoding when using Base64.encode

#1
Looking at the source of Ruby's Base64.encode I cannot determine what character encoding a string is converted to, if at all, before encoding that data in Base64. A Utf-8 string encoded in Base64 is going to be a lot different than a Utf-16 string encoded in Base64. Does Ruby make any promises regarding this operation?
Reply

#2
The [fine manual][1] has this to say:

> **encode64(bin)**
> Returns the Base64-encoded version of bin. This method complies with RFC 2045.

Section 6.8 of [RFC 2045][2] says:

> **6.8. Base64 Content-Transfer-Encoding**
>
> The Base64 Content-Transfer-Encoding is designed to represent arbitrary sequences of octets in a form that need not be humanly readable. [...]
>
> A 65-character subset of US-ASCII is used, enabling 6 bits to be represented per printable character. (The extra 65th character, "=", is used to signify a special processing function.)

So Base64 encodes *bytes* into ASCII. If those bytes actually represent a UTF-8 encoded string then the UTF-8 string will be broken down into individual bytes and those bytes will be converted to Base64; for example, if you have a UTF-8 string `'µ'` then you'll end up encoding the bytes `0xc2` and `0xb5` (in that order) to the Base64 representation `"wrU=\n"`. If you start out with a binary string `"\xc2\xb5"` (which just happens to match the UTF-8 version of `'µ'`) then you'll get the same `"wrU=\n"` output.

When you decode `"wrU=\n"`, you'll get the bytes `"\xc2\xb5"` and you'll have to know that those bytes are supposed to be UTF-8 encoded text rather than some arbitrary blob of bits. This is why you have separate content type and character set meta data attached to the Base64.

Similarly, if you have a UTF-16 string then it will be broken into bytes and those bytes will be encoded just like any other byte string. Of course this case is a little more complicated due to byte order issues but that's why we have content type and character set headers and BOMs.

The main point is that Base64 works with *bytes*, not characters. What format (UTF-8 text, UTF-16 text, a PNG image, ...) is someone else's problem. Base64 just converts a byte stream to a subset of US ASCII and then back to bytes; the format of those bytes must be specified separately.

---

I did some poking around in the source and the results might be of interest even if they're not completely relevant. The [`encode64` method][3] is simply this:

def encode64(bin)
[bin].pack("m")
end

Then if you look through [`Array#pack`][4]:

static VALUE
pack_pack(VALUE ary, VALUE fmt)
{
/*...*/
int enc_info = 1; /* 0 - BINARY, 1 - US-ASCII, 2 - UTF-8 */

and keep an eye on `enc_info`, you'll see that a `'m'` format will leave `enc_info` alone so the packed string will come out as US-ASCII and so `encode64` will produce US ASCII output as expected.

[1]:

[To see links please register here]

[2]:

[To see links please register here]

[3]:

[To see links please register here]

[4]:

[To see links please register here]

Reply

#3
An example to encode and decode an utf-8 string in base64:

text = "intérnalionálização"
=> "intérnalionálização"
text.encoding
=> #<Encoding:UTF-8>
encoded = Base64.encode64(text)
=> "aW50w6lybmFsaW9uw6FsaXphw6fDo28=\n"
encoded.encoding
=> #<Encoding:US-ASCII>
decoded = Base64.decode64(encode)
=> "int\xC3\xA9rnalion\xC3\xA1liza\xC3\xA7\xC3\xA3o"
decoded.encoding
=> #<Encoding:US-ASCII>
decoded = decoded.force_encoding('UTF-8')
=> "intérnalionálização"
decoded.encoding
=> #<Encoding:UTF-8>
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

©0Day  2016 - 2023 | All Rights Reserved.  Made with    for the community. Connected through