Create an account

Very important

  • To access the important data of the forums, you must be active in each forum and especially in the leaks and database leaks section, send data and after sending the data and activity, data and important content will be opened and visible for you.
  • You will only see chat messages from people who are at or below your level.
  • More than 500,000 database leaks and millions of account leaks are waiting for you, so access and view with more activity.
  • Many important data are inactive and inaccessible for you, so open them with activity. (This will be done automatically)


Thread Rating:
  • 243 Vote(s) - 3.5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How can I globally ignore invalid byte sequences in UTF-8 strings?

#1
I have an Rails application surviving from migrations since Rails version 1 and I would like to ignore **all** invalid byte sequences on it, to keep the backwards compatibility.

**I can't know the input encoding**.


Exemple:

> "- Men\xFC -".split("n")
ArgumentError: invalid byte sequence in UTF-8
from (irb):4:in `split'
from (irb):4
from /home/fotanus/.rvm/rubies/ruby-2.0.0-rc2/bin/irb:16:in `<main>'

I can overcome this problem in one line, by using the following, for example:

> "- Men\xFC -".unpack("C*").pack("U*").split("n")
=> ["- Me", "ü -"]

However I would like to always ignore the invalid byte sequences and disable this errors. On Ruby itself or in Rails.
Reply

#2
Encoding in Ruby 1.9 and 2.0 seems to be a bit tricky. \xFC is the code for the special character ü in ISO-8859-1, but the code FC also occurs in UTF-8 for ü `U+00FC = \u0252` (and in UTF-16). It could be an artifact of the Ruby [pack/unpack functions][1]. Packing and unpacking Unicode characters with the U* template string for Unicode is not problematic:

>> "- Menü -".unpack('U*').pack("U*")
=> "- Menü -"

You can create the "wrong" string, i.e. a string that has an invalid encoding, if you first unpack Unicode UTF-8 characters (U), and then pack unsigned characters ©:

>> "- Menü -".unpack('U*').pack("C*")
=> "- Men\xFC -"

This string has no longer a valid encoding. Apparently the conversion process can be reversed by applying the opposite order (a bit like operators in quantum physics):

>> "- Menü -".unpack('U*').pack("C*").unpack("C*").pack("U*")
=> "- Menü -"


In this case it is also possible to "fix" the broken string by first converting it to ISO-8859-1, and then to UTF-8, but I am not sure if this works accidentally because the code is contained in this character set

>> "- Men\xFC -".force_encoding("ISO-8859-1").encode("UTF-8")
=> "- Menü -"
>> "- Men\xFC -".encode("UTF-8", 'ISO-8859-1')
=> "- Menü -"




[1]:

[To see links please register here]

Reply

#3
I don't think you can globally turn off the UTF-8 checking without much difficulty. I would instead focus on fixing up all the strings that enter your application, at the boundary where they come in (e.g. when you query the database or receive HTTP requests).

Let's suppose the strings coming in have the BINARY (a.k.a. ASCII-8BIT encoding). This can be simulated like this:

s = "Men\xFC".force_encoding('BINARY') # => "Men\xFC"

Then we can convert them to UTF-8 using [String#encode][1] and replace any undefined characters with the UTF-8 replacement character:

s = s.encode("UTF-8", invalid: :replace, undef: :replace) # => "Men\uFFFD"
s.valid_encoding? # => true

Unfortunately, the steps above would end up mangling a lot of UTF-8 codepoints because the bytes in them would not be recognized. If you had a three-byte UTF-8 characters like "\uFFFD" it would be interpreted as three separate bytes and each one would get converted to the replacement character. Maybe you could do something like this:

def to_utf8(str)
str = str.force_encoding("UTF-8")
return str if str.valid_encoding?
str = str.force_encoding("BINARY")
str.encode("UTF-8", invalid: :replace, undef: :replace)
end

That's the best I could think of. Unfortunately, I don't know of a great way to tell Ruby to treat the string as UTF-8 and just replace all the invalid bytes.

[1]:

[To see links please register here]


Reply

#4
In ruby 2.0 you could use the String#b method, that is a short alias to String#force_encoding("BINARY")
Reply

#5
If you can configure your database/page/whatever to give you strings in ASCII-8BIT, this will get you their real encoding.

Use Ruby's stdlib encoding guessing library. Pass all your strings through something like this:

require 'nkf'
str = "- Men\xFC -"
str.force_encoding(NKF.guess(str))

The NKF library will guess the encoding (usually successfully), and force that encoding on the string. If you don't feel like trusting the NKF library totally, build this safeguard around string operations too:

begin
str.split
rescue ArgumentError
str.force_encoding('BINARY')
retry
end

This will fallback on BINARY if NKF didn't guess correctly. You can turn this into a method wrapper:

def str_op(s)
begin
yield s
rescue ArgumentError
s.force_encoding('BINARY')
retry
end
end
Reply

#6
If you just want to operate on the raw bytes, you can try encoding it as ASCII-8BIT/BINARY.

str.force_encoding("BINARY").split("n")

This isn't going to get your ü back, though, since your source string in this case is ISO-8859-1 (or something like it):

"- Men\xFC -".force_encoding("ISO-8859-1").encode("UTF-8")
=> "- Menü -"

If you want to get multibyte characters, you *have* to know what the source charset is.
Once you `force_encoding` to BINARY, you're going to literally just have the raw bytes, so multibyte characters won't be interpreted accordingly.

If the data is from your database, you can change your connection mechanism to use an ASCII-8BIT or BINARY encoding; Ruby *should* flag them accordingly then. Alternately, you can monkeypatch the database driver to force encoding on all strings read from it. This is a massive hammer, though, and might be the absolutely wrong thing to do.

The *right* answer is going to be to fix your string encodings. This may require a database fix, a database driver connection encoding fix, or some combination thereof. All the bytes are still there, but if you're dealing with a given charset, you should, if at all possible, let Ruby know that you expect your data to be in that encoding. A common mistake is to use the mysql2 driver to connect to a MySQL database which has data in latin1 encodings, but to specify a utf-8 charset for the connection. This causes Rails to take the latin1 data from the DB and interpret it as utf-8, rather than interpreting it as latin1 which you can then convert to UTF-8.

If you can elaborate on where the strings are coming from, a more complete answer might be possible. You might also check out [this answer](

[To see links please register here]

) for a possible global(-ish) Rails solution to default string encodings.
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

©0Day  2016 - 2023 | All Rights Reserved.  Made with    for the community. Connected through