Arabic text shows as question marks (?) #5661

Closed
opened 2025-11-02 06:32:15 -06:00 by GiteaMirror · 20 comments
Owner

Originally created by @zuhairamahdi on GitHub (Jun 30, 2020).

  • Gitea version (or commit ref):
  • Git version:
  • Operating system:
  • Database (use [x]):
    • PostgreSQL
    • MySQL
    • MSSQL
    • SQLite
  • Can you reproduce the bug at https://try.gitea.io:
    • Yes (provide example URL)
    • No
    • Not relevant
  • Log gist:

Description

Hi,
we just noticed that when we create a new issue or pull request all the Arabic text transforms to question marks.
I think that we can fix this by changing the database from varchar to be nvarchar in MS SQL.

Screenshots

arabictext

Originally created by @zuhairamahdi on GitHub (Jun 30, 2020). <!-- NOTE: If your issue is a security concern, please send an email to security@gitea.io instead of opening a public issue --> <!-- 1. Please speak English, this is the language all maintainers can speak and write. 2. Please ask questions or configuration/deploy problems on our Discord server (https://discord.gg/gitea) or forum (https://discourse.gitea.io). 3. Please take a moment to check that your issue doesn't already exist. 4. Please give all relevant information below for bug reports, because incomplete details will be handled as an invalid report. --> - Gitea version (or commit ref): - Git version: - Operating system: - Database (use `[x]`): - [ ] PostgreSQL - [ ] MySQL - [x] MSSQL - [ ] SQLite - Can you reproduce the bug at https://try.gitea.io: - [ ] Yes (provide example URL) - [ ] No - [ ] Not relevant - Log gist: ## Description Hi, we just noticed that when we create a new issue or pull request all the Arabic text transforms to question marks. I think that we can fix this by changing the database from `varchar` to be `nvarchar` in MS SQL. ## Screenshots ![arabictext](https://user-images.githubusercontent.com/6876269/86105923-9c7ec280-bac8-11ea-8173-e969f9464a8c.PNG) <!-- **If this issue involves the Web Interface, please include a screenshot** -->
GiteaMirror added the type/enhancement label 2025-11-02 06:32:15 -06:00
Author
Owner

@zeripath commented on GitHub (Jun 30, 2020):

This is going to be a problem with your database collation and character set

@zeripath commented on GitHub (Jun 30, 2020): This is going to be a problem with your database collation and character set
Author
Owner

@zuhairamahdi commented on GitHub (Jul 6, 2020):

@zeripath
it is a problem with the type in MS SQL as varchar doesn't include Arabic characters and need to be replaced to be nvarchar

@zuhairamahdi commented on GitHub (Jul 6, 2020): @zeripath it is a problem with the type in MS SQL as `varchar` doesn't include Arabic characters and need to be replaced to be `nvarchar`
Author
Owner

@zeripath commented on GitHub (Jul 6, 2020):

OK as a holding measure whilst we think about this:

https://docs.microsoft.com/en-us/sql/t-sql/data-types/char-and-varchar-transact-sql?view=sql-server-ver15#remarks

If you have sites that support multiple languages:

Starting with SQL Server 2019 (15.x), consider using a UTF-8 enabled collation to support Unicode and minimize character conversion issues.
If using a lower version of the SQL Server Database Engine, consider using the Unicode nchar or nvarchar data types to minimize character conversion issues.

AFAICS You can set the collation to a UTF-8 collation to make varchar and nvarchar essentially the same.

My concern with migrating to use NVARCHAR is that we're essentially doubling the size of our data when setting the collation properly would allow better encodings to be used by the db.

@lunny what are your thoughts?

@zeripath commented on GitHub (Jul 6, 2020): OK as a holding measure whilst we think about this: https://docs.microsoft.com/en-us/sql/t-sql/data-types/char-and-varchar-transact-sql?view=sql-server-ver15#remarks > If you have sites that support multiple languages: > > Starting with SQL Server 2019 (15.x), consider using a UTF-8 enabled collation to support Unicode and minimize character conversion issues. > If using a lower version of the SQL Server Database Engine, consider using the Unicode nchar or nvarchar data types to minimize character conversion issues. AFAICS You can set the collation to a UTF-8 collation to make varchar and nvarchar essentially the same. My concern with migrating to use NVARCHAR is that we're essentially doubling the size of our data when setting the collation properly would allow better encodings to be used by the db. @lunny what are your thoughts?
Author
Owner

@lunny commented on GitHub (Jul 7, 2020):

In fact I have ever thought xorm should provide an option to consider VARCHAR is NVARCHAR, but I don't know if anyone need that. Maybe it's the answer of this issue?

@lunny commented on GitHub (Jul 7, 2020): In fact I have ever thought xorm should provide an option to consider `VARCHAR` is `NVARCHAR`, but I don't know if anyone need that. Maybe it's the answer of this issue?
Author
Owner

@LukeOwlclaw commented on GitHub (Jul 9, 2020):

We are having the same issue for emoticons. They become ??

This problem cannot be reproduced on https://try.gitea.io/Luke2/Repo2/pulls/1

What kind of database is try.gitea.io using?

@LukeOwlclaw commented on GitHub (Jul 9, 2020): We are having the same issue for emoticons. They become ?? This problem cannot be reproduced on https://try.gitea.io/Luke2/Repo2/pulls/1 What kind of database is try.gitea.io using?
Author
Owner

@zeripath commented on GitHub (Jul 9, 2020):

@LukeOwlclaw what database are you using? If, as I suspect, you are using MySQL in which you need to change to use utf8mb4 as the default charset and use gitea convert to convert your database from utf8 to utf8mb4.

@zuhairamahdi did changing the collation to a UTF8 enabled collation solve your problem?

@zeripath commented on GitHub (Jul 9, 2020): @LukeOwlclaw what database are you using? If, as I suspect, you are using MySQL in which you need to change to use utf8mb4 as the default charset and use `gitea convert` to convert your database from utf8 to utf8mb4. @zuhairamahdi did changing the collation to a UTF8 enabled collation solve your problem?
Author
Owner

@zuhairamahdi commented on GitHub (Jul 12, 2020):

@zuhairamahdi did changing the collation to a UTF8 enabled collation solve your problem?

I don't think it is possible to change the collation to UTF-8 as we use SQL Server 2012 not 2019
though there is a long list of collation to use which break the data if we selected Arabic:
image

@zuhairamahdi commented on GitHub (Jul 12, 2020): > > @zuhairamahdi did changing the collation to a UTF8 enabled collation solve your problem? I don't think it is possible to change the collation to UTF-8 as we use SQL Server 2012 not 2019 though there is a long list of collation to use which break the data if we selected Arabic: ![image](https://user-images.githubusercontent.com/6876269/87247594-74467a80-c45d-11ea-97ad-851b17e11e6f.png)
Author
Owner

@zeripath commented on GitHub (Jul 12, 2020):

Googling I think you might want to try: Latin1_General_100_CI_AI_SC

@zeripath commented on GitHub (Jul 12, 2020): Googling I think you might want to try: `Latin1_General_100_CI_AI_SC`
Author
Owner

@LukeOwlclaw commented on GitHub (Jul 13, 2020):

@zeripath We are using SQL Server 2016. So we are facing the same problem as @zuhairamahdi.

@LukeOwlclaw commented on GitHub (Jul 13, 2020): @zeripath We are using SQL Server 2016. So we are facing the same problem as @zuhairamahdi.
Author
Owner

@zuhairamahdi commented on GitHub (Jul 13, 2020):

Googling I think you might want to try: Latin1_General_100_CI_AI_SC

I tried it in an instance in my local environment and did not fix the issue

@zuhairamahdi commented on GitHub (Jul 13, 2020): > Googling I think you might want to try: `Latin1_General_100_CI_AI_SC` I tried it in an instance in my local environment and did not fix the issue
Author
Owner

@zeripath commented on GitHub (Jul 13, 2020):

OK, I think we need to add an option to https://gitea.com/xorm/xorm to allow us to default to nvarchar as required. Looking at engine.Sync2 I think if you change your columns to NVARCHAR manually that xorm would just deal with it without changing them back or complaining too loud.

Of course - make a backup before you do this.

@zeripath commented on GitHub (Jul 13, 2020): OK, I think we need to add an option to https://gitea.com/xorm/xorm to allow us to default to nvarchar as required. Looking at `engine.Sync2` I think if you change your columns to NVARCHAR manually that xorm would just deal with it without changing them back or complaining too loud. Of course - make a backup before you do this.
Author
Owner

@PaulBol commented on GitHub (Jul 14, 2020):

I did a quick test and changed the data type for comments with

ALTER TABLE [comment] ALTER COLUMN [content] nvarchar(max)

Everything worked fine afterwards. I'm only not sure if the change might interfere with a future database update.

For us this change seems sufficient. We want to keep things like issue titles, user names and release notes in plain English. So there's little reason to encourage users to include fancy characters. In other cases using UTF-16 makes little sense like for (commit) hashes, email addresses and access tokens.

If helpful for anyone, here's a T-SQL command for generating a script for converting all varchar columns to nvarchar. But I'm not in a position to tell if applying it blindly is a good idea.

SELECT 'ALTER TABLE [' + TABLE_NAME + '] ALTER COLUMN [' + COLUMN_NAME + '] nvarchar(' + IIF(CHARACTER_MAXIMUM_LENGTH=-1,'max',CAST(CHARACTER_MAXIMUM_LENGTH AS varchar)) + ')' + IIF(IS_NULLABLE='NO',' NOT NULL','') FROM INFORMATION_SCHEMA.COLUMNS WHERE DATA_TYPE='varchar'
@PaulBol commented on GitHub (Jul 14, 2020): I did a quick test and changed the data type for comments with ```tsql ALTER TABLE [comment] ALTER COLUMN [content] nvarchar(max) ``` Everything worked fine afterwards. I'm only not sure if the change might interfere with a future database update. For us this change seems sufficient. We want to keep things like issue titles, user names and release notes in plain English. So there's little reason to encourage users to include fancy characters. In other cases using UTF-16 makes little sense like for (commit) hashes, email addresses and access tokens. If helpful for anyone, here's a T-SQL command for generating a script for converting all `varchar` columns to `nvarchar`. But I'm not in a position to tell if applying it blindly is a good idea. ```tsql SELECT 'ALTER TABLE [' + TABLE_NAME + '] ALTER COLUMN [' + COLUMN_NAME + '] nvarchar(' + IIF(CHARACTER_MAXIMUM_LENGTH=-1,'max',CAST(CHARACTER_MAXIMUM_LENGTH AS varchar)) + ')' + IIF(IS_NULLABLE='NO',' NOT NULL','') FROM INFORMATION_SCHEMA.COLUMNS WHERE DATA_TYPE='varchar' ```
Author
Owner

@zuhairamahdi commented on GitHub (Jul 14, 2020):

ALTER TABLE [comment] ALTER COLUMN [content] nvarchar(max)

Thanks @PaulBol
it actually worked:
image

hope that there is a way to do it automatically rather than going manually and inspecting the tables if there is a column that need to be changed from varchar to nvarchar

@zuhairamahdi commented on GitHub (Jul 14, 2020): > ```tsql > ALTER TABLE [comment] ALTER COLUMN [content] nvarchar(max) > ``` Thanks @PaulBol it actually worked: ![image](https://user-images.githubusercontent.com/6876269/87424608-eb5a4b00-c5e4-11ea-9401-eb61c704e20a.png) hope that there is a way to do it automatically rather than going manually and inspecting the tables if there is a column that need to be changed from `varchar` to `nvarchar`
Author
Owner

@PaulBol commented on GitHub (Jul 16, 2020):

hope that there is a way to do it automatically rather than going manually and inspecting the tables if there is a column that need to be changed from varchar to nvarchar

Note that the SELECT 'ALTER TABLE ... command that I shared above does this. But it will create a script that changes every varchar column which is not needed. E.g. for columns storing commit hashes with only characters [0-9a-f].

@PaulBol commented on GitHub (Jul 16, 2020): > hope that there is a way to do it automatically rather than going manually and inspecting the tables if there is a column that need to be changed from varchar to nvarchar Note that the `SELECT 'ALTER TABLE ...` command that I shared above does this. But it will create a script that changes *every* `varchar` column which is not needed. E.g. for columns storing commit hashes with only characters [0-9a-f].
Author
Owner

@lunny commented on GitHub (Jul 16, 2020):

In fact I have ever thought xorm should provide an option to consider VARCHAR is NVARCHAR, but I don't know if anyone need that. Maybe it's the answer of this issue?

I have said that. I think xorm should provide a special option for MSSQL

engine, err := NewEngineWithParams(driver, connStr, map[string]string{
 "DEFAULT_VARCHAR": "NVARCHAR",
})

Then all new columns will be created with nvarchar even if the tag on field of struct are blank or VARCHAR.

Of course, for existance columns, you have to convert them manually like above.

@lunny commented on GitHub (Jul 16, 2020): > In fact I have ever thought xorm should provide an option to consider `VARCHAR` is `NVARCHAR`, but I don't know if anyone need that. Maybe it's the answer of this issue? I have said that. I think xorm should provide a special option for MSSQL ```go engine, err := NewEngineWithParams(driver, connStr, map[string]string{ "DEFAULT_VARCHAR": "NVARCHAR", }) ``` Then all new columns will be created with `nvarchar` even if the tag on field of struct are blank or `VARCHAR`. Of course, for existance columns, you have to convert them manually like above.
Author
Owner

@zeripath commented on GitHub (Jul 16, 2020):

So I've looked at using that SQL and putting it directly in to a gitea convert command - but unfortunately it would require the constraints to be dropped and readded over time.

@zeripath commented on GitHub (Jul 16, 2020): So I've looked at using that SQL and putting it directly in to a gitea convert command - but unfortunately it would require the constraints to be dropped and readded over time.
Author
Owner

@lunny commented on GitHub (Jul 18, 2020):

I have sent a PR to xorm and will send a PR to gitea after that merged. Please review https://gitea.com/xorm/xorm/pulls/1741

@lunny commented on GitHub (Jul 18, 2020): I have sent a PR to xorm and will send a PR to gitea after that merged. Please review https://gitea.com/xorm/xorm/pulls/1741
Author
Owner

@lunny commented on GitHub (Jul 19, 2020):

I have sent a PR #12269 to try to resolve this issue.

@lunny commented on GitHub (Jul 19, 2020): I have sent a PR #12269 to try to resolve this issue.
Author
Owner

@zuhairamahdi commented on GitHub (Jul 19, 2020):

I have sent a PR #12269 to try to resolve this issue.

thanks @lunny will this change current columns already in the database? or it will work only for new instances?

@zuhairamahdi commented on GitHub (Jul 19, 2020): > I have sent a PR #12269 to try to resolve this issue. thanks @lunny will this change current columns already in the database? or it will work only for new instances?
Author
Owner

@lunny commented on GitHub (Jul 19, 2020):

@zuhairamahdi It's only for new instance or new column on exist instance. You have to convert old column manually.

@lunny commented on GitHub (Jul 19, 2020): @zuhairamahdi It's only for new instance or new column on exist instance. You have to convert old column manually.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/gitea#5661