Incorrect "slow" response message in Cache Configuration test #13947

Closed
opened 2025-11-02 10:58:04 -06:00 by GiteaMirror · 14 comments
Owner

Originally created by @H0llyW00dzZ on GitHub (Jan 10, 2025).

Description

When testing the Cache Configuration with Redis in the admin panel, the cache test keeps displaying the message: Cache test successful, but response is slow: 6.725281ms.

However, a response time of around 6ms should not be considered slow. The message seems to be misleading and may cause confusion for administrators.

Expected behavior:
The cache test should provide an accurate assessment of the response time and not label a response time of around 6ms as slow

Gitea Version

v1.23.0

Can you reproduce the bug on the Gitea demo site?

No

Log Gist

No response

Screenshots

Screenshot from 2025-01-10 09-34-01

Git Version

2.47.1

Operating System

Linux

How are you running Gitea?

Running on Kubernetes

Database

SQLite

Originally created by @H0llyW00dzZ on GitHub (Jan 10, 2025). ### Description When testing the Cache Configuration with Redis in the admin panel, the cache test keeps displaying the message: `Cache test successful, but response is slow: 6.725281ms.` However, a response time of around 6ms should not be considered slow. The message seems to be misleading and may cause confusion for administrators. **Expected behavior:** The cache test should provide an accurate assessment of the response time and not label a response time of around 6ms as `slow` ### Gitea Version v1.23.0 ### Can you reproduce the bug on the Gitea demo site? No ### Log Gist _No response_ ### Screenshots ![Screenshot from 2025-01-10 09-34-01](https://github.com/user-attachments/assets/fd47477e-12e7-43ab-9460-8bb112c435ca) ### Git Version 2.47.1 ### Operating System Linux ### How are you running Gitea? Running on Kubernetes ### Database SQLite
GiteaMirror added the type/bug label 2025-11-02 10:58:04 -06:00
Author
Owner

@wxiaoguang commented on GitHub (Jan 10, 2025):

Related: Add cache test for admins #31265 @6543

You can safely ignore that message, and I still think that test should be removed from admin panel.

@wxiaoguang commented on GitHub (Jan 10, 2025): Related: Add cache test for admins #31265 @6543 You can safely ignore that message, and I still think that test should be removed from admin panel.
Author
Owner

@H0llyW00dzZ commented on GitHub (Jan 10, 2025):

Related: Add cache test for admins #31265 @6543

You can safely ignore that message, and I still think that test should be removed from admin panel.

@wxiaoguang It seems that the implementation in PR #31265 is incorrect. The test measures performance in microseconds (μs), which is a local measurement. However, in real-world scenarios, it should start from milliseconds (ms). Basically, any response time under 30ms, even if it extends to microseconds, can still be considered fast or low latency.

@H0llyW00dzZ commented on GitHub (Jan 10, 2025): > Related: Add cache test for admins #31265 @6543 > > You can safely ignore that message, and I still think that test should be removed from admin panel. @wxiaoguang It seems that the implementation in PR #31265 is incorrect. The test measures performance in microseconds (μs), which is a local measurement. However, in real-world scenarios, it should start from milliseconds (ms). Basically, any response time under 30ms, even if it extends to microseconds, can still be considered fast or low latency.
Author
Owner

@TheFox0x7 commented on GitHub (Jan 10, 2025):

I can fix this, but the question remains how:

  1. Remove the test entirely per PR comments (Gitea won't start if the cache doesn't work so testing if it works is pointless)
  2. Replace it with pure information about the timing (so no warning).
  3. Change time to pass to 100ms so the test won't fail

Any other suggestions or ideas? I'd add a prometheus histogram there but then there's no way to get that information by default without enabling metrics and scarping (which would be an issue if switched to otel). And self querying the endpoint feels weird.

@TheFox0x7 commented on GitHub (Jan 10, 2025): I can fix this, but the question remains how: 1. Remove the test entirely per PR comments (Gitea won't start if the cache doesn't work so testing if it works is pointless) 2. Replace it with pure information about the timing (so no warning). 3. Change time to pass to 100ms so the test won't fail Any other suggestions or ideas? I'd add a prometheus histogram there but then there's no way to get that information by default without enabling metrics and scarping (which would be an issue if switched to otel). And self querying the endpoint feels weird.
Author
Owner

@lunny commented on GitHub (Jan 10, 2025):

I prefer the solution 3.

@lunny commented on GitHub (Jan 10, 2025): I prefer the solution 3.
Author
Owner

@H0llyW00dzZ commented on GitHub (Jan 11, 2025):

I can fix this, but the question remains how:

  1. Remove the test entirely per PR comments (Gitea won't start if the cache doesn't work so testing if it works is pointless)
  2. Replace it with pure information about the timing (so no warning).
  3. Change time to pass to 100ms so the test won't fail

Any other suggestions or ideas? I'd add a prometheus histogram there but then there's no way to get that information by default without enabling metrics and scarping (which would be an issue if switched to otel). And self querying the endpoint feels weird.

@TheFox0x7
Nah, don't remove it. It's pretty useful for testing.

Another idea is to change the current implementation by adjusting the ratio. For example, if the latency is under 30ms, it can still be considered fast or low latency. However, if the latency goes around 50ms to 100ms, then it can be considered slow or high latency. This is because under 30ms is considered an average stable network latency.

@H0llyW00dzZ commented on GitHub (Jan 11, 2025): > I can fix this, but the question remains how: > > 1. Remove the test entirely per PR comments (Gitea won't start if the cache doesn't work so testing if it works is pointless) > 2. Replace it with pure information about the timing (so no warning). > 3. Change time to pass to 100ms so the test won't fail > > Any other suggestions or ideas? I'd add a prometheus histogram there but then there's no way to get that information by default without enabling metrics and scarping (which would be an issue if switched to otel). And self querying the endpoint feels weird. @TheFox0x7 Nah, don't remove it. It's pretty useful for testing. Another idea is to change the current implementation by adjusting the ratio. For example, if the latency is under 30ms, it can still be considered fast or low latency. However, if the latency goes around 50ms to 100ms, then it can be considered slow or high latency. This is because under 30ms is considered an average stable network latency.
Author
Owner

@wxiaoguang commented on GitHub (Jan 11, 2025):

Removing it should be fine because:

  • Gitea instance already does cache check during startup, it doesn't make sense to "test" it again and again
  • For the latency, it should be handled by something like "tracing" and "profiling", but not just test once and report a delay.
@wxiaoguang commented on GitHub (Jan 11, 2025): Removing it should be fine because: * Gitea instance already does cache check during startup, it doesn't make sense to "test" it again and again * For the latency, it should be handled by something like "tracing" and "profiling", but not just test once and report a delay.
Author
Owner

@H0llyW00dzZ commented on GitHub (Jan 11, 2025):

Removing it should be fine because:

  • Gitea instance already does cache check during startup, it doesn't make sense to "test" it again and again
  • For the latency, it should be handled by something like "tracing" and "profiling", but not just test once and report a delay.

Cache check during startup is different from cache testing repeatedly. When you press test again and again, the results will always vary, sometimes ranging from low to high latency.

For example:

image
image
image

This test provides a real-time check of the latency, which can be useful for monitoring the system's performance.

@H0llyW00dzZ commented on GitHub (Jan 11, 2025): > Removing it should be fine because: > > * Gitea instance already does cache check during startup, it doesn't make sense to "test" it again and again > * For the latency, it should be handled by something like "tracing" and "profiling", but not just test once and report a delay. Cache check during startup is different from cache testing repeatedly. When you press test again and again, the results will always vary, sometimes ranging from low to high latency. For example: ![image](https://github.com/user-attachments/assets/fe4039b7-6522-4ece-8f59-68e512d06346) ![image](https://github.com/user-attachments/assets/867ca482-4c77-4b38-93ff-5b12e73c59eb) ![image](https://github.com/user-attachments/assets/6054be9a-d04e-4079-891e-79b1dda7aa5d) This test provides a real-time check of the latency, which can be useful for monitoring the system's performance.
Author
Owner

@wxiaoguang commented on GitHub (Jan 11, 2025):

This test provides a real-time check of the latency, which can be useful for monitoring the system's performance.

But:

For the latency, it should be handled by something like "tracing" and "profiling", but not just test once and report a delay.

@wxiaoguang commented on GitHub (Jan 11, 2025): > This test provides a real-time check of the latency, which can be useful for monitoring the system's performance. But: > For the latency, it should be handled by something like "tracing" and "profiling", but not just test once and report a delay.
Author
Owner

@H0llyW00dzZ commented on GitHub (Jan 11, 2025):

This test provides a real-time check of the latency, which can be useful for monitoring the system's performance.

But:

For the latency, it should be handled by something like "tracing" and "profiling", but not just test once and report a delay.

No, you don't really need something like tracing and profiling in this case.

In PR https://github.com/go-gitea/gitea/pull/31265, the code is measuring latency, not just reporting a single delay. The Test() function performs a series of cache operations and measures the total time taken to complete them:

const (
	testCacheKey       = "DefaultCache.TestKey"
	SlowCacheThreshold = 100 * time.Microsecond
)

func Test() (time.Duration, error) {
	if defaultCache == nil {
		return 0, fmt.Errorf("default cache not initialized")
	}

	testData := fmt.Sprintf("%x", make([]byte, 500))

	start := time.Now()

	if err := defaultCache.Delete(testCacheKey); err != nil {
		return 0, fmt.Errorf("expect cache to delete data based on key if exist but got: %w", err)
	}
	if err := defaultCache.Put(testCacheKey, testData, 10); err != nil {
		return 0, fmt.Errorf("expect cache to store data but got: %w", err)
	}
	testVal, hit := defaultCache.Get(testCacheKey)
	if !hit {
		return 0, fmt.Errorf("expect cache hit but got none")
	}
	if testVal != testData {
		return 0, fmt.Errorf("expect cache to return same value as stored but got other")
	}

	return time.Since(start), nil
}
@H0llyW00dzZ commented on GitHub (Jan 11, 2025): > > This test provides a real-time check of the latency, which can be useful for monitoring the system's performance. > > But: > > > For the latency, it should be handled by something like "tracing" and "profiling", but not just test once and report a delay. No, you don't really need something like tracing and profiling in this case. In PR https://github.com/go-gitea/gitea/pull/31265, the code is measuring latency, not just reporting a single delay. The `Test()` function performs a series of cache operations and measures the total time taken to complete them: ```go const ( testCacheKey = "DefaultCache.TestKey" SlowCacheThreshold = 100 * time.Microsecond ) func Test() (time.Duration, error) { if defaultCache == nil { return 0, fmt.Errorf("default cache not initialized") } testData := fmt.Sprintf("%x", make([]byte, 500)) start := time.Now() if err := defaultCache.Delete(testCacheKey); err != nil { return 0, fmt.Errorf("expect cache to delete data based on key if exist but got: %w", err) } if err := defaultCache.Put(testCacheKey, testData, 10); err != nil { return 0, fmt.Errorf("expect cache to store data but got: %w", err) } testVal, hit := defaultCache.Get(testCacheKey) if !hit { return 0, fmt.Errorf("expect cache hit but got none") } if testVal != testData { return 0, fmt.Errorf("expect cache to return same value as stored but got other") } return time.Since(start), nil } ```
Author
Owner

@wxiaoguang commented on GitHub (Jan 11, 2025):

No, you don't really need something like tracing and profiling in this case.

In PR #31265, the code is measuring latency, not just reporting a single delay. The Test() function performs a series of cache operations and measures the total time taken to complete them:

Does it make sense? This time, you do a test, it shows that "the cache is fast", then some time later, there is some cache performance problem, then users report that "the response is slow", then you go to admin panel click "test", at that time the cache recovers, you still see "the cache is fast". Does it resolve any real world problem?

@wxiaoguang commented on GitHub (Jan 11, 2025): > No, you don't really need something like tracing and profiling in this case. > > In PR #31265, the code is measuring latency, not just reporting a single delay. The `Test()` function performs a series of cache operations and measures the total time taken to complete them: Does it make sense? This time, you do a test, it shows that "the cache is fast", then some time later, there is some cache performance problem, then users report that "the response is slow", then you go to admin panel click "test", at that time the cache recovers, you still see "the cache is fast". Does it resolve any real world problem?
Author
Owner

@H0llyW00dzZ commented on GitHub (Jan 11, 2025):

No, you don't really need something like tracing and profiling in this case.
In PR #31265, the code is measuring latency, not just reporting a single delay. The Test() function performs a series of cache operations and measures the total time taken to complete them:

Does it make sense? This time, you do a test, it shows that "the cache is fast", then some time later, there is some cache performance problem, then users report that "the response is slow", then you go to admin panel click "test", at that time the cache recovers, you still see "the cache is fast". Does it resolve any real world problem?

Of course, if the response times are accurate. However, the message Cache test successful, but response is slow: 6.725281ms can be misleading.

A latency of 6ms is generally considered fast or low latency, and it's well within acceptable performance standards. Labeling a response time under 30ms as "slow" might not be appropriate in most cases, as it can create confusion

@H0llyW00dzZ commented on GitHub (Jan 11, 2025): > > No, you don't really need something like tracing and profiling in this case. > > In PR #31265, the code is measuring latency, not just reporting a single delay. The `Test()` function performs a series of cache operations and measures the total time taken to complete them: > > Does it make sense? This time, you do a test, it shows that "the cache is fast", then some time later, there is some cache performance problem, then users report that "the response is slow", then you go to admin panel click "test", at that time the cache recovers, you still see "the cache is fast". Does it resolve any real world problem? Of course, if the response times are accurate. However, the message `Cache test successful, but response is slow: 6.725281ms` can be misleading. A latency of 6ms is generally considered fast or low latency, and it's well within acceptable performance standards. Labeling a response time under 30ms as "slow" might not be appropriate in most cases, as it can create confusion
Author
Owner

@H0llyW00dzZ commented on GitHub (Jan 11, 2025):

So the PR #31265 only needs to improve this part:

elapsed, err := cache.Test()
if err != nil {
    ctx.Data["CacheError"] = err
} else if elapsed > cache.SlowCacheThreshold {
    ctx.Data["CacheSlow"] = fmt.Sprint(elapsed)
}

Change the SlowCacheThreshold to:

const SlowCacheThreshold = 30 * time.Millisecond

Then it might look like this:

elapsed, err := cache.Test()
if err != nil {
    ctx.Data["CacheError"] = err
} else if elapsed > cache.SlowCacheThreshold {
    ctx.Data["CacheSlow"] = fmt.Sprintf("Cache response is slow: %v", elapsed)
} else {
    ctx.Data["CacheFast"] = fmt.Sprintf("Cache response is fast: %v", elapsed)
}
@H0llyW00dzZ commented on GitHub (Jan 11, 2025): So the PR #31265 only needs to improve this part: ```go elapsed, err := cache.Test() if err != nil { ctx.Data["CacheError"] = err } else if elapsed > cache.SlowCacheThreshold { ctx.Data["CacheSlow"] = fmt.Sprint(elapsed) } ``` Change the `SlowCacheThreshold` to: ```go const SlowCacheThreshold = 30 * time.Millisecond ``` Then it might look like this: ```go elapsed, err := cache.Test() if err != nil { ctx.Data["CacheError"] = err } else if elapsed > cache.SlowCacheThreshold { ctx.Data["CacheSlow"] = fmt.Sprintf("Cache response is slow: %v", elapsed) } else { ctx.Data["CacheFast"] = fmt.Sprintf("Cache response is fast: %v", elapsed) } ```
Author
Owner

@TheFox0x7 commented on GitHub (Jan 11, 2025):

Does it make sense? This time, you do a test, it shows that "the cache is fast", then some time later, there is some cache performance problem, then users report that "the response is slow", then you go to admin panel click "test", at that time the cache recovers, you still see "the cache is fast". Does it resolve any real world problem

Of course, if the response times are accurate

But they aren't really. You measure a set of 3 operations in a one off time measurement and that's prone to outliers. To have it accurate, you'd need to measure the operations for a bit (say 10-100 times) and take an average from them, at which point you're just re-implementing metrics but without all the details and benefits of it.

@H0llyW00dzZ Maybe you can explain your usecase better? Right now the longer I look at the feature the more I agree with @wxiaoguang that it's kind of pointless, so a real user scenario would show the usefulness of it better. In particular:

  • Do you believe the test is accurate when you use it?
  • How do you use it? once or refreshing the page/clicking the button multiple times?
  • Why wouldn't histogram in metrics endpoint cover this instead?
@TheFox0x7 commented on GitHub (Jan 11, 2025): > > Does it make sense? This time, you do a test, it shows that "the cache is fast", then some time later, there is some cache performance problem, then users report that "the response is slow", then you go to admin panel click "test", at that time the cache recovers, you still see "the cache is fast". Does it resolve any real world problem > > Of course, if the response times are accurate But they aren't really. You measure a set of 3 operations in a one off time measurement and that's prone to outliers. To have it accurate, you'd need to measure the operations for a bit (say 10-100 times) and take an average from them, at which point you're just re-implementing metrics but without all the details and benefits of it. @H0llyW00dzZ Maybe you can explain your usecase better? Right now the longer I look at the feature the more I agree with @wxiaoguang that it's kind of pointless, so a real user scenario would show the usefulness of it better. In particular: - Do you believe the test is accurate when you use it? - How do you use it? once or refreshing the page/clicking the button multiple times? - Why wouldn't histogram in metrics endpoint cover this instead?
Author
Owner

@H0llyW00dzZ commented on GitHub (Jan 11, 2025):

Does it make sense? This time, you do a test, it shows that "the cache is fast", then some time later, there is some cache performance problem, then users report that "the response is slow", then you go to admin panel click "test", at that time the cache recovers, you still see "the cache is fast". Does it resolve any real world problem

Of course, if the response times are accurate

But they aren't really. You measure a set of 3 operations in a one off time measurement and that's prone to outliers. To have it accurate, you'd need to measure the operations for a bit (say 10-100 times) and take an average from them, at which point you're just re-implementing metrics but without all the details and benefits of it.

@H0llyW00dzZ Maybe you can explain your usecase better? Right now the longer I look at the feature the more I agree with @wxiaoguang that it's kind of pointless, so a real user scenario would show the usefulness of it better. In particular:

  • Do you believe the test is accurate when you use it?
  • How do you use it? once or refreshing the page/clicking the button multiple times?
  • Why wouldn't histogram in metrics endpoint cover this instead?
  • Do you believe the test is accurate when you use it? Yes, because the code in the Test() function is correct for checking the latency of cache operations.
func Test() (time.Duration, error) {
	if defaultCache == nil {
		return 0, fmt.Errorf("default cache not initialized")
	}

	testData := fmt.Sprintf("%x", make([]byte, 500))

	start := time.Now()

	if err := defaultCache.Delete(testCacheKey); err != nil {
		return 0, fmt.Errorf("expect cache to delete data based on key if exist but got: %w", err)
	}
	if err := defaultCache.Put(testCacheKey, testData, 10); err != nil {
		return 0, fmt.Errorf("expect cache to store data but got: %w", err)
	}
	testVal, hit := defaultCache.Get(testCacheKey)
	if !hit {
		return 0, fmt.Errorf("expect cache hit but got none")
	}
	if testVal != testData {
		return 0, fmt.Errorf("expect cache to return same value as stored but got other")
	}

	return time.Since(start), nil
}

The function performs a series of cache operations (delete, put, and get) and measures the total time taken to complete them, providing an accurate measurement of the cache latency.

  • How do you use it? I only use it when there are many crawler viewing the source code.

  • Why wouldn't histogram in metrics endpoint cover this instead? do you mean the metrics endpoint from gitea configuration ?

@H0llyW00dzZ commented on GitHub (Jan 11, 2025): > > > Does it make sense? This time, you do a test, it shows that "the cache is fast", then some time later, there is some cache performance problem, then users report that "the response is slow", then you go to admin panel click "test", at that time the cache recovers, you still see "the cache is fast". Does it resolve any real world problem > > > > > > Of course, if the response times are accurate > > But they aren't really. You measure a set of 3 operations in a one off time measurement and that's prone to outliers. To have it accurate, you'd need to measure the operations for a bit (say 10-100 times) and take an average from them, at which point you're just re-implementing metrics but without all the details and benefits of it. > > @H0llyW00dzZ Maybe you can explain your usecase better? Right now the longer I look at the feature the more I agree with @wxiaoguang that it's kind of pointless, so a real user scenario would show the usefulness of it better. In particular: > > * Do you believe the test is accurate when you use it? > * How do you use it? once or refreshing the page/clicking the button multiple times? > * Why wouldn't histogram in metrics endpoint cover this instead? * Do you believe the test is accurate when you use it? Yes, because the code in the `Test()` function is correct for checking the latency of cache operations. ```go func Test() (time.Duration, error) { if defaultCache == nil { return 0, fmt.Errorf("default cache not initialized") } testData := fmt.Sprintf("%x", make([]byte, 500)) start := time.Now() if err := defaultCache.Delete(testCacheKey); err != nil { return 0, fmt.Errorf("expect cache to delete data based on key if exist but got: %w", err) } if err := defaultCache.Put(testCacheKey, testData, 10); err != nil { return 0, fmt.Errorf("expect cache to store data but got: %w", err) } testVal, hit := defaultCache.Get(testCacheKey) if !hit { return 0, fmt.Errorf("expect cache hit but got none") } if testVal != testData { return 0, fmt.Errorf("expect cache to return same value as stored but got other") } return time.Since(start), nil } ``` The function performs a series of cache operations (delete, put, and get) and measures the total time taken to complete them, providing an accurate measurement of the cache latency. * How do you use it? I only use it when there are many crawler viewing the source code. * Why wouldn't histogram in metrics endpoint cover this instead? do you mean the metrics endpoint from gitea configuration ?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/gitea#13947