Problem Description
Sometimes, the Apache DolphinScheduler project management page fails to load. When inspecting in the browser, the request [http://ip:12345/dolphinscheduler/projects?pageSize=10&pageNo=1&searchVal=]
times out.
Troubleshooting Steps
1. Check DolphinScheduler Logs (No Exceptions Found)
tail /home/dolphinscheduler/api-server/logs/dolphinscheduler-api.log
2. Restart the DolphinScheduler api-server
Module
sh /home/dolphinscheduler/bin/dolphinscheduler-daemon.sh stop api-server
sh /home/dolphinscheduler/bin/dolphinscheduler-daemon.sh start api-server
Result: The page still fails to load.
3. Test the Timeout Request and Observe the Response
Testing reveals that data can be returned, but it takes over ten seconds.
4. Check the Source Code for Potential Slow Queries
Code location:
5. Use Arthas to Analyze Runtime Performance
curl -O https://arthas.aliyun.com/arthas-boot.jar
java -jar arthas-boot.jar
# Identify the DolphinScheduler API process; in my case, it was the first one.
Trace the runtime to find where the delay occurs. It was identified in the queryProjectListPaging
method.
trace org.apache.dolphinscheduler.api.service.impl.ProjectServiceImpl queryProjectListPaging -n 1
Inspecting the queryProjectListPaging
method:
6. Run the SQL Query and Analyze
Testing the SQL query reveals that it is not inherently slow.
SELECT * FROM t_ds_project p
WHERE 1=1 AND p.id IN (1, 2, 3, 5, 6, 8, 9, 12, 13, 14, 16, 17, 19, 21, 22, 23, 25, 26)
GROUP BY p.id
ORDER BY p.id DESC;
Testing the SQL being executed by DolphinScheduler, using Arthas and the CTBots MyBatis SQL Parser:
7. Identify the Slow Query
The slow SQL query was identified as follows:
SELECT
p.id, p.name, p.code, p.description, p.user_id, p.flag, p.create_time, p.update_time,
u.user_name AS user_name,
(SELECT COUNT(*) FROM t_ds_process_definition AS def WHERE def.project_code = p.code) AS def_count,
(SELECT COUNT(*) FROM t_ds_process_definition_log def, t_ds_process_instance inst
WHERE def.code = inst.process_definition_code
AND def.version = inst.process_definition_version
AND def.project_code = p.code
AND inst.state = 1) AS inst_running_count
FROM t_ds_project p
LEFT JOIN t_ds_user u ON u.id = p.user_id
WHERE 1=1
AND p.id IN (1, 2, 3, 5, 6, 8, 9, 12, 13, 14, 16, 17, 19, 21, 22, 23, 25, 26)
GROUP BY p.id, u.user_name
ORDER BY p.id DESC LIMIT 100;
Analysis using EXPLAIN
revealed that the bottleneck was in the inst_running_count
subquery.
The t_ds_process_definition
table contains approximately 280,000 rows, while t_ds_process_definition_log
has 1,000 rows.
8. Optimize the Query with Indexes
Adding an index to the t_ds_process_instance
table significantly improved performance:
CREATE INDEX qwl_diy_index
ON t_ds_process_instance (process_definition_code, process_definition_version, state);
9. Verify the Results
After adding the index, the query execution time improved, and the page loaded correctly.
By addressing the slow query with an appropriate index, the issue was resolved. If this article was helpful, feel free to like, bookmark, or share it! 😊
Top comments (0)