通过自定义 Kong 插件与 Consul 元数据实现浏览器到微服务的全链路追踪


我们的分布式追踪系统 Jaeger 已经上线运行了一段时间,但团队始终面临一个棘手的诊断难题。当用户报告前端某个操作响应缓慢时,我们能在 Jaeger 中看到从 API 网关 Kong 开始到后端各个微服务的完整调用链,却唯独缺失了最关键的一环:这条链路究竟是由前端哪个具体的用户行为触发的?链路数据在网关处凭空出现,仿佛是无源之水。我们知道问题出在从浏览器到网关这一段的上下文丢失,但真正的挑战在于,即便我们能将前端的 Trace ID 传递到 Kong,获取到的信息依然贫乏。在真实项目中,我们需要的不仅仅是“哪个服务被调用了”,而是“哪个版本的服务实例,在哪个环境下,被前端的哪个组件调用了”。这些宝贵的元数据,分散在我们的服务注册中心 Consul 和前端应用逻辑中,而标准的 OpenTelemetry 插件并不能将它们自动缝合起来。

问题的核心演变为:如何构建一个机制,在 API 网关层,利用前端传递的追踪上下文,动态地从 Consul 查询服务元数据,并将这些信息作为属性(Attributes)注入到当前的追踪 Span 中,从而极大地丰富 Jaeger 中的链路信息,实现从用户点击到数据库查询的端到端、富元数据诊断。这需要我们跳出标准插件的限制,亲手打造一个连接 OpenTelemetry 上下文与 Consul 服务元数据的桥梁。

第一步:搭建可复现的基础设施

要解决这个问题,首先需要一个能够完整复现该场景的本地环境。我们使用 Docker Compose 来编排所有依赖组件:Kong 作为 API 网关、Consul 作为服务发现与配置中心、Jaeger 用于追踪数据收集与展示,以及一个用 Go 编写的简单后端服务 echo-service

# docker-compose.yml
version: '3.8'

services:
  jaeger:
    image: jaegertracing/all-in-one:1.41
    ports:
      - "16686:16686" # Jaeger UI
      - "4317:4317"   # OTLP gRPC receiver
    environment:
      - COLLECTOR_OTLP_ENABLED=true

  consul:
    image: consul:1.13
    ports:
      - "8500:8500" # Consul UI & HTTP API
    command: "agent -server -ui -client=0.0.0.0 -bootstrap-expect=1"

  echo-service:
    build:
      context: ./echo-service
    ports:
      - "8080" # Port is dynamic, exposed to host for debug only
    environment:
      - SERVICE_NAME=echo-service
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
      # This service will register itself to Consul
      - CONSUL_HTTP_ADDR=http://consul:8500
      # We add version metadata during registration
      - SERVICE_VERSION=1.2.3-beta
      - DEPLOY_REGION=us-east-1

  kong:
    image: kong:3.0
    ports:
      - "8000:8000" # Proxy
      - "8001:8001" # Admin API
    environment:
      KONG_DATABASE: 'off'
      KONG_DECLARATIVE_CONFIG: /etc/kong/kong.yml
      KONG_DNS_RESOLVER: consul:8500 # Tell Kong to use Consul for DNS
      KONG_PLUGINS: bundled,consul-enricher # Enable our custom plugin
      KONG_LUA_PACKAGE_PATH: /usr/local/share/lua/5.1/?.lua;;
      KONG_LOG_LEVEL: debug
      # OpenTelemetry plugin configuration
      KONG_OPENTELEMETRY_TRACES_ENDPOINT: http://jaeger:4317
    volumes:
      - ./kong/config:/etc/kong/
      - ./kong/plugins/consul-enricher:/usr/local/share/lua/5.1/kong/plugins/consul-enricher
    depends_on:
      - jaeger
      - consul
      - echo-service

networks:
  default:
    name: kong-jaeger-consul-net

这个 docker-compose.yml 文件是整个方案的基石。关键配置在于:

  1. KONG_DNS_RESOLVER: consul:8500: 指示 Kong 使用 Consul 作为其 DNS 解析器。这使得 Kong 可以通过 echo-service.service.consul 这样的地址动态发现后端服务。
  2. KONG_PLUGINS: bundled,consul-enricher: 声明加载一个名为 consul-enricher 的自定义插件。
  3. KONG_LUA_PACKAGE_PATH: 确保 Kong 能找到我们自定义插件的 Lua 代码。
  4. volumes: 我们将自定义插件的源代码目录挂载到 Kong 容器的插件路径下。

第二步:后端服务与服务注册

后端服务 echo-service 使用 Go 编写,它不仅需要处理业务请求,还承担着两个关键的云原生任务:集成 OpenTelemetry SDK 以便参与链路追踪,以及在启动时将自身注册到 Consul 并附带丰富的元数据。

// echo-service/main.go
package main

import (
	"context"
	"fmt"
	"log"
	"net/http"
	"os"
	"os/signal"
	"syscall"
	"time"

	"github.com/google/uuid"
	"github.com/hashicorp/consul/api"
	"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
	"go.opentelemetry.io/otel"
	"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
	"go.opentelemetry.io/otel/propagation"
	"go.opentelemetry.io/otel/sdk/resource"
	sdktrace "go.opentelemetry.io/otel/sdk/trace"
	semconv "go.opentelemetry.io/otel/semconv/v1.12.0"
)

var serviceID string
var tracer = otel.Tracer("echo-service")

func initTracer() (*sdktrace.TracerProvider, error) {
	exporter, err := otlptracegrpc.New(context.Background(), otlptracegrpc.WithInsecure())
	if err != nil {
		return nil, fmt.Errorf("failed to create OTLP gRPC exporter: %w", err)
	}

	res, err := resource.New(context.Background(),
		resource.WithAttributes(
			semconv.ServiceNameKey.String(os.Getenv("SERVICE_NAME")),
		),
	)
	if err != nil {
		return nil, fmt.Errorf("failed to create resource: %w", err)
	}

	tp := sdktrace.NewTracerProvider(
		sdktrace.WithBatcher(exporter),
		sdktrace.WithResource(res),
	)
	otel.SetTracerProvider(tp)
	otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))
	return tp, nil
}

func registerServiceInConsul() {
	serviceName := os.Getenv("SERVICE_NAME")
	serviceVersion := os.Getenv("SERVICE_VERSION")
	deployRegion := os.Getenv("DEPLOY_REGION")

	config := api.DefaultConfig()
	config.Address = os.Getenv("CONSUL_HTTP_ADDR")
	consulClient, err := api.NewClient(config)
	if err != nil {
		log.Fatalf("Failed to create consul client: %v", err)
	}

	serviceID = fmt.Sprintf("%s-%s", serviceName, uuid.New().String())
	port := 8080 // In a real app, this would be dynamic

	registration := &api.AgentServiceRegistration{
		ID:      serviceID,
		Name:    serviceName,
		Port:    port,
		Address: os.Getenv("HOSTNAME"), // Use hostname in docker network
		Meta: map[string]string{
			"version": serviceVersion,
			"region":  deployRegion,
		},
		Check: &api.AgentServiceCheck{
			HTTP:                           fmt.Sprintf("http://%s:%d/health", os.Getenv("HOSTNAME"), port),
			Interval:                       "10s",
			Timeout:                        "1s",
			DeregisterCriticalServiceAfter: "1m",
		},
	}

	if err := consulClient.Agent().ServiceRegister(registration); err != nil {
		log.Fatalf("Failed to register service with consul: %v", err)
	}
	log.Printf("Successfully registered service %s with Consul", serviceID)
}

func deregisterService(client *api.Client) {
	if err := client.Agent().ServiceDeregister(serviceID); err != nil {
		log.Printf("Failed to deregister service: %v", err)
	} else {
		log.Printf("Successfully deregistered service %s", serviceID)
	}
}

func main() {
	tp, err := initTracer()
	if err != nil {
		log.Fatalf("Failed to initialize tracer: %v", err)
	}
	defer func() {
		if err := tp.Shutdown(context.Background()); err != nil {
			log.Printf("Error shutting down tracer provider: %v", err)
		}
	}()

	registerServiceInConsul()

	// Graceful shutdown setup
	quit := make(chan os.Signal, 1)
	signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
	go func() {
		<-quit
		log.Println("Shutting down server...")
		config := api.DefaultConfig()
		config.Address = os.Getenv("CONSUL_HTTP_ADDR")
		consulClient, _ := api.NewClient(config)
		deregisterService(consulClient)
		os.Exit(0)
	}()

	echoHandler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		// The otelhttp handler wrapper automatically creates a child span
		_, span := tracer.Start(r.Context(), "handle-echo-request")
		defer span.End()

		w.WriteHeader(http.StatusOK)
		w.Write([]byte("Request processed and traced.\n"))
	})
	
	healthHandler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		w.WriteHeader(http.StatusOK)
		w.Write([]byte("OK"))
	})

	// Wrap handler with OpenTelemetry middleware
	otelHandler := otelhttp.NewHandler(echoHandler, "echo-server")

	http.Handle("/echo", otelHandler)
	http.Handle("/health", healthHandler)

	log.Println("Echo service listening on port 8080")
	if err := http.ListenAndServe(":8080", nil); err != nil {
		log.Fatalf("Server failed: %v", err)
	}
}

这段 Go 代码的核心在于 registerServiceInConsul 函数。它从环境变量读取服务版本和部署区域,并将这些信息作为 Meta 字段注册到 Consul。这正是我们稍后希望在 Jaeger Span 中看到的数据。

第三步:前端追踪上下文的生成与传递

为了模拟前端环境,我们使用一段 JavaScript 代码,它利用 OpenTelemetry JS SDK 创建一个父 Span,然后通过 fetch 发出请求。这段代码通常会由 Babel 和 Webpack 等工具打包进现代前端应用中。

// A simplified example of frontend tracing instrumentation
// In a real project, this would be part of your application bootstrap code.
import { WebTracerProvider } from '@opentelemetry/sdk-trace-web';
import { SimpleSpanProcessor } from '@opentelemetry/sdk-trace-base';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { ZoneContextManager } from '@opentelemetry/context-zone';
import { registerInstrumentations } from '@opentelemetry/instrumentation';
import { FetchInstrumentation } from '@opentelemetry/instrumentation-fetch';
import { W3CTraceContextPropagator } from '@opentelemetry/core';

// --- Configuration ---
const provider = new WebTracerProvider({
    resource: {
        attributes: { 'service.name': 'browser-app' },
    },
});

// For demo, we export to console. In prod, this would go to a collector.
// const exporter = new OTLPTraceExporter({ url: 'http://jaeger-collector/v1/traces' });
// provider.addSpanProcessor(new SimpleSpanProcessor(exporter));
provider.register({
    contextManager: new ZoneContextManager(),
    propagator: new W3CTraceContextPropagator(),
});

registerInstrumentations({
    instrumentations: [
        new FetchInstrumentation(),
    ],
});

const tracer = provider.getTracer('my-frontend-app');

// --- Business Logic ---
function performAction() {
    // Create a parent span for the user action
    const parentSpan = tracer.startSpan('user-clicks-submit-button');
    
    // Set attributes relevant to the frontend context
    parentSpan.setAttribute('user.id', 'user-12345');
    parentSpan.setAttribute('component.name', 'CheckoutForm');

    // Make the API call within the context of the parent span
    tracer.withSpan(parentSpan, async () => {
        try {
            // The FetchInstrumentation will automatically create a child span
            // and inject W3C Trace Context headers (traceparent, tracestate).
            const response = await fetch('http://localhost:8000/api/echo', {
                method: 'GET',
                headers: { 'Content-Type': 'application/json' },
            });
            console.log('API call successful:', await response.text());
        } catch (error) {
            console.error('API call failed:', error);
            parentSpan.recordException(error);
            parentSpan.setStatus({ code: 2, message: error.message }); // 2 is ERROR status
        } finally {
            parentSpan.end(); // End the parent span when the action is complete
        }
    });
}

// Simulate a user action
document.getElementById('action-button').addEventListener('click', performAction);

这段代码的关键在于 W3CTraceContextPropagatorFetchInstrumentation。当 fetch 被调用时,它们会自动在 HTTP 请求头中注入 traceparent Header,格式类似于 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01。这个 Header 携带着 Trace ID 和 Parent Span ID,是实现跨服务上下文传递的标准方式。

第四步:自定义 Kong 插件的实现

现在进入核心部分:编写一个 Lua 插件,它将在 Kong 的请求处理流程中,从 Consul 获取元数据并附加到 OpenTelemetry Span 上。

首先,定义插件的 schema,这决定了插件有哪些可配置项。

-- kong/plugins/consul-enricher/schema.lua
return {
  name = "consul-enricher",
  fields = {
    -- We can add configuration fields if needed, for now it's empty.
  },
}

然后是插件的主体逻辑 handler.lua。Kong 插件的生命周期函数中最适合我们这个场景的是 access 阶段,此时请求即将被代理到上游服务,并且 Kong 的 OpenTelemetry 插件已经创建了当前的 Server Span。

-- kong/plugins/consul-enricher/handler.lua
local kong = kong

-- A small utility for pretty printing tables for debugging
local function pretty_print(t)
  local s = "{"
  for k, v in pairs(t) do
    s = s .. string.format(" [%s] = %s,", tostring(k), tostring(v))
  end
  return s .. " }"
end

local ConsulEnricherHandler = {
  PRIORITY = 100, -- Lower than the opentelemetry plugin's priority (e.g. 199)
  VERSION = "0.1.0",
}

function ConsulEnricherHandler:access(conf)
  -- 1. Get the current span context created by the opentelemetry plugin.
  -- The official plugin conveniently stores the span object in ngx.ctx.
  local span = kong.request.get_span and kong.request.get_span()
  if not span then
    kong.log.debug("consul-enricher: no active span found in ngx.ctx, skipping.")
    return
  end

  -- 2. Get the upstream service object that Kong has resolved for this request.
  local service = kong.router.get_service()
  if not service then
    kong.log.debug("consul-enricher: no service associated with the route, skipping.")
    return
  end

  -- 3. We are interested in services that are dynamically resolved by Consul.
  -- Their host will be in the format `servicename.service.consul`.
  -- We extract the service name from the host.
  local service_name = service.host:match("([^.]+).service.consul")
  if not service_name then
    kong.log.debug("consul-enricher: service host '", service.host, "' is not a Consul service, skipping.")
    return
  end
  
  kong.log.debug("consul-enricher: found Consul service: ", service_name)

  -- 4. Use Kong's built-in DNS client to query Consul for service metadata.
  -- This is a powerful, non-obvious feature. It respects Kong's DNS settings.
  -- The query returns a list of records for the service. We only need the first one.
  local records, err = kong.dns.query(service.host)
  if err or not records or #records == 0 then
    kong.log.err("consul-enricher: failed to resolve service from Consul: ", err)
    return
  end

  -- 5. The metadata is available in the TXT records of the Consul DNS response.
  -- It's a bit convoluted to parse, but it's reliable.
  -- Example TXT record: "consul-service-meta-version=1.2.3-beta"
  local txt_records, err = kong.dns.query(service_name .. ".service.consul", { qtype = kong.dns.TYPE_TXT })
  if err or not txt_records then
    kong.log.warn("consul-enricher: could not fetch TXT records for metadata: ", err)
  else
    for _, record in ipairs(txt_records) do
      local key, value = record.txt:match("consul%-service%-meta%-([^=]+)=(.+)")
      if key and value then
        local attribute_key = "consul.service.meta." .. key
        kong.log.debug("consul-enricher: setting span attribute: ", attribute_key, " = ", value)
        
        -- This is the magic line: add metadata to the current span.
        span:set_attribute(attribute_key, value)
      end
    end
  end
  
  -- Also add the resolved IP and Port as attributes for debugging.
  local target_ip = records[1].ip
  local target_port = records[1].port
  span:set_attribute("net.peer.ip", target_ip)
  span:set_attribute("net.peer.port", target_port)
  
  kong.log.debug("consul-enricher: enriched span for service ", service_name, " targeting ", target_ip, ":", target_port)
end

return ConsulEnricherHandler

代码逻辑剖析:

  1. kong.request.get_span(): 这是获取由 OpenTelemetry 插件创建的当前 Span 对象的标准方式。
  2. kong.router.get_service(): 获取当前请求匹配到的 Kong Service 对象。
  3. service.host:match(...): 我们通过匹配 service 的 host 字段是否为 .service.consul 后缀来判断它是否是一个由 Consul 管理的服务。
  4. kong.dns.query(...): 这是整个插件的关键。我们使用 Kong 内置的、异步的 DNS 客户端去查询 Consul。这比自己实现一个 HTTP 客户端去请求 Consul API 更高效、更优雅,因为它复用了 Kong 的连接池和 DNS 缓存机制。
  5. 解析TXT记录:Consul 的 DNS 接口会将服务的 Meta 数据编码在 TXT 记录中。我们通过正则表达式解析出这些键值对。
  6. span:set_attribute(...): 将从 Consul 获取的元数据,如 versionregion,以 consul.service.meta.version 这样的键名设置到 Span 的属性中。

第五步:整合与验证

最后一步是将所有配置整合起来,并验证最终效果。这是 Kong 的声明式配置文件 kong.yml

# kong/config/kong.yml
_format_version: "3.0"
_transform: true

services:
  - name: echo-service-proxy
    # Use Consul's service discovery name
    host: echo-service.service.consul
    port: 8080
    protocol: http
    plugins:
      - name: opentelemetry
        config:
          # We don't need to sample; let the collector decide.
          sampling_rate: 1
      - name: consul-enricher # Enable our custom plugin on this service

routes:
  - name: echo-route
    paths:
      - /api/echo
    strip_path: true
    methods:
      - GET
    service: echo-service-proxy

现在,启动整个环境 docker-compose up。当所有服务正常运行后,在浏览器中打开一个简单的 HTML 页面,执行前面编写的 JavaScript 代码段来触发 fetch 请求。

接着,打开 Jaeger UI (http://localhost:16686)。搜索 browser-app 服务,你会看到一条完整的链路:

graph TD
    A[browser-app: user-clicks-submit-button] --> B[browser-app: GET]
    B --> C[kong: GET /api/echo]
    C --> D[echo-service: echo-server]
    D --> E[echo-service: handle-echo-request]

这本身已经是标准的链路追踪了。但真正的价值在于点击 kong: GET /api/echo 这个 Span。在它的 “Tags” 或 “Attributes” 标签页中,你将看到我们通过自定义插件注入的宝贵信息:

  • consul.service.meta.version: 1.2.3-beta
  • consul.service.meta.region: us-east-1
  • net.peer.ip: 172.x.x.x (echo-service 容器的 IP)
  • net.peer.port: 8080

当线上出现问题时,运维人员不再需要猜测是哪个版本的服务出了问题。他们可以直接在 Jaeger UI 中看到,这次缓慢的请求是被路由到了 1.2.3-beta 版本的、部署在 us-east-1 区域的服务实例上。结合前端 Span 中的 user.idcomponent.name,我们拥有了从用户界面交互到特定后端服务实例的完整、详细的调试信息。

方案局限性与未来展望

这个方案有效地解决了问题,但并非没有权衡。每次请求通过 Kong,插件都会对 Consul 进行一次或多次 DNS 查询。尽管 Kong 的 DNS 模块有缓存,在高并发场景下,这仍然可能成为一个性能瓶颈或故障点。在真实项目中,需要对插件的性能进行压测,并考虑为 Consul 查询结果添加一层内存缓存(例如使用 lua-resty-lrucache),并制定当 Consul 不可用时的优雅降级策略(例如,跳过增强步骤,只记录一条警告日志)。

此外,该实现强依赖于 Kong 内部的 kong.dns API,未来 Kong 版本升级可能会带来兼容性风险。一个更具前瞻性的方向是探索使用 Kong 新的 Go 或 WebAssembly (WASM) 插件系统。用 Go 实现这个逻辑会更加健壮,测试也更方便,并且可以利用 Consul 官方的 Go SDK。WASM 则提供了跨语言的能力,让团队可以用 Rust 或 TinyGo 等语言编写高性能、安全沙箱化的插件,这或许是未来 API 网关可扩展性的最终形态。


  目录