Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sQBwV-008Nia-GC for pgsql-general@arkaria.postgresql.org; Sat, 06 Jul 2024 20:23:35 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sQBwU-006BVt-5h for pgsql-general@arkaria.postgresql.org; Sat, 06 Jul 2024 20:23:34 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sQBwT-006BVl-Q8 for pgsql-general@lists.postgresql.org; Sat, 06 Jul 2024 20:23:33 +0000 Received: from mail-oa1-x36.google.com ([2001:4860:4864:20::36]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1sQBwR-000mOT-0J for pgsql-general@lists.postgresql.org; Sat, 06 Jul 2024 20:23:32 +0000 Received: by mail-oa1-x36.google.com with SMTP id 586e51a60fabf-250ca14422aso1490547fac.0 for ; Sat, 06 Jul 2024 13:23:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1720297408; x=1720902208; darn=lists.postgresql.org; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date:message-id:reply-to; bh=+sVQIgqp35G+ceO07m06g6Zp3+EfHUHckutwpkf0Y9A=; b=mwomsN9oeEP9mXzf8D5r1AB8IHsdElbIdam2HsOiap9QmgqGF5xSi8df5w8VvZTb1A OjfbnmJsnYEmxTN7e2KptzEBetVSK6gawENXDBQKgCl/R9ha+vdYWY6x+pumFK+BkG1k +Dy2rgqJu/45bG4g3Dg6esDQIenKtHqltkVMQeABcz6cIofj+0pkwk83OP4j7n0fKfSy J8jXHY6OG2tA9kBQy+JjHWXih5PhrjAsyKLgOiAp54HvqkJYn/TSZiXllMbsDhrrcO8S DqCu5HW2jYW0HSvwoYxdnOX57MykePEMTNzR0aNrx98S6nSYbvnicS9Wy525xJfTCy0c DUqw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720297408; x=1720902208; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=+sVQIgqp35G+ceO07m06g6Zp3+EfHUHckutwpkf0Y9A=; b=ZHQNBHhvrDcvR4eafUva9+C1W8qN73NqziJ2kVvHmgCWVbEQDtxuSJLsVGDHfZdCGi OmFDJ7JHJqd2HikPGO8wW3WHSLgdo/ZsklHnZAJ5u04pcZr/53mFTGr0UPK5n1UvXjdj F2oi/RoSwIwSwWs/vDuKa/UBq29VSpqPCct4znKl30/eUgZNLVXDXUW7KVi+uDUWRUR9 hRtfQFgDjDyV4hEK9W1+lSG3vksabkDbcMK+FdZmi06i9sq8EVNLACk0Xlzbc6n5ntRl TxWV7FE3QOYNxYmOb9FXOAEKwEkkmMgSCwrlAOgFEJkMbaU67J8egYiZ0sQ2om3yGqe1 2CAA== X-Gm-Message-State: AOJu0YyBoXz58ZRYoCoHxsGsllygSk96g3sfAJwCs0aKyXkzOpTlq3eb 1fNNggW2tyWuSVdxS2ou6/pTjT62ZWNZoMXDwAfRhuAyQzT+3+hB7ckgcu9aNIOaCjHpCDX5nrz XWssuwtIJpjNFdB1xY1ZF7px2uX/Qqw== X-Google-Smtp-Source: AGHT+IEKtybJ8zXkuB/eNGPCrvPDkycvrauqzibbEjNvIeDxz9XTsejWFRmjwCOHi6BlfufgKvJmJo5xOY30RPyVSto= X-Received: by 2002:a05:6870:5b30:b0:25d:f0ba:eab7 with SMTP id 586e51a60fabf-25e2ba25c4emr4187533fac.18.1720297408400; Sat, 06 Jul 2024 13:23:28 -0700 (PDT) MIME-Version: 1.0 References: <11d5753c-578e-41a2-af17-6de956f03058@aklaver.com> In-Reply-To: From: Ron Johnson Date: Sat, 6 Jul 2024 16:23:17 -0400 Message-ID: Subject: Re: Load a csv or a avro? To: pgsql-general Content-Type: multipart/alternative; boundary="000000000000ef943f061c99f389" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000ef943f061c99f389 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sat, Jul 6, 2024 at 4:10=E2=80=AFPM sud wrote: > On Fri, Jul 5, 2024 at 8:24=E2=80=AFPM Adrian Klaver > wrote: > >> On 7/5/24 02:08, sud wrote: >> > Hello all, >> > >> > Its postgres database. We have option of getting files in csv and/or i= n >> > avro format messages from another system to load it into our postgres >> > database. The volume will be 300million messages per day across many >> > files in batches. >> >> Are dumping the entire contents of each file or are you pulling a >> portion of the data out? >> >> >> > Yes, all the fields in the file have to be loaded to the columns in the > tables in postgres. > But you didn't say *which* columns or *which* tables. If one row of CSV input must be split into multiple tables, then it might be pretty slow. > But how will that matter here for deciding if we should ask the data in > .csv or .avro format from the outside system to load into the postgres > database in row and column format? Again my understanding was that > irrespective of anything , the .csv file load will always faster as becau= se > the data is already stored in row and column format as compared to the > .avro file in which the parser has to perform additional job to make it r= ow > and column format or map it to the columns of the database table. Is my > understanding correct here? > Yes and no. It all depends on how well each input row maps to a Postgresql table. Bottom line: you want an absolute answer, but we can't give you an absolute answer, since we don't know what the input data looks like, and we don't know what the Postgresql tables look like. An AVRO file *might* be faster to input than CSV, or it might be horribly slower. And you might incompetently program a CSV importer so that it's horribly slow. We can't give absolute answers without knowing more details than the ambiguous generalities in your emails. --000000000000ef943f061c99f389 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Sat, Jul 6, 2024 at 4:10=E2=80=AFPM su= d <suds1434@gmail.com> wrot= e:
On Fri, Jul 5, 2024 at 8:24= =E2=80=AFPM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 7/= 5/24 02:08, sud wrote:
> Hello all,
>
> Its postgres database. We have option of getting files in csv and/or i= n
> avro format messages from another system to load it into our postgres =
> database. The volume will be 300million messages per day across many <= br> > files in batches.

Are dumping the entire contents of each file or are you pulling a
portion of the data out?



Yes, all the fields in the file have t= o be loaded to the columns in the tables in postgres.

But you didn't say which=C2=A0colu= mns or which=C2=A0tables.

If one row of CSV= input must be split into multiple tables, then it might be pretty slow.
=C2=A0
But how will that matter here f= or deciding if we should ask the=C2=A0data in .csv or .avro format from the= outside system to load into the postgres database in=C2=A0row and column f= ormat? Again my understanding=C2=A0was that irrespective=C2=A0of anything ,= the .csv file load will always faster as because the data is already=C2=A0= stored in row and column format as compared=C2=A0to the .avro file in which= the parser has to perform additional=C2=A0job to make it row and column fo= rmat or map it to the columns of the database table. Is my understanding=C2= =A0correct here?

Yes and = no.=C2=A0 It all depends on how well each input row maps to a Postgresql ta= ble.

Bottom line: you want an absolute answer, but= we can't give you an absolute answer, since we don't know what the= input data looks like, and we don't know what the Postgresql tables lo= ok like.

An AVRO file might=C2=A0be faster = to input than CSV, or it might be horribly slower.

And you might incompetently program a CSV importer so that it's horrib= ly slow.

We can't give absolute answers withou= t knowing more details than the ambiguous generalities in your emails.
--000000000000ef943f061c99f389--