Commit 96749554 authored by Daniel Stenberg's avatar Daniel Stenberg
Browse files

LIBCURL-STRUCTS: new document

This is the first version of this new document, detailing the seven
perhaps most important internal structs in libcurl source code:

  1.1 SessionHandle
  1.2 connectdata
  1.3 Curl_multi
  1.4 Curl_handler
  1.5 conncache
  1.6 Curl_share
  1.7 CookieInfo
parent 78574940
Loading
Loading
Loading
Loading
+41 −72
Original line number Diff line number Diff line
@@ -111,6 +111,9 @@ Windows vs Unix
Library
=======

 (See LIBCURL-STRUCTS for a separate document describing all major internal
 structs and their purposes.)

 There are plenty of entry points to the library, namely each publicly defined
 function that libcurl offers to applications. All of those functions are
 rather small and easy-to-follow. All the ones prefixed with 'curl_easy' are
@@ -135,16 +138,18 @@ Library
 options is documented in the man page. This function mainly sets things in
 the 'SessionHandle' struct.

 curl_easy_perform() does a whole lot of things:
 curl_easy_perform() is just a wrapper function that makes use of the multi
 API.  It basically curl_multi_init(), curl_multi_add_handle(),
 curl_multi_wait(), and curl_multi_perform() until the transfer is done and
 then returns.

 It starts off in the lib/easy.c file by calling Curl_perform() and the main
 work then continues in lib/url.c. The flow continues with a call to
 Curl_connect() to connect to the remote site.
 Some of the most important key functions in url.c are called from multi.c
 when certain key steps are to be made in the transfer operation.

 o Curl_connect()

   ... analyzes the URL, it separates the different components and connects to
   the remote host. This may involve using a proxy and/or using SSL. The
   Analyzes the URL, it separates the different components and connects to the
   remote host. This may involve using a proxy and/or using SSL. The
   Curl_resolv() function in lib/hostip.c is used for looking up host names
   (it does then use the proper underlying method, which may vary between
   platforms and builds).
@@ -160,10 +165,7 @@ Library
 o Curl_do()

   Curl_do() makes sure the proper protocol-specific function is called. The
   functions are named after the protocols they handle. Curl_ftp(),
   Curl_http(), Curl_dict(), etc. They all reside in their respective files
   (ftp.c, http.c and dict.c). HTTPS is handled by Curl_http() and FTPS by
   Curl_ftp().
   functions are named after the protocols they handle.

   The protocol-specific functions of course deal with protocol-specific
   negotiations and setup. They have access to the Curl_sendf() (from
@@ -182,10 +184,9 @@ Library
   be called with some basic info about the upcoming transfer: what socket(s)
   to read/write and the expected file transfer sizes (if known).

 o Transfer()
 o Curl_readwrite()

   Curl_perform() then calls Transfer() in lib/transfer.c that performs the
   entire file transfer.
   Called during the transfer of the actual protocol payload.

   During transfer, the progress functions in lib/progress.c are called at a
   frequent interval (or at the user's choice, a specified callback might get
@@ -207,33 +208,11 @@ Library
   used. This function is only used when we are certain that no more transfers
   is going to be made on the connection. It can be also closed by force, or
   it can be called to make sure that libcurl doesn't keep too many
   connections alive at the same time (there's a default amount of 5 but that
   can be changed with the CURLOPT_MAXCONNECTS option).
   connections alive at the same time.

   This function cleans up all resources that are associated with a single
   connection.

 Curl_perform() is the function that does the main "connect - do - transfer -
 done" loop. It loops if there's a Location: to follow.

 When completed, the curl_easy_cleanup() should be called to free up used
 resources. It runs Curl_disconnect() on all open connections.

 A quick roundup on internal function sequences (many of these call
 protocol-specific function-pointers):

  Curl_connect - connects to a remote site and does initial connect fluff
   This also checks for an existing connection to the requested site and uses
   that one if it is possible.

   Curl_do - starts a transfer
    Curl_handler::do_it() - transfers data
   Curl_done - ends a transfer

  Curl_disconnect - disconnects from a remote site. This is called when the
   disconnect is really requested, which doesn't necessarily have to be
   exactly after curl_done in case we want to keep the connection open for
   a while.

 HTTP(S)

@@ -316,48 +295,38 @@ Persistent Connections
   hold connection-oriented data. It is meant to hold the root data as well as
   all the options etc that the library-user may choose.
 o The 'SessionHandle' struct holds the "connection cache" (an array of
   pointers to 'connectdata' structs). There's one connectdata struct
   allocated for each connection that libcurl knows about. Note that when you
   use the multi interface, the multi handle will hold the connection cache
   and not the particular easy handle. This of course to allow all easy handles
   in a multi stack to be able to share and re-use connections.
   pointers to 'connectdata' structs).
 o This enables the 'curl handle' to be reused on subsequent transfers.
 o When we are about to perform a transfer with curl_easy_perform(), we first
   check for an already existing connection in the cache that we can use,
   otherwise we create a new one and add to the cache. If the cache is full
   already when we add a new connection, we close one of the present ones. We
   select which one to close dependent on the close policy that may have been
   previously set.
 o When the transfer operation is complete, we try to leave the connection
   open. Particular options may tell us not to, and protocols may signal
   closure on connections and then we don't keep it open of course.
 o When libcurl is told to perform a transfer, it first checks for an already
   existing connection in the cache that we can use. Otherwise it creates a
   new one and adds that the cache. If the cache is full already when a new
   conncetion is added added, it will first close the oldest unused one.
 o When the transfer operation is complete, the connection is left
   open. Particular options may tell libcurl not to, and protocols may signal
   closure on connections and then they won't be kept open of course.
 o When curl_easy_cleanup() is called, we close all still opened connections,
   unless of course the multi interface "owns" the connections.

 You do realize that the curl handle must be re-used in order for the
 persistent connections to work.
 The curl handle must be re-used in order for the persistent connections to
 work.

multi interface/non-blocking
============================

 We make an effort to provide a non-blocking interface to the library, the
 multi interface. To make that interface work as good as possible, no
 low-level functions within libcurl must be written to work in a blocking
 manner.
 The multi interface is a non-blocking interface to the library. To make that
 interface work as good as possible, no low-level functions within libcurl
 must be written to work in a blocking manner. (There are still a few spots
 violating this rule.)

 One of the primary reasons we introduced c-ares support was to allow the name
 resolve phase to be perfectly non-blocking as well.

 The ultimate goal is to provide the easy interface simply by wrapping the
 multi interface functions and thus treat everything internally as the multi
 interface is the single interface we have.

 The FTP and the SFTP/SCP protocols are thus perfect examples of how we adapt
 and adjust the code to allow non-blocking operations even on multi-stage
 protocols. They are built around state machines that return when they could
 block waiting for data.  The DICT, LDAP and TELNET protocols are crappy
 examples and they are subject for rewrite in the future to better fit the
 libcurl protocol family.
 The FTP and the SFTP/SCP protocols are examples of how we adapt and adjust
 the code to allow non-blocking operations even on multi-stage command-
 response protocols. They are built around state machines that return when
 they would otherwise block waiting for data.  The DICT, LDAP and TELNET
 protocols are crappy examples and they are subject for rewrite in the future
 to better fit the libcurl protocol family.

SSL libraries
=============
@@ -408,12 +377,12 @@ API/ABI
Client
======

 main() resides in src/main.c together with most of the client code.
 main() resides in src/tool_main.c.

 src/tool_hugehelp.c is automatically generated by the mkhelp.pl perl script
 to display the complete "manual" and the src/urlglob.c file holds the
 functions used for the URL-"globbing" support. Globbing in the sense that
 the {} and [] expansion stuff is there.
 to display the complete "manual" and the src/tool_urlglob.c file holds the
 functions used for the URL-"globbing" support. Globbing in the sense that the
 {} and [] expansion stuff is there.

 The client mostly messes around to setup its 'config' struct properly, then
 it calls the curl_easy_*() functions of the library and when it gets back
@@ -425,8 +394,8 @@ Client
 curl_easy_getinfo() function to extract useful information from the curl
 session.

 Recent versions may loop and do all this several times if many URLs were
 specified on the command line or config file.
 It may loop and do all this several times if many URLs were specified on the
 command line or config file.

Memory Debugging
================

docs/LIBCURL-STRUCTS

0 → 100644
+245 −0
Original line number Diff line number Diff line
                                  _   _ ____  _
                              ___| | | |  _ \| |
                             / __| | | | |_) | |
                            | (__| |_| |  _ <| |___
                             \___|\___/|_| \_\_____|

Structs in libcurl

This document should cover 7.32.0 pretty accurately, but will make sense even
for older and later versions as things don't change drastically that often.

 1. The main structs in libcurl
  1.1 SessionHandle
  1.2 connectdata
  1.3 Curl_multi
  1.4 Curl_handler
  1.5 conncache
  1.6 Curl_share
  1.7 CookieInfo

==============================================================================

1. The main structs in libcurl

  1.1 SessionHandle

  The SessionHandle handle struct is the one returned to the outside in the
  external API as a "CURL *". This is usually known as an easy handle in API
  documentations and examples.

  Information and state that is related to the actual connection is in the
  'connectdata' struct. When a transfer is about to be made, libcurl will
  either create a new connection or re-use an existing one. The particular
  connectdata that is used by this handle is pointed out by
  SessionHandle->easy_conn.

  Data and information that regard this particular single transfer is put in
  the SingleRequest sub-struct.

  When the SessionHandle struct is added to a multi handle, as it must be in
  order to do any transfer, the ->multi member will point to the Curl_multi
  struct it belongs to. The ->prev and ->next members will then be used by the
  multi code to keep a linked list of SessionHandle structs that are added to
  that same multi handle. libcurl always uses multi so ->multi *will* point to
  a Curl_multi when a transfer is in progress.

  ->mstate is the multi state of this particular SessionHandle. When
  multi_runsingle() is called, it will act on this handle according to which
  state it is in. The mstate is also what tells which sockets to return for a
  speicific SessionHandle when curl_multi_fdset() is called etc.

  The libcurl source code generally use the name 'data' for the variable that
  points to the SessionHandle.


  1.2 connectdata

  A general idea in libcurl is to keep connections around in a connection
  "cache" after they have been used in case they will be used again and then
  re-use an existing one instead of creating a new as it creates a significant
  performance boost.

  Each 'connectdata' identifies a single physical conncetion to a server. If
  the connection can't be kept alive, the connection will be closed after use
  and then this struct can be removed from the cache and freed.

  Thus, the same SessionHandle can be used multiple times and each time select
  another connectdata struct to use for the connection. Keep this in mind, as
  it is then important to consider if options or choices are based on the
  connection or the SessionHandle.

  Functions in libcurl will assume that connectdata->data points to the
  SessionHandle that uses this connection.

  As a special complexity, some protocols supported by libcurl require a
  special disconnect procedure that is more than just shutting down the
  socket. It can involve sending one or more commands to the server before
  doing so. Since connections are kept in the connection cache after use, the
  original SessionHandle may no longer be around when the time comes to shut
  down a particular connection. For this purpose, libcurl holds a special
  dummy 'closure_handle' SessionHandle in the Curl_multi struct to 

  FTP uses two TCP connections for a typical transfer but it keeps both in
  this single struct and thus can be considered a single connection for most
  internal concerns.

  The libcurl source code generally use the name 'conn' for the variable that
  points to the connectdata.


  1.3 Curl_multi

  Internally, the easy interface is implemented as a wrapper around multi
  interface functions. This makes everything multi interface.

  Curl_multi is the multi handle struct exposed as "CURLM *" in external APIs.

  This struct holds a list of SessionHandle structs that have been added to
  this handle with curl_multi_add_handle(). The start of the list is ->easyp
  and ->num_easy is a counter of added SessionHandles.

  ->msglist is a linked list of messages to send back when
  curl_multi_info_read() is called. Basically a node is added to that list
  when an individual SessionHandle's transfer has completed.

  ->hostcache points to the name cache. It is a hash table for looking up name
  to IP. The nodes have a limited life time in there and this cache is meant
  to reduce the time for when the same name is wanted within a short period of
  time.

  ->timetree points to a tree of SessionHandles, sorted by the remaining time
  until it should be checked - normally some sort of timeout. Each
  SessionHandle has one node in the tree.

  ->sockhash is a hash table to allow fast lookups of socket descriptor to
  which SessionHandle that uses that descriptor. This is necessary for the
  multi_socket API.

  ->conn_cache points to the connection cache. It keeps track of all
  connections that are kept after use. The cache has a maximum size.

  ->closure_handle is described in the 'connectdata' section.

  The libcurl source code generally use the name 'multi' for the variable that
  points to the Curl_multi struct.


  1.4 Curl_handler

  Each unique protocol that is supported by libcurl needs to provide at least
  one Curl_handler struct. It defines what the protocol is called and what
  functions the main code should call to deal with protocol specific issues.
  In general, there's a source file named [protocol].c in which there's a
  "struct Curl_handler Curl_handler_[protocol]" declared. In url.c there's
  then the main array with all individual Curl_handler structs pointed to from
  a single array which is scanned through when a URL is given to libcurl to
  work with.

  ->scheme is the URL scheme name, usually spelled out in uppercase. That's
  "HTTP" or "FTP" etc. SSL versions of the protcol need its own Curl_handler
  setup so HTTPS separate from HTTP.

  ->setup_connection is called to allow the protocol code to allocate protocol
  specific data that then gets associated with that SessionHandle for the rest
  of this transfer. It gets freed again at the end of the transfer. It will be
  called before the 'connectdata' for the transfer has been selected/created.
  Most protocols will allocate its private 'struct [PROTOCOL]' here and assign
  SessionHandle->req.protop to point to it.

  ->connect_it allows a protocol to do some specific actions after the TCP
  connect is done, that can still be considered part of the connection phase.

  Some protocols will alter the connectdata->recv[] and connectdata->send[]
  function pointers in this function.

  ->connecting is similarly a function that keeps getting called as long as the
  protocol considers itself still in the connecting phase.

  ->do_it is the function called to issue the transfer request. What we call
  the DO action internally. If the DO is not enough and things need to be kept
  getting done for the entier DO sequence to complete, ->doing is then usually
  also provided. Each protocol that needs to do multiple commands or similar
  for do/doing need to implement their own state machines (see SCP, SFTP,
  FTP). Some protocols (only FTP and only due to historical reasons) has a
  separate piece of the DO state called DO_MORE.

  ->doing keeps getting called while issudeing the transfer request command(s)

  ->done gets called when the transfer is complete and DONE. That's after the
  main data has been transferred.

  ->do_more gets called doring the DO_MORE state. The FTP protocol uses this
  state when setting up the second connection.

  ->proto_getsock
  ->doing_getsock
  ->domore_getsock
  ->perform_getsock
  Functions that return socket information. Which socket(s) to wait for which
  action(s) during the particular multi state.

  ->disconnect is called immediately before the TCP connection is shutdown.

  ->readwrite gets called during transfer to allow the protocol to do extra
  reads/writes

  ->defport is the default report TCP or UDP port this protocol uses

  ->protocol is one or more bits in the CURLPROTO_* set. The SSL versions have
  their "base" protocol set and then the SSL variation. Like "HTTP|HTTPS".

  ->flags is a bitmask with additional information about the protocol that will
  make it get treated differently by the generic engine:

    PROTOPT_SSL - will make it connect and negotiate SSL

    PROTOPT_DUAL - this protocol uses two connections

    PROTOPT_CLOSEACTION - this protocol has actions to do before closing the
    connection. This flag is no longer used by code, yet still set for a bunch
    protocol handlers.
  
    PROTOPT_DIRLOCK - "direction lock". The SSH protocols set this bit to
    limit which "direction" of socket actions that the main engine will
    concern itself about.

    PROTOPT_NONETWORK - a protocol that doesn't use network (read file:)

    PROTOPT_NEEDSPWD - this protocol needs a password and will use a default
    one unless one is provided

    PROTOPT_NOURLQUERY - this protocol can't handle a query part on the URL
    (?foo=bar)


  1.5 conncache

  Is a hash table with connections for later re-use. Each SessionHandle has
  a pointer to its connection cache. Each multi handle sets up a connection
  cache that all added SessionHandles share by default.


  1.6 Curl_share
  
  The libcurl share API allocates a Curl_share struct, exposed to the external
  API as "CURLSH *".

  The idea is that the struct can have a set of own versions of caches and
  pools and then by providing this struct in the CURLOPT_SHARE option, those
  specific SessionHandles will use the caches/pools that this share handle
  holds.

  Then individual SessionHandle structs can be made to share specific things
  that they otherwise wouldn't, such as cookies.

  The Curl_share struct can currently hold cookies, DNS cache and the SSL
  session cache.

  
  1.7 CookieInfo

  This is the main cookie struct. It holds all known cookies and related
  information. Each SessionHandle has its own private CookieInfo even when
  they are added to a multi handle. They can be made to share cookies by using
  the share API.
+2 −2
Original line number Diff line number Diff line
@@ -5,7 +5,7 @@
#                            | (__| |_| |  _ <| |___
#                             \___|\___/|_| \_\_____|
#
# Copyright (C) 1998 - 2012, Daniel Stenberg, <daniel@haxx.se>, et al.
# Copyright (C) 1998 - 2013, Daniel Stenberg, <daniel@haxx.se>, et al.
#
# This software is licensed as described in the file COPYING, which
# you should have received as part of this distribution. The terms
@@ -36,7 +36,7 @@ EXTRA_DIST = MANUAL BUGS CONTRIBUTE FAQ FEATURES INTERNALS SSLCERTS \
 README.win32 RESOURCES TODO TheArtOfHttpScripting THANKS VERSIONS	 \
 KNOWN_BUGS BINDINGS $(man_MANS) $(HTMLPAGES) HISTORY INSTALL		 \
 $(PDFPAGES) LICENSE-MIXING README.netware DISTRO-DILEMMA INSTALL.devcpp \
 MAIL-ETIQUETTE HTTP-COOKIES
 MAIL-ETIQUETTE HTTP-COOKIES LIBCURL-STRUCTS

MAN2HTML= roffit < $< >$@