UDP Sockets Hang On Reads

15 Mar 2012

I am having an issue with UDP Sockets that only occurs when there is a lot of network traffic. I am transferring packets @100Hz and it works fine doing this when there isn't much traffic on the network. When there a multiple broadcast packets floating around, as is the case when I am running X-Plane, my application will eventually hang.

I have used SimpleSocket, mbedNet and one other thinking maybe it was a library thing, but all of them behave exactly the same.

I have found the following things:

1) Before this happens I get false-positives where it uses the same pack multiple times. (When I use a RecvFrom type function).

2) A few times later, the mbed board will eventually hang in the RecvFrom type of call.

3) I ended up writing a pacer routine on the reads, but this is merely a bandage until I can get this running normally.

I have a an interrupt routine running at 10Khz, but I tried disabling it and I still got the same result.

I've kind of run out of ideas. Any thoughts/suggestions would be most appreciated.

Here is the main loop using SimpleSocket.

    while ( true ) 
    {
        // Just A Little Something To Say That I Am Alive ~once a second
        if ( ( ENCLOOP % 10000 ) == 0 )
        {
            pc.printf( "*" );
        };
        
        if ( ( TimeoutCount >= 10000 ) && !TimeoutFlag )
        {
            TimeoutFlag = true;
            TimeoutCount = 10000;
            pc.printf( "<Not Connected>\n" );            
        };
    
        // Handle UDP Packets
        if ( datagram.receive( &buddy, 0.0 ) > 0 ) 
        {
            pc.printf( "R" );
        
            YellowEthLED = true;
            IpAddr ip   = buddy.getIp();
            int    port = buddy.getPort();
            BuffLen = datagram.read( buff, 200 );
            
            pToEncoder = (TOENC *)&buff;
            
            if ( ProcessPacks() )
            {
                TimeoutCount = 0;
                if ( TimeoutFlag ) 
                {
                    pc.printf( "[Connected]\n" );                
                    TimeoutFlag = false;
                }
                
                datagram.write(  (char *)&FromEncoder, sizeof( FromEncoder ) );
                datagram.send( ip, port );                            
                pc.printf( "W" );
            };       
            
        }

    }

The loop prints a ! every second, if it gets a packet it prints a R if the packet is valid it returns a packet and prints a W

-Doug

17 Mar 2012

How hard is it to setup a repro for this issue? If it was something that I could reproduce here, I could slap it under the debugger and take a look to see if I could help. It could be related to a couple of issues that I encountered in the ethernet driver and fixed in my private copy when building my web server since they too only occurred under heavier load.

-Adam

17 Mar 2012

Wow Adam,

Thank you for looking into this. Here is the project.

/media/uploads/DugALug/octo_quad_reader.zip

For the problem to occur, you need to change the following lines in the main:

From:

        if ( ( CheckBuff.read() > RecvTime ) && ( ConfigMode ) )

To:

        if ( 1 ) //( CheckBuff.read() > RecvTime ) && ( ConfigMode ) )

And change this:

        if ( SendBuff.read() > SendTime )

To:

        if ( TimeoutCount == 0 )

Here is the application that I use to talk to it:

/media/uploads/DugALug/deliverable.zip

It is a windows application. Just unzip it into a folder and run it. (If it is talking, the box above the table will turn green).

It works completely fine, as long as there is not a ton of traffic on the network. When the traffic starts getting heavy, the code will simply lock up and stop running. This threw me for a loop for a couple of weeks, because I couldn't replicate it at my office. I ran the windows application for 48 hours without incident. It only took 2 seconds for it to fail on site. (Isn't that just the way of engineering?).

The code is hard-coded to default to an IP address of 192.168.1.252.

If this address is a problem, you can change the project's hard-coded IP address AND the addresses in the .ini file on the windows side.

Again, thanks for the help.

-Doug

22 Mar 2012

I took a look at this code tonight and while I could get it to start up and run on my mbed, I couldn't get it to crash/hang or ever connect to the application running on the PC.

My network is setup to be 192.168.0.1 as the gateway with a mask of 255.255.255.0 so I made these changes in an attempt to get the connect to work:

uint8_t       IPStr[4]   = { 192, 168,   0, 252 };  // My Default IP Address
uint8_t       NMStr[4]   = { 255, 255, 255,   0 };  // My NetMask
uint8_t       GWStr[4]   = { 192, 168,   0,   1 };  // My Gateway
uint8_t       HostIp[4]  = { 192, 168,   0,   14 };  // My Host
int           IPort = 6100;

- and -
                        broadcastNet =  { 192, 168,   0, 255 },
                        broadcastAll =  { 255, 255, 255, 255 };

Is there anything else I should have needed to change? I even turned off the firewall on my Windows XP machine and the mbed still would indicate that it couldn't connect. I can ping the mbed running your code from my Windows XP machine at address 192.168.0.252

This is what I saw logged to the console:

Software Version Number 1.031000
 Initializing
Using Defaults For Configuration Mode!
Send Frequency: 20.000Hz Recv Frequency: 10.000Hz
Setting Up Datagram
[INFO      ]      NetIF: Initializing NetIF layer
[INFO      ]      NetIF: Interface 'en0' registered (192.168.0.252/255.255.255.0 gw 192.168.0.1) using driver 'mbedEMAC'
[INFO      ]      NetIF: Registered periodic function 'ARP cache' with period 10 seconds
Setting IP Address to [192.168.0.252] Port: 6100 
In Main Loop...
*<Not Connected>
*T*T*T

Even though it wasn't connected, I pushed my network up to 50% utilization by running some code against one mbed device while another was running your code but I didn't get any hangs or crashes. But maybe it has to be connected for the problem to show up.

-Adam

22 Mar 2012

Hey Adam,

Thanks for running it and looking into this. It should have connected. Well, on the PC side, did you edit the .ini file to use the same network values?

Those values need to match what was hard-coded in the mbed device.

  DEF_BOARDID    = 1 
    DEF_IP_Address = '192.168.0.252' 
    DEF_IP_NetMask = '255.255.255.0' 
    DEF_IP_Gateway = '192.168.0.1' 
    DEF_IP_HOST     = '192.168.0.255' 
    DEF_IP_PORT    = 6100 
 
    ALT_BOARDID    = 1 
    ALT_IP_Address = '192.168.0.252' 
    ALT_IP_NetMask = '255.255.255.0' 
    ALT_IP_Gateway = '192.168.0.1' 
    ALT_IP_HOST    = '192.168.0.1' 

The problem only seems to occur when it is connected, this is important to its failure.

-Doug

24 Mar 2012

Updating the INI file did the trick. Unfortunately I don't appear to be able to reproduce the type of network conditions that lead to the issue that you were hitting. I will keep it running here for a bit to see if it does happen to repro sooner or later.

Are you able to run experiments in the field where it actually fails in an attempt to narrow down the cause of the problem? If so, now that I know how to test it to make sure that I don't break it, I could try building you a debuggable version of your binary. I would send you this binary to try in the field. You would need to have a PC connected to the mbed via the USB cable and this PC would need to be running arm-none-eabi-gdb. If the crash/hang reproduces with that binary, I can give you gdb commands (if you aren't already familiar with gdb) that should help narrow down the problem. Let me know if you want to give that a try.

Thanks,

Adam

24 Mar 2012

Hey Adam,

Yeah, I don't know how to reproduce it beyond on the full up system. My heartache is that when the problem appears it locks up immediately. Using wireshark, I can look at some of the traffic, but not all of it. There is a mix of TCP 12 devices are using ModbusTCP (each of these is running at 100Hz), 4 devices using UDP (also 100Hz), a couple of broadcasted packs (lower rates).

With the code that I sent you, you can see my work around. I ended up just spewing data out at a fixed rate and ignoring the reads completely (in non-config mode). This works without a hitch, but it prevents the user from resetting the encoder and it makes lots of useless traffic.

I am up to trying to debug this. I haven't run gdb in many years, but I am sure I could figure it out. Is there a particular link for arm-none-eabi-gdb?

Thanks again for helping me out here.

-Doug

24 Mar 2012

About a half hour after I made my previous post, it did hang on my network.

I have started trying to port your code to compile with GCC and as I am doing this, the first thing I noticed is that you weren't really using EthernetIF at all. You mentioned this in your first post but in all honesty I didn't really know what mbedNet was so I just glossed over it:) That does mean that the issue I hit would not be what is impacting you as it only pertained to EthernetIf. I am currently updating the mbedNet code to get it to compile with GCC.

I will send you a link to the build of gdb to run along with the binary if I can get it to build properly with GCC and actually have something for you to run under the debugger.

-Adam

24 Mar 2012

Hey Adam,

I can change it back to using EthernetIF if you would prefer (mbedNet was just my attempt to get the thing working). It behaved exactly the same in both libraries. So I think it is something lower level than the library, or it is some core principal that isn't cool.

-Doug

24 Mar 2012

Doug Joseph wrote:

I can change it back to using EthernetIF if you would prefer

Let's delay that for now. I have mbedNet compiling on GCC now but it is crashing even earlier. It could be related to the problem you see with the online compiler or I broke something else during the port. The debugging of this crash is helping me find weaknesses in my debug monitor so I am finding it very useful :)

24 Mar 2012

Adam Green wrote:

The debugging of this crash is helping me find weaknesses in my debug monitor so I am finding it very useful

Well that is good :). I am thankful for your time and effort on this.

-Doug

25 Mar 2012

I will start with the part that you probably care the most about. Once I got your sample to startup after being built with gcc, I got a crash in Recv_Data() as it tried to free a data buffer from the queue. This is caused by the fact that there are allocations taking place in Hook_UDPv4() which is called from within the Ethernet ISR while frees are happening in the main code. This will easily lead to heap corruption and would be more likely to take place the more often you receive UDP packets. Removing the calls to read the data would get rid of the main line heap free calls so that would protect the heap from corruption as well. The mbedNet code should be rewritten to remove the dynamic memory allocations from within the ISR.

Program received signal SIGSEGV, Segmentation fault.
0x0000cb2a in _free_r () at EncoderDefs.h:8
8	DigitalIn EncPairB[8] = { ( p12 ), ( p14 ), ( p16 ), ( p18 ), ( p20 ), ( p22 ), ( p24 ), ( p26 ) };
(gdb) bt
#0  0x0000cb2a in _free_r () at EncoderDefs.h:8
#1  0x00004ece in Recv_Data (entry=0x10002de8, data=0x10000b84 "@0 \020\001", length=1024) at mbedNet/Sockets.cpp:244
#2  0x00005858 in Sockets_RecvFrom (socket=0, data=0x10000b84 "@0 \020\001", length=1024, flags=0, 
    remoteAddr=0x10002bdc, addrLen=0x10007f60) at mbedNet/Sockets.cpp:567
#3  0x000013f0 in main () at Encoder_Handler.cpp:425

You can either try refactoring mbedNet to not require any dynamic memory allocations from the ISR (pre-allocate all of the buffers when the socket is created for example and then just marking them as free or not) or try switching to EthernetIf and we can try debugging what problems it has. It should be different since this particular queuing code looks like it is particular to mbedNet.

Now for me to ramble on about other things that I hit while looking at this code that you probably don't care about. The first is that I couldn't even get the code to barely start after the initial port without it crashing almost immediately in the ISR. It would crash on this line of ENET_IRQHandler():

	LPC_EMAC->IntClear = status;

At first when I went to dump the LPC_EMAC registers in the debugger, I just got a bunch of 0xdeadabba values. I thought this was caused by some missing code to enable the Ethernet peripheral or properly set its clock. After much debugging, I determined that it didn't start to happen until the peripheral was actually powered up and it was actually a bug in my debug monitor. It turns out that those registers can't be read a byte at a time and that is what the monitor tries to do based on the protocol spec. I do see that there is a way for me to know that the user is actually accessing them a word at a time and do the right thing from the monitor so I will need to make that modification.

So now what was leading to the crash if the Ethernet peripheral was powered up correctly? It turns out that the non-optimized code generated by GCC for the above line is:

          0x00006ddc <+16>:	mov.w	r3, #1342177280	; 0x50000000
        => 0x00006de0 <+20>:	ldr.w	r2, [r3, #4072]	; 0xfe8
           0x00006de4 <+24>:	mov.w	r1, #0
           0x00006de8 <+28>:	ldr	r2, [r7, #44]	; 0x2c
           0x00006dea <+30>:	orr.w	r2, r1, r2
           0x00006dee <+34>:	str.w	r2, [r3, #4072]	; 0xfe8

Even though the IntClear register is only being written to in the C code, GCC generates a read but then discards the resulting r2 value anyway. This register is write-only so that is what led to the bus fault. It also does some extraneous ORing and such. I was able to work around this issue by convincing GCC that the uint32_t type is an "unsigned int" and not a "unsigned long" which are both 32-bit on this platform but it appears that GCC doesn't treat them as equal and that leads to the above issue.

Once the above issue was worked around, I hit the malloc/free issue described above.

I am sort of disheartened to see how poorly gdb was able to step through your code with all optimizations turned off. I need to spend some time looking into that issue. I get truncated callstacks and sometimes it treats C functions with symbols as though they didn't have symbols at all.

-Adam

25 Mar 2012

Wow Adam,

That is some serious sleuthing. Thanks for finding the problem. For the last hour, I've been looking at the Sockets.cpp and trying to figure out what the best course of action is.

I think it might be better for me to switch back to EthernetIf. Looking at how the malloc is implemented, I am worried about moving the paradigm. It seems like the code is codependent so it would be quite a trick to assure that overlap would be covered.

I will try to change it back to EthernetIf tomorrow and put the link here.

Again, thanks for your time, patience, and knowledge on this.

On a side note, Gdb has always been a somewhat questionable debugger: this hasn't changed in years. I used to curse Gdb regularly. But in all fairness to it, it has saved my butt many times too! In this case, with some patience and understanding, you were able to find the problem. This is way beyond what I could have done without you and gdb! Gdb has serious grief with re-entrance, and ISR's. To change that would add an order of magnitude to its complexity.

-Doug

27 Mar 2012

Okay Adam,

I finally got around to re-porting it. I kept both versions in there in case you needed to compare. I haven't got a chance to run it, but it should work. (It worked before).

/media/uploads/DugALug/octo_quad_reader.zip

Sorry I it took me a while to port it.

-Doug

30 Mar 2012

Doug Joseph wrote:

I kept both versions in there in case you needed to compare. I haven't got a chance to run it, but it should work. (It worked before).

/media/uploads/DugALug/octo_quad_reader.zip

What is different in this code? When I compare Encoder_Handler.cpp between what is found in this zip and what was found in the previous zip, I don't see any differences. Does the inclusion of the SimpleSocket library account for the change?

It also looks like Encoder_Handler.cpp in this zip archive still contains the modifications that make it send only. Is that correct? Do I have to make the same mods to this code?

You mention above that you have both versions in this archive. Maybe that is what is confusing me and I am just not seeing the new code.

Thanks,

Adam

30 Mar 2012

Adam,

Hmmmmm... Okay, I deleted the zip and redid it. Maybe I didn't export it correctly?

Try this one and see if it is any better.

/media/uploads/DugALug/octo_quad_reader.zip

-Doug

03 Apr 2012

Doug, I see the EthernetNetIf code in this new archive and it builds with the online compiler. I have pulled it down to my local machine and started looking at whether I can get it ported to GCC to make debugging easier.

-Adam

03 Apr 2012

Cool Adam,

Thanks Doug

04 Apr 2012

Doug,

I remember what I don't like about EthernetNetIf! It's really complicated! Too complicated for my simple little mind to figure out! I ended up just stripping out all the layers above lwip and modified your code to use the raw lwip UDP APIs instead. I developed and debugged it with gcc and then pushed it up to the online compiler. It can be imported from here:

[Not found]

I just followed the code flow of what it appeared that you had in your EthernetNetIf version where it only sent a packet back to the PC in response to packets that had been sent from the PC. The raw lwip API works through a callback mechanism. From your main program, you poll the network stack in your main loop (I am doing this through a call to SNetwork_Poll().) Within this polling call, it will take any packets arriving from the ethernet driver and push them up through the network stack. If it is a UDP packet destined for your port, then it will call the handleUdpReceive() function that I registered to your UDP port in main. You can look in this callback code (in main.cpp) to see how I pulled the data out and then sent back a response. You can copy your data into the same g_outboundPbuf->payload buffer (as long as it is the same size as the FromEncoder structure sent in the callback) and use it to make a call to udp_sendto() from your main thread if you want to send data more often.

I have run it here for almost an hour and I don't see any hangs. It does sometimes time out on the PC, marks the connection as down, and then quickly reestablishes the connection and keeps on running.

-Adam

04 Apr 2012

Wow Adam,

I know what you mean about complexity. That's why in Windows I have a class for socket handling that is made up of basically 5 commands: Open, Listen, Read, Write, Close.

I am a little concerned about the dropping of the connection on the PC. That means it didn't get a reply from the mbed device for over a second (On the PC side, I keep sending packs whether I receive one or not so it is dropping a bunch of packs). The tool I sent you runs at 10-15Hz. In the real application I am running at fixed 100Hz, and losing over 100 frames would be devastating to the application. I am wondering if it is just due to timing and we are getting partial packs which are being rejected? Eventually it would correct itself, because of the time gaps between packets.

I will give this a try on our net and see how it does. It will probably have to be some time next week since the sim is down right now for the rest of the week.

Thank you so much for work on this. I am blown away that you would take this much time and effort on this. I really appreciate your time and knowledge.

-Doug

11 Apr 2012

Hey Adam,

The new code seems to be dropping lots of packets. :(

I play with it some more tomorrow.

-Doug

11 Apr 2012

Doug Joseph wrote:

The new code seems to be dropping lots of packets. :(

That's not good!

To be clear, I only respond to datagrams sent from the PC and I didn't copy the code which would broadcast the datagrams on an even more regular basis. Maybe it is these missing higher rate broadcasts that looks like dropped packets from your application's responsiveness point of view? If not, let me know exactly what you are seeing and I will see if I can reproduce under the debugger.

-Adam

11 Apr 2012

Hey Adam,

No this is the ping-pong interface (only speak when spoken to). My host keeps sending packs whether it receives one or not and counts sent versus received packs. I am getting about 1 pack in 4 and sometimes it won't respond at all for as much as a second.

I will try putting a counter on the mbed side to see what I am receiving from the host. (if only I had wireshark over there...lol).

-Doug

11 Apr 2012

Doug,

Thanks for the clarification. I might be able to reproduce that issue here. What ping rate were you using?

One experiment that you could try on your side is to disable your EncoderHandler() routine to see if maybe it is starving the network stack of CPU cycles.

-Adam

11 Apr 2012

Hey Adam,

I will try disabling the encoder stuff too.

I am running at 100Hz.

-Doug

12 Apr 2012

Another experiment would be to try commenting out the printf()s. I sent those down a different path in my port so that they would be sent to the gdb console when I was running under the debugger. This path might be slower.

-Adam

21 Apr 2012

Hey Adam,

Our sim has been down. I will try hitting it this week. Sorry to keep you waiting.

-Doug

21 Apr 2012

Sounds good. I have been busy refactoring code from one of my own projects anyway.

While your original problem needed the network conditions from your field environment to reproduce, does this packet dropping issue require it as well or does it reproduce on your office network too?

-Adam

22 Apr 2012

Hey Adam,

Yes, even in the test application that I sent you... that is why the button says disconnected for a moment then turns back to green again... every now and then. At 10x the speed it is much more prevalent.

-Doug

15 Sep 2012

Adam Green wrote:

Doug Joseph wrote:

I can change it back to using EthernetIF if you would prefer

Let's delay that for now. I have mbedNet compiling on GCC now but it is crashing even earlier. It could be related to the problem you see with the online compiler or I broke something else during the port. The debugging of this crash is helping me find weaknesses in my debug monitor so I am finding it very useful :)

Could you please share some details on how to get this to compile using GCC?

Thanks